NORMALIZING FLOWS AIDED VARIATIONAL INFERENCE FOR UNCERTAINTY 
QUANTIFICATION 

By 

Sumegha Premchandar 

A DISSERTATION 

Submitted to 
Michigan State University 
in partial fulfillment of the requirements   
for the degree of 

Statistics – Doctor of Philosophy 

2024 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ABSTRACT

Bayesian statistics is a powerful tool for quantifying uncertainties when estimating unknown model

parameters. It is often the case that the posterior distributions arising from the Bayesian paradigm

are intractable. This may be due to complex statistical model choices and high-dimensionality

of the parameter space. Previously, Markov Chain Monte Carlo (MCMC) methods have been

the preferred approach for sampling from posterior distributions with an unknown normalizing

constant. However, MCMC methods run into a number of issues in practice. For instance,

they do not always scale well to multimodal distributions deﬁned on a high-dimensional support.

Variational Inference (VI) has emerged as a scalable alternative to MCMC for sampling from

intractable posterior distributions. Recently, Normalizing Flows aided VI (FAVI) has been used

for sampling from complex and multimodal posterior distributions to overcome the limitations of

existing mean-ﬁeld and structured VI approaches. FAVI has had a signiﬁcant impact across ﬁelds in

applications such as computer vision, computational biology, and physics-based modelling. Despite

its impact, there is limited research on the theoretical properties of the approximate posterior arising

from FAVI. The computational cost of FAVI depends heavily on the choice of Normalizing Flow

(NF) family, but there is no work quantifying the nature of the approximate posterior from FAVI at

a particular complexity of the NF, especially with respect to uncertainty quantiﬁcation.

In this dissertation, we study the properties of the FAVI posterior with a focus on:

(i) The trade-oﬀ between accurate recovery of the posterior samples and complexity of the

selected NF family.

(ii) Uncertainty quantiﬁcation.

We ﬁrst provide background on FAVI and compare it to popular competitors (Mean-Field VI

(MF-VI) and MCMC) over some basic statistical applications. Our results demonstrate that FAVI

lies between MCMC and MF-VI in both statistical accuracy and computational eﬃciency.

In this second part of this dissertation, we use the framework of Bayesian linear regression

with 2 predictor variables to rigorously study the optimal Kullback-Leibler divergence between

the FAVI approximation with Inverse Auto-regressive Flows (IAF) and the true posterior. We

also derive the uncertainty quantiﬁcation (credible interval coverage) resulting from using FAVI to

approximate the posterior, as a function of the correlation between the regression predictors. We

contrast this coverage with MF-VI (the most popular VI approach in the literature) and ﬁnd that,

given suﬃcient complexity of the NF, there is virtually no loss in coverage from FAVI relative to

the true posterior, regardless of the correlation. On the other hand, the loss in coverage for MF-VI

increases monotonically in the correlation.

Next, we extend our results to the case of an arbitrary ?> 2 regression predictors. Our results

(presented across complexity levels of the IAF transformations), demonstrate that given suﬃcient

complexity of IAF, FAVI can completely recover the true posterior. To our knowledge, this is the

ﬁrst theoretical exploration of this kind.

Finally, we discuss ongoing research and plans for future work where we will leverage our

learning to use FAVI for Bayesian inference in high-dimensional linear models with spike and

slab priors. Preliminary results show that FAVI can capture dependencies in the posterior more

eﬀectively than MF-VI.

FAVI is one among many novel computational tools that has originated in machine learning

literature for scalable Bayesian computation, but there has been little previous work analyzing its

statistical properties and reliability for uncertainty quantiﬁcation. By studying the FAVI posterior

from a statistical lens, this dissertation bridges some of the gap between machine learning and

statistics, and takes strides towards building reliable computational tools for Bayesian inference.

Copyright by
SUMEGHA PREMCHANDAR
2024

This thesis is dedicated to my parents - Jayanthi Premchandar and M.R. Premchandar.

v

ACKNOWLEDGEMENTS

First and foremost, I would like to express my gratitude to my advisors Dr. Tapabrata Maiti and

Dr. Shrijita Bhattacharya who have been instrumental in shaping my thesis. Their enthusiasm for

tackling interesting and challenging research problems has been an example for me and has molded

me into the researcher I am today. I have greatly appreciated their support and patience these past

3 years while I learned the ins and outs of doing research. I would also like to express my gratitude

for the generous funding provided to me by my advisors during my third and fourth years of the

PhD. This funding gave me the space and time to focus on my research. I am additionally very

thankful to my committee members Dr. Yimin Xiao and Dr. Selin Aviyente for the time they

spent attending committee meetings, reading my work and the valuable feedback they provided that

greatly improved the quality of my work.

I count myself lucky to have had some wonderful professors and mentors over the course of my

PhD including Dr. Yimin Xiao, Dr. Haolei Weng and Dr. Shrijita Bhattacharya whose instruction

in the prelim courses helped me build a strong foundational knowledge in probability and statistics.

I am additionally grateful for the mentorship of Dr. Sandeep Madireddy for providing me with the

opportunity to intern at Argonne National Laboratory and for introducing me to the exciting area

of Neural Architecture Search. My time at Argonne was well spent, picking up computational and

research skills that have been put to good use throughout my PhD. Aside from my professors and

mentors, I would like to express my appreciation for the STT staﬀ; particularly Andy, Tami and

Ashlynn for making administrative processes far easier and MSU technology much more accessible.

I have had a small but loyal army of friends and family by my side who have made this journey

possible. To my friends Tathagata, Sikta and Hema; thank you for providing me with a sense of

community and family in East Lansing. Thanks also to my other friends at MSU; Phuong, Satabdi,

Soyeong, Nian, Haoxiang, Sang Kyu, Arka, Anirban, Sampriti, Alex and Sanket for making East

Lansing feel much less lonely. I am especially indebted to Arka, who allowed me to pick his brain

about my research many times and who has always been a source of useful advice for me.

My support system outside of STT has been just as important as the people inside of it. I greatly

vi

value my family members, who have provided an unwavering love and support for me these past

few years. They have made the eﬀort to keep in touch even when my busy schedule sometimes

left me with little energy to do the same. I reserve a special mention for all of my grandparents,

who have been proud of even the smallest of my wins. Thanks to my friends Mahevash and Shailja

for coming all the way to Chennai to visit me on my ﬁrst trip back to India and for our enduring

long-distance friendship. I owe a great deal to my friends Anisha and Nayantara; who have always

been there for me at my lowest points and for reminding me of my worth outside of research. It is

diﬃcult for me to ﬁnd the words to describe how much their friendship has meant to me. The near

daily pictures of Ingee, my favourite ginger cat, and her sage advice communicated via Anisha kept

my spirits amused on many an occasion.

I can state with certainty that I would not have made it to the ﬁnish line without the love and

support of my partner Nisarg. He has helped me with the more frustrating parts of using LATEX,

debugged my code on many occasions and acted as a sounding board for my ideas. But more

than this, thank you for having faith in me when I had none in myself, for encouraging me to treat

myself more kindly and bringing so much levity to even the most stressful parts of my life. My

deepest appreciation is for my sister Sucharita, whose unconventional and bite-sized wisdom got

me through the more arduous phases of this PhD. Last and most important, I would like to thank

my parents Jayanthi Premchandar and M.R. Premchandar for everything they have done for me.

Thanks Appa, for doing the sometimes diﬃcult job of pushing me when it was required and thank

you Amma for being such a source of strength, resilience and unconditional love for me these past

5 years and more. This thesis belong to all of you.

vii

LIST OF ABBREVIATIONS .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

TABLE OF CONTENTS

CHAPTER 1

1.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 The basic Variational Inference algorithms . . . . . . . . . . . . . . . . . . .
1.3 Normalizing Flows aided Variational Inference
1.4 Preliminary Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Dissertation Outline .

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
. . . . . . . . . . . . . . . . .
.
.

.

1
2
3
4
5
6

CHAPTER 2

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

NORMALIZING FLOWS AIDED VARIATIONAL INFERENCE: BACK-
7
.
GROUND, EXAMPLES AND COMPARISONS . . . . . . . . . . . .
7
.
2.1 Background .
. .
9
.
2.2 When should we use Normalizing Flows VI? . . . . . . . . . . . . . . . . . .
2.3 Normalizing Flows
9
.
. .
2.4 Discrete and Continuous-Time Flows . . . . . . . . . . . . . . . . . . . . . . . 10
. 13
2.5 Neural Auto-regressive Flows . . . . . . . . . . . . . . . . . . . . . . . .
. 16
2.6
. 28
2.7 Looking Ahead .

. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

Illustrative Examples .

.

.

.

CHAPTER 3

STATISTICAL PROPERTIES OF THE FAVI POSTERIOR: A CASE
STUDY WITH LINEAR REGRESSION . . . . . . . . . . . . . . . . . 31
. 32
. 35
. 38
. 38
. 57

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Contribution . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Main Results
.
. .
3.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Technical Details
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Extensions to higher dimensions . . . . . . . . . . . . . . . . . . . . . . . .

.
.

.
.

.
.

.
.

.

.

.

CHAPTER 4

LINEAR REGRESSION WITH SPIKE AND SLAB PRIORS . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Variational Family .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Simulation study .
4.4 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
Implementation details
.

.

.

. 69
. 72
. 73
. 77
. 81

CHAPTER 5

CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

BIBLIOGRAPHY .

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

APPENDIX A

INFORMATION ON ENERGY DENSITY FUNCTIONS . . . . . . . . 89

APPENDIX B

RUN-TIME COMPARISONS FOR LOGISTIC REGRESSION . . . . . 90

APPENDIX C

ADDITIONAL RESULTS FOR SPIKE AND SLAB REGRESSION .

. 91

viii

LIST OF ABBREVIATIONS

BVS

CDF

Bayesian variable selection

Cumulative distribution function

ECDF

Empirical cumulative distribution function

FAVI

Flows aided variational inference

GLM

Generalized linear model

IAF

KDE

Inverse auto-regressive ﬂow

Kernel density estimates

MCMC Markov chain Monte Carlo

MF

MH

NAF

NF

PDF

PMF

RW

SGA

SSE

SVI

VI

DSF

Mean-ﬁeld

Metropolis Hastings

Neural auto-regressive ﬂows

Normalizing ﬂows

Probability distribution function

Probability mass function

Random-walk

Stochastic gradient ascent

Sum of squared error

Structured variational inference

Variational inference

Deep sigmoidal ﬂow

ix

CHAPTER 1

INTRODUCTION

In various scientiﬁc ﬁelds, statistical and machine learning models play a signiﬁcant role in inference

and decision making. It is crucial to have models that accurately represent uncertainties in unknown

model parameters, as this helps in building robust decision making processes. This is especially

important in safety critical applications such as medical image analysis and autonomous driving.

Bayesian statistics is a potent tool for quantifying uncertainties, without needing to resort to complex

asymptotic results. It also has the added advantage of being able to incorporate domain speciﬁc

prior knowledge into the statistical model.

Bayesian statistics derives all inference about an unknown model parameter from the posterior

distribution. In numerous applications, the posterior distribution does not have a closed form. In

such a situation we say that the posterior distribution is “intractable” and we use approximate infer-

ence methods to sample from it. A key challenge in Bayesian statistics is developing scalable and

statistically accurate computational tools to generate samples from complex, intractable posterior

distributions. It is diﬃcult to balance the trade-oﬀs between statistical accuracy and computational

eﬃciency for approximate inference methods.

Normalizing Flows aided Variational Inference (FAVI) is an algorithm for sampling from in-

tractable distributions that originated in machine learning literature [33].

It was introduced to

recover complex and multimodal distributons, while retaining some of the scalability that charac-

terizes VI. FAVI been impactful across scientiﬁc ﬁelds such as computer vision [20], computational

biology [7] and physics-based modelling ([43], [46]). Given its popularity across application areas,

it is crucial to address some of the fundamental theoretical gaps in our knowledge surrounding this

valuable tool. This can then enable its wider adoption for statistical inference and variable selection

in high-dimension.

In the rest of this chapter we will provide an overview of concepts relevant to our work and

discuss important related literature. At the end of this chapter, we provide an outline for the rest of

this dissertation.

1

1.1 Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC) methods generate samples from a Markov Chain with a

stationary distribution equal to the target posterior distribution we wish to sample from. There are

a number of MCMC methods in the literature, the most well known of which are the Metropolis

Hastings algorithm [9] and Gibbs sampling [8].

The Metropolis-Hastings algorithm uses a proposal distribution that serves as a transition kernel

for a Markov Chain. The proposal distribution is used to generate samples (a new state) based on

some previous state. This new state is then accepted or rejected with a probability that depends on

the un-normalized target posterior and the proposal distribution. One of the most popular choices

for the proposal is a Gaussian distribution with mean equal to the previous state. This is known

as the Gaussian Random-Walk Metropolis Hastings (RW-MH) algorithm and we use it for our

comparison studies in chapter 2.

Gibbs sampling is a special case of the Metropolis-Hastings algorithm. Gibbs sampling gen-

erates samples from a multivariate distribution, one dimension at a time.

If we have a joint

distribution c

✓

, where ✓ =

(
conditional distributions c

)

\8

(

|

\1,\ 2 . . .\ ?

(
\1,\ 2 . . .\ 8

)2
1,\ 8

 

R?, it leverages information about the complete

1, . . .\ ?

+

, 1

)



8



?, to produce samples.

MCMC methods have the desirable property that the generated Markov chain samples are

guaranteed to converge to samples from the target distribution. However, they run into a number

of issues in practice. MCMC can be slow to converge, especially for high-dimensional state spaces

and multimodal target distributions. The RW-MH algorithm requires a computational budget of

?2

$

(

)

to generate two approximately independent samples from the stationary distribution [26]. If

a large number of samples are required, this can be computationally prohibitive. Gibbs sampling

generally has faster mixing times than the RW-MH algorithm and does not require tuning of the

proposal distribution, but we may not always have information on the complete conditionals.

There are more contemporary MCMC methods such as the Hamiltonian Monte Carlo (HMC)

that leverage gradient information of the target distribution to explore the state space much more

eﬃciently than the MH algorithm. However, as discussed later in section 2.7, even HMC runs into

2

issues for highly multimodal target distributions.

Lastly, assessing convergence of the Markov Chain to the stationary distribution is often chal-

lenging [35]. Empirical convergence diagnostics such as the Gelman-Rubin diagnostic [10] are

popularly used, however they do not always guarantee convergence of the Markov Chain. Further,

calculating such diagnostics requires running multiple parallel chains from which many of the

samples are discarded. This can be computationally expensive. Other measures such as upper

bounds on the total variation distance between the density of the MCMC samples and the stationary

distribution are diﬃcult to obtain for many statistical models.

1.2 The basic Variational Inference algorithms

Variational Inference (VI) surfaced in machine learning literature as a scalable means of sam-

pling from intractable posterior distributions ([18]). VI is widely used in applications such as

computer vision, topic modeling, and computational biology ([5], [7]). A common theme in these

applications is the presence of large datasets and high-dimensional parameter spaces.

In VI, a

variational family of distributions Q is proposed, from which a distribution @q⇤ is selected based on

its “closeness” to the target posterior. Keeping with the formulation in a majority of VI literature,

we will use the Kullback-Leibler (KL) divergence as the measure of closeness. More precisely, VI

reframes the problem of sampling from a target distribution ⇧

⇡

✓

|

(

)

into the following optimization:

@q⇤ 2

arg min@ q

&   !

@q

(

⇧

⇡

.

|

(

))

||

2

(1.1)

The choice of family & drives the trade-oﬀ between statistical accuracy and computational

eﬃciency that characterizes VI. Mean-Field VI (MF-VI), enables computational eﬃciency by

assuming any @q

2

& can be factorized as @q

=

✓

)

(

?

8=1 @q8 (

\8

)

. Structured VI (SVI) allows for

some level of dependencies among \8 by estimating a non-diagonal covariance matrix ⌃

Œ

R?

⇥

?,

2

but it sacriﬁces some of the computational eﬃciency of MF-VI. A caveat of both MF-VI and SVI

is that they cannot recover multimodal target distributions.

3

1.3 Normalizing Flows aided Variational Inference

Some of the earliest mentions of the idea of using a Normalizing Flow for probabilistic mod-

elling can be found in [39] and [38]. A Normalizing Flow (NF) is nothing but a composition of

continuously diﬀerentiable mappings with diﬀerentiable inverse, applied to samples from a base

distribution. Normalizing Flows aided VI (FAVI) was introduced in [33] to alleviate some of the

issues encountered by MF-VI and SVI in sampling from complex and multimodal target distribu-

tions. FAVI generates a family of distributions by sampling from a base distribution @0(
applying diﬀerentiable and invertible mappings )B : R?
1 · · · 

R? such that ✓( = )(

!

)(

If i

✓

R< denotes the space of parameters for the transformations

)B
{

}

 
(
B=1, then any q

 

2

and

✓0)
)1(
✓0)
i results

.

in a distribution @q

✓(

(

)

. We then choose an optimal distribution @q⇤ based on the minimization

in (1.1). More complex choices of )B yield more expressive variational families, albeit at added

computational cost. Among the many ways to choose )B, we limit our scope throughout this dis-

sertation to the popular Auto-Regressive Flows, for which the cost of computing @

✓(

(

)

from ✓0 is

?(

$

(

)

[28].

FAVI has demonstrated a lot of potential as a computational tool for Bayesian inference in

complex statistical models ([28], [22]). However, very little is known about theoretical behaviour

of the variational posterior arising from FAVI. Previous theoretical works on VI have mostly

focussed on asymptotic posterior consistency properties of the approximate posterior arising from

MF-VI or SVI ([3], [31], [42]). These results take a frequentist viewpoint and are focussed more on

the impact of the variational approximation on central tendency estimates.1 They do not provide

an in-depth exploration on the uncertainty quantiﬁcation obtained from the variational posterior.

Certain families of NFs are known to be highly expressive and in theory, they can model any target

distribution with non-zero support on R? (see chapter 2). Given these properties, FAVI should

demonstrate improved uncertainty quantiﬁcation and recovery of the posterior samples for ﬁnite

sample sizes, when compared to simpler variational families (e.g. mean-ﬁeld). Consequently, we

believe that while studying the theoretical properties of the approximate posterior obtained from

1By frequentist viewpoint we mean the assumption that there exists a true unknown parameter ✓0, that generates

the observed data ⇡, by means of a probability distribution P

⇡

.

✓0)

|

(

4

FAVI, it is crucial to look beyond frequentist consistency results to assess the overall statistical

accuracy of posterior samples from a Bayesian perspective, especially with respect to uncertainty

quantiﬁcation.

Chapters 2 and 3 provide more background on FAVI, as well as a current state of research in

the area, open problems and necessary technical details.

1.4 Preliminary Notation

We use lowercase letters to denote scalars (G

R<) and capital letters denote matrices (-

R<

2

R), boldface letters to denote vectors (x

2
3). The symbols  

⇥

2
represent the

.

)

(

and q

.

)

(

standard normal cumulative distribution function (cdf) and probability distribution function (pdf)

respectively. The notation

??

is used to represent pair-wise independence between any two random

variables. Let I denote an indicator function, that is, for any real valued random variable . and

Borel set  

; I

R
)

.
(

=

 

)

2

2B (

For any scalar B

R we use sign

2

8>>>><
0,
B
>>>>
(
:

)

1, when .

 

2

otherwise

for the sign of B, that is:

1,

if B> 0

1,

 

0,B

if B< 0

= 0

1 0 . . . 0

sign

=

B

)

(

8>>>>>>>><
>>>>>>>>

:

The size < identity matrix is given by  <

R<

⇥

< =

2

We use the symbol & to denote a variational family of distributions and @q is a member of &

with variational parameters q. We use @q⇤ to refer to the variational posterior, that is the optimal

distribution from & that minimizes the KL divergence as per equation (1.1). We will sometimes

use @q⇤ to refer to a speciﬁc member of & that may not satisfy the minimization (1.1), but we will

make it clear if that is the case.

5

.

2
6
6
6
6
6
6
6
6
6
6
6
4

0 1 . . . 0
3
7
7
...
...
...
7
7
7
7
0 0 . . . 1
7
7
7
7
7
5

The symbol ✓ is used to denote the unknown parameter of interest or latent variables and ⇡

refers to observed data. We will use ? to denote the dimensionality of the parameter space (✓

R?)

2

and ⇧

⇡

✓

|

(

)

or ⇧

⇡

.

|

(

)

for the target posterior distribution. The Kullback-Leibler (KL) divergence

between 2 probability distributions @ and ? is deﬁned as:

  !

@

(

||

?

)

=

ln

@
?

x
)
x
)

(
(

πx

3x

@

x

)

(

See Tables 3.1 and 3.2 for more details on mathematical notation used in this dissertation.

1.5 Dissertation Outline

In this dissertation we study the properties of the variational posterior arising from FAVI with a

dual focus on: (i) the trade-oﬀ between accurate recovery of the posterior samples and complexity

of the NF transformations )B (ii) uncertainty quantiﬁcation.

Chapter 2 serves as an exposition of FAVI and is written to be accessible to a broad scientiﬁc au-

dience. It also includes comparisons to popular competitors (Mean-Field VI (MF-VI) and MCMC)

over a variety of classical statistical applications. The results of our comparison studes indicate

that FAVI lies somewhere between MCMC and MF-VI in statistical accuracy and computational

eﬃciency. This motivates the problems we consider in the subsequent chapters of this dissertation.

In Chapter 3 we begin with a rigorous study of the optimal Kullback-Leibler divergence and

loss in uncertainty quantiﬁcation between the FAVI approximation with Inverse Auto-regressive

Flows (IAF) and the true posterior, within the speciﬁc context of Bayesian linear regression with 2

predictors. We also contrast the loss in uncertainty quantiﬁcation (credible interval coverage) from

using the FAVI approximate posterior with IAF to that of MF-VI as a function of the correlation

between regression predictors. We then extend our results to the case of ?> 2 regression predictors.

Our theoretical results highlight the beneﬁts of FAVI for uncertainty quantiﬁcation.

We follow this with details on ongoing research, where we adapt FAVI for Bayesian inference

in high-dimensional linear models, with spike and slab priors (chapter 4). We demonstrate its

usefulness in capturing dependencies across latent variables as well as emphasize its limitations.

Based on these limitations we suggest possible directions for future work.

6

CHAPTER 2

NORMALIZING FLOWS AIDED VARIATIONAL INFERENCE: BACKGROUND,
EXAMPLES AND COMPARISONS

A modiﬁed version of this chapter was ﬁrst published in Notices of the American Mathematical

Society in volume 70, number 7, year 2023; published by American Mathematical Society. © 2023

American Mathematical Society.

2.1 Background

A major area of contemporary statistics research is learning to model probability distributions

of varying complexity. The problem of learning to characterize probability distributions broadly

takes 2 forms: estimating a probability density given samples from it and approximating densities

that are known only up to a normalizing constant. The latter avenue of research has applications

in Bayesian inference, where we wish to generate samples from the posterior distribution of model

parameters given observed data.

This chapter discusses the use of Normalizing Flows for Variational Inference (VI), a method

wherein we can approximate and sample from complex probability densities [33]. This type of

probabilistic modeling lies in the second avenue of research, where we do not have a normalizing

constant for probability densities of interest. VI is a tool that emerged in machine learning to

approximate probability densities.

It is often applied in Bayesian statistics as a more scalable

alternative to Markov Chain Monte Carlo (MCMC) methods for large datasets. Although scalable,

earlier works such as mean-ﬁeld or structured VI are limited when approximating more complex

and multimodal probability distributions. Normalizing Flows are mappings from a simple base

distribution to a more complex probability distribution. They are primarily used for modeling

continuous distributions and can be used to specify very ﬂexible probability models, thus improving

the statistical accuracy of VI algorithms.

There already exist comprehensive reviews for Normalizing Flow methods in general. An

overview of diﬀerent Normalizing Flow families is provided in [22], while [28] goes into depth on

each family of ﬂow models and extends this discussion to newer areas, such as ﬂows for discrete

7

variables. These reviews are an overarching look at ﬂows for probabilistic modelling and are

focussed on applications in the machine learning literature. Discussion of applications of a more

classical statistical nature is limited. An excellent exposition and survey of VI from a statistical

lens is given in [4]. However, they only cover variational families with a known parametric form,

such as mean-ﬁeld and structured VI. We extend the discussion to variational families speciﬁed by

Normalizing Flows. Further, this chapter is written to be accessible to readers new to the area.

In latent variable modeling, we aim to learn the conditional distribution of latent variables

✓ =

(

\1,\ 2, . . .\ ?

)

given observed data ⇡, that is, ⇧

⇡

✓

|

(

)

.1 We explain how solving this problem

is useful in Bayesian statistics. In parametric statistics, stochasticity in the observed data is often

described using a speciﬁc probability distribution ?

⇡

✓

|

)

(

, where ✓ needs to be estimated based

on the data ⇡. In Bayesian inference, we assume a prior distribution c

on ✓ representing our

✓

)

(

beliefs about the model parameter prior to observing the data. Based on the data, we update our

beliefs via the posterior distribution ⇧

⇡

✓

|

(

)

. The posterior can be calculated by Bayes theorem:

For cases where the marginal likelihood <

⇡

✓

c

✓

3✓ is intractable we resort to

)
approximate inference. MCMC methods have long been the go-to for sampling from posterior

(

(

)

(

)

|

⇡

⇧

✓

|

(

)

=

c
c

✓
(
✓
(

)
)

3✓

⇡
⇡

?
✓ ?
Ø
⇡

(
(

=

✓
|
)
✓
|
)
✓ ?
Ø

distributions when <

⇡

)

(

cannot be computed. MCMC algorithms generate samples from a

Markov Chain whose stationary distribution converges to the target distribution of interest. One

prominent example is the Metropolis-Hastings method [9], of which the Gibbs sampling algorithm

[8] is a special case. However, these methods may not always scale well to high-dimensional models

and can be slow to converge for multimodal distributions. VI has shown promise as a scalable

alternative to MCMC. In VI, the target distribution is approximated by a family of distributions &

among which we choose the optimal distribution @q⇤ to be “closest” to the target. To determine

“closeness”, KL-divergence is often used. Intuitively, KL-divergence is something akin to a distance

between 2 probability distributions. Thus, probabilistic modeling with VI becomes an optimization

1In much of the variational inference literature, z will be used for the unknown parameters (latent variables) instead

of ✓. We use ✓ to be consistent with statistics literature.

8

problem:

@q⇤ 2

arg min@ q

&   !

@q

(

⇧

⇡

.

|

(

))

||

2

Mean-Field VI (MF-VI) is a popular approach in which the variational family & is deﬁned based

on the assumption that latent variables are independent of each other. The mean-ﬁeld assumption

is useful for faster computations during optimization but is restricted in the complexity of densities

we can approximate. Structured VI takes this one step further by allowing dependencies across

latent variables. However, even with Structured VI we cannot guarantee that we can approximate

any density arbitrarily well. This is where Normalizing Flows come in.

2.2 When should we use Normalizing Flows VI?

In [4], the authors observe that “VI is suited to large data sets and scenarios where we want

to quickly explore many models; MCMC is suited to smaller data sets and scenarios where we

happily pay a higher computational cost for more precise samples.” While this is generally

true of MF-VI, Normalizing Flows VI lies somewhere between MCMC and other variational

approximation approaches in terms of computational eﬃciency and accuracy. To shed some light

on how Normalizing Flows VI compares to other sampling methods such as MCMC and MF-VI,

we implement variational inference with Neural Auto-regressive Flows [16] for several examples.

These examples cover classical Bayesian statistical applications in exponential family models,

Gaussian linear regression and logistic regression. We cover scenarios of varying dimensions and

complexity of the target distribution. This gives us a high-level idea of scalability vs. accuracy for

these methods.

We begin the following section by introducing Normalizing Flows and elaborate on how to use

them for VI. We then proceed to examples in Section 2.6. Finally, we discuss some important

takeaways & challenges remaining in the area in Section 2.7.

2.3 Normalizing Flows

The main idea behind Normalizing Flows is to transform some simple continuous base distribu-

tion into a “target” distribution that is usually more complex, via a series of bijective, continuously

diﬀerentiable transformations with diﬀerentiable inverse [28]. These functions are often referred

9

to as “diﬀeomorphisms”.

Let /

2

R? be a random variable whose density we wish to model. We begin with a random

variable * sampled from some base distribution ?*

deﬁned on support R? and apply a diﬀeo-

u

)

(

morphism ) : R?

!
variable formula [36]:

R? such that / = )

*

(

)

. The density of / is then given by the change of

?/

z

(

)

= ?*

 )

u

(

)|

(

u

)|

1
 

 )

u

denotes the absolute value of the determinant of Jacobian of ) w.r.t u. Thus, the function

(

)|

|
) transforms the density ?*

)
density ‘ﬂow’ through a mapping to obtain another density is called a Normalizing Flow.

(

)

(

u

into ?/

z

. This process, wherein samples from one probability

A natural question to ask is whether Normalizing Flows can be used to transform a simple

base distribution (e.g uniform or standard normal distribution) into any target distribution. The

paper [28] contains a constructive argument to show that Normalizing Flows can indeed recover

any target density under rather general conditions. In practice, this is heavily dependent on the

transformations ) that we employ.

2.4 Discrete and Continuous-Time Flows

Normalizing Flows are mainly of two types - discrete time (ﬁnite ﬂows) and continuous time

(inﬁnitesimal ﬂows) [28]. Discrete-time Normalizing Flows are constructed by choosing a ﬁnite

sequence of transformations )1,) 2, . . .) ( and applying them successively to some base distribu-

tion ?*

u

)

(

such that z( = )(

)(

 

1 · · ·  

 

)1(

u
)

. Since we choose all transformations to be

diﬀeomorphisms, the change of variables formula applies and we have:

?/( (

z(

)

= ?*

u

)⇥|

z(

 )( (

1)|

 

1

 

(
 )(

z(

1 (

2)|

 

1

 

· · ·⇥|

 )1 (

u

)|

1
 

 

⇥|

The number of transformations (, is often called the ﬂow depth. Increasing ﬂow depth can help us

model progressively more complex densities at the expense of increased computational cost due to

the calculation of  )B (

zB

1)

 

.

10

We can think of discrete time ﬂows as modelling the evolution of a probability density at

(-many time points [28]. In contrast, continuous-time Normalizing Flows model this evolution

continuously from some time C = 0 to ( as an ordinary diﬀerential equation 3zC

3C = 5

C, zC

(

)

. A

well known example of a continuous time ﬂow is the Hamiltonian Flow, which is used for MCMC

sampling [30].

2.4.1 Normalizing Flows for Variational Inference

We now expand on how Normalizing Flows are used to aid VI. We revert to our previous

notation of using ✓ =

\1,\ 2, . . .\ ?

(

)

to represent the latent variables, ⇡ for the observed data and

⇡

⇧

✓

|

(

)

for the target conditional distribution we wish to sample from.

⇡

⇧

✓

|

(

)

=

?

(

⇡
|
<

c
✓
)
⇡

(

)

✓

)

(

Recall that, VI approximates the target distribution by choosing a family of distributions & =

@q

q

 

and selecting the optimal distribution in this family @q⇤ closest to the target density in

|

2

}
{
terms of KL-divergence:

@q⇤ 2

arg min@ q

&   !

@q

(

⇧

⇡

.

|

(

))

||

2

(2.1)

Other metrics such as more generalized U-divergence measures [24] can be used in place of

KL-divergence. However, KL-divergence is popular due to its versatility and relative ease of

implementation. The optimization in (2.1) is diﬃcult to work with due to the presence of the

intractable marginal likelihood <

⇡

.

)

(

In practice, we maximize the Evidence Lower Bound

(ELBO) with respect to the variational parameters q due to its equivalence to (2.1). The ELBO is

the negative KL-divergence between the variational distribution @ and the joint distribution ?

of latent variables and the observed data.

max@ q

2

&ELBO

@q

(

= max@ q

E@ q

✓

)

(

&

2

n

⇡

.

|

))

⇧

||
(
ln ?

⇡, ✓

E@ q

✓

)

(

ln @q

✓

)

(

)

 

(

⇥

⇤

⇥

⇤ o

⇡, ✓

)

(

(2.2)

Using Normalizing Flows to aid Variational Inference was ﬁrst popularized in [33]. The

idea is to start with some base distribution @0(

✓0)

and then apply diﬀeomorphisms )1,) 2 . . .) (

11

successively so that ✓( = )(

1 . . .) 1(
induce a ﬂexible variational family & =

{
space of parameters for the transformations

)(

 

 

✓0)
@q

. The transformations

)B

(
B=1, parameterized by q,
. In this case, the symbol i denotes the

)

(

✓

q

i

(
)B

}

2

)|
(
B=1. We have the following useful relations:

(

)

ln @q

✓(

= ln @0(

✓0) 

)

(

E@ q

✓

)

(

⌘

✓

)

(

= E@0 (

✓0)

⌘

(

(

ln

det

m)B
m✓B

1

’B=1
)(

 

 
 
 
 

)(

⇣
1 . . .) 1(

 

⌘ 
 
 
✓0))

(2.3)

(2.4)

(2.3) follows from the change of variable formula and (2.4) is a well known property of expectation

(sometimes termed the law of the unconscious statistician). We simplify the maximization of the

ELBO in (2.2) as follows:

max@ q

&E@ q

2

✓

)

(

ln ?

⇡

c

✓

|

)

(

✓

(

) 

ln @q

✓

)

(

= max@ q

⇥
E@0 (

✓0)

&

2

⇢

= max@ q

E@0 (

✓0)

&

2

⇢

⇥

⇥

ln ?

⇡

✓(

✓(

c

)

(

|

(

)

+

⇤
E@0 (

✓0)

⇤

ln ?

⇡

✓(

✓(

c

)

(

|

(

)

+

E@0 (

✓0)

⇤

(

’B=1

h

(

’B=1

h

E@0 (

✓0)

 

ln @0(

⇥

i

⌘ 
 
 

✓0)

 
⇤
(2.5)

(2.6)

m)B
m✓B

 

1

m)B
m✓B

 

1

⇣

⇣

ln

det

 
 
 

ln

det

 
 
 

i  

⌘ 
 
 

Equations (2.3) and (2.4) jointly imply (2.5). We are essentially re-parametrizing the expectation

in terms of the base distribution @0. In (2.6), we are able to drop E@0 (
free of the parameter q. In practice, optimizing over @q

ln @0(
& eﬀectively becomes optimizing over

because it is

✓0)

✓0)

⇥

⇤

2

the parameters q of transformations

)B

(

)

(
B=1. We will sometimes refer to q as ﬂow parameters. In
=

general, for ?-dimensional latent variables ✓, calculating the determinant of Jacobian  )B (
det

time [28]. Therefore, in addition to )1,) 2, . . . ,) B being diﬀeomorphisms,

takes $

?3

1)

✓B

 

m)B
m✓B

(

1 )

 

(

)

they are often selected such that computational complexity of calculating  )B (

✓B

1)

 

is $

?

.

)

(

There are myriad ways in which we can choose the Normalizing Flow transformations. Intu-

itively, if we choose )B to be deep neural networks we should be able to approximate almost any

well behaved function. But how do we ensure computational feasibility? Neural Auto-regressive

Flows (NAF) were proposed in [16] as an attempt at achieving this balance between expressivity

and computational feasibility. NAF satisfy the “Universal approximation property”. This means

12

that they can approximate any probability distribution within an arbitrarily small error margin in

the weak convergence sense, provided the width of the neural networks transformations used in the

ﬂow are large enough. Further, the auto-regressive structure of these ﬂows ensures the Jacobian

determinants can be computed in $

?

)

(

time. Note that this is just one among many families of

Normalizing Flows. Given these properties we choose to use NAF for our examples in Section 2.6.

2.5 Neural Auto-regressive Flows

Auto-regressive ﬂows are among the most popular Normalizing Flows discussed in the literature.

We discuss some of the principals behind auto-regressive Normalizing Flows. We concentrate on

describing NAF since we use these for the examples in which we contrast Normalizing Flows aided

VI, MCMC and MF-VI.

Continuing with the similar notation, we denote the input from the base distribution by ✓0 =

(

1,\ 0
\0
1,\ B
\ B

2, . . .\ 0
?)
2, . . .\ B

and transformed latent variable by ✓1 =

1,\ 1
\1

2, . . .\ 1
?)

(

. For any vector ✓B =

R?, we let ✓B

8: 9 =

8 ,\ B
\ B
8
+

1, . . . ,\ B
9 )

(

be the sub-vector of ✓B running from the

8

8









?. Auto-regressive ﬂows are constructed such that each

9



? is dependent only on the ﬁrst 8 elements ✓1

1:8 of ✓1. More
is made up of ? many diﬀeomorphisms such that:

speciﬁcally, the transformer ) =

g1,g 2, . . .g ?

(

)

?)2
(
8th to 9 th element, where 1

transformed variable \1

8 , 1

\0
\1
1; 21)
1 = g1(
✓0
\0
\1
8 ; 28
8 = g8
1:8

(

(

1))

 

2

8





?

g8 is parameterized by the vector 28

✓0
1:8

(

1)

 

. The functions 28 : R8

1

 

R<, 2

8





!

? are referred

to as conditioners and they enforce the auto-regressive property for the Normalizing Flow. See

Figure 2.1 for a visualization of auto-regressive ﬂows. As the name suggests, NAF uses a neural

network for g8. The 2 types of transformations used are:

(i) Deep Sigmoidal Flow (DSF) - This neural network uses a single hidden layer.

(ii) Dense Deep Sigmoidal Flow (DDSF) - This uses a deep neural network.

13

Figure 2.1 Visualization of auto-regressive ﬂows.

For readers who are unfamiliar with the topic, think of a neural network as a somewhat complex

function that takes some inputs and applies a series of operations and transformations to them. They

generally involve multiplication of inputs with weight matrices, translation and and the application

of certain “activation” functions. The DSF network is formally deﬁned as:

\1
8 = f 

1

w>8 f

(

a8.\ 0

8 +

(

b8

))

a8, w8, b8

R  1

2

8





?

Here   is the number of nodes in the hidden layer and f

G

(

)

= 1

/(

1

G
4 

)

+

is an activation

function (sigmoid activation). The parameters w8, a8, b8 are the outputs of conditioner networks

.

2w
q (
1,

,2 b
q (

.

,2 a
q (

.

)

)
9 F8, 9 = 1. This ensures invertibility of g8 [16]. Since the DDSF transformation leverages a

!

8

)

R? . Further, a8 and w8 are constrained as 08, 9 > 0

8, 9 , 0 <F 8, 9 <

: R?

deep neural network, it has the capacity to be more expressive than DSF, albeit at an increased

Õ

computational cost.

Until now we have discussed the choice of the transformers g8 for NAF. To construct the

conditioner there are no constraints such as invertibility on the functions 28. A natural choice

is to use a neural network for the conditioner as well. However, using a distinct neural network

for each 28 is computationally infeasible as the dimensionality ? increases. This is because we

have to store and optimize over ?

 

1 networks each with diﬀerent parameters.

[28] discusses

a range of conditioners that leverage parameter sharing across 28. Following [16], we adopt the

14

popular Masked Conditioner approach. Masked conditioners take ✓0

1:? as inputs to a neural network

and calculate all the parameters 21,

28

✓0
1:8

(

1)

 

?
8=2 for the transformers in a single forward pass.

For a network with a single hidden layer, the auto-regressive dependency structure is enforced by

 

 

multiplying the weight matrices

W1 &

W2 by masking matrices

M2 consist of binary 1

M1,
M
connection is dropped from the network. Therefore entries in

0 entries such that a 0 entry in

 

is no connection between the 8th input \0

8 and 1, 2 . . . ,

<8

)

(

M1,

M2 of the same dimension.
8 implies the corresponding weighted

M1,
M2 are chosen such that there
outputs of the network. Here, < is

a multiplier which tells us how many parameters are required for each g8. The weight matrices

⇥

R?

  and

W1 2
conditioner network. See section 3.4.2 for further technical details on masked auto-regressive

< ? correspond to the hidden and output layer respectively for the

W2 2

R 

⇥

conditioners.

2.5.1

Implementation

Recall that for Normalizing Flows Aided Variational Inference (FAVI) we maximize the ELBO

(or equivalently minimize the negative of the ELBO):

@q

)

L (

= E@0 (

✓0)

ln ?

⇡, ✓(

(

E@0 (

✓0)

)

+

’B=1
⇥
(
B=1 and ✓( depend on q in the equation above. In general,

h

⇤

)B

m)B
m✓B

1
 

(

ln

det

 
 
 
L (

@q

i

⇣
will not have a closed form

⌘ 
 
 

)

(
expression. Additionally, standard co-ordinate wise gradient-ascent algorithms are computationally

)

ineﬃcient for large datasets. As a result, the Stochastic Gradient Ascent (SGA) algorithm is often

used for optimizing the ELBO.

SGA is an iterative method that uses the following update for the ﬂow parameters q at step C:

1 = qC

qC

+

UC;

qC

(

)

+

The term ;

q

)

(

is a realization for an unbiased estimator of

ELBO. We can calculate it by sampling ✓(

transformations )1,) 2, . . . ,) ( to get ✓(

!

)

1
0 , ✓(
)
2
)( , . . . , ✓(

2
0 , . . . ✓(
)
0
!
)(

.

1
)( , ✓(

q

@q

, the gradient for the

L (
and passing them through the

)

r
.
from @0(

)

q

;

(

)

=

1
!

!

’;=1 

q ln ?

(

r

;

⇡, ✓(

)( ) +

(

r

’B=1

q ln

det

⇣

 
 
 

m)B
;
m✓(
)B
 

1 ⌘ 
 
 

 

15

Figure 2.2 Visualization of the gradient ascent algorithm.

SGA almost surely converges to a local minimum for non-convex functions and global minimum

for pseudo-convex functions when learning rates satisfy

1C=1 UC =

&

1

1C=1 U2

C <

1

. ([34], [6]).

In practice, choosing the learning rate UC is non-trivial. When UC is too large we may overshoot

Õ

Õ

the maxima and when UC is too small then SGA will learn too slowly. We use the Adam algorithm

[19] which uses an adaptive learning rate that incorporates information about the scale of diﬀerent

components in the parameter vector q. We use a standard normal distribution for @0(

✓0)

.

Note that, the outputs of the Normalizing Flow transformations ✓B are unconstrained, i.e they

belong to R?, since they are outputs of a neural network. Sometimes the latent variable space is

restricted, for example, our model may have a variance parameter f2 > 0. More formally, when

\8

2S⇢
1

set \(
8

+

R, we apply a ﬁnal transformation )(

1 to constrain ✓(

8 . For instance, if \8 > 0 then we

+

= ln

1

(

+

4\(

8

.

)

2.6

Illustrative Examples

Here, we implement the FAVI algorithm on some examples. We also provide comparisons

to MCMC and Mean-Field VI (MF-VI) where applicable. Note that MCMC comprises a wide

class of algorithms ranging from the more basic Random-Walk Metropolis Hastings (RW-MH)

and Gibbs sampling methods to approaches that make use of gradient information such as the

Hamiltonian Monte-Carlo (HMC) [30]. We use either the RW-MH or Gibbs sampling methods as

a baseline since these are widely used in classical applications of Bayesian inference. See Section

2.7 for a detailed discussion of contemporary MCMC literature. Section 2.6.1 discusses FAVI

16

for exponential family models, followed by 2.6.2 in which we sample from un-normalized energy

density functions. We then move onto Bayesian linear and logistic regression in 2.6.3 and 2.6.4

respectively. Through these examples we hope to elucidate how FAVI works in diﬀerent contexts.

Code reproducing the results in this chapter can be found at Normalizing-Flows-Review.

2.6.1 The Exponential Family

In many applications of Bayesian inference the complete conditionals ?

latent variables belong to the exponential family

=

P
known as conditionally conjugate exponential family models and its broad applicability makes it of

))

 

(

(

\8
⌘
) 
(
[
)
(

exp

[CC

\8

\8

\

8,⇡
 

1

8

? of

|

(
)
. This class of models is





interest to statistical practitioners. [4] discusses the derivation for Co-ordinate Ascent Variational

Inference (CAVI), an MF-VI method, for this class of models. It is natural to extend this discussion

to the FAVI algorithm for exponential family models.

FAVI performing well on low-dimensional examples is a necessary but not suﬃcient condition

for them to be reliable for high dimensional VI problems. This motivates our choice of examples:

(i)

(ii)

(

(

H8

=
8=1|

F i.i.d
⇠

)

Bernoulli

F

)

(

F

c

(

)

: F

Beta

0, 1

(

)

⇠

H8

=
8=1|

)

`, f 2 i.i.d
⇠

#

(

`, f 2

, c

)

(

`, f 2

: `

#

(

⇠

)

0,g 2

)??

f2

⇠

Inv-Gamma

E1,E 2)

(

In the ﬁrst case the posterior has a closed form with which we can compare the density obtained

by ﬂows:

Above we have used y =

F

y

|

⇠

Beta

0

(

+

= ¯H, 1

=

 

+ (

= ¯H

))

H1,H 2, . . .H =

and H =

)

(

=
8=1 H8
=

.

Õ
For the second example, we compare with results obtained by Gibbs sampling. We also include

results from MF-VI.

We see from Figure 2.3 that FAVI, MF-VI and Gibbs sampling produce similar results. Figure

2.4 is a demonstration of how the density obtained from FAVI converges to the true distribution

over the training epochs. From the plot of ELBO against epochs we see that at around the twentieth

epoch there is a plateau. This indicates the density from FAVI has changed shape and is approaching

the true distribution.

17

 
(a) ⇧

F

|

(

y

- Bernoulli, Beta

)
Figure 2.3 Exponential Family - Posterior Density Plots.

)

(

|

(b) ⇧

`, f 2

y

- Normal, Inverse Gamma.

(a) ELBO vs Training Epochs.

(b) ⇧

F

y

|

)

(

: epochs = 2.

(c) ⇧

F

y

|

)

(

: epochs = 20.

Figure 2.4 Convergence of FAVI - Bernoulli, Beta.

18

2.6.2 Sampling from Multimodal Densities

The primary advantage of Normalizing Flows is their ability to recover highly multimodal

target distributions with complex dependencies.

In [16], the authors use NAF to sample from

multimodal energy density functions, for which the normalizing constant is unknown. They do not

however, provide a comparison to MCMC methods. Given that MCMC methods are theoretically

guaranteed to converge to the target distribution of interest, we believe it would be useful to include

this comparison. We contrast both accuracy and computational time for NAF and the RW-MH

Algorithm, for sampling from the energy density functions *1

*9 (see Figure 2.5c).

 

We compare the 2 methods based on run-time and kernel density estimates. Run-time is

measured from the ﬁrst iteration for the algorithm till convergence. For comparing the densities

generated by both methods we calculate kernel density estimates (k.d.e) from the samples. We use

a Gaussian kernel, on a grid of size 200

⇥

200. We then calculate Square root of Sum of Squared

Error (pSSE) for k.d.e over the grid as

#=40,000
8=1

ˆ5

(

(

x8

) 

5True(

x8

))

2. Here, ˆ5

.

)

(

is the kernel

density estimator obtained from either the FAVI/RW-MH algorithm and 5True(
This is equivalent to Frobenius norm of errors between true and estimated density on our grid.

is the true density.2

)

.

qÕ

Results are reported as an average

±

standard deviation across 5 diﬀerent random seeds. We do this

to contrast the stability of FAVI (an optimization algorithm), with MCMC - a sampling algorithm.

2.6.2.1 Determining Convergence

For assessing convergence of the RW-MH algorithm we visually inspect the auto-correlation

and trace plots. The plots for many energy functions display non-negligible auto-correlation upto

lag 40, therefore we thin the samples by 40 and run the chain for 400,000 samples. We choose this

run time in order to obtain a suﬃcient sample size of 10k to get richer kernel density plots. We run

FAVI for 15k epochs based on stabilization of the loss function and also generate 10k samples after

training. For both the RW-MH and FAVI algorithms there is a degree of subjectivity to determining

convergence since we use visual inspection. Empirical convergence criteria such as trace plots, ˆ'

[10] and zero auto-correlation in the samples does not guarantee convergence of the Markov Chain.

2Since we do not have the closed form for the true density we normalize the energy functions using numerical

integration from SciPy’s integrate module.

19

Although satisfying these criteria is not suﬃcient for convergence, it is necessary for us to gain

conﬁdence that the Markov Chain is approaching the stationary distribution.

2.6.2.2 Results

Table 2.1 reports the average pSSE of k.d.e for both FAVI and RW-MH algorithms across the

best 3 of 5 trials (based on loss)

±

standard deviation. We choose best 3 because the loss does not

converge in all cases for the FAVI algorithm, due to sensitivity to choice of initialization and the

presence of local minima.3 We observe that the RW-MH algorithm outperforms FAVI in terms of

k.d.e metrics by a small-medium margin with one exception, *6. We also see from the standard

deviations that the RW-MH algorithm is more stable than FAVI, which is highly sensitive to the

initialization of ﬂow parameters q. For instance, *5 has a standard deviation of pSSE 0.95 for

FAVI and only 0.01 for RW-MH. In terms of computational time, approximately half of the energy

functions have run-time of a similar order for the FAVI and RW-MH algorithms. For the remaining

functions we observe:

(i) For *3 and *5 the RW-MH algorithm takes approximately 60% of the run-time that FAVI

does.

(ii) For *4 and *9 the trend is reversed and FAVI takes only 30% of the RW-MH algorithm

run-time.

Upon closer examination of the function forms we see that *3 and *5 are relatively simple functions

to evaluate over a particular sample whereas *4 and *9 are complex functions, being a mixture of

multiple densities.4 *4 is a mixture gaussian density with 4 components and *9 is a mixture of *3

and part of *8. Thus, unlike FAVI, the RW-MH algorithm does not scale eﬃciently as complexity

of the target density increases.

3We report results across diﬀerent initializations to contrast the stability across FAVI and MCMC. Further, run-time

varies across trials and averaging gives us a better idea of the true run-time.

4See Appendix A for additional information on these functions.

20

(a) Histogram of samples - RW-MH.

(b) Histogram of samples - FAVI.

(c) True Density.

Figure 2.5 Plot of Energy Density Functions *1

*9.

 

Avg pSSE

Avg Runtime

FAVI

0.98
0.46
0.20
0.62
1.72
1.21
1.07
1.11
1.25

±
±
±
±
±
±
±
±
±

0.05
0.03
0.01
0.04
0.95
0.03
0.01
0.04
0.04

RW-MH
0.01
0.94
0.01
0.42
0.01
0.18
0.02
0.56
0.01
1.17
0.00
1.25
0.02
1.07
0.00
1.10
0.01
1.15

±
±
±
±
±
±
±
±
±

FAVI
1
114
7
118
3
109
1
122
2
111
15
145
7
122
12
127
2
131

±
±
±
±
±
±
±
±
±

RW-MH
4
119
31
173
8
68
84
465
4
69
40
212
8
166
13
251
45
303

±
±
±
±
±
±
±
±
±

Ef
*1
*2
*3
*4
*5
*6
*7
*8
*9

Table 2.1 Avg pSSE

s.d. of k.d.e. for *1

*9 (smaller values are better)

Avg algorithm

±
runtime in seconds

 
s.d. for *1

*9 (smaller values are better).

|

±

 

21

2.6.3 Linear Regression

In this section we implement the FAVI algorithm on a Bayesian Linear Regression example

to sample from the posterior of regression parameters given the data, ⇧

V, f 2

⇡

|

)

(

. We use the

framework below:

Y8,V

H8

c

(

x>8 V

⇠
V, f 2

)

+
: V

#

(

⇠

i.i.d
⇠

R?,Y 8

2
0,g 2 ?

)??

#

(
f2

0,f 2

8

1

)





=

Inv-Gamma

0, 1

(

)

⇠

As before, we use y =

H1,H 2, . . .H =

)

(

and in this case the data ⇡ =

y, x1, x2, . . . x=

. We

)

(

compare FAVI to both MF-VI and the Gibbs Sampling algorithm. We use Gibbs Sampling because

it relies on the complete conditionals for latent variables c

V

|

(

⇡, f 2

and c

⇡, V

f2

|

(

)

)

which are

easily available in this case. Through this example, we can gain some insight on the scalability

and accuracy contrast between the 3 methods in a classical statistical set-up. To assess eﬀect of

both sample size and dimensionality on the performance of these methods we use a grid of

)
combinations. We allow = (sample size) to take values 50, 100 & 200, while ? (V dimension) takes

(

=, ?

values 2, 20, 50, 100.

2.6.3.1 Simulation Details

For our experiments, we simulate the true data generating V0 from the Uniform

1
2, 2

)

(

dis-

tribution. We assume f0 = g = 1 where f0 is the true value of model parameter f. The

?-dimensional predictor variables x1, x2 . . . x=

R? are simulated from a multivariate normal

distribution #

0, ⌃

. The covariance matrix ⌃=

d

 ?

d11>, where 1 =

1, 1, . . . 1

(
The parameter d allows us to characterize correlation between predictor variables. For the ex-

 

+

(

)

)

(

R?.

)> 2

2
1

periments in this chapter we set d = 0, since we are primarily interested in the eﬀect of

)
combinations on model performance. In chapter 3 we will consider the eﬀect of increasing d on

=, ?

(

the performance of these algorithms. For the cases where ?

20, we set only 20% of the variables

 

to be non-zero in order to ensure the latent variable space is sparse, that is, "<<? where " is

the number of non-zero components in V0.

Similar to Section 2.6.2, convergence of the Gibbs sampling algorithm is determined by a

22

f2

⇡

(a) ⇧

(

|

)

(
Figure 2.6 Linear Regression: = = 100, ? = 2

(b) ⇧

V1,V 2|

⇡

)

combination of trace and auto-correlation plots. We thin the samples by a factor of 10 to ensure

0 auto-correlation.We initialize V with its O.L.S estimate for faster convergence. Convergence of

FAVI and MF-VI is ascertained via stabilization of the loss function.

2.6.3.2 Results

In order to visualize the diﬀerence between densities approximated by the 3 approaches (FAVI,

MF-VI and Gibbs) we use kernel density plots. For the case where ? = 2, we can easily visualize

the posterior distributions of V and f2. For higher dimensional examples we use the kernel density

plots for SSE of V; 6

=

V

)

(

V

k

 

V0k

2
2 where V is sampled from the posterior ⇧

⇡

V

|

(

)

. We present

density plots for = = 100 and varying ? in Figure 2.7. We report the model predictive Root Mean

Squared Error (pMSE) on test data

=test
8=1 (

H8

ˆH8

2

)

/

 

=test. Here ˆH8 = x>8

ˆV is the predicted value

for the 8th sample based on mean of the posterior samples ˆV = 1

qÕ

#

/

#
==1 V=. The symbol # denotes

the number of V samples generated and is set to be 10k. To get a sense of variance of the posterior

Õ

distribution for V we also report BV. This is obtained by ﬁrst computing sample standard deviation

from posterior samples for each ⇧

V=
8  
?
8=1 BV8 . By reporting pMSE
and BV we are able to measure both model predictive performance and uncertainty quantiﬁcation

then aggregate these by averaging across dimensions as BV =

#
1
==1(
))
?
Õ
1
/

as BV8 =

2 for 1

?. We

/(

Õ

V8

V8

⇡

#

 





1

(

(

)

)

(

)

8

|

for the 3 methods.

23

(a) = = 100, ? = 20

(b) = = 100, ? = 50

Figure 2.7 Linear Regression: Density Plots of

V

k

 

V0k

2
2 where V is sampled from ⇧

⇡

V

|

(

.

)

We observe from density plots that for cases where ? = 2 the diﬀerence across 3 algorithms is

insubstantial (Figure 2.6). As dimension ? increases we see that the FAVI results are reasonably

close to Gibbs sampling (the gold standard) but MF-VI produces a peaked distribution, indicating

it under estimates uncertainty in the samples (Figure 2.7). Performance metrics in Table 2.2 show

that all 3 methods have similar predictive performance. For uncertainty quantiﬁcation, BV shows

that MF-VI has lower posterior variance in general than the other two methods, while FAVI is

comparable to Gibbs. There are 2 notable exceptions when = = 100,? = 100 and = = 50,? = 100.

In these cases MF-VI has larger BV by 0.01 points than FAVI but this diﬀerence is insubstantial.

For the 3 cases where ?

 

=, the Gibbs sampling algorithm breaks down due to the instability

of the following matrix inversion:

g2

 ?

(

/

+

- >-

f2

/

) 

1 . From error metrics we see that both

these algorithms have somewhat poor predictive performance for these cases. It is to be expected

that a naive model such as this (that does not incorporate sophisticated regularization) may perform

rather poorly in high dimension. A more interesting observation is that since FAVI, like MF-VI,

does not require the matrix inversion step and can specify an arbitrarily ﬂexible family of densities,

it could be a promising alternative to Gibbs sampling in such a set-up.5

Table 2.2 reports average algorithm run-times

±

s.d. across 5 trials. In general, MF-VI runs the

fastest, followed by Gibbs sampling and then FAVI. However, for the highlighted cases of ? = 50

5There are modiﬁcations of Gibbs sampling that can circumvent this and a more thorough exploration is required.

24

Pred pMSE

Avg. V s.d. s#

*
*

*
*

n, p Gibbs FAVI MF-VI Gibbs FAVI MF-VI
50, 2
1.06
(
)
50, 20
0.88
)
(
50, 50
3.41
)
(
50, 100
2.64
(
)
100, 2
1.04
)
(
100, 20
1.05
)
(
100, 50
1.93
)
(
100, 100
2.92
)
(
200, 2
1.07
)
(
200, 20
1.10
)
(
200, 50
1.26
)
(
200, 100
1.86
)
(

0.17 0.17
0.24 0.23
0.47
0.77
0.11 0.11
0.14 0.14
0.17 0.16
0.46
0.08 0.09
0.09 0.09
0.10 0.10
0.14 0.13

1.06 1.06
0.88 0.92
3.06
2.93
1.04 1.04
1.05 1.05
1.93 1.90
2.64
1.07 1.07
1.10 1.09
1.26 1.26
1.86 1.87

0.16
0.17
0.46
0.78
0.11
0.12
0.11
0.47
0.08
0.08
0.08
0.09

*

*

Gibbs
0
55
5
79

±
±
*
*

Avg Runtime (in seconds)
FAVI
8
252
21
374
41
606
84
1055
10
267
24
383
23
549
56
1087
13
254
10
568
31
567
28
1011

MF-VI
1
22
1
23
10
122
69
922
2
24
1
24
2
36
80
778
1
23
0
23
2
24
1
35

3
0
110
37

±
±
±
±
±
±
±
±
±
±
±
±

±
±
±
±
±
±
±
±
±
±
±
±

±
±
±
±

4
5
54

±
±
±
*

59
85
953

57
80
997
833

Table 2.2 Results for Linear Regression. Left: Model Predicted square root of mean square

pMSE (smaller values are better)

Avg. s.d. of V samples. Right: Avg algorithm runtime
over 5 trials (smaller values are better). We use * when the result can’t be computed.

|

±

s.d.

and =

 

100 this trend switches and Gibbs sampling becomes slower than FAVI. Given that Gibbs

sampling breaks down in high-dimension it is diﬃcult to discern a pattern and contrast scalability

of the 3 methods.

2.6.4 Logistic Regression

We consider the following model:

H8

⇠

Bernoulli

c

V

)

(

: V

#

⇠

,? 8 =

?8

)

0,g 2 ?

)

(

(

4G>8 V

4G>8 V

1

+

We use the same simulation set-up for V and x8 as of Section 2.6.3.1 on linear regression. Here

we use RW-MH instead of Gibbs sampling because we no longer have closed form complete

conditionals. Maintaining consistency with previous experiments, we use trace and auto-correlation

plots to decide on convergence.6 We initialize the RW-MH algorithm with the maximum likelihood

estimates for V. We use plots and metrics as in 2.6.3, replacing pMSE by Accuracy.

6For most cases the auto-correlation after thinning is between 0.0

auto-correlation of 0.4 given computational considerations.

0.2, however when ? = 100 we allow

 

25

(a) ⇧

V1,V 2|

(

⇡

)

when = = 100, ? = 2.

(b) Density of

V

⇠
 
when = = 100, ? = 20.

k

2
2, V

V0k

⇧

⇡

V

|

(

)

(c) Density of

V

⇠
 
when = = 100, ? = 50.

k

2
2, V

V0k

⇧

⇡

V

|

(

)

(d) Density of

V

⇧
 
when = = 100, ? = 100.

V0k

2
2, V

⇠

k

⇡

V

|

(

)

Figure 2.8 Logistic Regression: Density Plots.

2.6.4.1 Results

Density plots for V when = = 100,? = 2 are presented in Figure 2.8a. From the contour plot

we see that MF-VI does not capture the elliptical structure of the joint distribution of V which

the other 2 methods display. For higher dimensions, kernel density plots of V SSE,

V0k
are presented in Figure 2.8. Similar to the trend displayed by Gaussian linear

 

V

k

2
2

when V

⇧

⇡

V

|

(

)

⇠

regression, we see that kernel density plots for MF-VI seem to center on a lower SSE. The FAVI

and RW-MH algorithms perform similarly with respect to uncertainty quantiﬁcation as dimension

? increases but MF-VI has lower aggregate posterior variance for V (See Table 2.3). All 3 methods

display identical model predictive Accuracy given by

=test
8=1 I

{

ˆH8 = H8

=test.

}/

The plot of average run-time across 5 trials against dimension ? presents an interesting contrast

Õ

26

Accuracy

Avg. V sd. s#

n, p MH FAVI MF-VI MH FAVI MF-VI

50, 2

(

)
(
50, 20
)
50, 50
(
)
50, 100
)

(

(

100, 2
(
)
100, 20
)
100, 50
(
)
100, 100

(

200, 2
(
)
200, 20
)
200, 50
)
(
200, 100

(

(

)

)

0.80 0.80

0.80 0.41 0.40

0.39

0.90 0.90

0.90 0.56 0.55

0.47

0.60 0.60

0.60 0.77 0.73

0.61

0.50 0.50

0.50 0.88 0.85

0.74

0.60 0.60

0.60 0.36 0.36

0.32

0.65 0.65

0.65 0.40 0.39

0.34

0.95 0.95

0.95 0.59 0.55

0.44

0.70 0.70

0.70 0.77 0.71

0.56

0.85 0.85

0.85 0.25 0.25

0.25

0.85 0.85

0.85 0.25 0.25

0.25

0.55 0.55

0.55 0.40 0.38

0.31

0.78 0.78

0.78 0.61 0.54

0.42

Table 2.3 Logistic Regression: Model Accuracy (larger values are better)

samples. Here RW-MH is abbreviated as MH.

Avg. s.d. of V

|

(Figure 2.9). As expected, MF-VI scales the best, since it has run-time of approximately the same

order regardless of dimension. RW-MH algorithm scales poorly ranging from 40s for ? = 2 to

approximately 1000s when ? = 100. FAVI on the other hand, only has a run-time 400s when

? = 100, less than half the time of the RW-MH algorithm (see Appendix B for a detailed table

containing run-time comparisons for logistic regression). Thus, FAVI is scalable relative to the RW-

MH algorithm and performs much better than MF-VI at approximating densities in their entirety,

beyond just measures of central tendency.

27

Figure 2.9 Logistic regression: Average Runtime vs V dimension when = = 200.

2.7 Looking Ahead

This chapter has discussed how Normalizing Flows are a useful tool in probabilistic modeling.

In the rest of this section we will summarize the results of our preliminary experiments and

elaborate on some interesting open problems concerning Normalizing Flows and their use for

statistical inference. This will motivate the next chapter of this dissertation.

The examples covered in Section 2.6 conﬁrm that FAVI lies somewhere between basic MCMC

methods (RW-MH, Gibbs sampling) and MF-VI with respect to accuracy and scalability. They can

approximate multimodal densities and come reasonably close to MCMC for uncertainty quantiﬁ-

cation while retaining some scalability. An exciting feature of FAVI is a high degree of ﬂexibility

over the desired levels of expressivity and scalability for the probabilistic model. This is due to our

ability to select ﬂow depth, type of transformation, and the number of ﬂow parameters q.

There are still many challenges remaining in this area. There is no rigorous study on the

scalability vs. expressivity trade-oﬀ in the literature. Many Normalizing Flow families such as

NAF have universal approximation properties which allow them to approximate any distribution

arbitrarily well, given enough ﬂow depth and complexity. This does not, however, provide answers

on the ﬂow depth and complexity required to achieve desired levels of accuracy.

In contrast,

MCMC methods have eﬀective sample size measures, which indicate the minimum amount of

samples needed to obtain the required quality in posterior samples.

28

Another research direction would be improving FAVI’s scalability without a signiﬁcant accuracy

loss. We observe that FAVI can be computationally expensive in Bayesian inference if the likelihood

?

⇡

✓

|

)

(

is diﬃcult to evaluate. There are already eﬀorts in this direction. [43] replaces the likelihood

with a surrogate likelihood that we can learn while training the ﬂow transformations. In addition,

we see in Section 2.6.2, that FAVI can be highly sensitive to the initialization of ﬂow parameters

q. Thus FAVI can take longer to converge to the minima with bad initializations. It follows that a

beneﬁcial avenue of research would be going beyond naive initializations for FAVI.

As discussed in Section 2.6, we have used the RW-MH and Gibbs sampling algorithms for

our experiments. However, there exist a range of other MCMC algorithms in the literature.

Among the most popular is the HMC algorithm [30], and its self tuning variant, the No-U-Turn

sampler (NUTS) [14]. These methods have achieved considerable success by leveraging gradient

information to make jumps through the state space for the target distributions. In fact, HMC scales

at $

4

?1
/

)

(

iterations to achieve 2 nearly independent samples in comparison to $

time for RW-

?

)

(

MH. Thus, HMC is considered to be the gold standard for unimodal high-dimensional regimes,

if computationally feasible. [44] contains a comparison of FAVI and HMC on 13 Bayesian linear

regression models. Their results indicate that FAVI is competitive with HMC, having a lower MSE

in 5 of 13 models. [26] shows that for highly multimodal distributions the above scaling regime

need not hold. Speciﬁcally, HMC and the RW-MH algorithm behave the same way, with their

spectral gaps decaying at the same rate. Thus, FAVI has the potential to compete with HMC for

multimodal densities if we have a limited computational budget. A more rigorous, wide scale

exploration of how FAVI compares to gradient based MCMC methods is essential.

Normalizing Flows can also be used to aid MCMC sampling. We expand on some existing

ideas to do this. For multimodal target distributions, the MCMC chains may converge slowly, and

samples are highly correlated.7 The MH algorithm, uses a “proposal density” ?

from which

✓

)

(

we generate candidates for posterior samples. The proposal density is often selected to be easy

to sample from, for example, the gaussian distribution. Unfortunately, when the target density

7We see this occur for the mixture Gaussian density *4 in section 2.6.2

29

has a complicated geometry, this results in slow exploration of the state space. To address this,

we can use Normalizing Flows to shift the proposal density space to a “distorted” space. This

is possible because Normalizing Flows are nothing but a re-shaping of one density into another.

The MH algorithm is then able to move faster through this “distorted” space. Recently, Inverse

Auto-regressive Flows have been used to aid HMC sampling [15].

Until now, we have discussed how Normalizing Flows can be used for learning continuous

distributions. What happens for discrete probability distributions? There is an equivalent change

of variable formula for ﬂows on discrete distributions:

?/

z

(

)

= ?*

1

)  

(

z

(

))

) :

U!Z

is a bijection between two discrete spaces

,

U

Z

. However, some issues exist with

using the above for learning discrete distributions. For one, we rely heavily on the base distribution

for expressivity in discrete ﬂow models. We need to incorporate dependencies across variables

into the base distribution itself. Further, there is no research on modeling joint discrete-continuous

distributions. We expect this to be a popular avenue of research in the near future.

Normalizing Flows are a signiﬁcant advancement for probabilistic modeling, particularly for

VI. This area is in the nascent stages, and many issues need to be tackled. In order to enable the

wider adoption of Normalizing Flows aided VI for Bayesian inference, it is crucial to understand

the statistical properties of the variational posterior obtained from FAVI. This will motivate Chapter

3 of this dissertation.

30

CHAPTER 3

STATISTICAL PROPERTIES OF THE FAVI POSTERIOR: A CASE STUDY WITH
LINEAR REGRESSION

As discussed in the previous chapters, Variational Inference (VI) has emerged as a popular tool to

sample from probability distributions for which the normalizing constant is unknown. This problem

frequently occurs in applications requiring uncertainty quantiﬁcation with Bayesian statistics. In

order for this chapter to be self-contained, we brieﬂy review some basic concepts of Bayesian

statistics and VI below.

In Bayesian statistics, all inference about an unknown parameter ✓

R?, governing a random

2

data generating process, comes from the posterior distribution:

⇡

⇧

✓

|

(

)

=

?

⇡

c

✓

|

)

(

✓

(

(

))/(

?

⇡

c

✓

|

)

(

✓

)

(

3✓

)

π✓
✓

The symbol c

✓

denotes the prior distribution, ?

⇡

is the likelihood and ⇡ is the observed data.

(
The stochasticity in ✓ induced by the prior facilitates uncertainty quantiﬁcation for our estimates.

)

)

(

|

In applications where <

⇡

=

⇡

✓

c

✓

3✓ is intractable, we resort to approximate

(
inference methods such as VI, wherein the problem of sampling from the target posterior distribution

)

)

(

)

(

|

✓ ?
Ø

is converted to the following optimization:

@q⇤ 2

arg min@ q

&   !

@q

(

⇧

⇡

.

|

(

))

||

2

(3.1)

Mean-Field VI (MF-VI) assumes any @MF

)
in order to facilitate computational eﬃciency. Structured VI (SVI) allows for some degree of

Œ

)

q 2

&MF can be factorized as @MF
q (

✓

=

?
8=1 @MF
q8 (

\8

,

dependencies among \8 by estimating a non-diagonal covariance ⌃

R?

⇥

?, but is still limited to

2

unimodal distributions.

We recollect that, Normalizing Flows aided VI (FAVI) generates a family of distributions &NF

by sampling from a base distribution @0(
✓0)
R? such that ✓( = )(
)B : R?
1 · · · 

)(

!
the transformations

)B
{

}
from &NF =

 

 
(
B=1, then any q
i
q

@NF
q

2

}

{

|

optimal @NF
q⇤

and applying diﬀerentiable and invertible mappings

R< denotes the space of parameters for

✓0)

. If i

)1(
✓
i results in a distribution @NF
q (

2
based on the minimization in (3.1) (see section 2.4.1). As

)

✓(

. We then choose an

31

before, we limit our scope to auto-regressive ﬂows. For explanations of the mathematical notation

relating to VI used throughout the remaining chapters of this dissertation, refer to Table 3.1.

FAVI is widely used to address the limitations of MF-VI and SVI, with the landmark paper

[33] being cited

 

3500 times since its 2015 publication. In chapter 2 we mentioned that certain

classes of Normalizing Flows (such as neural auto-regressive ﬂows), are universal approximators.

This means that they can recover any distribution on R? arbitrarily well, provided )B are expressive

enough [28]. Despite this, the statistical properties of the variational posterior @NF

q⇤ 2

&NF are not

well understood. The universal approximation results for Normalizing Flows (NFs) do not account

for the optimization (3.1). It is thus, theoretically unclear how well @NF
q⇤

emulates ⇧

⇡

✓

|

(

)

. There

is also little work characterizing the gains in exact recovery of the posterior samples from FAVI at

varying complexity of )B, especially with respect to uncertainty quantiﬁcation.

In this chapter, we provide a ﬁrst attempt at answering these questions within the speciﬁc context

of the Bayesian Linear model:

y = -✓

n, ✓

+

R?

2

c

✓

)

(

⇠

(3.2)

where y

2

R= is the response vector, -

2

R=

⇥

? is the design matrix of predictor variables, ✓ is

unknown and n

R=

#

(

⇠

2

0,f 2 

)

is random noise. We choose this set-up due to: (i) its simplicity;

(ii) wide applicability; (iii) the surge of interest in using VI for high-dimensional Bayesian regression

[31], [7], [29]. This framework also facilitates quantifying the accuracy of the credible interval1

coverage based on the FAVI posterior at diﬀerent complexity levels of )B (See Corollary 3.2.1.1).

3.1 Contribution

For simplicity, we begin by considering the case of 2 predictors, that is, ? = 2 and - =

x1, x2]
Section 3.5 considers extensions to the case ?> 2. Let &IAF be the family of distributions generated

[

.

by Inverse Auto-Regressive Flows (IAF), a type of NF introduced in [21]. Let the linear model

(3.12), with observed data ⇡ =

-, y

)

(

, have target posterior ⇧

-, y

✓

|

(

)

. We derive the variational

posterior @IAF

&IAF that approximates ⇧

✓

-, y

to learn how closely it models the true posterior.

)
1Credible intervals are in some sense the Bayesian counterpart to frequentist conﬁdence intervals. This concept is

(

|

q⇤ 2

formally deﬁned in Section 3.2.1.1.

32

Symbol

Description

(

⇤

  !
&
@q
@q
@0
)
2q
&{·}
@{·}q
@{·}q⇤

@

?

)

||

Kullback-Leibler divergence between 2 distributions.
Variational family of distributions.
A member of & with variational parameters q.
Optimal variational distribution with parameters q⇤.
Base distribution for a Normalizing Flow (NF).
NF transformation applied to samples from @0.
Conditioner network for inverse auto-regressive ﬂows (IAF).
is NF, IAF, MF (mean-ﬁeld)).
is NF, IAF, MF).

Generating distributional family (
·

A member of &{·} (
·
Optimal variational distribution from &{·} (
·

is NF, IAF, MF).

Table 3.1 Glossary of notation relating to variational inference. More exhaustive information on
the basic statistical notation used in this chapter is included in Table 3.2.

Our results are presented at diﬀerent complexity levels of a single IAF transformation2 ) and

provide an initial framework for analyzing the accuracy vs computational cost trade-oﬀ for the

FAVI posterior. We also derive the loss in uncertainty quantiﬁcation (credible interval coverage),

from using the approximation @IAF
q⇤

in lieu of the true posterior, as a function of the correlation d

between x1 and x2. We contrast the loss in credible interval coverage from FAVI to that of MF-VI

(since MF-VI is the most commonly used VI algorithm in the literature). Adaptations to other VI

approaches are brieﬂy covered in the Discussion. Although there are simulation studies in [31], [7]

showing that MF-VI under-estimates the posterior variance as d increases, to our knowledge the

impact of this on the credible interval coverage has not yet been mathematically quantiﬁed and we

are the ﬁrst to do so.3

3.1.1 Assumptions

We assume a #

0, 

(

)

prior on ✓ and a known f2. This implies ⇧

✓

-, y

has a closed form,

)

where <✓ and ⌃✓ are

(
|
<✓, ⌃✓)

which helps to obtain analytic results. Speciﬁcally, ✓

-, y

|

#

(

⇠

given by:

<✓ =⌃  
✓

1

- >y
f2

, ⌃✓ =

- >-
f2 +

 

⇣

⌘

1

 

(3.3)

2For the rest of this chapter we assume ( = 1 and drop the subscript from )B.
3Code reproducing Figure 3.1 and 3.2 subplots can be found at FAVI_for_Bayesian_Regression.

33

To generate &IAF we sample ✓0 =

(

✓ = )

✓0)

(

as follows:

1,\ 0
\0

2, . . .\ 0

?)> ⇠

0, 

#

(

)

and then apply a transformation

\8 = 18

08\0

8 , 1

+
✓0
1:8

1)
 

18 = 21
q (

8

?




and 08 = ⌘

20
q (

(

✓0
1:8

, 2

8





?

1))
 

(3.4)

.

, c0
q (

c1
q (
only on the sub-vector ✓1:8

 
.
)

!

)

Here,

{

✓0
1:8

21
q (
: R?

,2 0
q (

✓0
1:8

1)
R? that enforce the auto-regressive structure for the NF, wherein \8 depends

are the outputs of “conditioner” neural networks

1)}

 

?
8=2,1 1,0 1

 

1 (see Section 3.4.2 for details). The function ⌘ : R

 

R+ is invertible

!

and ensures 08 > 0. We use the width (number of hidden nodes) of the networks cq as the measure

of complexity of ). Although we have chosen IAF, a very simple class of auto-regressive NFs, we

see later that they are suﬃcient for approximating the Gaussian posterior. We also assume a ReLU

G

= max

function 6

G, 0
)
Extensions to other activations are left to future work.

)

(

(

, one of the more often used piecewise polynomial activations for cq.

34

 
3.2 Main Results

Theorem 3.2.1. Consider the posterior ⇧

✓

-, y

deﬁned in (3.3). Let @IAF

&IAF, the optimal

q⇤ 2

approximate posterior for ⇧

-, y

✓

|

(

)

(

)
|
, be deﬁned as:

@IAF
q⇤ 2

argmin@IAF
q 2

&IAF  !

@IAF
q

(

⇧

.

|

(

||

-, y

))

where &IAF is the family generated by IAF with   hidden nodes in the shallow network cq from

(3.4). Then we have:

(i) Case when K = 1:

  !

@IAF
q

(

⇧

.

|

(

||

-, y

))

> 0 for all @IAF

q

2

&IAF and x>1 x2 = 0 =

)

a) x>1 x2 < 0 =
@IAF
q⇤

  !

⇧

)
.

b)

-, y

= 0.

|

(

||

(
))
some combination of parameters k⇤(

˜C
)

(
. When x>1 x2 > 0 then lim˜C

9
covariance ⌃k⇤ (
˜C
)
@IAF
lim˜C
˜C
k⇤ (
)

  !

(

! 1
provided ˜C

.
! +1

=

q⇤
 (

˜C
)

has mean <✓ and

s.t @IAF
˜C
k⇤ (
)
=⌃ ✓ element-wise and

, ˜C
)
⌃k⇤ (

˜C
)

! 1

⇧

.

|

(

||

-, y

))

= 0.

If x>1 x2 < 0, then we have the same limits

(ii) Case when K = 2: @IAF
q⇤ (

✓

)

=⇧

-,.

✓

|

(

)8

✓

2

R2 and   !

@IAF
q⇤

(

⇧

.

|

(

||

-, y

))

= 0.

Here, ˜C in (ib) is a re-scaled bias for the hidden node in c1

q (see section 3.19) and q⇤
 (

are the

˜C
)

remaining IAF parameters.

Since   !

@

?

)

||

(

= 0

()

@ = ? almost everywhere (a.e), (ia) tells us that with one hidden

node in cq we cannot recover the true posterior entirely unless x>1 x2 = 0. However, (ib) says we

can ﬁnd a sequence of distributions for which the KL approaches 0. Thus, choosing ˜C appropriately

ensures we almost entirely recover ⇧

-, y

✓

|

(

)

when   = 1. The density plot of the true and

approximate posteriors in Figure 3.1a shows that MF-VI cannot recover the ellipsoid shape of

-, y

⇧

✓

|

(

)

at correlation d =

0.6 between x1 and x2, but FAVI with IAF and   = 1 tracks the

 

true posterior closely. Figure (3.1b) displays the limiting behaviour of the KL described in (ib).

The ﬂip in behavior at d = 0.6 vs d =

0.6 in can be explained by the fact that the optimal KL

is actually a function of ˜C and sign

(

 
x>1 x2)

(see proof of Theorem 3.2.1 in 3.4.3.2). In order to

35

|

(

,@ MF
q⇤

✓

-, y

and
when   = 1. We set = = 200, f = 1 and the data

(a) Density plot of samples from ⇧
@IAF
q⇤
generating \0 ⇠
i.i.d
⇠
)
)
Figure 3.1 Visualizing the behaviour of @MF
q⇤

. The rows z 9 of - are
 
d

0.5, 2
)
0,d 11> + (

simulated as z 9

,d =

0.6.

(
#

*

 

 

1

(

)

(b) Plot of the mean KL (bold lines) with

(shaded area) against ˜C for @IAF
q⇤

1SD
at   = 1 and @MF
q⇤
= = 200. The mean and SD are taken across 100
simulations. Simulation hyper-parameters are the

±

at

same as Figure 3.1a with d =

0.6.

±

and @IAF
q⇤

when   = 1.

completely recover ⇧

-, y

✓

|

(

)

we require   = 2. We prove these results by analyzing the form of

@IAF
q

2

&IAF and choosing q⇤ that minimizes the KL. The proof is presented in Section 3.4.3.2.

3.2.1 Uncertainty Quantiﬁcation

We now move on to the loss in uncertainty quantiﬁcation when using credible intervals from

FAVI or MF-VI in place of the true posterior. A 1

U credible interval for \8 is a range of values

 

such that \8 belongs to this interval with probability 1

U under the posterior distribution.

 

! (

8
)U ,* (

8

)U )

(
8
Let @q⇤ be the variational posterior and   (
)q⇤
(cdf) for @8,q ⇤ (

@q⇤ (

3✓

\8

=

✓

 (

)

)

)

8

8

 (

)

✓
Ø

denote the marginal cumulative distribution function

. Then the 1

U equal-tailed credible interval for \8 is:

 

1

 

8
  (
)q⇤

(

1

 

U
2 )

(

8
,  (
)q⇤

U
2 ))

1

(

 

8
U(
)⇧ credible interval in (3.5) under the true posterior ⇧

The actual coverage of the 1

 

then given by:

8
U(
)⇧ = 

1

 

1

 

8
  (
)q⇤

(

1

 
⌃✓

⇣

U
2 ) 

1
2

88

<✓8

1

8
  (
)q⇤

 

 

 

U
2 ) 

1
2

88

(
⌃✓

<✓8

⌘

⌘

⇣
cdf. The diﬀerence

 

where <✓, ⌃✓ are as in (3.3) and   is the #

0, 1
)
the loss in uncertainty quantiﬁcation when replacing the true posterior with its approximation.

U(
)⇧ )

denotes

) (

 
(

 
(

 

 

U

1

1

 

(3.5)

✓

|

(

-, y

is

)

(3.6)

8

36

Corollary 3.2.1.1 expresses 1

 

and MF-VI.

8
)⇧ as function of the correlation d between x1 and x2, for FAVI
U(

Corollary 3.2.1.1. Let 1

 

U be the coverage for the equal-tailed credible intervals for \8,8 = 1, 2

speciﬁed in (3.5), corresponding to the approximate posterior @q⇤. Let 1

8
U(
)⇧ deﬁned in (3.6) be

the actual coverage of this interval under the true posterior. Let z 9

9

= be the rows of

 
R2, 1

2
where 1 =



)> then:

1, 1

(

design matrix -. Assume z 9

i.i.d
⇠

#

0,d 11> + (

(

1

 

d

 2)

)

(i) Mean-Field VI: Let the approximate posterior of interest be @MF
q⇤

. Then we have:

1

 

8
U(
)⇧

a.e
!

1

 

2

1

n

 

 

1

  

⇣

U
2 )

1

(

 

d2

1

 

p

⌘o

as =

! 1

(ii) IAF: Let @IAF
q⇤

be the approximate posterior for this case.

a) Case when K = 1: x>1 x2 = 0 =

)

8
U(
)⇧ = 1

1

 

U. If x>1 x2 < 0, let @IAF
˜C
k⇤ (
)

 

be the

sequence of approximate posteriors from Theorem 3.2.1 (ib). When x>1 x2 > 0, then

lim˜C

! 1

1

8

U(
)⇧ (

˜C
)

 

= 1

 

U a.e. . If x>1 x2 < 0, we have the same limits provided ˜C

.
! 1

b) Case when K = 2: 1

8
U(
)⇧ = 1

U 4

 

 

The corollary implies there is no loss in coverage for MF-VI when d = 0. However, this loss

increases monotonically in

reaching

d

|

|

⇡

0% coverage as

coverage when using @IAF
q⇤

with   = 1 provided

sign

 
no loss in coverage when   = 2 regardless of d. Figure 3.2 illustrates that FAVI consistently

(

d

|!

1. There is almost no loss in

˜C is large enough. There is absolutely

|
x>1 x2)

outperforms MF-VI, achieving

⇡

95% coverage when d> 0, ˜C<

2 and behaves similarly to

 

MF-VI, with near 0% coverage when d> 0, ˜C> 2. The proof of Corollary 3.2.1.1 provides the

expression used to plot 1

1
)⇧ in Figure 3.2 (see 3.4.3.3).5
U(

 

4Note that while the expressions for limit ˜C

of coverage 1

are free of d they can be viewed as constant functions of d.

! ±1

8

U (
)⇧ (

˜C
)

 

when   = 1 and 1

8
)⇧ when   = 2
U (

 

5The marginal inverse cdf   (

2
)q⇤

1

 

on ˜C and the sample correlation x>1 x2 via   (
2
appropriate ˜C, 1
U (
)⇧ ⇡

U.

 

 

1

2
)q⇤

for \2 under IAF (  = 1) does not admit a closed form. Since 1

2
U (
)⇧ depends
we need simulations to visualize this relationship. Regardless, for an

1

 

 

37

Figure 3.2 Plot of actual coverage 1
95% credible intervals from @MF
q⇤

1
U(
)⇧ for \1 obtained from the true posterior corresponding to
 
and @IAF
q⇤
parameter ˜C respectively.

when   = 1. The G and H axes are d and the bias

3.3 Summary and Discussion

FAVI has emerged as a powerful tool for Bayesian inference in machine learning. However,

some fundamental theoretical gaps in the characterization of this tool still remain. In this chapter,

we have used Bayesian linear regression to provide a lens by which the statistical properties of the

approximate posterior from FAVI can be analyzed. Our approach goes beyond studying frequentist

consistency properties of VI that is the mainstay of current literature, to evaluate the ﬁdelity of the

posterior variance estimates from VI. While we have focused on comparisons to MF-VI, we note

that a multivariate Gaussian structured variational family &SVI with covariance ⌃

contain ⇧

✓

-, y

. In fact, we will see in the next chapter that the family &IAF

0

R2
⇥

2 will

2

 
&SVI and both

◆

)
FAVI with IAF and SVI require $

(

|

?2

)

(

parameters to completely recover the true posterior.

Substantial future work is necessary to build conﬁdence in FAVI as a tool for statistical inference.

In Chapter 4 we will expand our proofs to characterize the behavior of @IAF
q⇤

for ?> 2 and discuss

further potential generalizations of this work.

3.4 Technical Details

In this section we will present the required technical details and proofs for the main results in

this chapter. We begin with a glossary of mathematical notation that acts as a supplement to Table

38

3.1. This will be useful to follow the proof arguments in this section.

3.4.1 Glossary of General Mathematical Notation

Symbol

Description

  !

@

(

||

?

)

-
y
x8
d
✓
✓0
-, y
✓
|
(
<✓, ⌃✓
n
f

⇧

U

1

 

)

.

)

[

(

@

= E@

ln @
? ]

Kullback-Leibler (KL) divergence measuring “closeness” between 2 distribu-
?
tions. The KL is deﬁned as   !
||
Design matrix of predictor variables.
Observed response vector.
Columns of -.
Correlation between predictors x8, x 9 ,8 < 9.
Unknown parameters for statistical inference.
Samples from the base distribution @0 for NFs, ✓0 =
Target posterior distribution to sample from.
Mean vector and covariance matrix for ⇧
✓
Random noise vector.
Standard deviation of n.
Bayesian credible interval coverage, indicating the probability that \8 lies in the
credible interval
under the posterior (true posterior or variational
posterior).

2, . . .\ 0
?)

8
)U ,* (

1,\ 0
\0

)U )

-, y

! (

(

)

(

(

.

.

|

8

✓1:8

1 or ✓<8 The sub-vector of ✓ comprising
 
✓

The sub-vector of ✓ excluding \8, that is,

\1,\ 2, . . .\ 8

(

1)
 
\1, . . . ,\ 8

8

for 2


1,\ 8

(

 

 (

)

?.

1, . . . ,\ ?

8

+

.

)

Table 3.2 Glossary of General Mathematical Notation used in the manuscript and SI text.

39

3.4.2

Inverse Auto-Regressive Flows

Symbol

18
08
⌘
.
(
21
q
20
q

)

Description
Translation parameters applied to base distribution samples \0
8 .
Scale parameters applied to base distribution samples \0
8 .
Any invertible function from R
Conditioner neural network for IAF whose outputs are 18.
1
Conditioner neural network for IAF whose outputs are ⌘ 
Number of hidden nodes in the conditioner networks 2q.

R+.

!

08

)

(

.

✓0
1:8

<8 Sub-vector of ﬁrst 8

1 elements of base distribution samples ✓0.

 

1 or ✓0
 
.
6
(
)
t{·}
in
t{·}out
2{·}: 9

E{·}: 9

G

= max

Activation function (usually refers to the ReLU function 6
Bias parameter for hidden layer of 2{·}q (
·
Bias parameter for output layer of 2{·}q (
·
Entries of weight matrix ⇠ for 2{·}q (
·
appropriate when ⇠ becomes a vector or scalar due to the mask "⇠.
Entries of weight matrix + for 2{·}q (
·
appropriate when + becomes a vector or scalar due to the mask "+ .

is 0 or 1).
is 0 or 1).

)

(

is 0 or 1). We drop subscripts :, 9 where

is 0 or 1). We drop subscripts :, 9 where

G, 0

).

)

(

Table 3.3 Notation pertaining to Inverse Auto-Regressive Flows (IAF) relevant to proofs.

In this section, we present details on Inverse Auto Regressive Flows (IAF) relevant to the proofs

presented in the remaining sections. For background information refer to [21].

To generate &IAF we sample ✓0 =

(

✓ = )

✓0)

(

as follows:

1,\ 0
\0

2, . . .\ 0

?)> ⇠

0,  ?

#

(

)

and then apply a transformation

\8 = 18

08\0

8 , 1

+
\0
1:8

1)

 

18 = 21
q (

8

?




and 08 = ⌘

20
q (

(

\0
1:8

1))

 

,

for 2

8





?

(3.7)

\0
1:8

{

Here

21
q (
networks c0
 
q (

.

)

1)
 
, c1
q (

,2 0
\0
q (
1:8
 
: R?

.

)

?
8=2,1 1,0 1
R?; 11 2

1)}

!

comprise the outputs of shallow “conditioner” neural

R,0 1 2

R+ are free of ✓0 and ⌘ : R

R+ is an invertible

!

function that ensures the scaling parameters 08 > 0. Keeping with the formulation in the paper ﬁrst

introducing IAF [21], we use conditional masked auto-regressive density estimators (cMADE) for

cq [12]. Let ✓0 2

R?. We can deﬁne the cMADE cq

: R?

.

)

(

!

R? as follows:

cq

✓0)

(

= tout + (

+

 

"+

6

)

tin + (

⇠

 

"⇠

✓0

)

(3.8)

 

40

 

 
 
Here 6 is an activation function; tout 2
 ," ⇠
the weight matrices; "+

R?

⇥

R?, tin 2
R 

⇥

R  are bias parameters; +

R?

⇥

 ,⇠

2

2

R 

⇥

? are

? are masking matrices that enforce the auto-regressive

2

2

structure of 2q and

 

denotes element-wise multiplication.

The masking matrices "+ and "⇠ have binary 0, 1 entries and are constructed so as to drop

connections between any \8 and 2q

✓

) 9 for any 8

(

 

9. In order to achieve that an integer <

is

:

)

(

assigned inclusively to each of the hidden nodes in cq such that 1

<

:

)

(



<? . Here, < represents

the number of inputs each output of the conditioner depends on. "+ and "⇠ are then deﬁned as:

1,

if 8><

:

)

(

"+

8: =

otherwise

)

(

0,

8>>>><
>>>>
:
3.4.3 Theoretical Results

if 8

<

:

)

(



otherwise

:

1





 , 1

8





?.

and

"⇠

(

:8 =

)

1,

0,

8>>>><
>>>>
:

Some Preliminary Notation: Section 1.4 has already introduced basic notational conventions used

throughout this dissertation. For further simplicity, we use ˜x =

2
2 +
for the sign of B, that is:

x

k

k

1 for any vector x

R<.

2

Recall that for any scalar B

R we use sign

B

)

(

2

sign

=

B

)

(

1,

if B> 0

1,

 

0,B

if B< 0

= 0

8>>>>>>>><
>>>>>>>>

:

3.4.3.1 Lemmas

Lemma 3.4.1. If .

#

`, f 2

and + = max

)
with parameters ` and f. Its ﬁrst two moments are given by:

⇠

(

., 0
)
(

then + is a rectiﬁed gaussian random variable

E

E

+
[
]
+ 2
[

= `

1

 

 

 
`2

=

]

+

( 
f2

`
f )

1

 
 

+

 

Proof. See [1] for the proof.

 

   

41

`
f )

(
`
f )

fq

( 

`fq

`
f )

(

+

 

É

Lemma 3.4.2. Let .

#

(

⇠

0, 1

)

and + = max

B.

(

E

.+
[

]

= B

1

(

 

+

 

for some B, C

C, 0

)

2

R,B < 0. Then:

sign

( 

C
B ))

B

)

(

Proof. Let ˜+ = .+.

Case (i) s > 0: Then ˜+ =

Therefore,

E

˜+
[

= B

= B

8>>>><
>>>>
:
]

B

. 2
(

,

. C
B )

+

if .

C
B

   

0,

otherwise

C
B

   

.

C
B

I

.

 

 

+
H2
2

3H

1

H

C
B

π
 

C

+

C2
2B2

C4 

1

+

 

 

 i
H2
2

3H

4 
p2c
C
B ) +

( 

= E

B

. 2

1

h
 
H2 4 
p2c
1
Bp2c
C
B )

( 

 

 

C
B

π
 

⇥

⇣

⌘

= B

1

⇣

 

Case (ii) s < 0: Then ˜+ =

B

. 2
(

,

. C
B )

+

if .

C
B

  

otherwise

0,

8>>>><
>>>>
˜+
:
]
[

E

= E

B

. 2

I

.

.

C
B

3H

 
C

 

+

+
H2
2

h
 
H2 4 
p2c

C
B

  
C
B

 

H

H2
2

 i
4 
p2c

C
B

 

= B

π

 1
Substituting H =

π
 1
D we have,

C
B

D

D2
2

4 
p2c

3D

3D

C

+

C2
2B2

π

1
 

 
D2
2

C
B

D2 4 
 
p2c
1
Bp2c
C
B )

C4 

 

(

⌘

= B

= B

π

1

⇥

⇣

= B

1

⇣

 

C2
2B2

C4 

1
Bp2c

⌘

3H

C
B ) 

(

1
Bp2c

C2
2B2

C4 

⌘

É

1

+

 

42

Lemma 3.4.3. Let .

#

(

⇠

+

.
|

#

`

.
(

)

(

⇠

,f 2

)

and let 5

0, 1
)
E

(

)

and `

.
(

= max

B.

C, 0

1 for some B < 0,C,1

)

+
be the marginal distribution of +. Then:

) +

(

R. Deﬁne

2

5

E

)

(

= 

sign

C
B

B

)

(

⌘

 

⇣

2

(

1
E
)
 
4 
2f2
p2cf +

 

1

 

⇣

⇣

sign

B

)

(

 

Cf2

B2
+
(
fBpf2

⇣

Proof. Case (i) s > 0: First, we note the following.

E

1

 
B2

+

4 

2c

E
(
2

1
 (
f2
(

+
+

2

C
) )
B2
)

f2

(

B2

)

+

)

⌘⌘⌘

p

We then have:

5

E

)

(

=

 

=

π

 1

πR

C
B

B.

1

+

C,

+

if .

C
B

   

1,

otherwise

`

.
(

)

=

8>>>><
>>>>
:

5+

E

H

|

)

. (
|

q

(

H

)

3H =

1
p2cf

2

E

(

H
`
 
(
2f2

) )

4 

H2
2 3H

4 

1
p2c

1
p2cf

2

(

E
1
)
 
2f2

4 

1
p2c

πR
H2
2 3H

4 

+

E
(
2

1
 (
f2
(

+
+

2

C
) )
B2
)

4 

1

C
B

π
 

p2cf

1
p2cf

1

C
B

π
 

B

(

H

 

1
p2c

4 

 

E

(

 (

BH
1
+
2f2

C

+

2

) )

4 

H2
2 3H

4 

1
p2c

C

E
 (
f2

+

2

f2

B2

)

+

(

3H

1
+
) )
B2
f2

 

C
B  

B

(

C

E
 (
f2
+

1
+
B2

))

 

 

f
pf2

B2

⌘⌘

+

E

Cf2

B2
(
+
fBpf2

p
1
)

 
B2

 

1

 

⇣

1

 

⇣

 

4 

2c

E
(
2

1
 (
f2
(

+
+

2

C
) )
B2
)

(

f2

+
4 

B2

)
1
 (
f2
(

E
(
2

2

C
) )
B2
)

+
+

2c

f2

B2

= 

= 

= 

C
B

C
B

 

⌘

 

 

 

⇣

C
B

 

2

(

E
1
)
 
4 
2f2
p2cf +

2

(

E
1
)
 
4 
2f2
p2cf +

2

(

E
1
)
 
4 
2f2
p2cf +

⇣
(3.9) follows from completing the square and (3.10) follows from properties of q.

⌘⌘⌘

p

+

+

⌘

⇣

⇣

⇣

(

)

(3.9)

(3.10)

Case (i) s < 0: For this case, 5

E

)

(

has the following expression.

5

E

)

(

=

1

C
B

π
 

2

(

E
1
)
 
2f2

4 

1
p2cf

H2
2

4 
p2c

3H

+

C
B

 

π

 1

1
p2cf

2

(

1
E
)
 
4 
2f2
p2cf +

= 

C
B

⌘

⇣

Cf2

B2
+
(
fBpf2

 

1

 

⇣

⇣⇣

E

+

1

 
B2

)

⌘⌘⌘

H2
2

4 
p2c

3H

E

(

 (

BH
1
+
2f2

C

+

2

) )

4 

4 

2c

E
(
2

1
 (
f2
(

+
+

2

C
) )
B2
)

f2

(

B2

)

+

(3.11)

Where (3.11) is obtained by the same arguments as those of the case B> 0. Therefore, the result

p

holds.

É

43

Lemma 3.4.4. Let 9

=

˜C
)

1

(

+

˜C2

)

Here, 62(
Then:

: R

 

(

 

(

.

)
1

 

!
sign

R be deﬁned as 9

B

˜C
)

)

(

sign

B

)

(

 

 

(
˜Cq

˜C
)
˜C
)

(

˜C
= 62(
) 
˜C
and 61(
)

62
1 (
= ˜C

˜C
)
1

for any ˜C

 

(

 

sign

(

2
B

)

R and ﬁx B

˜C
)

sign

B

)

(

 

R.

.

˜C
)

2
q

(

 

 

(i) B> 0 =

)

lim˜C

! 1

= 1

9

˜C
)

(

(ii) B< 0 =

)

lim˜C

!1

= 1

9

˜C
)

(

Proof.

 

sign

B

(

(

sign

˜Cq

B

)

(

 

˜C
) 

˜C2

1

 

(

 

sign

B

˜C
)

)

(

=

9

˜C
)

(

1

(

+

sign

B

(

))

 (

˜C2

)
 
2q2

=

1

 

(

 

sign

(

 

 

1

 
˜C
) 
˜C
)

)

(

B

˜C
)

)
˜C

(
˜C
)

)

2sign

B

(

)

 
1

 

(

 

sign

B

(

q

 
˜C
)

(

˜C2

1

 
 

(

 

sign

 

(

B

)
   

+

 

˜C
)

 

(

 
sign

B

˜C
)

)

(

 

2

 

|
sign

B

{z
˜C
˜Cq
) (
(

)

(

}
sign

|
B
(

))

2q2

˜C
)

(

 

 

{z
2sign

˜C

1

B

)

(

 

     

 

 

(
 +

}
sign

B

˜C
)

)

(

q

˜C
)

(

 

|

For the case B> 0, it is easy to see that as ˜C

}
|
:  
!  1
For     we can apply L’Hospital’s Rule as follows:

{z

{z
1,      ,  +

!

!

0 by the properties of q,  .

}

lim
˜C
! 1

    = lim
˜C
! 1

Therefore when B> 0, lim˜C

! 1

9

(

 
˜C
)

1

 

˜C
)

(

 

 

⇥

lim
˜C
! 1

˜C
 
(
)
2
C 

= lim
˜C
!1

 

C3

q

(

˜C
)
2

= 0

= 1. Similar arguments can be used to show that B< 0 =

R s.t D>E . Deﬁne 9

˜C
)

(

as in Lemma 3.4.4 and ;

˜C
)

(

=

1

(

 

 

(

sign

B

,

˜C
))

)

(

lim˜C

!1

= 1.

9

˜C
)

(

Lemma 3.4.5. Let D, E

2
then the following hold:

(i) B> 0 =

) 9

˜C0 s.t D 9

(ii) B< 0 =

) 9

˜C0 s.t D 9

˜C
) 

(

˜C
) 

(

E;2

E;2

˜C
)

(

˜C
)

(

> 0

> 0

8

8

˜C< ˜C0

˜C> ˜C0

Proof. To start with let B> 0. From Lemma 3.4.4 we know that lim˜C

see that lim˜C

= 1. Therefore, lim˜C

 
deﬁnition of limits. Similar arguments can be used for the case B< 0.

! 1

! 1

(

(

(

D 9

˜C
) 

E;2

= D

˜C
)

;2

˜C
)

9

˜C
)

(

= 1 and it is easy to

! 1
E> 0. The result follows by

44

)
É

É

                 
                 
                                    
                                    
                                  
                                  
                                   
                                   
3.4.3.2 Proof of Main Theorem

For improved readability, we recap some deﬁnitions and the theorem statements. We deﬁne the

Bayesian linear model as follows:

y = -✓

+

n, Prior: c

=

✓

)

(

1
2c

4 

1
2 (

\2
1+

\2
2)

(3.12)

Where y

R=,- =

x1, x2]

, x1, x2 2

[

R=; n

0,f 2 =

#

(

⇠

)

2

and f2 is known.

The true posterior ⇧

✓

-, y

for the model (3.12) above is the #

distribution where

|
1
✓ - >y and ⌃✓ =

(

)
- >-

<✓ =⌃  

+

(
˜x2x>1 y

 2) 
x>1 x2)

 (
det ⌃ 
✓

1

1. Expanding <✓ as

x>2 y

and <✓2 =

<✓, ⌃✓)

(
<✓1,< ✓2)> we have:
(
x>1 y
˜x1x>2 y

x>1 x2)

 (
det ⌃ 
✓

1

<✓1 =

⌃✓ =

1
⇡ 2
6
6
6
6
6
4

˜x2

x>1 x2
 

x>1 x2
 

˜x1

Theorem 3.4.6. Consider the posterior ⇧

Here, ⇡ = ˜x1 ˜x2  (

x>1 x2)

2

(3.13)

-, y

✓

|

)

in (3.13) corresponding to the model (3.12).

3
7
7
7
7
7
(
5

-, y

✓

|

(

)

. Here, &IAF is the family

Let @IAF

q⇤ 2

&IAF be the optimal approximate posterior for ⇧

generated by IAF with   hidden nodes in the shallow network cq from (3.7). That is:

@IAF
q⇤ 2

argmin@IAF
q 2

&IAF  !

@IAF
q

(

⇧

.

|

(

||

-, y

))

Then we have:

(i) Case when K = 1:

  !

@IAF
q

(

⇧

.

|

(

||

-, y

))

> 0 for all @IAF

q

2

&IAF and x>1 x2 = 0 =

)

a) x>1 x2 < 0 =
@IAF
q⇤

  !

⇧

)
.

b)

-, y

= 0.

|

(

||

(
))
some combination of parameters k⇤(
9
. When x>1 x2 > 0 then lim˜C
ance ⌃k⇤ (
lim˜C

˜C
)
  !

-, y

= 0.

⇧

.

||

(

|

))

@IAF
˜C
k⇤ (
)

(

! 1
provided ˜C

.
! +1

45

=

˜C
)

! 1

q⇤
 (

(
⌃k⇤ (

˜C
)

˜C
)

has mean <✓ and covari-

s.t @IAF
, ˜C
˜C
k⇤ (
)
)
=⌃ ✓ element-wise and

If x>1 x2 < 0, then we have the same limits

(ii) Case when K = 2: @IAF
q⇤ (

✓

)

=⇧

-,.

✓

|

(

)8

✓

2

R2 and   !

@IAF
q⇤

(

⇧

.

|

(

||

-, y

))

= 0.

Proof. Since f2 is known, we can assume without loss of generality that f2 = 1. This is because,

if f2 < 1 we can use the re-parameterization ˜y = f 

1H, ˜- = f 

1- and ˜n = f 

1n to write the

regression model as ˜y = ˜-✓

˜n for which ˜n

#

(

⇠

0,  =

.

)

+

Case when K = 2: We begin with this case because it is easier to prove and it will motivate some

simpliﬁcations we make for the case   = 1 later on.

We know that any multivariate normal distribution can be written as the product of the condi-

tional distributions. Therefore:

-, y

⇧

✓

|

(

)

=⇧

(

\2|

\1,-, y

⇧

-, y

\1|

(

)

1

\2 

\1 ) (

4 

2B2
2 (

)⇥
\1))
`2 (

2

1
2B2
1 (

2

\1 

`1)

4 

1
p2cB2(
1

=

=

\1)
4 

1
2c ˜x 
2

 

⇥

1
p2cB1
2 \1)

1

2

1

⇥

 

2c ˜x2⇡ 

1

1
2 ˜x 
2

1

\2 (

x>2 y ˜x 

1
2  

x>1 x2 ˜x 

1
2 ˜x2 ⇡ 

\1 

<✓ 1)

1 (

2

4 

(3.14)

q
That is, \1 ⇠

#

(

1
<✓1, ˜x2⇡ 

and \2|

\1 ⇠

#

`2(

\1)

(

)

p
1
, ˜x 
2 )

. Here `2(

\1)

is given by:

`2(

\1)

= x>2 y ˜x 

1
2  

x>1 x2 ˜x 

1
2 \1

By the properties of KL divergence we have for any two pdfs @ and ?;   !

(3.15)

?

@

(

||

) 

0 and

@ = ? a.e. Therefore, if we choose a set of IAF parameters q⇤ such that

✓

R2 then we are done. For convenience, we drop the super-script IAF

= 0

)
=⇧

?

  !

@

||
✓

(
@IAF
q⇤ (
from @IAF

)

q

()
-, y

✓

|

(

)8
for the rest of this proof.

2

The IAF distribution ✓ =

\1,\ 2)⇠

(

transformations:

@q for   = 2 hidden nodes is generated by the following

1,\ 0
\0

✓0 =

(
\1 = 01\0

1 +

#

0,  2)
2)⇠
(
11 and \2 = 02(

46

\0
1)⇥

\0
2 +

12(

\0
1)

(3.16)

Here 01 > 0, 12(
where 6 : R

!

=

2
:=1 6

\0
1)
(
R is the ReLU activation, 6

C 1
in,: )

E1
: +
G

21
: \0

1 +

Õ

C 1
out and 02(
0,G
= max

\0
1)

(

)

  Õ
and ⌘ : R

}

!

 
R+ is an invertible

{

= ⌘

2
:=1 6

20
: \0

1 +

(

C0
in,: )

E0
: +

C0
out

function that ensures 02(

\0
1)

> 0. In this case:

q =

(

01,1 1,C 0

out,C 1

out,

Using the re-parameterization \0

1 = \1 
01

11

: ,E 0
20

: ,C 0

in,: }

2
:=1,

{

: ,E 1
21

: ,C 1

in,: }

2
:=1)

{

we can re-write 12(

\0
1)

and 02(

\0
1)

as a function of \1,

that is:

12(

\0
1)

= 10

2(

=

\1)

2

6

21
: (

’:=1

⇣

2

02(

\0
1)

= 00

2(

\1)

= ⌘

6

20
: (

11)

\1  
01

+

C 1
in,:

E1
: +

C 1
out

⌘
C0
in,:

11)

\1  
01

+

E0
: +

C0
out

◆

1, 20

1 = ˜x2⇡ 

We set, 11 = <✓1, 02
(3.17) and (3.18) imply that \1 ⇠
: ,C 1
only need to choose C 1

: ,E 1
21

out,

: = E0
#

(

{

in,: }

✓

⇣

⌘

’:=1
in,: = 0 for : = 1, 2 and C0
: = C0
#
and \2|
<✓1, ˜x2 det ⌃✓)
2
\1)
\1)
:=1 s.t 10
2(

= `2(

1

out = ⌘ 
\1)
\1 ⇠
in (3.15) and we are done.

1
˜x2 )
(
q
1
, ˜x 
2 )

10
2(

(

. Then (3.16),

. Therefore we

(3.17)

(3.18)

Set C 1

in,: = 0 for 1

E1
1 =

sign

(

 


x>1 x2)

:

2, C 1


,E 1
2 = sign

out = x>2 y ˜x 
x>1 x2)

(

1

x>1 x2 ˜x 

1
2  
. We then have:

2 11,2 1

1 =

10
2(

\1)

= x>2 y ˜x 

1
2  

x>1 x2 ˜x 

sign

x>1 x2)( |

(

x>1 x2|

+

1

2 11  
\1  
1
2 01)
˜x 
01

sign

11

= x>2 y ˜x 

1
2  
1
= x>2 y ˜x 
2  

x>1 x2 ˜x 

x>1 x2 ˜x 

1

2 11  
2 \1 = `2(

1

x>1 x2 ˜x 

1
2 (

x>1 x2|
11 < 0

I

(

x>1 x2)|
\1  
{
\1  

11)⇥

}
\1  

{

I

 

This completes the proof for   = 2.

x>1 x2|

|

1

2 01,2 1
˜x 

2 =

x>1 x2|

 |

1
2 01,
˜x 

˜x 

1
2 01

11

\1  
01

I

\1  

{

11  

0

}

11  

0

} +

I

\1  

{

11 < 0

}

 

\1)

47

Case when K = 1:

The IAF distribution ✓ =

in (3.16) with 12(

\0
1)

= 6

(

\1,\ 2)⇠
(
C 1
21\0
in)

1+

@q for   = 1 hidden node is generated by the transformations

E1

C 1
out, 02(

\0
1)

+

= ⌘

6

20\0

1+

E0

C0
in)

+

C0
out

(

and 6

G

(

)

= max

0,G

{

.

}

 

 

Let x>1 x2 < 0. Note that   !

@q

⇧

.

-, y

= 0

@q =⇧

.

-, y

))
= 0 for some @q. But if this is true, we know that @q

()

||

(

(

(

|

|

  !

@q

⇧

.

-, y

a.e. Let if possible

is gaussian with

\1)

)
\2|

(

is a linear function of \1 as in (3.15) and `2 < 0 when x>1 x2 < 0.

(
|
\1)

(
||
\1)
mean `2(
However, by the IAF transformations (3.16) we have @q

))
. Here `2(

\2|

\1)

(

has mean 12(

\0
1)

= 12(

11

\1 
01

)

where

12 is some non-linear function of \1 due to the use of the non-linear activation function 6 or 12 = 0

when 21 or E1 = 0. We arrive at a contradiction.

Therefore,   !

@q

⇧

.

|

(

||

(

-, y

))

> 0

@q

8

2

&IAF when x>1 x2 < 0 and   = 1.

For the second part of the proof we need to ﬁnd some combination of parameters k

, ˜C
)
. Here, we have used the following

q⇤
 (

˜C
)

˜C
)

=

(

(

such that   !

@k

⇧

.

-, y

|
re-parameterization for the input bias parameter C 1
in:

)) !

!  

(

(

(

0 as ˜C

˜C
) ||

sign

x>1 x2)1

(

Motivated by the case   = 2, we set:

in = ˜C21
C 1

(3.19)

20 = C0

in = E0 = 0,C 0

out = ⌘ 

1

1
˜x2 )

,E 1 = 1

(r
C 1
out. Now we are ready to derive the KL

(3.20)

This implies 02(
divergence as follows:

\0
1)

2 = 1
˜x2

and 12(

\0
1)

= 6

(

21\0

1 +

21 ˜C

) +

  !

@q

(
= E@0 (

⇧

||

✓0)

-, y

(|
ln @q

))
✓0)

(

= E@ q

✓

)

(

ln @q

✓

(

) 

⇧

✓

|

(

⇥
E@0 (
✓0)

 

-, y

⇧

✓

|

(

)

 

⇤

⇥

⇤

-, y

)
E@0 (

⇤
✓0)

Where @q

⇥
is the #

(

0,  2)

✓0)

(

ln

det

h

⇣ 
 
 

m)
m✓0

⇣

⌘ 
 
 

(3.21)

⌘i

distribution and (3.21) follows from the change of variable formula

along with properties of expectation [33]. Here, since ) refers to the IAF transformation in (3.16),

48

ln

det

⇣ 
 
 

m)
m✓0

⇣

  !

⌘

⌘ 
 
 
@q

(

= ln 01 +

ln 02 = ln 01 +

1

2 ln 1
˜x2

. Continuing we have:

-,.

⇧

||

(|
✓

(

 (

ln 2c

= E@0 (

))
- >-

+

✓0)
 
h
1- >y
 2) 

E@0 (
✓0)
h
ln 01  

1
2

ln

1
˜x2

2

\0
1
2  

 
- >-
)>(
2

+

2

\0
1
2
 2) (

i
✓

 

 (

ln 2c

ln det

1
2
- >-

(

- >-
 2) 

+

 2) +

+
1- >y

)

i

1
 
1
2

E@0 (

✓0)

+

h
- >-

ln det

(

˜x1
2

+

\2
1 +
 2) +

\2
2 +

˜x2
2
y>-

x>1 x2\1\2  
 2) 
- >-

(

1- >y

x>1 y\1  

x>2 y\2
1
2

i
ln

1
˜x2

ln 01  

 

+
2

+

 

=

 

Substituting \1 = 01\0

1 +

11,\ 2 = 02\0

2 +

12(

\1)

where 02 =

1
˜x2

we have:

q

  !

@q

⇧

-,.

=

1

(
||
ln 01  

(|
1
2

ln

))
1
˜x2 +

 
˜x1
2

 

1
 
2
02
1E@0 [

ln det

- >-

(

+

2

\0
1

12
1 +

] +

y>-

 2) +
20111E@0 [

- >-

(

\0
1]

+

 2) 
+
2
˜x2
1
˜x2
2

E@0 [

12(

\0
1)

+

2

] +

2

 
1
˜x2 ⇤

E@0 [

12(

x>1 x2E@0 [(

+

 
01\0

1 +

1- >y

\0
2

2

]

 

E@0 [
1
11) (r
˜x2

\0
2 +

12(

\0
1))]

r
11] 

01\0

1 +

 

x>1 yE@0 [
1
2

1
 
˜x1
02
1 +
2
 
x>1 y11  

 

=

+

 

ln det

- >-

(

˜x2
2

+

12
1

 
 
x>2 yE@0 [

+
1
˜x2 +
12(

\0
1)]

\0
2]

\0
1)
1
˜x2
- >-

 
\0
2 +

+
2

x>2 yE@0 [r
y>-
 2) +

(

\0
1)]
1- >y

12(
 2) 

2

E@0 [

12(

\0
1)

x>1 x2

 

]

+

 

1
2

 
11E@0 [

ln

ln 01  
\0
12(
1))] +

1
˜x2
01E@0 [

\0
112(

\0
1))]

(3.22) follows from the fact that ✓0 ⇠

0, 

#

(

)

. From Lemmas 3.4.1 and 3.4.2 we know that:

C 1
out

) +

21, ˜C

E@0 [
E@0 [
E@0 [

= ^1(
= ^2(

\0
12(
1)]
2
\0
12(
1)
]
\0
\0
112(
1))]

= ^3(

21, ˜C

) +
21, ˜C
)

2

C 1
out

2C 1
out^1(

21, ˜C
)

+

49

 
(3.22)

(3.23)

(3.24)

(3.25)

Where:

^1(
^2(
^3(

21, ˜C
)
21, ˜C
)
21, ˜C
)

sign

21

˜C
)

)

(

= 21 ˜C

1

 
1

1

= 22
1 (
= 21

(

 

 
˜C2

+

)

( 

1

 

 

 

 
( 

 

sign

(

( 
21

(

 
sign

˜C
))

)

+ |
21

21

q

|

(

)

˜C
)

+

 

˜C
)
21

21

|

˜Cq

˜C
)

(

|

Although Lemmas 3.4.1 and 3.4.2 apply to the case where 21 < 0 we see that the above still holds

when 21 = 0. This is because 21 = 0 =

)

^1,^ 2,^ 3 = 0 and 12(

\0
1)

= C 1

out. We can now re-write

the KL as:

=

out, ˜C
)
˜x1
2 (

2 +
21, ˜C

⌘

(

 

+

 

01,1 1,2 1,C 1
ln ˜x2

ln 01 +
x>1 x211^1(
21, ˜C
x>2 y^1(

) 

) +
x>2 yC 1

out

1
1
 
2  
2
02
12
1 +
1) +
x>1 x211C 1

ln det

2

(
˜x2
C 1
out
2

- >-

 2) +
+
˜x2
^2(
+
2
21, ˜C
x>1 x201^3(

out +

y>-

- >-

(

1- >y

 2) 

+
2
˜x2C 1
out^1(

21, ˜C
)

21, ˜C

) +

x>1 y11

) 

Where ⌘

.

)

(

is a real-valued function of multiple variables that represents   !

order to satisfy the result (ib) we choose parameters

01,1 1,2 1,C 1

out)

(

(
as follows:

˜x1 ˜x2 9

˜C
) (

(
q
R are continuous functions of ˜C deﬁned below:

)

(

(

(

2;

2

˜C
)

˜x2 9

˜C
)

61(

˜C
)

Here 9

,;

˜C
)

(

˜C
)

,6 1(

˜C
)

(

: R

˜x2 9

˜C
) (

(

˜C
(
)
x>1 x2)
 

˜x1 ˜x2 9

˜C
) (

(

2

2;
˜C
)
(
˜C
x>1 x2;
(
)
2;
x>1 x2)

2

˜C
)

(

)

˜x2 9

˜C
)

(

1

 

sign

(
˜C
)

 
 
x>1 x2)
˜C
)

 

 
sign

(

x>1 x2)
sign

˜C
)
 
x>1 x2)
˜C
q
)

(

 
(
x>1 x2)

(

˜Cq

˜C
)

(

11 = <✓1,0 2

1 =

˜x1 ˜x2 9
˜C
)
˜C
)

=

x>1 x2
˜x2

;
9

(
(
x>1 x211
 
˜x2

21 =

01

 

x>2 y

C 1
out =

= <✓2 +

(
q
^1(
 
x>1 x2;

21, ˜C
)

˜C
(
)
x>1 x2)

!
= 62(
=
1

(
= ˜C

9

˜C
)
(
˜C
62(
)
˜C
61(
)

˜C
) 
˜C2

+

62
1 (

˜C
)

,;

=

˜C
)

(

1

 

)

 

1

 

 
(

 

sign

(

 

sign

(
(
x>1 x2)

 

50

@q

⇧

||

(|

-,.

))

. In

(3.26)

(3.27)

(3.28)

(3.29)

(3.30)

(3.31)

Note that we need to ensure 01 > 0 to satisfy the IAF parameter constraints.

Case x>1 x2 = 0: Then 02
Case x>1 x2 > 0: By Lemma 3.4.4, 9

1 = 1
˜x1

> 0.

8
Cauchy Schwartz inequality imply that ˜x1 ˜x2 9

(

˜C
)

> 0

˜C< ˜C1 for some ˜C1. Further, Lemma 3.4.5 and the

2 > 0

2;

˜C
)

(

8

˜C< ˜C2. Thus, our choice

˜C
(
) (
˜C1, ˜C2)

x>1 x2)
.

of 01 in (3.26) satisﬁes: 01 > 0

˜C< ˜C0 = min

8
Case x>1 x2 < 0: Similar arguments for this case leads us to 01 > 0
We restrict our choice of 01 to values ˜C< ˜C0 or ˜C> ˜C0 depending on the sign of x>1 x2.

˜C> ˜C0 for some ˜C0 2

8

(

R.

Substituting the parameters from (3.26)-(3.28) we can reduce ⌘ to:

(

⌘

=

 

˜C
)

1
2

1
2

ln 9

˜C
) +

˜x1 ˜x2 9

˜C
(
) (
˜x1 ˜x2  (
@q⇤ ||
⇧
\2 and the resulting distribution @q belongs to the mean-ﬁeld variational family. It

Clearly, x>1 x2 = 0 =
this case, \1 ??
stands to reason that when there is no dependency between the predictors x8 a mean-ﬁeld family is

2;
x>1 x2)
2
x>1 x2)
-, y

= 0. In fact, since 21 = 0 for

= min@ q⇤

) 8

R,⌘

  !

˜C
)

˜C
)

ln

))

2

˜C

(

(

(

(

(

.

|

2

suﬃcient to characterize our posterior distribution.

For the remaining cases, observe that by Lemma 3.4.4 and the properties of  

.

)

(

, lim˜C

! 1

⌘

˜C
)

(

= 0

when x>1 x2 > 0 and lim˜C

!1

⌘

˜C
)

(

= 0 when x>1 x2 < 0.

Now it only remains to derive the mean and covariance for @k⇤ (
parameters selected in (3.20), (3.26)-(3.28).

˜C
)

where k⇤(

˜C
)

represents the

˜x1 ˜x2 9

˜x2 9

˜C
) (
21, ˜C
)

(
^1(

˜C
(
)
x>1 x2)
= <✓2

2;

2

˜C
)

(

Trivially, E

\1]

[

= <✓1 and Var

=

\1]

[

E

\2]

[

= E

E

\2|

\1]

[

= E

12(

\0
1)]

[

= C 1

out +

⇥

⇤

51

2 = 02

\2]

= E

[

2

C 1
out

Var

[
= 02

2 +
1
˜x2 +

=

˜x2(

+

E

\2]
21, ˜C

\2
2] 
[
2C 1
out^1(
x>1 x2)
˜C
) (
(

(
˜x1 ˜x2 9

2

) 
2;
˜C
(
)
x>1 x2)

2

12(
^1(

\0
1)
21, ˜C
)

] (
2

E

[

2

 

 

2 +
C 1
out

=

2;

2

˜C
)

(

)

˜x1 ˜x2 9

˜C
) (

(

2

21, ˜C

))

C 1
^1(
out +
2C 1
out^1(
˜x1 9

21, ˜C
)

˜C
(
)
x>1 x2)

2;

2

˜C
)

(

Cov

\1,\ 2]

[

= E

[

\1\2] 

\2]

= 01E

\0
112(

\0
1)] +

[

11E

12(

\0
1)]  

[

11E

12(

\0
1)]

[

= 0121;

=

˜C
)

(

˜x1 ˜x2 9

E

E

[
2

\1]
[
˜C
x>1 x2;
(
)
 
˜C
x>1 x2)
) (

(

2;

2

˜C
)

(

The equations above follow from (3.23),(3.24), the IAF transformations in (3.16) and the subsitution

of the optimal parameters selected in (3.20), (3.26)-(3.28).

Therefore, we have @k⇤ (
matrix is given by:

˜C
)

has mean <✓, the same as the true posterior ⇧

-, y

✓

|

(

)

and the covariance

⌃q⇤

 (

˜C

)

,˜C =

˜x2 9

˜x1 ˜x2 9

˜C
) (
(
x>1 x2;
 
˜C
˜x1 ˜x2 9
) (

(

˜C
(
)
x>1 x2)
˜C
(
)
x>1 x2)

2

2;

2

˜C
)

(

2;

2

˜C
)

(

x>1 x2;
 
˜C
˜x1 ˜x2 9
) (
˜x1 9

(

˜x1 ˜x2 9

˜C
) (

(

2

˜C
(
)
x>1 x2)
˜C
(
)
x>1 x2)

2
6
6
6
6
6
4
⌃k⇤ (

(
8 9 =

˜C
)

2;

2

˜C
)

(

2

(

2;

˜C
)

3
7
7
7
7
2, lim˜C
7
5
! 1

(3.32)

⌃k⇤ (

˜C
)

8 9 =

⌃✓

8 9

Clearly Lemma 3.4.4 and deﬁnition of ;

˜C
)

imply that for 1

8, 9





when x>1 x2 > 0 and lim˜C

!1
ance matrix as in (3.13). Therefore, the result holds.

 

 

 

 

⌃✓

8 9 when x>1 x2 < 0. ⌃✓ is the true posterior covari-

 

 

 

 

Some Remarks:

Our choice of IAF parameters in (3.26)-(3.28) is motivated by the solution to the ﬁrst order gradient

equations m⌘
m01

= 0, m⌘
m11

= 0, . . . m⌘
mC1
out

= 0. The idea is that in order to minimize the function

, the ﬁrst order gradient conditions must be satisﬁed. However, we note that the results ia

⌘

.

)

(

and ib of Theorem 3.4.6 imply that the global minimum   !

@q⇤ ||

(

⇧

.

|

(

-, y

))

does not exist when

x>1 x2 < 0. In general, the stochastic gradient descent algorithm used for minimizing the KL will

yield reasonably well performing local minima.

É

52

3.4.3.3 Proof of Corollary

8
Some Preliminaries:   (
)q⇤

denotes the cdf for @8,q ⇤ =

credible interval for \8 is given by:

\

8

)

 (

Ø

✓

@q⇤ (

)

3\

8

)

 (

then the 1

 

U equal-tailed

8
  (
)q⇤

(

1

 

U
2 )

(

1

 

8
,  (
)q⇤

1

(

 

U
2 ))

(3.33)

Let 1

⇧8

\8

(

|

 
-, y

8
U(
)⇧ be the actual coverage for this credible interval under the true marginal posterior

=

)

\

8

)

 (

Ø

⇧

✓

|

(

-, y

3\

)

8

)

 (

. Then 1

 

8
U(
)⇧ satisﬁes:

8
U(
)⇧ = 

1

 

1

 

8
  (
)q⇤

(

1

 
⌃✓

⇣

U
2 ) 

1
2

88

8
  (
)q⇤

 

<✓8

 

 

⇣

⌘

1

U
2 ) 

1
2

88

(
⌃✓

<✓8

⌘

(3.34)

The diﬀerence

U

1

(

 

) (

1

8

U(
)⇧ )

 

 
represents the loss in uncertainty quantiﬁcation when replacing

 

 

 

the true posterior with an approximation.

Corollary 3.4.6.1. Let 1

 

U be the coverage for the equal-tailed credible intervals for \8,8 = 1, 2

speciﬁed in (3.33), corresponding to the approximate posterior @q⇤. 1

 

actual coverage for this interval provided by the true posterior. Let z 9

of design matrix -. Assume z 9

i.i.d
⇠

#

0,d 11> + (

(

1

 

d

 2)

)

hold:

8
U(
)⇧ deﬁned in (3.34) is the
R2, 1

= be the rows

9




)>, then the following

1, 1

(

2
where 1 =

(i) Mean-Field VI: Let the approximate posterior of interest be @MF
q⇤

. Then we have:

1

 

8
U(
)⇧

a.e
!

1

 

2

1

n

 

 

1

  

⇣

U
2 )

1

(

 

d2

1

 

p

⌘o

as =

! 1

(ii) IAF: The approximate posterior for this case is @IAF
q⇤

.

a) Case when K = 1: x>1 x2 = 0 =

)
sequence of approximate posteriors from Theorem 3.4.6 (ib). Then, lim˜C

 

 

U. If x>1 x2 < 0, let @IAF
˜C
k⇤ (
)
x>1 x2)1

sign

! 

(

be the

1

 

1

8
U(
)⇧ = 1

8

U(
)⇧ (

˜C
)

= 1

 

U a.e.

b) Case when K = 2: 1

8
U(
)⇧ = 1

U

 

 

53

Proof. Case i MF-VI: Recall that for the case of MF-VI, \1 ??

\2 for any @MF

q⇤ 2

&MF. We deﬁne:

2
8=1

2

(

`8 )
\8  
2B2
8

&MF =

1
2c

4 

Õ

n

 
 
 

`8

2

R,B 8 > 0

2

R,8 = 1, 2

o

The KL divergence between any member of the mean-ﬁeld family and the true posterior is then

given by:

  !

@MF
q ||

(

⇧

.

|

(

-, y

))

=

1
2

n

2

˜x8B2

8  

2

ln

˜x1 ˜x2  (

(

x>1 x2)

 

2

) +

2

’8=1

ln B2
8

 

’8=1
<✓  

+ (

- >-

`

>

)

(

 

<✓  

) (

`

)

+

o

<✓ and

- >-

(

 2) 

+

1 are the mean and covariance of the true posterior respectively. See (3.13).

It is easy to see from the ﬁrst and second order gradient conditions that the KL is minimized when

8 = 1
˜x8

` = <✓ and B2
\2 ⇠

<✓2, 1
˜x2 )

#

(

.

for 8 = 1, 2. Thus, the optimal @MF
q⇤

has the distribution \1 ⇠

#

(

<✓1, 1

˜x1 )??

Now, a 1

 

U equal tailed credible interval obtained from @MF
q⇤

for \8,8 = 1, 2 is given by:

<✓8 ±

1
  

1

(

 

U
2 )r

1
˜x8

The actual coverage 1

 

8
U(
)⇧ provided by the true posterior corresponding to the credible interval

above can be obtained from (3.34). In this case, both the approximate posterior and true posterior

are gaussian with the same mean <✓. Consequently, we can solve for U(

8
)⇧ by equating the length of

the credible intervals obtained from @MF
q⇤

and ⇧

-, y

✓

|

(

)

as follows:

1

  

1

(

 

U
2 )r

1
˜x8

=   

1

1

(

 

8
U(
)⇧
2 )s

8
U(
)⇧ = 2

=
)

1

 

⇢

 

1

  

⇣

1

(

 

, for 9 < 8

˜x 9
x>1 x2)
˜x1 ˜x2  (
x>1 x2)
˜x1 ˜x2  (

2

˜x1 ˜x2

U
2 )s

2

⌘  

We have assumed that the rows of the design matrix are distributed as z 9 =

8.8.3
⇠
=. Combining this with the Strong Law of Large Numbers

I 91,I 92)>

(

#

0,d 11> + (

(

1

 

d

 2)

)

for 1

9





54

(S.L.L.N) we see that:

˜x8
=

= k

x8

1

2
2 +
k
=

=

=

I2
98
= +

1
=

a.e
!

1 for =

,8 = 1, 2

! 1

=

’9=1
a.e
!

I 91I 92
=

’9=1
˜x1 ˜x2  (

x>1 x2
=

=

=
) s

d for =

! 1

x>1 x2)

˜x1 ˜x2

2

a.e
!

1

 

p

d2 as =

! 1

1

 

8
U(
)⇧

a.s
!

1

2

1

 

 

 

1

  

! 1
U
2 )

 

1

(

:

=
)

d2

1

 

Therefore, by continuity of   and properties of limits =

(3.35)

(3.36)

⇣
Case ii IAF K = 1: The result for the case when x>1 x2 = 0 follows from Theorem 3.4.6 (ia), that

⌘o

p

n

is, x>1 x2 = 0 =

)

  !

@IAF
q⇤

(

⇧

.

|

(

||

-, y

))

= 0.

Case when x>1 x2 > 0: Let @IAF
˜C<
k⇤ (
by IAF ﬂows with   = 1 hidden node that satisﬁes   !

(
know that such a sequence exists due to Theorem 3.4.6.

)

be a sequence of approximate posterior distributions generated

@IAF
k⇤ (

˜C<

) ||

⇧

.

|

(

-, y

)) !

0 as ˜C<

. We

!  1

Let  ˜C< be the sequence of cdfs corresponding to @IAF
˜C<
k⇤ (

)

. We know that, convergence in KL

divergence =

)

weak convergence as a consequence of Pinsker’s inequality and deﬁnition of total

variation distance. As a result of this:

˜C<

!  1

=
)

8
  (
)˜C<

weakly
!

8
)⇧ for 8 = 1, 2
  (

(3.37)

is the marginal cdf for @IAF
8,k ⇤ (

8
Here   (
)˜C<
-, y

⇧8

\8

(

|

.

)

=

˜C<

)

\

8

)

 (

Ø

@IAF
k⇤ (

˜C<

) (

✓

)

3\

8

)

 (

and   (

8
)⇧ is the marginal cdf for

From (3.34) we know that the actual coverage 1

8
U(
)⇧ is then given by:

 

8
U(
)⇧ = 

1

 

⇣

1

 

8
  (
)˜C<

(

1

 
⌃✓

 

 

U
2 ) 

1
2

88

8
  (
)˜C<

 

<✓8

 

 

⇣

⌘

1

U
2 ) 

1
2

88

(
⌃✓

<✓8

⌘

(3.38)

55

 

 

By continuity of   and properties of limits we have:

1

 

8
  (
)˜C<

(

1

 
⌃✓

U
2 ) 

1
2

88

1

lim
! 1

˜C<

 

8
U(
)⇧ = 

lim
! 1

˜C<

= 

= 1

<✓8 +

⌃✓

⇣

 

U
2  

 

1

 

1

⇣
88  
⌃✓

1

(
88

 

 

 

1
 
  

U
2 )

1

(

 

U
2 ) 

 

<✓8
 

 

 
 

⇣

U

 

⌘
= 1

<✓8

⌘
<✓8  

 

 

⌃✓

 

1

8
  (
)˜C<

 

U
 
2 ) 

 

1

(
88

(
⌃✓

U
2 ) 

1
2

88
<✓8
 

<✓8

⌘

⌘

lim
! 1

˜C<

⇣
1
88  
⌃✓

 

 

 

Case when x>1 x2 < 0: It is easy to see that when x>1 x2 < 0, taking ˜C<

 

 

  

result.

Case iii IAF K = 2: Follows trivially from Theorem 3.4.6 (ii).

Some Remarks:

We have shown for the case   = 1 that lim˜C<

sign

! 

x>1 x2)1

(

8
U(
)⇧ = 1

1

 

 

holds only in a limiting sense and does not explicitly characterize the coverage 1

leads to the same

! 1

U. However, this result

8
U(
)⇧ as a function

 

of ˜C and the correlation d between predictors x8. We provide some comments on this below.

From Theorem 3.4.6 we know that \1 ⇠
tions deﬁned in (3.29). Therefore the 1

#

˜x2 9

˜x1 ˜x2 9

<✓1,
(
)
U credible interval for \1 is given by <✓1 ±

where 9

˜C
) (

2 )

˜C
)

,;

2;

)

(

(

.

.

(

(

˜C
(
)
x>1 x2)

 

are func-

1

  

1

(

 

U
2 )

r

˜x2 9

˜C
(
)
x>1 x2)

˜x1 ˜x2 9

˜C
) (

(

2 . Following the steps in the MF-VI case we then have:

2;

˜C
)

(

1
U(
)⇧ = 2

1

⇢

 

 

1

  

⇣

1

(

 

U
2 )vt  

2

˜x1 ˜x2  (
˜C
) (
(

9
x>1 x2)
(
2;
x>1 x2)
 

˜C
)
˜C
)

(

˜x1 ˜x2 9

2

⌘ 

By (3.35) and (3.36) U(

explicitly calculate U(

a.e
!

2

1

1
)⇧

d2
9
)
˜C
;
(
)
1
)⇧ as a function of ˜C and d. Note that the expression above is valid only for

. Therefore, we can

1
 
˜C
) 

! 1

˜C
(
)
2 d2

as =

U
2 )

  

(
9
(

r

⌘o

 

 

 

1

n

⇣

(

1

R if x>1 x2 > 0 and ˜C> ˜C0 2

˜C< ˜C0 2
distribution. See the proof of Theorem 3.4.6 for details.

R if x>1 x2 < 0 where ˜C0 is selected to ensure @IAF
˜C
k⇤ (
)

is a valid

56

For calculating U(

2
)⇧ the same arguments will not work since the marginal @IAF
˜C
k⇤ (
) (

is not gaussian.

\2)

We instead have:

2
U(
)⇧ = 

1

 

1

2

 

)

  (
˜C

(

1

 
⌃✓

⇣

U
2 ) 

1
2

22

2

 

)

  (
˜C

<✓2

 

 

⇣

⌘

1

(
⌃✓

U
2 ) 

1
2

22

<✓2

⌘

From Lemma 3.4.3 we know that   (
˜C

2

)

 

 

is the cdf for a random variable with probability distribution:

 

 

5

E

)

(

= 

sign

x>1 x2)

(

˜x2
2c

˜C
⌘r

2 ˜x2

E

(

 

1
)
2

4 

1

 

+

⇢ ⇣

 

sign

⇣

x>1 x2)

(

⇣

˜C21 ˜x2 +
p ˜x221

212

E

(
˜x2 +
q

1

)
 
212

⌘⌘⌘

⇣

(

4 

2c

⇥

E

1
 (
+
˜x2+
2

(

2

˜C 21
) )
21 2

)

212

˜x2 +
2161(

˜C
)

q

(
Where 1 = <\2  
  (
˜C

2

)

 

)
, 21 and 61(

.

)

are deﬁned in (3.27) and (3.30) respectively. The cdf

does not have an analytic expression and we have to resort to simulations to approximate the

behaviour of 1

 

2
)⇧ at various values of ˜C and d.
U(

3.5 Extensions to higher dimensions

É

In the previous sections, we characterized the behaviour of the optimal variational posterior

@IAF
q⇤

resulting from using FAVI with IAF, for the Bayesian linear regression model with gaussian

priors and exactly two predictors (? = 2). A major goal of ours was to provide a concrete, analytical

formulation for the loss in credible interval coverage from using the FAVI/MF-VI approximation

in place of the true posterior; across complexity levels of the IAF transformation (  = 1, 2).6 To

this end, we made the simplifying assumption of a covariate dimension ? = 2. This assumption

(among others), allowed us to get a closed form KL divergence and model the covariance matrix

as a function of a single parameter d (the correlation between the two covariates). As a result, we

were able to achieve the desired goal.

We will now consider extensions to the case ?> 2. Given the increased complexity of this

set-up, we will have to relax our requirements for a very explicit characterization of the optimal KL

6We do not consider  > 2 since our results showed that   = 2 is suﬃcient to completely recover the posterior

and using  > 2 would result in an over-parameterized variational family.

57

divergence and loss in credible interval coverage, and derive upper bounds instead. Generalizations

to non-Gaussian posteriors, more complex NF families and non-linear activations are reserved for

future work.

Recollect that we use the number of hidden nodes in the conditioner network 2q as the measure

of complexity of the IAF transformation. We will focus on answering two questions:

(i) What is the IAF complexity required for FAVI to completely recover the true posterior?

(ii) If   ⇤ denotes the IAF complexity required to completely recover the true posterior, what is

the loss in statistical accuracy if we use complexity  <  ⇤?

How to measure statistical accuracy in this context is an open ended question. Since the variational

posterior is obtained by minimizing the KL divergence (See 3.1), we derive an upper bound for

min@IAF
q 2

&IAF   !

@q

⇧

✓

⇡

))
(
notation see Tables 3.1 and 3.2.

||

(

|

as a starting point. For a recap of frequently used mathematical

3.5.1 Main Result

We ﬁrst present some necessary preludes and then state the main result. The Bayesian linear

model we consider is deﬁned as:

y = -✓

+

n, Prior: c

=

✓

)

(

1
p2c

? 4 

✓

1
2 k

2
2

k

(3.39)

and f2 is known.

)

Where y

R=,- =

x1, x2, . . . , x?

2
The true posterior ⇧

[
✓

, x8

]

2

R= for 1

8





?; n

-, y

)

|

for the model (3.39) above is the #

(

(

distribution where:

#

0,f 2 =

⇠
(
<✓, ⌃✓)

<✓ =⌃  

1
✓ - >y and ⌃✓ =

- >-

(

1
 

 ?

)

+

(3.40)

Let us denote the : th largest eigen value of ⌃✓ by W: for 1

:





?. We know that ⌃✓ is a

positive deﬁnite matrix, since its inverse is the sum of a positive semi-deﬁnite matrix - >- and the

identity matrix  ?. Therefore its minimum eigen value _min(
state the main result.

⌃✓)

= W? > 0. We are now ready to

58

Theorem 3.5.1. Consider the posterior ⇧

-, y

✓

|

(

)

deﬁned in (3.40). Let @IAF

q⇤ 2

&IAF be the optimal

approximate posterior for ⇧

-, y

✓

|

(

)

. Here, &IAF is the family generated by IAF with   hidden

nodes in the shallow conditioner network 2q deﬁned in (3.4). That is:

@IAF
q⇤ 2

argmin@IAF
q 2

&IAF  !

@    
q

(

⇧

.

|

(

||

-, y

))

W? be the eigen values of the covariance matrix ⌃✓. We assume   = 2  0 where

W2 · · ·  

Let W1  
  0 <?
1



1. Then we have:

 

min
q 2

@IAF

&IAF

  !

@q

(

⇧

.

|

(

||

-, y

)) 

1
2W2
?

?

’:=  0
+

2

W:

(

W?

2

)

 

(3.41)

First observe that when   0 =

?

(

2

)

 

or equivalently   = 2

?

(

2
)

 

, we have:

min
q 2

@IAF

&IAF

  !

@q

(

⇧

.

|

(

||

-, y

)) 

1
2W2
? (

W?

W?

)

 

2 = 0

Therefore, from the theorem we know that in order for FAVI with IAF to exactly recover the true

posterior, i.e.   !

@IAF
q⇤

(

⇧

.

|

(

||

-, y

))

= 0, we require   = 2

?

(

2

)

 

hidden nodes in the conditioner

network 2q. It naturally follows that the corresponding loss in credible interval coverage from the

IAF approximation would be 0%.

Since we are working with a Gaussian target posterior, completely characterizing ⇧

)
boils down to recovering the mean <✓ and covariance matrix ⌃✓. Theorem 3.5.1 tell us that for the

-, y

✓

(

|

remaining cases  < 2

?

⌃✓ upto its ﬁrst   0

+

 
(
1 =  

2
)
2
/

, the approximate posterior is able to explain the covariance matrix

1 eigen values. The remaining unexplained component is shaped

+

by the magnitude of the ?

2
/

 

 

eigen value from them.

3.5.2 Technical Details

1 smallest eigen values of ⌃✓ after subtracting the minimum

Some Preliminaries: For any vector x =

G1,G 2, . . .G 3

)
to be the sub-vector of x running from the 8th to the 9 th element. Similarly, let G8 9 be the 8 9 th

(

(

+

8<9

R3 we denote x8: 9 =

G8,G 8

1, . . .G 9

,

)> 2

59

 
element of matrix -

R<

⇥

3 and -81:82,9 1: 92 2

2

R(

82 

81+

1
)⇥(

92 

91+

1
) be the sub-matrix of - given by:

G81,9 1
G81+
1,9 1
...

1

G81,9 1+
G81+
1,9 1+
...

1

. . .

G 81,9 2
. . . G 81+
...

1,9 2

G82,9 1

G82,9 1+

1

. . .

G 82,9 2

3
7
7
7
7
7
7
7
7
7
7
7
5

2
6
6
6
6
6
6
6
6
6
6
6
4
G1H1,G 2H2, . . .G 3 H3

We denote the element-wise product of two vectors x =

G1,G 2, . . .G 3

R3 as x

y =

H8 < 0

 
1

8

8





(

↵
?. For a matrix -, we denote its Frobenius norm by

)

-

(
  =

and element-wise division by x

(

)> and y =
, G2
y =
H2

G1
H1

H1,H 2, . . .H 3
(
. . . G3

)> 2
H3 )> provided
3 G2
8 9 .

<

1

1

8

9

k

k

. We will sometimes use

qÕ





88 instead

The minimum eigen value of - is represented by _min(
of G88 to represent the diagonal elements of matrix -. The symbol R+ denotes the strictly positive

-
Õ
)

-

(

)

subset of the real numbers, i.e. R+ =

G

{

2

R

|

G> 0

.

}

For a review of other useful mathematical notation used in our proofs see section 1.4, Table 3.1,

and Table 3.2.

3.5.2.1 Lemmas

Lemma 3.5.2. Let  

R?

⇥

?

2
minimum eigen value of A. Then we can write   =

 

0 be symmetric positive deﬁnite and denote U> 0 to be the

?

:=1 _:D:D>: +

U ?, where _:

0 and D:

 

R?

2

are orthonormal for 1

:





?.

Õ

This lemma basically says that,   is the sum of a positive semi-deﬁnite matrix and U times the

identity matrix, where U> 0 is the minimum eigen value of  .

Proof. We can write   =

 

(

 

U ?

) +

U ? and we know that  

 

U ? is symmetric (since   is

symmetric).

Now if we can show that  

U ?

 

⌫

0 then we are done. This is because any real symmetric, positive

semi-deﬁnite matrix is diagonalizable by orthogonal matrices and it has non-negative eigen values.

That is,  

 

U ? = *⇤*> =

?
:=1 _:D:D>: , with _:

0 for 1

:





 

? and **> = *>* =  ?.

Õ

60

By symmetry and positive deﬁniteness of   we have,   = ˜* ˜⇤ ˜* for diagonal ˜⇤, with

(
88 are the eigen values of  .

88 > 0 for

˜⇤
)

1

8





? and ˜* ˜*> = ˜*> ˜* =  ?. The diagonal entries

˜⇤

)

(

We know that  

U ? = ˜* ˜⇤ ˜*>  

 

U ˜* ˜*> = ˜*

˜⇤

(

 

U ?

)

˜*> and

˜⇤

(

88

)

 

Therefore  

 

U ? is symmetric with non-negative eigen values, i.e  

 

U

 
U ?

0 (since U = _min(

 

)

).

0. We are done. É

⌫

Lemma 3.5.3. For some integer 1

  0

1

+

:=1 Z: E : E>: +
Õ
a diagonal matrix with

⇡, where Z:

2


R+, E :

  0

?

 



2, let  

2

R?

⇥

? be a matrix of the form   =

R? are orthonormal for 1

:





  0

+

1 and ⇡

2

R?

⇥

? is

2

88 > 0. Then we can re-write   as:

⇡

)

(
1

  0

  =

+

˜Z: ˜E : ˜E>: +

’:=1
? is diagonal with

˜⇡ where ˜Z:

R+, ˜E :

2

2

R? are orthonormal.

The matrix ˜⇡

R?

⇥

2

˜⇡

)

(

88 = 0 for 1

8





  0

+

1 and

˜⇡

)

(

88 > 0 for   0

1 <8

+

?.



Proof. We split the diagonal matrix ⇡ into two parts:

⇡1:  0

1,1:  0

+

+

1 O1

O3

O1

2
6
6
6
6
R?
6
4

O>1

  0

?

1

 

⇥

 

  0

 

 

O2

3
7
7
7
7
1 and O3 2
7
5

+ 2
O>1 ⇡  0
6
6
6
6
6
4

R  0

+

2:?,  0

2:?

+

+

3
7
7
7
7
7
5

1
⇥

  0

+

1 are matrices with all 0

⇡ = ⇡1 +

⇡2 =

Here, O1 2
entries.

R  0

1
⇥

+

?

 

  0

 

1, O2 2

We re-write   as:

  =

  0

1
+

(

’:=1

Z: E : E>:

⇡1) +

+

⇡2

⇠

Now, since ⇠ and ⇡1 are symmetric positive semi-deﬁnite they are simultaneously diagonalizable

}
[27]. Therefore, there exists an orthogonal matrix ˜+

{z

|

R?

⇥

? and diagonal matrices ⇤1, ⇤2 2

2

R?

⇥

?

such that:

⇠

+

Since both ⇠ and ⇡1 are rank   0

+

⇡1 = ˜+⇤1 ˜+ >

˜+⇤2 ˜+ > = ˜+

⇤1)
1 matrices and positive semi-deﬁnite, we know that both ⇤1

⇤1 +

˜+ >

+

(

and ⇤2 have exactly   0

+

1 strictly positive eigen values. Without loss of generality we assume

61

        
        
⇤ 9

(

)11  (

⇤ 9

)22 . . .

(

⇤ 9

  0

  0

1

+

)(

1
)

+

)(

> 0 for 9 = 1, 2.7

Therefore, we ﬁnally have: ⇠

⇡1 =

+

1

  0
+
:=1

(

⇤1)

(

: :

⇤2)

+ (

: : > 0 and orthonormal ˜E : 1
Õ

:

 





: :

⇤1)
  0

+

+ (
1.

Observe that ⇡2 satisﬁes the conditions

⇡2)

(

88 = 0 for 1



  0

+

1 <8



?. We are done.

3.5.2.2 Proofs of main result

⇤2)

: :

˜E : ˜E>: =

  0
1
+
:=1

˜Z: ˜E : ˜E>: for Z: =

 

8

Õ

  0

+



1 and

⇡2)

(

88 > 0 for

É

For convenience of the reader, we re-state the main theorem below.

Theorem 3.5.4. Consider the posterior ⇧

-, y

✓

|

(

)

deﬁned in (3.40). Let @IAF

q⇤ 2

&IAF be the optimal

approximate posterior for ⇧

-, y

✓

|

(

)

. Here, &IAF is the family generated by IAF with   hidden

nodes in the shallow conditioner network 2q deﬁned in (3.4). That is:

@IAF
q⇤ 2

argmin@IAF
q 2

&IAF  !

@    
q

(

⇧

.

|

(

||

-, y

))

W? be the eigen values of the covariance matrix ⌃✓. We assume   = 2  0 where

Let W1  
  0
1

W2 · · ·  
?





 

2. Then we have:

min
q 2

@IAF

&IAF

  !

@q

(

⇧

.

|

(

||

-, y

)) 

1
2W2
?

?

’:=  0
+

2

W:

(

W?

2

)

 

(3.42)

Proof. We split this proof into two parts.

Step 1: Building on ideas in [2], we ﬁrst bound the KL divergence for the family:

&SVI =

<q, ⌃q

#

{

(

<q

2

) |

R?, ⌃q

2

R?

⇥

? =

  0

1
+

’:=1

Z: E : E>: +

D,Z : > 0,E :

R?

}

2

The matrix D is a diagonal matrix with strictly positive diagonal entries

88 > 0.8

⇡

)

(

7We can always re-order the eigen vectors ˜E : to ensure this is satisﬁed.
8This is nothing but a rank   0

1 plus diagonal structured variational family of Gaussians.

+

62

Step 2: Next we show that &IAF with 2  0 hidden nodes in 2q contains the family deﬁned above,

i.e &SVI

&IAF.

✓

We will then have:

min
2

&IAF

@ q

  !

@q

(

⇧

.

|

(

||

-, y

)) 

min
2

&SVI

@ q

  !

@q

(

⇧

.

|

(

||

-, y

)) 

1
2W2
?

?

’:=  0
+

2

W:

(

W?

2

)

 

Step 1:

We know that, the true posterior ⇧

-, y

✓

|

(

)

is distributed as #

<✓, ⌃✓)

(

where:

⌃✓ =

- >-

(
<✓ =⌃ ✓ 

+
1- >y

 ?

1

 

)

Now observe that if @q is Gaussian #

<q, ⌃q

(

)

, then we have:

  !

@q

(

⇧

.

|

(

||

-, y

))

=

⌃✓ 

1⌃q

Tr

(

ln

?

+

) 

1
2

n

Choosing <q = <✓ we have:

det ⌃✓
det ⌃q

 

 

<✓  

+ (

<q

)

1

>⌃✓ 

<✓  

(

<q

)

o

  !

@q

(

⇧

.

|

(

||

-, y

)) 

1
2

min
2

&SVI

@ q

We know,   !

⇧

.

|

(

(

-, y

) ||

@q

) 

Continuing, we have:

Tr

⌃✓ 

1⌃q

 ?

(

n
0 =

ln

(

)

det ⌃✓
det ⌃q )

ln

(

o
q ⌃✓)
⌃ 

1

Tr

 ?

(

 

) +

 
det ⌃q
det ⌃✓ ) 

@ q

&SVI

min
2
1
2



  !

@q

(

⇧

.

|

(

||

-, y

1
2

n

Tr

⌃✓ 

1⌃q

 ?

) 

 

ln

(

det ⌃q
det ⌃✓ )

o

Tr

⌃✓ 

1⌃q

(

⌃q 

+

2 ?

=

)

Tr

⌃ 

1
q (

⌃q

⌃✓)

⌃ 

1
✓ (

⌃q

⌃✓)

 

 

)) 
1⌃✓  

(
1
2

 

Applying Ruhe’s trace inequality [37] for symmetric, positive semi-deﬁnite matrices we have:

 

@ q

&SVI

min
2
1
2



_max(

  !

@q

(

⇧

.

|

(

||

-, y

)) 

1
2

_max(

⌃ 

1
q )

Tr

⌃q

(

 

⌃✓)

⌃ 

1
✓ (

⌃q

⌃✓)

 

1
⌃ 
q )

_max(

⌃ 

1
✓ )k

⌃q

⌃✓ k

 

2
  =

 
2_min(

1
_min(
⌃q

)

k

⌃✓)

 
2
 

⌃q

⌃✓ k

 

We note that ⌃✓ =

- >-

 ?

1 is symmetric, positive deﬁnite. By Lemma 3.5.2 we can express

?

(
+
:=1 _:D:D>: +

) 
W?  ?, where W? = _min(

, _:

⌃✓)

 

0 and D:

2

R? for 1

:





? are

it as: ⌃✓ =

Õ

63

orthonormal. We then have the following:

min
2

&SVI

@ q

  !

@q

(

⇧

.

|

(

||

-, y

)) 

Without loss of generality let _1  
We also deﬁne:

_2 · · ·  

  0

1

+

1
2W2

? k

W?  ?

+

’:=1
_ ? and deﬁne ˜*

Z: E : E>:  

W?  ?

?

 

’:=1

_:D:D>: k

2
 

R?

⇥(

?

 

  0

1
) =

 

2

D  0

2,D   0

3, . . .D ?
+

.

+

 

 

0

Choosing Z: = _: and E : = D: , 1

1, we get an upper bound on the KL as follows:

_  0

2

+

0

2
6
6
6
6
6
6
6
6
6
:
6
6
4
⇧

0
...

0



.

|

(

˜⇤=



||

@q

3
+

_  0
...

0

· · ·

· · ·
. . .

· · ·

  0

+

-, y

)) 

1
2W2

? k

0
...

3
7
7
7
7
7
7
7
7
7
7
7
5
?

_ ?

˜* ˜⇤ ˜*>

˜* ˜⇤ ˜*>

=

>

)

) (

_:D:D>: k

2
 

2

’:=  0
+
?

1
2W2
?

_2
:

2

’:=  0
+

  !

min
2

&SVI

@ q

=

1
2W2
?

Tr

(

(

Now observe that if * =

 
D1,D 2, . . . , u?

(

and ⇤

)

2

 
R?

⇥

? is a diagonal matrix with diagonal entires

_: , 1

:





?, then the covariance matrix ⌃✓ can be written as:

⌃✓ = *⇤*>

W?**> = *

⇤

(

+

W?  ?

*>

)

+

(3.43)

Therefore ⌃✓ has eigen values _:

W? for 1

:

?. Note that since W? = _min(

⌃✓)





+

, we have

_ ?=0. Therefore, denoting the eigen values of ⌃✓ by W: we have:

min
2

&SVI

@ q

  !

@q

(

⇧

.

|

(

||

-, y

)) 

1
2W2
?

?

’:=  0
2
+

W:

(

W?

2

)

 

We now move onto the next step of the proof.

Step 2:

We now wish to show that &SVI

✓

&IAF where &IAF is the family generated by IAF with 2  0 hidden

nodes in 2q. &IAF is generated by sampling ✓0 =

b where a =

✓ = a

✓0 +
for sampling from @q

 

01,0 2, . . .0 ?

(

)

&IAF below.

2

64

(
and b =

1,\ 0
\0
2, . . .\ 0
?)⇠
11,1 2, . . .1 ?

0,  ?

#

(

)

and then computing

. We present the detailed steps

(

)

(i) Sample ✓0 =

(

1,\ 0
\0

2, . . .\ 0

?)⇠

0,  ?

#

(

.

)

(ii) Compute \1 = 01\0

1 +

11 for some 01,1 1 2

R.

(iii) FOR 8 = 2, 3 . . .? :

(iv) Compute 18

(v) Compute 08

✓0
1:8

 

✓0
1:8

1

 

1

 

 
(vi) Now sample, \8 = 08

 

(vii) ENDFOR

=

8

min
{
9=1

 

1,  0

}

,2  0

min
2
{
:=2 9

8
(
 
1
 

1

)

}

E1
8: 6

21
: 9 \0

9 +

(

C 1
in,8) +

C 1
out,8

 

Õ
= ⌘

Õ
1,  0
}

 

8

min
{
9=1

,2  0

min
2
{
:=2 9

8
(
 
1
 

1
)

}

E0
8: 6

20
: 9 \0

9 +

(

C0
in,8) +

C0
out,8

⌘

⇣ Õ
✓0
1:8
1

 

 

 

\0
8 +

⇥

Õ
✓0
1:8

18

 

1

 

 

Above the functions 18

✓0
1:8

1

, 08

✓0
1:8

1

follow from the deﬁnition of 2q in (3.4). Here,

 
 
is the ReLU function and ⌘ : R

 

 

 

 

R+ is invertible.

!

&IAF

' ✓

✓

&IAF, where &IAF

'

is a restricted version of the IAF family of

6

G

(

= max

G, 0

)
(
We will show &SVI

)

distributions satisfying:

in,8 = C 1
C0
: 9 = E0
20

2:

1

)

 

21
(
E1
8
(

in,8 = 0, 1

8





?

8, 9, : =

8: = 0

8
9 = 1 and 21
2:

✓0
1:8

08

(

 

)

= ⌘

(

C0
out,8)
8

1)
:

 

(
,9

9 =

)



 

:

1,9

min





8

{

min


1,  0

{
 
, 1

}

1,  0

, 1

}

min

9





9

 



8

{

min

1,  0

8

{

 

}


1,  0

}

=

E1
8
(

 

2:

)

2:

1
)

 

Under the above constraints we can re-write 18

✓0
1:8

(

1)

 

as follows:

✓0
1:8

18

(

1)

 

=

=

=

=

min

8

{

 

1,  0

min

8

{

 

}

1,  0

}

’9=1
8
 
{

1,  0

min

’:= 9
8
 
{

1,  0

min

}

’9=1
8
 
{

1,  0

min

’:= 9
8
 
{

1,  0

min

}

’9=1
8
 
{

1,  0

min

’:= 9
8
 
{

1,  0

min

}

}

}

}

’9=1

’:= 9

E1
8
(

2:

1

)

 

6

(

21
(

2:

1
)

 

9 \0

9 ) +

E1
8
(

2:

)

6

(

21
(

2:

)

9 \0

9 ) +

C 1
out,8

E1
8
(

2:

1

)

 

6

\0
9 ) 

(

E1
8
(

2:

1

)

 

6

\0
9 ) +

( 

C 1
out,8

E1
8
(

2:

1
)

 

\0
9 I

\0
9  

(

0
) 

E1
8
(

2:

1

) ( 

 

I

\0
9 )

(

\0
9 < 0

) +

C 1
out,8

E1
8
(

2:

1
)

 

\0
9 +

C 1
out,8 =

min

8

{

 

1,  0

}

’9=1

65

F1

8 9 \0

9 +

C 1
out,8

where F1

8 9 =

8

min
:= 9

{

1,  0

 

}

E1
8
(

.

2:

1
)

 

Therefore, we ﬁnally have each transformed \8 can be expressed as:

Õ

\1 = 01\0

1 +

11

\8 = 08

✓0
1:8

(

1)
 

\0
8 +

18

(

✓0
1:8

1)

 

= ⌘

C0
out,8)

\0
8 +

(

min

8

{

 

1,  0

}

’9=1

where \0
8

i.i.d
⇠

#

0, 1
)

.

(

This can be equivalently written in matrix form as:

F1

8 9 \0

9 +

C 1
out,8 2

8





?

(3.44)

01

F1
21
...

F1

  0 1

0

02
...

. . .

0
...

. . . F 1
  0

. . .

. . .
...

0  0

. . .

. . .
...

. . .

F 1
(

  0

  0

1
)

1
+

0  0
...

+

...

. . .

. . .
...

. . .

. . .
...

0

0
...

0

0
...

  0

1

)

 

(
. . .
...

F1
(

  0

1

1
)

+

...

F1
?1

. . .
...

. . .

. . .

F ?  0

0

. . . 0 ?

✓ =

2
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4

...

C 1
out,1
C 1
out,2

2
6
6
6
6
6
6
C 1
6
out,?
6
6
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
5

+

\0
1
\0
2
...

\0
?

2
6
6
6
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
5

3
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5

=

⌫1 O1

O3 O1

O>1

  2
6
6
6
6
6
4

⌫2 O23
7
7
7
7
7
5

+ 2
6
6
6
6
6
4

!

˜⇡ 3
7
7
7
7
7
5

\0
1
\0
2
...

\0
?

2
6
6
6
6
6
6
6
6
6
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
5

0

02
...

tb
out

+

. . .

0
...

. . .

. . .
...

0

0
...

. . . F 1
(

  0

  0

1
)(

+

1
)

 

0  0

1

+

2

1
)

+

where ⌫1 2

R(

  0

1

+

)⇥(

  0

1
+

) =

R(

?

 

  0

1
 

)⇥(

  0

+

⌫2 2

1
) = 2
6
6
6
6
6
6
6
6
4

01

F1
21
...

2
6
6
6
6
6
6
6
6
6
6
6
  0
4
+
...

F1
(

  0

+

F1
(

1

2
)

1

F1
  0
1
)
(
. . . F 1
(
...

  0

  0

1
+

)

+

2
)(

...

F1
?1

. . .

F ?

  0

1
)

+

(

66

3
7
7
7
7
7
7
7
7
5

is lower triangular.

3
7
7
7
7
7
7
7
7
7
7
7
5

0  0

2

+

0

. . .

. . .
...

0

0
...

3
+

0  0
...

. . .

0

0 ?

0
...

2
6
6
6
6
6
6
0
6
6
6
), O3 2
6
6
4

3
7
7
7
7
7
7
7
7
7
7
7
5

˜⇡

2

R(

?

 

  0

1
)⇥(

 

?

 

  0

 

1
) =

is a diagonal matrix and O1 2

R(

  0

1
+

)⇥(

?

 

  0

1

),

 

R(

?

 

  0

1
)⇥(

 

?

 

  0

1

 

O2 2

R(

  0

1
)⇥(

+

  0

1

+

) are matrices with all 0 entries.

We use the following simpliﬁed notation:

! =

and ⇡ =

(3.45)

⌫1 O1

2
6
6
6
6
6
4

⌫2 O23
7
7
7
7
7
5

O3 O1

O>1

2
6
6
6
6
R?
6
4

⇥

2

˜⇡ 3
7
7
7
7
7
5

Observe that F1

8 9 and 08 can be selected such that !

? is any lower triangular matrix with

1
)

  0

strictly positive diagonal elements and remaining ?

+

(
Further, ⇡2 is any diagonal matrix with ?

  0

 

 

1 columns containing all zeros.

  0

 

 

1 strictly positive lower diagonal elements.

#

0,  ?

Since ✓0 ⇠
It is easily veriﬁed that

(

)

, we have: ✓

⇡

!

(

+

)(

⇠
!

#

(
⇡

(

!

!

+

⇡

⇡

tb
out,
)>)
) (
)> simpliﬁes to !!> +

+

+

.

⇡2. Therefore:

&IAF

' =

#

{

(

tb
out, ⌃q

tb
out 2

) |

R?, ⌃q = !!>

⇡

+

}✓

&IAF

(3.46)

where ! and ⇡ satisfy (3.45). Here we have used ⇡ = ⇡2.

Now we will connect &IAF

'

to &SVI. We recollect that &SVI is a family of gaussians with mean

R? and covariance matrix ⌃q

<q

2

R?

? =

:=1 Z: E : E>: +
1 matrix and a diagonal matrix with strictly positive entries.

D,Z : > 0 and E :

Õ

2

2

⇥

  0

1
+

is the sum of a rank   0

+

R?. That is, ⌃q

By Lemma 3.5.3 we know that ⌃q can be re-written as:

⌃q =

  0

1

+

’:=1

˜Z: ˜E : ˜E>: + 2

6
6
6
6
6
4

O3

O1

O>1 D  0

2:?,   0

+

+

2:?3
7
7
7
7
7
5

67

Now observe that ⇠ =

1

  0
+
:=1

˜Z: ˜E : ˜E>: is a rank   0

+

1, symmetric positive semi-deﬁnite matrix.

Therefore it can be expressed as !!>, for some ! which is lower triangular with   0

Õ

1 positive

+

entries on the diagonal and remaining ?

  0

 

 

Therefore, &SVI is of the form:

1 columns containing only zeros [11].

&SVI =

<q, ⌃q

#

{

(

<q

2

) |

R?, ⌃q

2

R?

⇥

? = !!>

⇡

}

+

(3.47)

For lower triangular ! and diagonal ⇡ as per (3.45). Since both <q and tb

out are allowed to vary

freely in R? by deﬁnitions in (3.46) and (3.47) we have &SVI

&IAF

' ✓

✓

&IAF. We are done.

É

68

CHAPTER 4

LINEAR REGRESSION WITH SPIKE AND SLAB PRIORS

In many scientiﬁc problems, we wish to ﬁnd out which variables are associated with a particular

response. Statistics literature terms this the variable selection problem and there is a multitude of

research in this direction. One of the simplest and most common formulations for this problem is

a linear regression model of the form:

y = -✓

n

+

(4.1)

where y =

H1,H 2 . . .H =

, - =

)

(

(

x1, x2, . . . x?

)2

R=

⇥

? and n

R=

#

(

⇠

2

0,f 2 =

)

for some f2 > 0.

The goal is to select the variables (predictors) that are signiﬁcantly associated with the response.

In recent years, a major stream of research has surfaced wherein we consider this problem under

the high-dimensional regime =



? (usually =<<? ). In order to conduct statistical inference in

such a set up we require sparsity constraints on the underlying true model, that is, there exists some

set

=

8

{

1

|



8



M

? and \8 < 0

}

such that the cardinality of

satisﬁes

M

|M|

<= . These sparsity

constraints impose the assumption that, despite the large number of variables under consideration,

the set of variables that are actually useful for predicting the response has a cardinality smaller than

the sample size.

There are many ways to deal with variable selection in high-dimension. Perhaps the most

well known of these is the Lasso [40], which introduces an !1 penalty term for ✓ into the sum of

squared errors loss function. Despite exhibiting desirable properties such as selection consistency

and computational eﬃciency, the lasso and related penalized regression methods run into issues

when multiple correlated variables occur in the design matrix -. More speciﬁcally, in a group of

multiple correlated predictors the lasso tends to select only one of these predictors and sets the

others to zero rather than spreading out the contribution of this group between individual variables.

This leads to a loss of information [13].

With the advent of powerful computational resources, Bayesian approaches to variable selection

have gained traction in the statistics community. In Bayesian variable selection (BVS), sparsity

in the model parameters is induced via the prior distribution c

. Some of the advantages of

✓

)

(

69

BVS include uncertainty quantiﬁcation for our estimates of ✓, as well as the calculation of easily

interpretable posterior inclusion probabilities P

\8 < 0

. The paper [13] has simulation studies

(
showing that BVS manages to spread out the posterior inclusion probabilies among the multiple

)

correlated predictors in a group as opposed to the lasso, thus avoiding the loss of information

mentioned earlier.

Among the wide variety of ways in which we can select the prior c

, one of the most enduring

✓

)

(

approaches is the spike and slab prior. Spike and slab prior distributions take the following form:

\8

i.i.d
⇠

F 51(

✓

1

F

✓

52(

)

)

 

) + (

where 0 <F< 1. Observe that each \8 is a mixture of 2 probability distributions 51 and 52 with

weights F and 1

F respectively. The common convention is to choose the distribution 51(

.

)

 

to be

any continuous distribution on R? with a non-negligible variance (the slab).1 The distribution 52 is

chosen to be a distribution with very small or zero variance (the spike). Often, a prior distribution

is imposed on F as well, e.g. F

Beta

00,1 0)

.

(

⇠

The main challenge in implementing BVS with spike and slab priors is accurately and eﬃciently

sampling from the posterior distribution. Sampling from the true posterior requires marginalizing

over 2? possible models and is computationally infeasible even for moderate dimensions. MCMC

methods while statistically accurate, may have a very slow mixing time (running for days to produce

samples close to those from the target distribution) due to high multimodality of the true posterior as

? increases. Mean-Field VI (MF-VI) has gained immense popularity for sampling from intractable

posteriors resulting from using spike and slab BVS. The paper [31] has used MF-VI and Structured

VI (SVI) for linear regression with a spike and slab prior and includes derivations for oracle

contraction rates of the variational posterior. [7] and [41] have used algorithms based on MF-VI

for genetic association studies, with a reasonable degree of success. [23] covers the case of MF-VI

for grouped linear regression, in which the predictors have an underlying group structure.

As dicussed throughout this dissertation, MF-VI has certain disadvantages when there are

multiple groups of correlated predictor variables. The papers [7] and [31] show via simulations

1The slab is generally selected to have tails that are atleast as heavy as exponential tails for better performance [31].

70

that the variational posterior from MF-VI under-estimates the posterior variance and is therefore

not very reliable for uncertainty quantiﬁcation. This is consistent with our theoretical results in

chapter 3.

In this chapter we will explore the use of Normalizing Flows aided Variational Inference (FAVI)

to improve uncertainty quantiﬁcation in BVS with spike and slab priors. Keeping with the themes

of the rest of this dissertation, we focus on variable selection with the linear regression formulation

in (4.1). The main challenge in adapting FAVI to spike and slab regression is that we need to jointly

model the continuous ✓ and discrete latent variables z

0, 1

}

2{

?. We will elaborate on this in

section 4.0.2.

4.0.1 Model

We assume the linear model deﬁned in (4.1). We will later discuss adaptations to the generalized

linear model (glm) in section 4.4.

We adopt a standard spike and slab formulation for the prior distribution c

, similar to works

✓

)

(

[31] and [32] as follows:

\8

i.i.d
⇠

F 5c

1

F

)

 

+ (

X0,F

, 1

0, 1

)

8





?

2(

In the equation above, 5c could be any continuous distribution with non-zero support on R?. X0

denotes the dirac delta function (point mass at 0). For any Borel set  

⌫

R
)

(

2

we have:

Since we have no additional information to suggest otherwise, we assume F = 1

2. To allow for

 

X0(

)

=

1,

0,

if 0

 

2

otherwise

8>>>><
>>>>
:

easier calculations later on we re-write the prior in the following manner:

\8

I8

8.8.3
⇠
8.8.3
⇠

X0

)
, 1

I8 5c

1

 

+ (

I8

Bernoulli

F

)

(

71

8





?

Our aim is to sample from the posterior distribution ⇧

✓, z

|

(

-, y

)

given in (4.2).2

P

y
|
(
z P

-, ✓
y

c
)
(
-, ✓
)

|

(

✓
|
c

z
)
✓

(

c

z
)
c
(

(
z
)

|

z

)

✓, z

⇧

(

|

-, y

=

)

p2cf

(

) 

=

✓
Ø
=4 

z (

✓

p2cf

) 

Ø

Õ
1
2f2 k

-✓

y

 

2
2

k

?
8=1

5c

=4 

1
2f2 k

y

 

Œ
2
-✓
2
k

n  
?
8=1

I8

\8

(

)

\8

X0(

)

1

I8 FI8

 

1

(

 

F

)

I8

5c

 

I8

 
\8

(

)

 
\8
X0(

)

1

I8 FI8

 

o
F

)

1

(

 

(4.2)

I8

n  
A key advantage of assuming a dirac spike X0 is that it facilitates the direct calculation of

Œ

Õ

o

 

 

 

the posterior inclusion probabilites P

\8 < 0

)

(

without needing to run any additional hypothesis

tests. It also improves computational eﬃciency for approximate inference methods by ensuring

a large number of \8 = 0. For instance, in each iteration of Gibbs sampling we will only need

to work with the predictors x8 for 1

8





? such that \8 < 0. Under a reasonable choice of

hyper-parameters for the prior and if the sparsity assumptions hold true, the cardinality of the set

=

1

8

{

|



8



M

?, and \8 < 0

}

will be small (

<< min

=, ?

).

}

{

|M|

4.0.2 Method

We introduce an algorithm leveraging FAVI to sample from the target posterior ⇧

)
deﬁned in (4.2). This method is inspired by the work in [45]. The idea is to condition the

-, y

✓, z

(

|

parameters ✓

2

R? which have a continuous distribution on the discrete latent variables, z

R?

2

and use normalizing ﬂows only for ✓. We assume a mean-ﬁeld family for the discrete component

R?. By assuming a mean-ﬁeld family for z we transform the problem of modelling a joint

z

2

discrete distribution with $

2?

(

)

parameters to a distribution with only $

variational parameters.

?

)

(

4.1 Variational Family

We deﬁne the variational family of distributions below.

@q,k,r (

✓, z

)
?

= @q,k

✓

z

z

@r (

)

)

|

(

A I8
8 (

1

A8

)

 

I8

1
 

I

{

I8 = 0

I

{

} +

I8 = 1

}

, 0 <A 8 < 1

z

@r (

)

=

Above we have assumed z8

÷8=1
i.i.d
⇠

Bernoulli

A8

(

)

2Note that if we have samples

;

✓ (

) , z (

;

(

, 1

) )



;



! from ⇧

✓, z

|

(

72

 
. To model @q,k

 
we take the following approach:

✓

z

|

(
-, y

)
then ✓ (

)

;

) are samples from ⇧

✓

|

(

-, y

.

)

(i) Sample ✓0 from a mean-ﬁeld spike and slab base distribution

@k

z

✓0|

(

)

=

?

÷8=1

@k8 (

\0
8 )

=

?

÷8=1

 

I8

\8

5k8 (

)

\8

X0(

)

I8

1
 

 

 

 

The distribution 5k8 belongs to the same family as 5c (the slab component in the prior).

(ii) Apply a normalizing ﬂow transformation )q to the samples ✓0 from the base distribution to

1 = z

get ✓(

+
vectors.

)(

)(

 

1 · · · 
 

)1(

✓0))

 (

, where

 

is the element-wise product between two

We use Neural auto-regressive ﬂows (NAF) with the Deep Sigmoidal Flow (DSF) for )q. We

review the deﬁnition of the NAF transformation below.

\ B
8 = f 

1

w>8 f

(

(

a8\ B
8

 

1

b8

))

+

, 1

8





?, 1

(

B





The DSF parameters w8, a8, b8

2
R? . Further, a8 and w8 are constrained as 08, 9 > 0

R  are the outputs of conditioner networks 2w
q (
)
9 F8, 9 = 1 to ensure

8, 9 , 0 <F 8, 9 < 1,

,2 a
q (

,2 b
q (

)

)

:

.

.

.

R?

!

8

the invertibility of the DSF transformation.

Õ

4.2

Implementation details

In order to implement FAVI we need to maximize the Evidence Lower Bound (ELBO) between

the variational family and the target posterior (see (2.2) for a review):

⇢ !⌫$

@q,k,r ||

(

⇧

.

|

(

-, y

))

= E@ q

✓

)

(

ln

?

y

|

(

-, ✓

✓, z

c

)

(

= E@ q,k

⇥
✓
|

(

z

 
@r (

)

z

) [

!

✓

(

)]  

  !

E@ q,k, r (
.
c

)
 
  ⇤
@q,k,r ||

(

(

))

✓,z

)

ln @q,k,r (

✓, z

⇥

)

⇤

where !

✓

= ln ?

y

-, ✓

. Let   =

1

8

?

I8 = 1

and   2 =

1

8

?

I8 = 0

and

(


assume f = 1. This assumption is consistent with related work [31]. If we expect that f < 1,







}

}

{

{

)

)

(

|

|

|

then we can obtain an estimate ˆf of f from the data and use the model ˆf 

1y = ˆf 

1-✓

ˆf 

1n.

+

Let -  =

G81,G 82, . . .G 8

(

 

|

| )

predictors. Similarly, let ✓  =

(

, where 8:

  for 1

2
\81,\ 82, . . .\ 8

:

 

|

|



; that is, -  is the submatrix of selected

.

 

|

| )

73

We have:

ELBO

(
= E@r (

z

@q,k,r ||
E

(

8

2

)



Œ

-, y

⇧

.

|

(

))

  5k8 (

\0
8 ))

 

⇣

|
  !

@r (

(

 

c

z

(

))

z

) ||
     

 

{z

= E@ q,k

1
2 k

y

 

z

z

✓

(
|
)
- ✓(
  k

@r (
2
2  

) [
1
2

!

✓

(

)]  

  !

@q,k,r ||

c

.

(

))

(
E@r (

ln 2c

 

⌘ 

E@ q,k

✓

|

(

z

)



z

)

⇣
   

ln

@q,k
(
c
✓

(

✓
|
z

|

z

)

)

⌘ 

}

|

{z

}
(4.3)

We will now simplify the ELBO in order to more easily implement stochastic gradient ascent

{z

|

}

(SGA).

Analyzing Term  : Due to the complex structure of )q we have to resort to MC sampling for this

term and we do not simplify further.

Analyzing Term    :

✓ 

(

|

✓ 2 ,z

X0 (

)

✓ 2

)

ln

⇣

@q,k
c

(
✓ 

✓ 

✓  2 , z

|
✓  2 , z

(

|

✓  2

X0(
)
X0(
✓  2

)

)

)

⌘  

E@r (

z

)


= E@r (

z

E@ q,k

✓

z

|

)

(

@q,k
(
c
✓

(

ln

⇣

✓
|
z

|

z

)

E@ q,k

✓ 2 =0,z

✓ 

(

|

)



)

⇣

⌘  
@q,k

)
ln

= E@r (

z

)

E

(



Œ

ln

  5k8 (

8

2

\0
8 )) ✓

8

2
8

2

Œ

  5k8 (

\0
8 ))

8

2

E

(

Œ

Œ
ln

⇣

z

)

= E@r (

?

=

 ’8
 
2
E@r (

z

)

I8E

’8=1 (

Simplifying further we have:

Œ

2

(

⇣

8

  5k8 (

\0
8 ))

E@ q,k

8

z

(

(

)
(

= E@r (
✓ 

)

✓  2 = 0, z
|
  5c
\8
2
\0
  5k8 (
Œ
8 )
\(
  5c
8 )
(
\0
5k8 (
8 )
\(
5c
8 ) ⌘ 
(
\0
5k8 (
8 )
\(
5c
8 ) ⌘  
(

ln

 

 

 

’B=1 ’8
 
2
E@r (

z

)

)

⌘ 
ln

det

m)B
m\ B
8

 

1

⇣

 
 
 

 ’8
 
2

E

(

8

2

Œ

⌘ 
 
 
  5k8 (

◆ 

\0
8 )) ✓

(

’B=1
(

E@r (

I8E

z

)



(

Œ

  5k8 (

8

2

\0
8 )) ✓

ln

det

 
 
 
ln

det

m)B
m\ B
8

 

1

⇣

m)B
m\ B
8

1
 

’B=1

⇣

 
 
 

◆ 

⌘ 
 
 

◆  )

⌘ 
 
 

    =

?

E@r (

E 5k8 (

\0
8 )

I8

z

)



n

⇣

ln 5k8 (

\0
8 )

(

⌘

E

(

 

  5k8 (

\0
8 ))

8

2

ln 5c

\(
8 )

(

⇣

⌘o 

?

’8=1
⇢@r (

’8=1
?

 

=

 

I8E

z

)



(

Œ

  5k8 (

\0
8 ))

8

2

⇣

’8=1 (
⇢@r (

z

A8E 5k8 (

\0
8 )

ln 5k8 (

⇣

\0
8 )

⌘
(

I8E

)



  5k8 (

8

2

\0
8 )) ✓

(

Œ

’B=1

ln

det

Œ
m)B
m\ B
8

 

✓

1

⌘ 

◆ 
 
 
  5k8 (

(

8

2

Œ

I8E

)


m)B
m\ B
8

 

◆  )

1

⌘ 
 
 

’B=1
 
 
 
E@r (
z

 

ln

det

⇣

 
 
 

74

\0
8 ))

⇣

ln 5c

\(
8 )

(

⌘ 

                                                               
                                                               
                                      
                                      
                  
                  
Analyzing Term      :

  !

z

@r (

(

)||

c

z

(

))

=

?

’8=1 n

A8 ln

A8
F + (

1

A8

)

 

ln

A8
F

1
1

 
 

o

(4.4)

Therefore we ﬁnally have the ELBO equal to:

ELBO

@q,k,r ||

(

⇧

.

|

(

-, y

))

= E@r (

z

)

E

(



Œ

  5k8 (

\0
8 ))

8

2

 

⇣

1
2 k

y

- ✓(
  k

2
2  

 

1
2

ln 2c

?

 

⌘ 

’8=1 n

A8 ln

A8
F + (

1

A8

)

 

ln

A8
F

1
1

 
 

o

?
|

 

’8=1

?
|

+

’8=1

A8E 5k8 (

\0
8 )

⇣

ln 5k8 (

\0
8 )

 

{z
+

⌘

?

’8=1

E@r (

z

)

   

0

)
(
{z
I8E



}
\0
8 )) ✓

  5k8 (

8

2

Œ

(

⇢@r (

z

)

(

1

 

|
ln

det

’B=1

 
 
 

I8E

(

8

2

Œ

}
\0
  5k8 (
8 ))

⇣

ln 5c

\(
8 )

(

m)B
m\ B
8

 

1

   

1

)
(
{z

◆ 

⌘ 
 
 



⇣

⌘ 

}

We use MC samples for terms  ,    

|

   

2

)
(
{z

1

)

(

and    

2

)

(

}

, but    

0

)

(

can usually be solved explicitly

depending on our choice of 5k.

The re-parameterization trick: Denote ELBO

@q,k,r ||

(

⇧

.

|

(

-, y

by

@q,k,r)

L (

))

. We wish to

(minimize

maximize

L (
we calculate ;

@q,k,r)
@q,k,r)

(
2.5.1). In order to calculate ;

@q,k,r)

 L (

) with respect to q, k, r. In stochastic gradient ascent,

, which is a realization for an unbiased estimator of

r

q,k,rL (
mW ED

@q,k,r)
5
@W [
: R

⇠

.

(see

D

(

,

)]
R

5

D

!
(
)
m ˜6
= E? n [
mW ]
, we use the Gumbel

m 5
m ˜6

)]

(

.

(
where @W is a probability distribution.3 If we can write D = ˜6

@q,k,r)

, we need to compute functions of the form m

for a function ˜6

W, n

(

)
m
mW ED

⇠
Bernoulli

@W [
A8

)
(
0, 1
)

and random variable n

?n that is free of the parameter W, then:

For D

<, B 2

#

(

⇠

)

⇠
, we use D = Bn

softmax transformation D = 1

1

parameter and ˜A8 = ln

A8

(

/(

1

 

< and n

+
4 ( 

ln

( 

⇠
ln n

#

(

˜A8

)+

)/

. For D

0, 1
)
; where n

_0

⇠

/(
A8

+
. As _0 !

))

)

⇠
0, the pdf of D converges to a Bernoulli

(

distribution

A8

(

)

Uniform

, _0 is a temperature

([17], [25]). In practice we set I8 = I

D> 0.5

{

}

during the forward pass but when diﬀerentiating

with respect to the variational parameters in the backward pass we use the continuous relaxation D

(this is known as the Gumbel-softmax Straight-Through (ST) transformation).

3Note that in this case D corresponds to \8 or z8 and @W is #

<8,B 8

(

)

for \8 and Bernoulli

A8

(

)

for I8 respectively.

75

                                                               
                                                               
                          
                          
                                        
                                        
                                                      
                                                      
Algorithm 4.1 Normalizing Flows Aided Variational Inference for spike and slab regression
(FAVI_ssreg)

)

(

-, y

Inputs: training data ⇡ =
, number of ﬂow transformations (, softmax temperature _0,
learning rate U0, exponential decay rates for momentum in Adam optimizer V0,V 1, number of
ﬂow samples !, batchsize for training data 1B8I4 and number of batches =10C2⌘, number of
layers !2 and hidden dimension 32 for cMADE conditioner networks 2q
, hidden dimension
  for DSF ﬂows.
Method:
Set initial values of parameters ˜r, k =
while not converged do

?
8=1 and q.

<8,B 8

(

(

)

)

.

for 8 = 1, 2 . . . ,=10C2⌘ do

Sample mini-batch ⇡mini from data ⇡.
for ; = 1, 2, . . .! do

Gumbel-softmax-ST
;

= 1
/(
;
)8 = 1, independently sample \0
)
(
8

A8,_ 0)

, A8 = f

= 0. If I(

˜A8

)

(

(

)

(

;

1

˜A8
4 
#

for 1
.

)
<8,B 8

(

)

8





?.

+
⇠

;

wB
(
)
8
(
;
)8 f 
= I(

, bB
8
1

;

(
)
wB
8

;

(

)

, aB
8
)
;
>f

(

)

= 2q
;
aB
8

)

(

✓B
 
1:8
(
\ B
1
8

 

1

;
(
)
1 )
 
;
)
(

, 1
bB
8

(

(

(

+


;
)

;

! and 1
8


, 1





))

8




?, 1



?.
;

!



end for
Calculate unbiased estimate of ELBO as:

8

;
i.i.d
Sample ˜I(
)8
⇠
;
)8 = 0 set \0
If I(
end for
for B = 1, 2 . . .( do
Apply 2q to get
Compute \ B
8

)

(

;

!

1B8I4

1B8I4

1
2

 

H8

(

 

x>8 ✓(

;

1
+

(

)

2

)

 

’;=1 (
˜A8
f

(

’8=1 h
f
ln

)

’8=1
1

˜A8
(
)
F + (

f

˜A8

(

))

 

ln

1

 
1

f

˜A8
(
F

ln 2c

i

=
2

)

ln 5k8 (

\0
8 )

⇣

+

⌘

1
=10C2⌘

i
;
I(
)8

 
?

’8=1

ln 5c

\(
8

;

(

)

(

)

;

(

q, k, ˜r

=

)

1
!

?

1
=10C2⌘

1
=10C2⌘

 

 

1
=10C2⌘

+

’8=1 h
?
f

’8=1
?

˜A8

(

E 5k8 (

)

\0
8 )

(

I8

ln

det

m)B
1

 

m\ B
8

;

(

)

)

⌘ 
 
 

’8=1
Update k as k = k
Update q as q = q
Update ˜r as ˜r = ˜r

’B=1
k;
U0r
+
q;
U0r
+
U0r ˜r;
+

 
⇣
 
 
q, k, ˜r
(
q, k, ˜r
(
q, k, ˜r
(

)
)
)

end for
end while
Output: Converged q⇤,k ⇤, ˜r⇤

76

4.3 Simulation study

For our simulation study we assume a Gaussian prior for the slab 5c and a Gaussian variational

distribution as follows:

Prior: 5c

=

\8

(

)

1
p2cg

4 

\2
8

2g2 Variational distribution: 5k8 (

=

\8

)

1
p2cB8

2

(

<8 )
\8  
2B2
8

4 

(4.5)

The unbiased estimator of the ELBO then simpliﬁes to:

1
!

?

!

1B8I4

1
2

’;=1 (  
˜A8
f
ln

(

)

’8=1
˜A8
f
(
)
F + (

H8

(

 

x>8 ✓(

(

;

)

2

)

 

ln 2c

=
2

1

1

f

˜A8

(

))

 

ln

1
2  

ln 2c

2  

ln B8

1
=10C2⌘

+

⌘

 
1

f

 

˜A8
(
F

)

?

’8=1

i

;
I(
)8

 

⇣

ln g

ln 2c

 

2  

2

;

\(
(
)
8
2g2

)

(

⌘

;

(

q, k, ˜r

=

)

1
=10C2⌘

1
=10C2⌘

1
=10C2⌘

 

 

+

’8=1 h
?

’8=1
?

f

˜A8

(

;
I(
)8

’8=1

’B=1

 

)

(

⇣

m)B
1

 

m\ B
8

;

(

)

⇣

ln

det

 
 
 

)

⌘ 
 
 

The simulation set-up we use is similar to that of section 2.6.3.1. We assume g = 1 and the

?-dimensional predictor variables x1, x2 . . . x=

R? are simulated from a multivariate normal

distribution #

0, ⌃

. The covariance matrix ⌃=

d

 ?

d11>, where 1 =

1, 1, . . . 1

(
and d represents the correlation between predictors. We present some results for ? = 3 and = = 100

 

+

(

)

(

)

R?

)> 2

2
1

in section 4.3.1. Following [31], we set the true data generating non-zero co-eﬃcients to \⇤8 = ln =.

We start by assuming one non-zero leading co-eﬃcient (\⇤1 = ln =), then two and ﬁnally set all three

co-eﬃcients to be non-zero. We also vary this ordering of the non-zero co-eﬃcients and present

some results in appendix C Figures C.4-C.7. We allow d to vary between

0.0, 0.6, 0.8

.

}

{

4.3.1 Results

As before, we compare FAVI (FAVI_ssreg - algorithm 4.1), MF-VI and Gibbs sampling.

Convergence for MF-VI and FAVI is determined by stabilization of the loss curves. For Gibbs

sampling we use trace and auto-correlation plots. The kernel density plots we used in chapter

2 are useful representations for continuous probability distributions on R, but are not eﬀective

visualizations for discrete distributions. Since in our case each \8 is a mixture of a continuous

77

distribution and a discrete distribution at 0, we display the empirical cdf (ecdf) plots for the

distribution of

as:

ˆ 

G

(

)

=

✓

k
#
8=1

 
I

{

✓⇤k
-
G
}#


2
2. The empirical cdf at a point G for a random variable - on R is calculated
. We use # = 10, 000 samples to generate these plots. It is diﬃcult to

visualize central tendencies, variance and other important properties of the distribution on ecdf

Õ

plots. Thus, for the case with a single non-zero leading co-eﬃcient we include the kernel density

plots, but remove the discrete samples at 0 for \8. We overlay bar plots onto the density plots to

represent the proportion of zero samples. We also include violin plots in the Appendix C to aid the

visualization of these distributional properties.4

(a) d = 0.0

(b) d = 0.6

(c) d = 0.8

Figure 4.1 Empirical cdf plots for

(. is MF, NAF, Gibbs) when

k

✓⇤k
✓
 
ln =, 0, 0
)

✓⇤ =

(

2
2 where ✓
and = = 100, ? = 3.

.
@{
}q⇤

⇠

Note that Gibbs sampling can be treated as the gold standard for recovery of the true posterior

samples in this low-dimensional set-up. As expected, we see across Figures 4.1, 4.3 and 4.4 that

an increase in d causes the distribution of MF-VI to deviate from that of Gibbs sampling, while

FAVI closely mimics Gibbs sampling. There is one exception in Fig 4.1 (d = 0.8) where there

is a small gap between FAVI and Gibbs sampling. We also see from the kernel density plots in

Figure 4.2 that MF-VI underestimates the posterior variance as d increases. Further, in the kernel

density plot 4.2 for the case d = 0.6 we see that MF-VI assigns a posterior inclusion probability

(PIP) P

I3 = 0

)

(

= 0 as opposed to FAVI and MCMC which assigns a PIP P

I3 = 0

)

(

between 0.15

and 0.2. These results suggest that FAVI is able to improve upon MF-VI in recovering the shape of

the target posterior, particularly with respect to the posterior variance.

To further validate our conclusions we also present results for = = 100 with ? = 5, 10. We

set 20% of the leading co-eﬃcients to ln =, i.e. \⇤8 = ln = for 1


4See Table 3.1 for a recap of VI related notation used in the ﬁgure captions.

8

0.2?

 d

e

and the remaining

78

(a) d = 0.0

(b) d = 0.6

(c) d = 0.8

Figure 4.2 Kernel density plots for \1 (left) and \3 (right) where ✓
when ✓⇤ =

(. is MF, NAF, Gibbs)
and = = 100,? = 3. Axis on the left is for the kernel density plots and axis

ln =, 0, 0

⇠

.
@{
}q⇤

(

)

on the right is for the bar plot.

(a) d = 0.0

(b) d = 0.6

(c) d = 0.8

Figure 4.3 Empirical cdf plots for

k

✓⇤k
✓
 
ln =, ln =, 0
)

✓⇤ =

(

2
2 where ✓
⇠
and = = 100,? = 3.

.
@{
}q⇤

(. is MF, NAF, Gibbs) when

co-eﬃcients to zero. We have already seen that for d = 0 there is no substantial diﬀerence for the

3 sampling methods. Additionally, as ? increases having d

0.8 is an extreme case in which PIP

 

estimates even for Gibbs sampling may be inaccurate. Therefore, we use a moderate correlation of

d = 0.6 for these cases. Results are presented in Figure 4.5.

We see that the results for ? = 5 and ? = 10 are largely consistent with that of ? = 3. We

make the additional observation that for ? = 10 the discrepancy between FAVI and Gibbs increases

slightly. This is conﬁrmed by the violin plot presented in Figure 4.6. Thus, as the dimension ?

79

(a) d = 0.0

(b) d = 0.6

(c) d = 0.8

Figure 4.4 Empirical cdf plots for

(. is MF, NAF, Gibbs) when

✓⇤k
✓
 
ln =, ln =, ln =

k

2
2 where ✓
)

⇠
and = = 100,? = 3.

.
@{
}q⇤

✓⇤ =

(

(a) ? = 5

(b) ? = 10

Figure 4.5 Empirical cdf plots for
k
We set \⇤8 = ln = for 1

✓



 
8

✓⇤k


2
2 where ✓
0.2? and remaining \⇤8 = 0. Here, d = 0.6.

(. is MF, NAF, Gibbs) when = = 100.

⇠

.
@{
}q⇤

Figure 4.6 Violin plot for

2
2 where ✓
✓⇤k
8
? = 10. We set \⇤8 = ln = for 1




 

✓

k

.
@{
}q⇤

⇠

(. is MF, NAF, Gibbs) when = = 100 and

0.2? and remaining \⇤8 = 0. Here, d = 0.6.

increases, the ability of FAVI to recover the true posterior shape is limited by our assumption of

a mean-ﬁeld family for the discrete latent variables z. We comment more on this in section 4.4.

Run-time and accuracy comparisons between the 3 methods as ? varies are presented in Table C.1

and Table C.2 in the appendix. Overall, the results are consistent with our earlier ﬁndings that FAVI

scales more eﬃciently than MCMC and retains more attributes of the true posterior than MF-VI.

80

Figure 4.7 Bar plots for I1(
✓⇤ =
(

, and I3(
left
)
ln =, ln =, 0
)

where z

right
and = = 100,? = 3,d = 0.8.

⇠

)

.
@{
}q⇤

(. is MF, IAF, Gibbs) when

4.4 Limitations and future work

So far we have evaluated the approximate posterior from FAVI for ✓, but what about the discrete

latent variables z? In Figure 4.7 we plot the distribution for I1 and I3 corresponding to the case

when ✓⇤1 =
posterior variance for the marginal distribution of I3 when compared to to Gibbs sampling.

and d = 0.8. We see that both FAVI and MF-VI under-estimate the

ln =, ln =, 0
)

It

(

is evident that while our algorithm can capture some of the dependencies across \8, it does not

eﬀectively model the dependency structure for z. This is most likely due to our assumption of a

mean-ﬁeld family for z.

A natural extension to the current algorithm would be to incorporate dependencies across I8

into our variational family. This problem is challenging because in order to have an exact model for

the distribution of I8, we require $

2?

(

)

variational parameters. One option is to use discrete ﬂows

for z, but as discussed in section 2.7 the research in this area is nascent and current methods can

only achieve a permutation of some base distribution. A more promising direction could be to use a

continuous relaxation for the I8’s, such as the Gumble-softmax distribution and integrate them into

the normalizing ﬂow. We could also leverage our knowledge of the underlying statistical model

dependency structure and relate those to the auto-regressive structure of the ﬂows. For example

with gaussianity assumptions for the slab, we know the complete conditional distributions. It can be

veriﬁed that given I8, \8 only depends on \

8

)

 (

and similarly given \8, I8 only depends on \

. This

8

)

 (

additional information can be integrated into the conditioner networks to improve computational

81

eﬃciency.

Once these extensions are ironed out, we can proceed to experiments in higher dimensions and

generalized linear models (glms). While our algorithm was outlined for a Gaussian likelihood, it

can be easily modiﬁed to other likelihood functions since we use MC sampling (see 4.3). In fact,

our experiments with logistic regression in chapter 2 showed that even for the simple case of ? = 2,

MF-VI was not able to recover the elliptical shape of the true posterior (see Figure 2.8a). This

indicates that FAVI may demonstrate an improved gains over MF-VI for variable selection in glms.

However, a much more rigorous empirical and theoretical study is required.

82

CHAPTER 5

CONCLUSIONS

With the onset of the big data era and increasing computational resources, developing scalable and

accurate computational methods for Bayesian inference is more important than ever. This line of

research derives much of its importance from its usefulness for uncertainty quantiﬁcation, a critical

component of answering many scientiﬁc questions.

In this dissertation, we have explored the use of Normalizing Flows aided Variational Inference

(FAVI), a novel Bayesian computational tool that surfaced in contemporary machine learning

literature. While it has been explored in many machine learning applications, this dissertation has

focussed on analyzing FAVI from a statistical lens. We have focussed on fundamental theoretical

questions relating to: (i) the expressivity of variational families generated by auto-regressive ﬂows

(IAF) (ii) the statistical accuracy vs scalability trade-oﬀ that is a feature of approximate inference

methods and (iii) measures of uncertainty quantiﬁcation.

Due to the challenging nature of these theoretical problems and the dearth of literature in the

area, we have taken a bottom up approach to answer the questions outlined above. In chapter 2 we

explored the use of FAVI for common statistical applications such as linear and logistic regression

and provided comparisons to MCMC, which is very popular among statisticians and probabilists.

Our experiments showed promise in FAVI as a tunable method that scales better than classical

MCMC and provides improved accuracy in recovering the true posterior when compared to MF-VI.

In chapter 3 we studied the approximate posterior resulting from using IAF within the context of

the widely applicable Bayesian linear regression model. We have derived theoretical results on

the behaviour of the optimal Kullback-Leibler divergence and the loss in uncertainty quantiﬁcation

from the variational approximation. We provided intuitive explanations for these results relating

them to correlation between the regression predictors and eigen values of the covariance matrix.

To our knowledge we are the ﬁrst to provide such a theoretical analysis and the results are novel in

their approach to analyzing uncertainty quantiﬁcation for variational inference. We believe this can

open up many avenues of future research. Finally, we have proposed an algorithm leveraging FAVI

83

for Bayesian variable selection with spike and slab regression. Preliminary experiments show that

FAVI has the potential to eﬀectively capture dependencies in the posterior distribution.

Considerable future work is required to make FAVI more widely accepted for statistical infer-

ence. Expanding our proofs, we hope to explore polynomial activations in the conditioner networks

and leverage their universal approximation properties to tackle non-linear activations. Further, we

could go beyond the linear model to generalized linear models (glms), where the form of the true

posterior is unknown and other non-Gaussian posteriors. We can also consider extensions from

IAF to Neural Auto-regressive Flows [16] (which are universal approximators) for our theoretical

results. For the Bayesian spike and slab regression method we proposed in chapter 4, more expan-

sive, large scale experimentation can be conducted as a ﬁrst step. Following this we can consider

extensions to glms and incoporate dependencies into the discrete latent variables.

We expect to see future collaborations between computer scientists and statisticians to address

some of these interesting and impactful open problems.

84

BIBLIOGRAPHY

[1] Maxime Beauchamp. On numerical computation for the distribution of the convolution of
n independent rectiﬁed gaussian variables. Journal de la Société de statistique de Paris,
159:88–111, 03 2018.

[2] Kush Bhatia, Nikki Kuang Lijing, Yi-An Ma, and Yixin Wang. Statistical and computational

trade-oﬀs in variational inference: A case study in inferential model selection, 2023.

[3] Shrijita Bhattacharya and Tapabrata Maiti. Statistical foundation of variational bayes neural

networks. Neural Networks, 137:151–173, 2021.

[4] David M. Blei, Alp Kucukelbir, and Jon D. McAuliﬀe. Variational inference: A review for
statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.

[5] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of

machine Learning research, 3:993–1022, 2003.

[6] Léon Bottou. Online algorithms and stochastic approximations. In David Saad, editor, Online
Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. revised,
oct 2012.

[7] Peter Carbonetto and Matthew Stephens. Scalable Variational Inference for Bayesian Variable
Selection in Regression, and Its Accuracy in Genetic Association Studies. Bayesian Analysis,
7(1):73–108, 2012.

[8] George Casella and Edward I. George. Explaining the gibbs sampler. The American Statisti-

cian, 46(3):167–174, 1992.

[9] Siddhartha Chib and Edward Greenberg. Understanding the metropolis-hastings algorithm.

The American Statistician, 49(4):327–335, 1995.

[10] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple

sequences. Statistical Science, 7(4):457–472, 1992.

[11] James E. Gentle. Numerical Linear Algebra, pages 203–240. Springer New York, New York,

NY, 2009.

[12] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoen-
coder for distribution estimation. In Francis Bach and David Blei, editors, Proceedings of the
32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine
Learning Research, pages 881–889, Lille, France, 2015.

[13] Yongtao Guan and Matthew Stephens. Bayesian variable selection regression for genome-
wide association studies and other large-scale problems. The Annals of Applied Statistics,
5(3):1780–1815, 2011.

[14] Matthew D. Hoﬀman and Andrew Gelman. The no-u-turn sampler: Adaptively setting path
lengths in hamiltonian monte carlo. Journal of Machine Learning Research, 15(47):1593–
1623, 2014.

85

[15] Matthew D. Hoﬀman, Pavel Sountsov, Joshua V. Dillon, Ian Langmore, Dustin Tran, and
Srinivas Vasudevan. Neutra-lizing bad geometry in hamiltonian monte carlo using neural
transport. arXiv: Computation, abs/1903.03704, 2019.

[16] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autore-
gressive ﬂows. In Proceedings of the 35th International Conference on Machine Learning
(ICML-2018), 2018.

[17] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.
In 5th International Conference on Learning Representations (ICLR 2017), Toulon, France,
2017.

[18] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. An introduction to variational

methods for graphical models. Machine learning, 37:183–233, 1999.

[19] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,

abs/1412.6980, 2014.

[20] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1
convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran
Associates, Inc., 2018.

[21] Diederik P. Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive ﬂow. In Proceedings of the
30th Advances in Neural Information Processing Systems (NeurIPS 2016), page 4743–4751,
Barcelona, Spain., 2016.

[22] Ivan Kobyzev, Simon Prince, and Marcus Brubaker. Normalizing ﬂows: An introduction and
review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence,
PP:1–1, 05 2020.

[23] Michael Komodromos, Marina Evangelou, Sarah Filippi, and Kolyan Ray. Group spike and

slab variational bayes, 2023.

[24] Yingzhen Li and Richard E. Turner. Rényi divergence variational inference. In NIPS, 2016.

[25] Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A contin-
uous relaxation of discrete random variables. In 5th International Conference on Learning
Representations (ICLR 2017), Toulon, France, 2017.

[26] Oren Mangoubi, Natesh S. Pillai, and Aaron Smith. Does hamiltonian monte carlo mix faster

than a random walk on multimodal densities? arXiv, 2018.

[27] ROBERT W. NEWCOMB. On the simultaneous diagonalization of two semi-deﬁnite matrices.

Quarterly of Applied Mathematics, 19(2):144–146, 1961.

[28] George Papamakarios, Eric Nalisnick, Danilo J. Rezende, Shakir Mohamed, and Balaji Lak-
shminarayanan. Normalizing ﬂows for probabilistic modeling and inference. Journal of
Machine Learning Research, 22(57):1–64, 2021.

86

[29] Sumegha Premchandar, Shrijita Bhattacharya, and Tapabrata Maiti. Normalizing ﬂows aided
variational inference: A useful alternative to mcmc? Notices of the American Mathematical
Society, 70(7):1059–1070, 2023.

[30] Neal M. Radford. MCMC using Hamiltonian dynamics, chapter 5. Chapman and Hall (CRC

Press), 2011.

[31] Kolyan Ray and Botond Szabo. Variational bayes for high-dimensional linear regression with

sparse priors. Journal of the American Statistical Association, 117:1–31, 11 2020.

[32] Kolyan Ray, Botond Szabo, and Gabriel Clara. Spike and slab variational bayes for high
In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan,
dimensional logistic regression.
and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages
14423–14434. Curran Associates, Inc., 2020.

[33] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows.

In
Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on
Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–
1538, Lille, France, 07–09 Jul 2015. PMLR.

[34] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of

Mathematical Statistics, 22(3):400 – 407, 1951.

[35] Vivekananda Roy. Convergence diagnostics for markov chain monte carlo. Annual Review of

Statistics and Its Application, 2019.

[36] Walter Rudin. Real and Complex Analysis. McGraw-Hill Science/Engineering/Math, 1986.

[37] Axel Ruhe. Perturbation bounds for means of eigenvalues and invariant subspaces. BIT

Numerical Mathematics, 10:343–354, 1970.

[38] Esteban G. Tabak and Cristina Vilma Turner. A family of nonparametric density estimation

algorithms. Communications on Pure and Applied Mathematics, 66, 2013.

[39] Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-

likelihood. Communications in Mathematical Sciences, 8:217–233, 2010.

[40] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society. Series B (Methodological), 58(1):267–288, 1996.

[41] Gao Wang, Abhishek Sarkar, Peter Carbonetto, and Matthew Stephens. A simple new approach
to variable selection in regression, with application to genetic ﬁne mapping. Journal of the
Royal Statistical Society Series B: Statistical Methodology, 82(3):1273–1300, 2020.

[42] Yixin Wang and David M. Blei. Frequentist consistency of variational bayes. Journal of the

American Statistical Association, 114(527):1147–1161, 2019.

[43] Yu Wang, Fang Liu, and Daniele E. Schiavazzi. Variational inference with nofas: Normalizing
ﬂow with adaptive surrogate for computationally expensive models. Journal of Computational
Physics, 467:111454, 2022.

87

[44] Stefan Webb, Jonathan P. Chen, Martin Jankowiak, and Noah Goodman. Improving automated
In ICML Workshop on Automated Machine

variational inference with normalizing ﬂows.
Learning, 2019.

[45] Cheng Zhang. Improved variational bayesian phylogenetic inference with normalizing ﬂows.
In Advances in Neural Information Processing Systems, volume 33, pages 18760–18771.
Curran Associates, Inc., 2020.

[46] Xuebin Zhao, Andrew Curtis, and Xin Zhang. Bayesian seismic tomography using normaliz-

ing ﬂows. Geophysical Journal International, 228:213–239, 2021.

88

APPENDIX A

INFORMATION ON ENERGY DENSITY FUNCTIONS

In this section we deﬁne energy density functions *3, *4 and *9 used in our experiments comparing

FAVI and the RW-MH algorithms in Section 2.6.2 of this dissertation. We speciﬁcally deﬁne these

functions due to the pattern displayed in our experiments wherein FAVI substantially outperforms

RW-MH in run-time. The remaining energy density functions can be found in toy_energy.

Let z =

I2,I 2)

(

, then the density for *3 is given by:

*3

z

(

)

= exp

2

 

 

n

⇣

q

I2
1 +

0.5I2
2

The density *4 is a mixture of 4 Gaussians with means `1 =

`4 =

0,

(

5
)

 

and variance f2 = 1.5.

2

⌘

( 

o
5, 0

)

(A.1)

, `2 =

5, 0

)

(

, `3 =

0, 5
)

,

(

4 

*4

z

(

)

= 0.1

4 

0.2

+

1
2f2 (

1
2f2 (

z

`1)> (
 
p2cf
`4)> (
z
 
p2cf

z

`4)

 

z

`1)

 

4 

0.3

+

1
2f2 (

z

`2)> (
 
p2cf

z

`2)

 

4 

0.4

+

1
2f2 (

z

`3)> (
 
p2cf

z

`3)

 

The density *9 is a complex mixture of multiple densities and is provided below.

*9

z

(

z

51(

)

z

)

= 51(
= 4G ?

) +

ln

z

)
0.5

52(
4 

((

F1 = sin

⇣

⇢
I1c

)

1
2

(

I2 

F1)/

0.4
)

2

+

4 

0.5

I2 

F1+

F3)/

0.35
)

((

2

0.05

I2
1 +

(

I2
2)

 

 

⌘

F3 = 2.5

1

1
/(

⇥

z

52(

)

= exp

2

{

⇤

+
*3

4 ((

I1 

2
)/

0.3
)

)

1.5

z

(

⇤

2
)}

 

89

APPENDIX B

RUN-TIME COMPARISONS FOR LOGISTIC REGRESSION

s.d (seconds)

Avg. Time to converge
n, p
50, 2
)
(
50, 20
)
(
50, 50
)
(
50, 100
(
)
100, 2
)
(
100, 20
)
(
100, 50
)
(
100, 100
)
(
200, 2
)
(
200, 20
)
(
200, 50
(
)
200, 100
)
(

MH
1
43
5
206
11
423
44
853
1
44
25
227
20
876
93
1167
8
45
30
233
16
921
76
1223

±
±
±
±
±
±
±
±
±
±
±
±

±
FAVI
4
169
21
265
16
371
106
496
19
183
37
294
16
380
41
472
23
183
33
276
13
375
23
471

±
±
±
±
±
±
±
±
±
±
±
±

MF-VI
2
155
11
158
4
157
8
151
10
161
17
166
4
160
17
167
25
164
17
172
7
157
12
172

±
±
±
±
±
±
±
±
±
±
±
±

Table B.1 Avg algorithm runtime

s.d. for logistic regression experiments in Section 2.6.4.

Results are computed over 5 trials (smaller values are better).

±

90

APPENDIX C

ADDITIONAL RESULTS FOR SPIKE AND SLAB REGRESSION

Remark: In order to eﬀectively compare the 3 methods via violin plots we remove outliers (values

greater than 1.5

⇥

75% quantile). This is because MCMC produces a few very extreme samples

that makes the scale of the 3 distributions appear very diﬀerent and it becomes diﬃcult to visualize

all of them simultaneously if those samples are not excluded.

The violin plots presented in Figures C.1, C.2 and C.3 below are for the case = = 100,? = 3

where we set leading coeﬃcients non-zero, from one non-zero co-eﬃcient to all three non-zero.

(a) d = 0.0

(b) d = 0.6

(c) d = 0.8

Figure C.1 violin plots for

✓

k

✓⇤k

 

2
2 where ✓

.
@{
}q⇤
and = = 100,? = 3.

⇠

(. is MF, NAF, Gibbs) when ✓⇤ =

ln =, 0, 0
)

(

(a) d = 0.0

(b) d = 0.6

(c) d = 0.8

Figure C.2 Violin plots for
✓⇤ =

 

✓⇤k
✓
k
ln =, ln =, 0
)
(

2
2 where ✓

.
@{
}q⇤
and = = 100,? = 3.

⇠

(. is MF, NAF, Gibbs) when

We also present results for = = 100,? = 3 but change the ordering of the non-zero coeﬃcients.

We display all possible orderings but only for correlation d = 0.6. Ecdf and violin plots are

presented in Figures C.4 - C.7.

91

(a) d = 0.0

(b) d = 0.6

(c) d = 0.8

Figure C.3 Violin plots for
✓⇤ =

(

✓

✓⇤k
k
ln =, ln =, ln =

 

2
2 where ✓
)

.
@{
}q⇤
and = = 100,? = 3.

⇠

(. is MF, NAF, Gibbs) when

Figure C.4 Ecdf (left) and violin plots (right) for
0, ln =, 0

when ✓⇤ =

(

k
)

2
2 where ✓

✓
⇠
and = = 100,? = 3.

✓⇤k

 

.
@{
}q⇤

(. is MF, NAF, Gibbs)

Figure C.5 Ecdf (left) and violin plots (right) for
when ✓⇤ =

0, 0, ln =

2
2 where ✓
, \⇤8 = 0 for 8 = 1, 2 and = = 100,? = 3.

✓⇤k

⇠

 

✓

k

.
@{
}q⇤

(

)

(. is MF, NAF, Gibbs)

Figure C.6 Ecdf (left) and violin plots (right) for

when ✓⇤ =

(

k
ln =, 0, ln =
)

2
2 where ✓

✓
⇠
 
and = = 100,? = 3.

✓⇤k

.
@{
}q⇤

(. is MF, NAF, Gibbs)

92

Figure C.7 Ecdf (left) and violin plots (right) for

when ✓⇤ =

(

k
0, ln =, ln =
)

2
2 where ✓

✓
⇠
 
and = = 100,? = 3.

✓⇤k

.
@{
}q⇤

(. is MF, NAF, Gibbs)

Figure C.8 Empirical cdf plots for
We set \⇤8 = ln = for 1

8





2
2 where ✓

✓
k
0.2? and remaining \⇤8 = 0. We have ? = 5 and d = 0.6.

(. is MF, NAF, Gibbs) when = = 100.

✓⇤k

⇠

 

.
@{
}q⇤

Avg. algorithm run-time in seconds.
p Gibbs MF-VI
5
10
15
20
30
40
50
70
90

FAVI
3.0
2.9
3.1
3.2
3.2
3.1
3.7
3.5
3.8

4.6
27
30.9
30.9
63.4
42.5
59.3
85.2
148.4

3.8
3.9
3.8
3.9
4
4.5
4.2
5.1
4.3

Table C.1 Avg algorithm runtime for spike and slab regression experiments in Section 4.3.1
(smaller values are better). We set \⇤8 = ln = where = = 100 for 1
. We choose 16

0.2?

8

 d
nodes in 2q and 4 nodes in the DSF. Here, d = 0.6.



e

93

p Gibbs
5
10
15
20
30
40
50
70
90

0.5
0.67
0.75
0.8
0.86
0.89
0.91
0.93
0.95

AUC PR

FAVI MF-VI
0.5
0.67
0.75
0.8
0.86
0.89
0.91
0.93
0.95

0.5
0.67
0.75
0.8
0.86
0.89
0.91
0.93
0.95

Table C.2 AUC Precision-Recall (larger values are better) for spike and slab regression

experiments in Section 4.3.1. We set \⇤8 = ln = where = = 100 for 1

8

 d
nodes in 2q and 4 nodes in the DSF. Here, d = 0.6.



0.2?

. We choose 16

e

94