CONSISTENT BAYESIAN LEARNING FOR NEURAL NETWORK MODELS:
                 THEORY AND COMPUTATION
                                   By
                       Sanket Rajendra Jantre
                         A DISSERTATION
                             Submitted to
                     Michigan State University
              in partial fulfillment of the requirements
                           for the degree of
                  Statistics – Doctor of Philosophy
                                  2022


                                       ABSTRACT
     CONSISTENT BAYESIAN LEARNING FOR NEURAL NETWORK MODELS:
                             THEORY AND COMPUTATION
                                             By
                                   Sanket Rajendra Jantre
Bayesian framework adapted for neural network learning, Bayesian neural networks, have
received widespread attention and successfully applied to various applications. Bayesian
inference for neural networks promises improved predictions with reliable uncertainty esti-
mates, robustness, principled model comparison, and decision-making under uncertainty. In
this dissertation, we propose novel theoretically consistent Bayesian neural network models
and provide their computationally efficient posterior inference algorithms.
    In Chapter 2, we introduce a Bayesian quantile regression neural network assuming an
asymmetric Laplace distribution for the response variable. The normal-exponential mixture
representation of the asymmetric Laplace density is utilized to derive the Gibbs sampling
coupled with Metropolis-Hastings algorithm for the posterior inference. We establish the
posterior consistency under a misspecified asymmetric Laplace density model. We illustrate
the proposed method with simulation studies and real data examples.
    Traditional Bayesian learning methods are limited by their scalability to large data and
feature spaces due to the expensive inference approaches, however recent developments in
variational inference techniques and sparse learning have brought renewed interest to this
area. Sparse deep neural networks have proven to be efficient for predictive model building in
large-scale studies. Although several works have studied theoretical and numerical properties
of sparse neural architectures, they have primarily focused on the edge selection.
    In Chapter 3, we propose a sparse Bayesian technique using spike-and-slab Gaussian
prior to allow for automatic node selection. The spike-and-slab prior alleviates the need
of an ad-hoc thresholding rule for pruning. In addition, we adopt a variational Bayes ap-
proach to circumvent the computational challenges of traditional Markov chain Monte Carlo


implementation. In the context of node selection, we establish the variational posterior con-
sistency together with the layer-wise characterization of prior inclusion probabilities. We
empirically demonstrate that our proposed approach outperforms the edge selection method
in computational complexity with similar or better predictive performance.
    The structured sparsity (e.g. node sparsity) in deep neural networks provides low latency
inference, higher data throughput, and reduced energy consumption. Alternatively, there is
a vast albeit growing literature demonstrating shrinkage efficiency and theoretical optimal-
ity in linear models of two sparse parameter estimation techniques: lasso and horseshoe.
In Chapter 4, we propose structurally sparse Bayesian neural networks which systemati-
cally prune excessive nodes with (i) Spike-and-Slab Group Lasso, and (ii) Spike-and-Slab
Group Horseshoe priors, and develop computationally tractable variational inference We
demonstrate the competitive performance of our proposed models compared to the Bayesian
baseline models in prediction accuracy, model compression, and inference latency.
    Deep neural network ensembles that appeal to model diversity have been used successfully
to improve predictive performance and model robustness in several applications. However,
most ensembling techniques require multiple parallel and costly evaluations and have been
proposed primarily with deterministic models. In Chapter 5, we propose sequential en-
sembling of dynamic Bayesian neural subnetworks to generate diverse ensemble in a single
forward pass. The ensembling strategy consists of an exploration phase that finds high-
performing regions of the parameter space and multiple exploitation phases that effectively
exploit the compactness of the sparse model to quickly converge to different minima in the
energy landscape corresponding to high-performing subnetworks yielding diverse ensembles.
We empirically demonstrate that our proposed approach surpasses the baselines of the dense
frequentist and Bayesian ensemble models in prediction accuracy, uncertainty estimation,
and out-of-distribution robustness. Furthermore, we found that our approach produced the
most diverse ensembles compared to the approaches with a single forward pass and even
compared to the approaches with multiple forward passes in some cases.


To Dr. Rahman for introducing me to the Bayesian way
                         iv


                                ACKNOWLEDGEMENTS
   I would like to take this opportunity to thank the people who have extended their support
and assistance throughout my Ph.D. journey.
   First, I would like to express my deepest gratitude to my advisors, Dr. Tapabrata Maiti
and Dr. Shrijita Bhattacharya, for their support, guidance, and encouragement. Dr. Maiti’s
exemplary commitment to research will continue to serve as a guide in my own academic
pursuits. Thank you, Dr. Bhattacharya, for consistently providing me with hands-on help
in my research.
   I would also like to extend my sincere appreciation to the rest of my dissertation com-
mittee members, Dr. Yuehua Cui and Dr. Andrew Finley, for providing me a careful review
of my dissertation.
   I would like to extend many thanks to my research collaborators, Dr. Sandeep Madireddy,
Dr. Zichao Di, and Dr. Prasanna Balaprakash from Argonne National Laboratory for
providing me the opportunity to work on developing practical models useful for scientific
problems and allowing me to participate in exciting interdisciplinary projects at Argonne.
Moreover, I want to thank Dr. Sandeep Madireddy and Dr. Prasanna Balaprakash for
funding my research throughout the final year of my Ph.D.
   My sincere thanks to the Michigan State University Graduate School for awarding me
the Dissertation Completion Fellowship during the Summer 2022.
   Lastly, I am grateful to my family. Thanks to my mother for her unequivocal support
during all my studies. Thanks to my sisters, Smita and Sanchita, and brother-in-law, Sam-
panna, for always cheering me on throughout my dissertation.
                                               v


                              TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      x
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        xi
CHAPTER 1 INTRODUCTION . . . . . . .              . . . . . . . . . . . . . . . . . . . . .  1
   1.1 Neural Networks . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . .  1
       1.1.1 Feedforward Neural Networks .        . . . . . . . . . . . . . . . . . . . . .  1
       1.1.2 Bayesian Neural Networks . . .       . . . . . . . . . . . . . . . . . . . . .  4
   1.2 Markov Chain Monte Carlo Sampling .        . . . . . . . . . . . . . . . . . . . . .  5
       1.2.1 Metropolis-Hastings Algorithm        . . . . . . . . . . . . . . . . . . . . .  5
       1.2.2 Gibbs Sampling Algorithm . . .       . . . . . . . . . . . . . . . . . . . . .  6
   1.3 Variational Bayesian Inference . . . . .   . . . . . . . . . . . . . . . . . . . . .  7
   1.4 Posterior Consistency . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . .  8
   1.5 Dissertation Outline . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER 2 BAYESIAN QUANTILE REGRESSION NEURAL NETWORKS                                  . . 13
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
   2.2 Bayesian Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . .     . . 16
   2.3 Bayesian Quantile Regression Neural Networks . . . . . . . . . . . . . . .       . . 18
       2.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 18
       2.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 20
   2.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 22
   2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 27
       2.5.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 27
       2.5.2 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . .       . . 32
   2.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 35
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 36
   APPENDIX A        LEMMAS FOR POSTERIOR CONSISTENCY PROOF .                           . . 37
   APPENDIX B        POSTERIOR CONSISTENCY THEOREM PROOFS . .                           . . 43
CHAPTER 3 LAYER ADAPTIVE NODE SELECTION IN BAYESIAN                               NEU-
              RAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . .          . . . . . 51
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
   3.2 Nonparametric Modeling: Deep Learning Approach . . . . . . . . .           . . . . . 55
   3.3 Spike-and-Slab Independent Gaussian Node Selection . . . . . . . .         . . . . . 57
       3.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . 57
       3.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . 60
   3.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . 61
   3.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . 69
       3.5.1 Simulation Study - I . . . . . . . . . . . . . . . . . . . . . .     . . . . . 70
                                             vi


      3.5.2 Simulation Study - II . . . . . . . . . . . . . . . . . . . . . . . . .    . . 71
      3.5.3 UCI Regression Datasets . . . . . . . . . . . . . . . . . . . . . . .      . . 72
      3.5.4 Image Classification Datasets . . . . . . . . . . . . . . . . . . . .      . . 73
  3.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 78
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 80
  APPENDIX A        PROOFS OF SS-IG THEORETICAL RESULTS . . . . .                      . . 81
  APPENDIX B        ADDITIONAL NUMERICAL EXPERIMENTS DETAILS                           . . 102
CHAPTER 4 COMPACT BAYESIAN NEURAL NETWORKS                           WITH    STRUC-
             TURED SPARSITY . . . . . . . . . . . . . . . .          . . . . . . . . . . . 106
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
      4.1.1 Proposed Methods . . . . . . . . . . . . . . . . .       . . . . . . . . . . . 107
  4.2 Structured Sparsity: Spike-and-Slab Hierarchical Priors .      . . . . . . . . . . . 108
      4.2.1 Spike-and-Slab Group Lasso (SS-GL): . . . . . . .        . . . . . . . . . . . 108
      4.2.2 Spike-and-Slab Group Horseshoe (SS-GHS): . . .           . . . . . . . . . . . 110
      4.2.3 Algorithm and Computational Details . . . . . .          . . . . . . . . . . . 112
  4.3 Numerical Experiments . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . 113
      4.3.1 MLP MNIST Classification . . . . . . . . . . . .         . . . . . . . . . . . 114
      4.3.2 LeNet-5-Caffe Experiments . . . . . . . . . . . . .      . . . . . . . . . . . 117
      4.3.3 Residual Network Experiments . . . . . . . . . .         . . . . . . . . . . . 119
  4.4 Conclusion and Discussion . . . . . . . . . . . . . . . . .    . . . . . . . . . . . 121
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
CHAPTER 5 SEQUENTIAL BAYESIAN NEURAL SUBNETWORK ENSEMBLES                                  125
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
      5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     127
  5.2 Sequential Bayesian Neural Subnetwork Ensembles . . . . . . . . . . . . . .          129
      5.2.1 Base Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     129
      5.2.2 Sequential Ensembling and Bayesian Neural Subnetworks . . . . . . .            129
  5.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    132
  5.4 Sequential BNN Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . .       135
      5.4.1 Function Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . .      135
      5.4.2 Dynamic Sparsity Learning . . . . . . . . . . . . . . . . . . . . . . .        137
      5.4.3 Effect of Ensemble size . . . . . . . . . . . . . . . . . . . . . . . . . .    137
  5.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .    138
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   140
  APPENDIX A        REPRODUCIBILITY CONSIDERATIONS . . . . . . . . . .                     141
  APPENDIX B        OUT-OF-DISTRIBUTION EXPERIMENT RESULTS . . . .                         144
  APPENDIX C        EFFECT OF THE ENSEMBLE SIZE . . . . . . . . . . . . .                  145
  APPENDIX D        EFFECT OF THE MONTE CARLO SAMPLE SIZE . . . . .                        146
  APPENDIX E        EFFECT OF THE PERTURBATION FACTOR . . . . . . .                        148
  APPENDIX F        EFFECT OF THE CYCLIC LEARNING RATE SCHEDULE                            150
CHAPTER 6 EPILOGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
  6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
                                            vii


   6.2 Broader Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
   6.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
                                           viii


                                  LIST OF TABLES
Table 2.1 Simulation study I results . . . . . . . . . . . . . . . . . . . . . . . . . . .   30
Table 2.2 Simulation study II results . . . . . . . . . . . . . . . . . . . . . . . . . .    31
Table 2.3 Real data applications results . . . . . . . . . . . . . . . . . . . . . . . . .   34
Table 3.1 Simulation study II results . . . . . . . . . . . . . . . . . . . . . . . . . .    72
Table 3.2 UCI regression datasets results . . . . . . . . . . . . . . . . . . . . . . . .    73
Table 4.1 ResNet-20/CIFAR-10 and ResNet-32/CIFAR-10 experiments results . . . 121
Table 5.1 ResNet-32/CIFAR10 experiment results . . . . . . . . . . . . . . . . . . . 134
Table 5.2 ResNet-56/CIFAR100 experiment results . . . . . . . . . . . . . . . . . . 134
Table 5.3 Diversity metrics in ResNet-32/CIFAR-10 and ResNet-56/CIFAR100 ex-
          periments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Table B.1 OoD detection results in ResNet-32/CIFAR10 experiment . . . . . . . . . 144
Table C.1 Ensemble size effect results in ResNet-32/CIFAR10 experiment . . . . . . 145
Table D.1 Monte Carlo sample size effect results in ResNet-32/CIFAR10 experiment            147
Table E.1 Perturbation factor effect results in ResNet-32/CIFAR10 experiment . . . 149
Table E.2 Diversity metrics for models trained with different perturbation factors
          in ResNet-32/CIFAR-10 experiment . . . . . . . . . . . . . . . . . . . . . 149
Table F.1 Cyclic learning rate schedules results in ResNet-32/CIFAR10 experiment . 152
Table F.2 Diversity metrics for models trained with different cyclic learning rate
          schedules in ResNet-32/CIFAR10 experiment . . . . . . . . . . . . . . . . 152
                                             ix


                                  LIST OF FIGURES
Figure 1.1 Single-layer neural network . . . . . . . . . . . . . . . . . . . . . . . . . .     2
Figure 1.2 ReLU and SiLU activations . . . . . . . . . . . . . . . . . . . . . . . . .         3
Figure 2.1 Quantile loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . .     17
Figure 3.1 Sparse neural network with node selection . . . . . . . . . . . . . . . . .        54
Figure 3.2 Simulation study I results . . . . . . . . . . . . . . . . . . . . . . . . . .     71
Figure 3.3 MLP/MNIST and MLP/Fashion-MNIST experiments results . . . . . . .                  76
Figure 3.4 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experiments
           results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  77
Figure B.1 Simulation study I: additional experiment results . . . . . . . . . . . . . 104
Figure B.2 MNIST experiment results for varying hidden layer widths . . . . . . . . 105
Figure 4.1 MNIST experiment results: motivation for group shrinkage priors over
           Gaussian prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Figure 4.2 SS-GL penalty parameter choice experiment results . . . . . . . . . . . . 115
Figure 4.3 SS-GHS regularization constant choice experiment results . . . . . . . . . 116
Figure 4.4 MLP/MNIST experiment results . . . . . . . . . . . . . . . . . . . . . . 117
Figure 4.5 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experiments
           results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Figure 5.1 Training trajectories of base learners in ResNet32/CIFAR10 experiment . 136
Figure 5.2 Dynamic sparsity and FLOPs curves . . . . . . . . . . . . . . . . . . . . 137
Figure 5.3 Predictive performance results of the base learners and the sequential
           ensembles as the ensemble size varies in ResNet32/CIFAR10 experiment              138
Figure F.1 Cyclic learning rate schedules . . . . . . . . . . . . . . . . . . . . . . . . 151
                                              x


                             LIST OF ALGORITHMS
Algorithm 3.1 Variational inference in SS-IG Bayesian neural networks . . . . . . .  61
Algorithm 4.1 Variational inference in SS-GL and SS-GHS Bayesian neural networks 113
Algorithm 5.1 Sequential Bayesian neural subnetwork ensemble (SeBayS) algorithm     131
                                           xi


                                         CHAPTER 1
                                     INTRODUCTION
    Artificial neural networks (ANN) are biologically inspired predictive models involving
computations and mathematics, which simulate the human–brain processes. Many of the
recent successes of the artificial intelligence such as image and voice recognition, robotics,
are powered by ANNs. However, ANNs still suffer from many fundamental issues from the
perspective of statistical modeling. One of the major challenges is their ability to model
the uncertainty, and hence build reliable and robust models while capturing complex data
dependencies and being computationally tractable. Probabilistic approaches and especially
the systematic Bayesian framework provides an exciting avenue to address this challenge. In
this dissertation, we propose novel Bayesian neural network models which are theoretically
consistent along with their computationally efficient implementations for the model inference.
    In this Chapter, we briefly introduce the main concepts which are fundamental part of
this dissertation, neural networks and their Bayesian counterpart (Section 1.1), Markov chain
Monte Carlo sampling methods (Section 1.2), variational Bayesian inference (Section 1.3),
and posterior consistency preliminaries (Section 1.4). In addition, we discuss some existing
work that have significant impact to this field. Finally, we provide a brief outline of the rest
of the chapters in Section 1.5.
1.1     Neural Networks
1.1.1    Feedforward Neural Networks
    Feedforward neural network can approximate any continuous function f (.) : Rp → R
arbitrarily well. Neural network tries to simulate the human brain, so it has many layers
of “neurons” just like the neurons in our brain. Originated from a multi-layer perceptron
(MLP) (Rosenblatt, 1961) which has multiple hidden layers compared to single hidden layer
                                                1


                Input                            Hidden                     Output
                 layer                            layer                       layer
          I1
                                                   H1
          I2
                   ..                               ..                                  O
                    .                                .
        Ip − 1                                     Hk
          Ip
                           Figure 1.1 Single-layer neural network.
counterpart, the perceptron (Rosenblatt, 1957), neural networks are getting deeper to be
more accurate in approximating a continuous function. In Figure 1.1 we illustrate a single
hidden layer neural network or shallow neural network. The input to the hidden layer consists
of a linear combination of the model inputs passed through a non-linear activation function.
    Let us consider the input vector x ∈ RP . Then the output of the shallow neural network
in Figure 1.1 is given by
                                                               p
                                             k
                                                                         !
                                            X                 X
                           η(x) = β0 +          βj × ψ γj0 +      γjh xh
                                            j=1               h=1
where γjh (j = 0, · · · , k; h = 1, · · · , p) are the input layer (including intercept h = 0) to
hidden layer weights, βj (j = 0, · · · , k) are the hidden layer (including intercept j = 0) to
output layer weights, and ψ(.) denotes the non-linear activation function. The universal
approximation theorem (Cybenko, 1989) states the approximation power of the shallow
neural networks.
Theorem 1.1.1 (Universal approximation theorem). Let ψ(.) be such that ψ(t) → 0 as t →
−∞ and ψ(t) → 1 as t → ∞. Then for a continuous function f on [0, 1]p and an arbitrary
                                                    2


ϵ > 0, there exist k and parameters βj (j = 0, · · · , k), and γjh (j = 0, · · · , k; h = 1, · · · , p)
such that,
                                 |f (x) − η(x)| < ϵ, ∀x ∈ [0, 1]p
    The universal approximation capacity of neural networks along with available computing
power explain the widespread use of deep learning nowadays. In this dissertation, we use
the following non-linear activation functions.
    • Sigmoid: ψ(x) =     exp(x)
                         1+exp(x)
                                  .
    • Rectified Linear Unit (ReLU): ψ(x) = x+ .
    • Sigmoid Linear Unit (SiLU or Swish): ψ(x) = x × Sigmoid(x).
    ReLU is one of the most popular activation function used in many deep neural archi-
tectures. However, it sometimes suffers from the dead ReLU problem where ReLU neurons
become inactive and only output 0 for any input (Lu et al., 2020). We ourselves encounter
this problem in our spike-and-slab models and there instead of ReLU we use the SiLU acti-
vations (Elfwing et al., 2018; Ramachandran et al., 2017) which unlike ReLU is smooth and
nonmonotonic (Figure 1.2).
                          Figure 1.2 ReLU and SiLU activations.
                                                 3


1.1.2    Bayesian Neural Networks
    Bayesian neural networks (BNN) differ from deterministic neural networks in that their
weights are assigned a probability distribution instead of a single value or point estimate.
These probability distributions describe the uncertainty in weights and can be used to esti-
mate uncertainty in predictions.
    A neural network model can be viewed as a probabilistic model: p(y|x, θ) where θ
denotes the neural network weights. For classification, y is a set of classes and p(y|x, θ) is a
categorical distribution. For regression, y is a continuous variable and p(y|x, θ) is a Gaussian
distribution.
    In Bayesian framework, instead of optimizing over a single probabilistic model, p(y|x, θ),
we discover all likely models via posterior inference over model parameters. First, we place a
prior distribution p(θ) on the neural network weights θ. The Bayes’ rule provides the exact
posterior distribution as follows,
                                                 p(D|θ)p(θ)
                                   p(θ|D) = R                                              (1.1)
                                               θ
                                                 p(D|θ)p(θ)dθ
where p(D|θ) denotes the likelihood of D given the model parameters θ.
    The main goal of the neural network is predictions on the new inputs. Given the poste-
rior in (1.1) we predict the label corresponding to a new example xnew by Bayesian model
averaging:
                                             Z
                         p(ynew |xnew , D) =    p(ynew |xnew , θ)p(θ|D)dθ
    The key distinguishing property of Bayesian approach from deterministic one is marginal-
ization instead of optimization, where we represent solutions given by all settings of param-
eters weighted by their posterior probabilities, rather than bet everything on a single setting
of parameters (Wilson and Izmailov, 2020). Bayesian procedures adapted for deep learning
have received widespread attention and applications of BNNs are found in several fields e.g.,
computer vision (Kendall and Gal, 2017), civil engineering (Bateni et al., 2007; Arangio
                                                 4


and Bontempi, 2015), astronomy (Perreault Levasseur et al., 2017; Cobb et al., 2019), and
medicine (Kwon et al., 2020; Beker et al., 2020).
1.2      Markov Chain Monte Carlo Sampling
     Markov chain Monte Carlo (MCMC) methods have been used in several Physics problems
for many years and later have been widely applied to Bayesian statistical modeling (Neal,
1996). MCMC methods do not make any assumptions regarding the form of the distribution
to be sampled, for instance whether a given distribution can be approximated by Gaussian.
Ideally, they are supposed to cover all the modes of a target distribution during sampling.
However, the high computational complexity of MCMC methods is their major disadvantage
in complex Bayesian models and large scale datasets.
     In what follows, we describe two well-known MCMC sampling algorithms. the combina-
tion of these two algorithm is used in Chapter 2 for posterior inference in BQRNN model.
1.2.1    Metropolis-Hastings Algorithm
     One of the most common algorithms for sampling from the posterior p(θ|D) is the
Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970). In the Markov
chain defined by the MH algorithm, the new state θ(t+1) is generated given the current state
θ(t) by first sampling a candidate state θ∗ from a proposal density gθ and subsequently
accepting the proposed candidate state with probability
                      
                             n ∗              (t)
                                                    o
                      min p(θ |D)gθ∗ (θ ) , 1         if p(θ(t) |D)g        ∗
                      
                                                                     θ(t) (θ   ) > 0,
                      
                              p(θ(t) |D)gθ(t) (θ∗ )
                      
                      1
                                                      otherwise.
By taking a sufficient number of trial steps all of state space is explored and the MH algorithm
ensures that the points are distributed according to the required target distribution.
     Typically, the proposal distribution is chosen to be symmetrical, satisfying the condition
               ′
gθ′ (θ) = gθ (θ ). Hence, the acceptance probability simplifies to min{p(θ∗ |D)/p(θ(t) |D), 1}
which yields the so called random walk metropolis algorithm. Commonly used symmetrical
                                                     5


proposal distribution is Gaussian with density gθ = N (θ, Σ) where Σ is a constant covariance
matrix. In our BQRNN model, we use the normal proposal density for the parameters
sampled using MH algorithm.
1.2.2     Gibbs Sampling Algorithm
     The Gibbs algorithm was formally described by Geman and Geman (1984) and later
Gelfand and Smith (1990) showed its potential in a wide variety of conventional statistical
problems. In its basic version, Gibbs sampling is a special case of the MH algorithm.
     The point of Gibbs sampling is that given a multivariate distribution over parameters
θ = {θ1 , · · · , θp }, it is simpler to sample from a conditional distribution than to marginalize
by integrating over a joint distribution. This allows us to simulate a Markov chain in which
θ (t+1) is generated from θ (t) as follows:
                  (t+1)                                     (t)   (t)             (t)
       Pick      θ1      from the distribution of θ1 given θ2 , θ3 , · · · , θp
                  (t+1)                                     (t+1)     (t)             (t)
       Pick      θ2      from the distribution of θ2 given θ1     , θ3 , · · · , θp
                                        ..
                                         .
                  (t+1)                                     (t+1)             (t+1)       (t)    (t)
       Pick      θj      from the distribution of θj given θ1     , · · · , θj−1 , θj+1 · · · , θp
                                       ..
                                        .
                  (t+1)                                     (t+1)     (t+1)               (t+1)
       Pick      θp      from the distribution of θp given θ1     , θ2       , · · · , θp−1
     The samples obtained using above procedure for all the parameters together approximate
their joint distribution. The marginal distribution of any subset of variables can be approxi-
mated by simply considering the samples for that subset of variables while ignoring the rest.
The expected value of any variable can be approximated by averaging over all the samples
obtained from above procedure.
     Although Gibbs sampling is easier to implement, it is only useful when the posterior dis-
tribution of one parameter conditional on given values of the other parameters has a known
distributional form. For many conventional statistical problems, these conditional distribu-
tions are of standard forms, hence efficient Gibbs sampling procedures are implemented with
                                                    6


ease. On the contrary in neural networks, the conditional posterior distribution for hidden
layer weights in the network given values for the rest of the weights can be extremely messy,
with multiple modes. This is the case we encounter for hidden layer weights in our BQRNN
model where we implement combination of random walk MH and Gibbs sampling algorithm
for posterior inference.
1.3     Variational Bayesian Inference
    Although, Markov chain Monte Carlo sampling is the gold standard for inference in
Bayesian models, it is computationally inefficient (Izmailov et al., 2021). As an alterna-
tive, variational inference or variational approximation tends to be faster and scales well
on complex Bayesian learning tasks with large datasets (Blei et al., 2017). Variational
Bayesian (VB) learning recasts the sampling problem to an optimization problem minimiz-
ing Kullback-Leibler (KL) divergence which intuitively measures the dissimilarity between a
surrogate distribution (called a variational distribution) q(θ) and the true posterior distri-
bution p(θ|D) (Jordan et al., 1999).
Definition 1.3.1 (Kullback-Leibler (KL) divergence). For two probability measures P1 and
P2 over a set X such that P1 is absolutely continuous with respect to P2 , the KL divergence
of P1 with respect to P2 is defined as
                                                 Z           
                                                          dP1
                                dKL (P1 , P2 ) =    log         dP1
                                                  X       dP2
where dP1 /dP2 is the Radon–Nikodym derivative of P1 with respect to P2 . If P1 is not
absolutely continuous with respect to P2 then dKL (P1 , P2 ) = ∞.
    As a first step in variational learning, we define a family of surrogate distributions, also
called variational family (Q) which consists of distributions of simpler form than the true
posterior (1.1) (ex. a family of Gaussian distributions).
        Q = {q(θ, ν) : q is a simple candidate distribution used for approximation.},
                                                  7


where ν denotes the parameters of the variational distribution, also known as variational
parameters. For instance, if Q is a Gaussian family then ν includes the mean (µ) and
standard deviation (σ) of a Gaussian candidate distribution.
    Once we select an appropriate variational family, the VB infers the parameters of a
distribution on the model parameters q(θ) that minimises the Kullback-Leibler (KL) distance
from the true posterior p(θ|D):
                              q ∗ (θ) = argmin d KL (q(θ), p(θ|D))                       (1.2)
                                         q(θ)∈Q
We simplify the d KL (q(θ), p(θ|D)):
            d KL (q(θ), p(θ|D)) = Eq(θ) [log q(θ) − log p(θ|D)]
                                = −Eq(θ) [log p(θ, D) − log q(θ)] + log m(D)
                                = −Eq(θ) [log p(D|θ)] + dKL (q(θ), p(θ)) + log m(D)
where m(D) is the marginal distribution of the data and is free of the θ. Hence, the op-
timization problem in (1.2) is equivalent to minimizing the negative evidence lower bound
(ELBO), which is defined as
                           L = −Eq(θ) [log p(D|θ)] + dKL (q(θ), p(θ)),
where the first term is the data-dependent cost widely known as the negative log-likelihood
(NLL), and the second term is prior-dependent and serves as regularization. Since, the
direct optimization of (3.9) is computationally prohibitive, gradient descent methods are
used (Kingma and Welling, 2014).
1.4     Posterior Consistency
    In Bayesian analysis, one starts with a prior distribution (either informative or non-
informative) on the model parameters and updates the knowledge of the model as the number
of data observations grows, reflected in the posterior distribution. It is therefore important
to know whether the posterior distribution concentrates on neighborhoods of the true data
                                                8


generating distribution as the data is collected indefinitely. This is known as the Bayesian
consistency of the posterior distribution. Although it is an asymptotic property, consistency
is one of the benchmarks since the violation of consistency is clearly undesirable and one
may have serious doubts against inferences based on an inconsistent posterior distribution.
    Let P denote the prior distribution and {Pn (.|Dn )} denote a sequence of posterior dis-
tributions where Pn (.|Dn ) is the posterior distribution based on the n-th data sample. Then
we define the posterior consistency as follows
Definition 1.4.1 (Posterior consistency.). The sequence of posteriors is consistent at θ0 if
{Pn (U |Dn )} → 1 a.s. Pθ∞0 for all neighborhoods U of θ0 .
    where Pθ∞0 is the joint distribution of {Di }∞ i=1 when θ0 is the true value of θ.
    Alternatively, let f0 (x) be the underlying density of X. Let E(Y |X = x) = ν0 (x) be the
true regression function of Y given X, and let ν̂n (x) be the estimated regression function.
Definition 1.4.2. ν̂n (x) is asymptotically consistent for ν0 (x) if
                               Z
                                                                p
                                  (µ̂n (x) − µ0 (x))2 f0 (x) dx → 0
    where p above the arrow denotes the convergence in probability. In this frequentist
sense, Funahashi (1989) and Hornik et al. (1989) have established that the neural networks
are asymptotically consistent by showing the existence of some neural network, ν̂n (x), whose
mean squared error with the true function, ν0 (x), converges to 0 in probability.
    Lee (2000) showed that the posterior distribution for feedforward (single-layer) sigmoidal
neural networks is consistent. Their proof of consistency embedded the problem in a density
function estimation, which uses bounds on the bracketing entropy to show that the posterior
is consistent over Hellinger neighborhoods. Mathematically, let f0 (x, y) denote the true joint
density function of X and Y random variables, and let f (x, y) be the corresponding density
function under the neural network model. The Hellinger distance between these two density
                                                  9


functions is given by
                                   sZ Z
                                            p               p             2
                     dH (f, f0 ) =              f (x, y) − f0 (x, y) dx dy
Definition 1.4.3. The posterior is asymptotically consistent for f0 over ϵ-Hellinger neigh-
borhoods ∀ϵ > 0 if,
                                                                                p
                       P ({f : dH (f, f0 ) ≤ ϵ}|(X1 , Y1 ), . . . , (Xn , Yn )) → 1
    We use this framework to establish the Bayesian posterior consistency in our BQRNN
model in Chapter 2 where we combine the ideas provided by Lee (2000) and Sriram et al.
(2013). Specifically, Sriram et al. (2013) established the posterior consistency of the Bayesian
quantile regression under misspecified ALD model.
    In variational Bayesian inference, Zhang and Gao (2020) studied contraction rates of the
variational posterior distributions for nonparametric and high-dimensional inference. They
provided the conditions on the prior, the likelihood and the variational family that charac-
terize the contraction rates. Similar to the “prior mass and testing” conditions considered
in the past literature (Schwartz, 1965), they found the contraction rate to be the sum of
two terms. The first term stands for the contraction rate of the true Bayesian posterior
distribution, and the second term is contributed by the variational approximation error.
    Bhattacharya and Maiti (2021) used the framework provided by Zhang and Gao (2020)
and established the conditions needed for the variational posterior consistency of the single-
layer Bayesian neural networks. They establish that a simple Gaussian mean-field approx-
imation is good enough to achieve the variational posterior consistency. In this direction,
they show that ϵ- Hellinger neighborhood of the true density function receives close to 1
probability under the variational posterior. We use the similar framework, in our SS-IG
model to establish the consistency of the variational posterior in Chapter 3.
                                                  10


1.5       Dissertation Outline
    The main theme of this dissertation is the development of asymptotically consistent
Bayesian neural network models tailored for different scenarios. Each chapter forms a sep-
arate manuscript which is either published or under review. For ease of readability, we
separate appendices of each chapter providing them after the main content in each chapter.
    Chapter 2 is based on our published paper titled “Quantile Regression Neural Networks:
A Bayesian Approach”1 . In this chapter, we present the Bayesian quantile regression neural
network (BQRNN) and provide a Metropolis-Hastings coupled with Gibbs sampling algo-
rithm for posterior inference. We establish the posterior consistency in our proposed model
and present a set of simulation examples as well as real-data applications demonstrating
the efficacy of our method. The proofs of the requisite lemmas and posterior consistency
theorems are discussed in the chapter appendices.
    Chapter 3 is based on our manuscript titled “Layer Adaptive Node Selection in Bayesian
Neural Networks: Statistical Guarantees and Implementation Details”2 . In this chapter, we
develop spike-and-slab Gaussian node selection model and provide a variational algorithm for
posterior inference. We derive the variational posterior consistency and its contraction rate
for any generic shaped network structure. We measure the computational gains achieved by
our approach using layer-wise node sparsities for shallow models and floating point operations
in larger models. We also discuss the memory efficiency and computational speedup trade-
off between edge selection and node selection approach during test time. The proofs of
the lemmas required to establish the variational posterior consistency as well as additional
numerical experiment details are presented in the chapter appendices.
    Chapter 4 is based on our manuscript titled “Compact Bayesian Neural Networks with
Structured Sparsity”3 . In this chapter, we propose structurally sparse Bayesian neural net-
    1
      Adpated by permission from Springer Nature: Journal of Statistical Theory and Practice, (Jantre
et al., 2021b), License No: 5352561465807.
    2
      The revision is currently under review (Jantre et al., 2021a).
    3
      The manuscript is currently under preparation.
                                                      11


works using two distinct spike and slab prior setups, where the slab component uses hier-
archical priors on the group of incoming weights on the neurons: (i) Spike-and-Slab Group
Lasso (SS-GL), and (ii) Spike-and-Slab Group HorseShoe (SS-GHS). The chapter appendix
discusses additional numerical experiment details.
    Chapter 5 is based on our manuscript titled “Sequential Bayesian Neural Subnetwork En-
sembles”4 . In this chapter, we propose a sequential ensembling strategy for Bayesian neural
networks (BNNs) which learns multiple subnetworks in a single forward-pass. We combine
the strengths of the automated sparsity-inducing spike-and-slab prior that allows dynamic
pruning during training, which produces structurally sparse BNNs, and the proposed sequen-
tial ensembling strategy to efficiently generate diverse and sparse Bayesian neural networks.
Reproducibility considerations and additional numerical experiment details are presented in
the chapter appendices.
    In Chapter 6, we summarize the work we have done in this dissertation and discuss the
likely future methodological and theoretical extensions of our current work. We also provide
fully documented Python codes that reproduce all the results in this dissertation and can be
easily modified and used by practitioners in chapter specific public repositories 5 .
    4
      The manuscript is currently under revision (Jantre et al., 2022). This work is a collaborative work with
my Argonne mentors, Dr. Madireddy and Dr. Balaprakash.
    5
      https://github.com/jsanket123
                                                     12


                                       CHAPTER 2
         BAYESIAN QUANTILE REGRESSION NEURAL NETWORKS
2.1     Introduction
    Quantile regression (QR), proposed by Koenker and Basset (1978), models conditional
quantiles of the dependent variable as a function of the covariates. The method supple-
ments the least squares regression and provides a more comprehensive picture of the entire
conditional distribution. This is particularly useful when the relationships in the lower and
upper tail areas are of greater interest. Quantile regression has been extensively used in
wide array of fields such as economics, finance, climatology, and medical sciences, among
others (Koenker, 2005). Quantile regression estimation requires specialized algorithms and
reliable estimation techniques which are available in both frequentist and Bayesian liter-
ature. Frequentist techniques include simplex algorithm (Dantzig, 1963) and the interior
point algorithm (Karmarkar, 1984), whereas Bayesian technique using Markov chain Monte
Carlo (MCMC) sampling was first proposed by Yu and Moyeed (2001). Their approach em-
ployed the asymmetric Laplace distribution (ALD) for the response variable, which connects
to frequentist quantile estimate, since its maximum likelihood estimates are equivalent to
the quantile regression using check-loss function (Koenker and Machado, 1999). Recently,
Kozumi and Kobayashi (2011) proposed a Gibbs sampling algorithm, where they exploit the
normal-exponential mixture representation of the asymmetric Laplace distribution which
considerably simplified the computation for Bayesian quantile regression models.
    Artificial neural networks are helpful in estimating possibly non-linear models without
specifying an exact functional form. The neural networks which are most widely used in
engineering applications are the single hidden-layer feedforward neural networks. These
networks consist of a set of inputs X, which are connected to each of k hidden nodes, which,
in turn, are connected to an output layer (O). In a typical single layer feedforward neural
                                              13


network, the outputs are computed as
                                                       p
                                        k
                                                                  !
                                       X              X
                            O i = b0 +     bj ψ cj0 +     Xih cjh
                                       j=1            h=1
where, cjh is the weight from input Xih to the hidden node j. Similarly, bj is the weight
associated with the hidden unit j. The cj0 and b0 are the biases for the hidden nodes and the
output unit. The function ψ(.) is a nonlinear activation function. Some common choices of
ψ(.) are the sigmoid and the hyperbolic tangent functions. The interest in neural networks is
motivated by the universal approximation capability of feedforward neural networks (FNNs)
(Cybenko, 1989; Funahashi, 1989; Hornik et al., 1989). According to these authors, standard
feedforward neural networks with as few as one hidden layer whose output functions are
sigmoid functions are capable of approximating any continuous function to a desired level of
accuracy, if the number of hidden layer nodes are sufficiently large. Taylor (2000) introduced
a practical implementation of quantile regression neural networks (QRNN) to combine the
approximation ability of neural networks with robust nature of quantile regression. Several
variants of QRNN have been developed such as composite QRNN where neural networks are
extended to the linear composite quantile regression (Xu et al., 2017) and later Cannon (2018)
introduced monotone composite QRNN which guaranteed the non-crossing of regression
quantiles.
    Bayesian neural network learning models find the predictive distributions for the target
values in a new test case given the inputs for that case as well as inputs and targets for the
training cases. Early work of Buntine and Weigend (1991) and Mackay (1992) has inspired
widespread research in Bayesian neural network models. Their work implemented Bayesian
learning using Gaussian approximations. Later, Neal (1996) applied Hamiltonian Monte
Carlo in Bayesian statistical applications. Further, Sequential Monte Carlo techniques ap-
plied to neural network architectures are described in De Freitas et al. (2001). A detailed
review of MCMC algorithms applied to neural networks is presented by Titterington (2004).
Although Bayesian neural networks have been widely developed in the context of mean re-
gression models, there has been limited or no work available on its development in connection
                                               14


to quantile regression both from a theoretical and implementation standpoint. We also note
that the literature on MCMC methods applied to neural networks is somewhat limited due
to several challenges including lack of parameter identifiability, high computational costs,
and convergence failures (Papamarkou et al., 2022).
    Our contributions: In this work, we develop the statistical framework for Bayesian quan-
tile regression neural network and study its properties both from theoretical as well as
numerical standpoint. The natural advantage of our Bayesian procedure over the frequentist
models is that we have posterior variance for our conditional quantile estimates which can
be used as an uncertainty quantification. The proposed Bayesian quantile regression neural
network (BQRNN) uses a single hidden layer FNN with sigmoid activation function, and a
linear output unit. On the numerical front, we have implemented the Bayesian procedure
using Gibbs sampling combined with random walk Metropolis-Hastings algorithm. We have
shown that our model outperforms Bayesian quantile regression (BQR) in the nonlinear
data setup while performing comparably in linear data setup. We use mean squared error to
provide empirical justification for the use of our proposed BQRNN model in any given data.
    Our theoretical development includes establishment of posterior consistency, an essential
property in nonparametric Bayesian statistics, which in turn provides confidence in the use
of Bayesian quantile regression neural network models across all disciplines. The posterior
consistency of our method makes use of techniques from the works of Lee (2000) and Sriram
et al. (2013). The former has shown posterior consistency in the context of Bayesian neural
network for mean models while the later has shown it in the case of Bayesian quantile
regression. Following the framework of Lee (2000), we prove consistency of the posterior
by using universal approximation properties of neural networks as discussed in Funahashi
(1989), Hornik et al. (1989) and others. Analogous to the work of Lee (2000), our current
works borrow several ideas for establishing consistency in the context of density estimation
from Barron et al. (1999). Finally, to handle the case of ALD responses, we use the framework
in Sriram et al. (2013)’s which provides a method for handling of ALD in BQR scenario.
                                               15


    The rest of this chapter is organized as follows. Section 2.2 introduces quantile regression
and its Bayesian formulation by establishing relationship between quantile regression and
asymmetric Laplace distribution. In Section 2.3, we propose the Bayesian quantile regression
neural network (BQRNN) model and the prior used in this study. Further, we detail our
hierarchical BQRNN model and provide the MCMC procedure which couples Gibbs sampling
with random walk Metropolis-Hastings algorithm. Section 2.4 provides an overview of the
posterior consistency results for our model. Section 2.5 presents simulation studies and real
world applications. A brief discussion and conclusion is provided in Section 2.6. The proofs
of the requisite lemmas and posterior consistency theorems are presented in the Appendix A
and Appendix B.
2.2     Bayesian Quantile Regression
    Quantile regression (Koenker and Basset, 1978) offers a practically important alterna-
tive to mean regression by allowing the inference about the conditional distribution of the
response variable through modeling of its conditional quantiles. Let Y and X denote the
response and the predictors respectively and τ ∈ (0, 1) be the quantile level of the condi-
tional distribution of Y and F (.) be the cumulative distribution function of Y , then a linear
conditional quantile function of Y is denoted as follows
                   Qτ (yi |Xi = xi ) ≡ F −1 (τ ) = xi T β(τ ),   i = 1, . . . , n,
where β(τ ) ∈ Rp is a vector of quantile specific regression coefficients of length p. The aim
of quantile regression is to estimate the conditional quantile function Q(.).
    Let us consider the following linear model in order to formally define the quantile regres-
sion problem,
                                       Y = X T β(τ ) + ε,                                      (2.1)
                                                                                   R0
where ε is the error vector restricted to have its τ th quantile to be zero, i.e.   −∞
                                                                                       f (εi )dεi =
τ . The probability density of this error is often left unspecified in the classical literature.
                                                 16


The estimation through quantile regression proceeds by minimizing the following objective
function
                                             X n
                                     min          ρτ (yi − xi T β(τ ))                    (2.2)
                                  β(τ ) ∈ Rp
                                              i=1
where ρτ (.) is the check function or quantile loss function with the following form:
                                    ρτ (u) = u.{τ − I(u < 0)},                            (2.3)
I(.) is an indicator function. which is either 0 or 1 depending on whether it satisfies the
given condition or not. This check function is not differentiable at zero, see Figure 2.1.
                                                      ρp(u)
                                                p-1      p
                             Figure 2.1 Quantile loss function.
    Classical methods employ linear programming techniques such as the simplex algorithm,
the interior point algorithm, or the smoothing algorithm to obtain quantile regression es-
timates for β(τ ) (Madsen and Nielsen, 1993; Chen, 2007). The statistical programming
language R makes use of quantreg package (Koenker, 2017) to implement quantile regres-
sion techniques whilst confidence intervals are obtained via bootstrap (Koenker, 2005).
    Median regression in Bayesian setting has been considered by Walker and Mallick (1999)
and Kottas and Gelfand (2001). In quantile regression, a link between maximum-likelihood
theory and minimization of the sum of check functions, (2.2), is provided by asymmetric
Laplace distribution (ALD) (Koenker and Machado, 1999; Yu and Moyeed, 2001). This
distribution has location parameter µ, scale parameter σ and skewness parameter τ . Further
details regarding the properties of this distribution are specified in Yu and Zhang (2005). If
Y ∼ ALD(µ, σ, p), then its probability distribution function is given by
                                                                       
                                           τ (1 − τ )               y−µ
                         f (y|µ, σ, τ ) =             exp −ρτ
                                                σ                      σ
                                                   17


As discussed in Yu and Moyeed (2001), using the above skewed distribution for errors pro-
vides a way to implement Bayesian quantile regression effectively. According to them, any
reasonable choice of prior, even an improper prior, generates a posterior distribution for
β(τ ). Subsequently, they made use of a random walk Metropolis Hastings algorithm with a
Gaussian proposal density centered at the current parameter value to generate samples from
analytically intractable posterior distribution of β(τ ).
    In the aforementioned approach, the acceptance probability depends on the choice of
the value of τ , hence the fine tuning of parameters like proposal step size is necessary to
obtain the appropriate acceptance rates for each τ . Kozumi and Kobayashi (2011) overcame
this limitation and showed that Gibbs sampling can be incorporated with AL density being
represented as a mixture of normal and exponential distributions. Consider the linear model
from (2.1), where εi ∼ ALD(0, σ, τ ), then this model can be written as
                                                √
                       yi = xi T β(τ ) + θvi + κ σvi ui ,      i = 1, . . . , n,          (2.4)
where, ui and vi are mutually independent, with ui ∼N(0, 1), vi ∼ E(1/σ) and E(1/σ) is the
exponential distribution with mean σ. The θ and κ constants in (2.4) are given by
                                                          s
                                    1 − 2τ                       2
                             θ=                and κ =
                                   τ (1 − τ )               τ (1 − τ )
Consequently, a Gibbs sampling algorithm based on normal distribution can be implemented
effectively. Currently, Brq (Alhamzawi, 2018) and bayesQR (Benoit et al., 2017) packages
in R provide Gibbs sampler for Bayesian quantile regression. We are employing the same
technique to derive Gibbs sampling steps for all except hidden layer node weight parameters
for our Bayesian quantile regression neural network model.
2.3     Bayesian Quantile Regression Neural Networks
2.3.1    Model
    In this work, we focus on feedforward neural networks with a single hidden layer of units
with logistic activation functions, and a linear output unit. Consider the univariate response
                                                18


variable Yi and the covariate vector Xi (i = 1, 2, . . . , n). Further, denote the number of
covariates by p and the number of hidden nodes by k which is allowed to vary as a function
of n. Denote the input weights by γjh and the output weights by βj . Let, τ ∈ (0, 1) be the
quantile level of the conditional distribution of Yi given Xi and keep it fixed. Then, the
resulting conditional quantile function is denoted as follows
                                                    k
                                                   X                       1
                    Qτ (yi |Xi = xi ) = β0 +            βj
                                                           1 + exp(−γj0 − ph=1 γjh xih )
                                                                              P
                                                   j=1
                                                   Xk
                                      = β0 +            βj ψ(xTi γj ) = β T ηi (γ) = Li β                     (2.5)
                                                   j=1
where, β = (β0 , . . . , βk )T , xi = (1, xi1 , . . . , xip )T , ηi (γ) = (1, ψ(xTi γ1 ), . . . , ψ(xTi γk ))T and
L = (η1 (γ), . . . , ηn (γ))T , i = 1, . . . , n. ψ(.) is the logistic activation function.
    The specified model for Yi conditional on Xi = xi is given by Yi ∼ ALD(Li β, σ, τ ) with
a likelihood proportional to
                                              (      n
                                                                            )
                                    −n
                                                    X    |εi | + (2τ − 1)εi
                                   σ exp −                                                                    (2.6)
                                                    i=1
                                                                  2σ
where, εi = yi −Li β. The above ALD based likelihood can be represented as a location-scale
mixture of normals (Kozumi and Kobayashi, 2011). For any a, b > 0, we have the following
equality (Andrews and Mallows, 1974)
                                         Z    ∞                                    
                                                    a               1 2      2 −1
                        exp{−|ab|} =             √       exp − (a v + b v ) dv
                                            0       2πv             2
                 √               √
Letting a = 1/ 2σ, b = ε/ 2σ, and multiplying by exp{−(2τ − 1)ε/2σ} the (2.6) becomes
           ( n                          )         n Z ∞
                                                                                (εi − ξvi )2
               X |εi | + (2τ − 1)εi                                                                 
    −n
                                                Y                1
  σ exp −                                    =                 √       exp −                   − ζvi dvi (2.7)
                i=1
                             2σ                  i=1 0
                                                             σ 4πσvi                4σvi
where, ξ = (1 − 2τ ) and ζ = τ (1 − τ )/σ. (2.7) is beneficial in the sense that there is no need
to worry about the prior distribution of vi as it is extracted in the same equation. The prior
of vi in (2.7) is exponential distribution with mean ζ −1 and it depends on the value of τ .
    Further we observe that, the output of the aforementioned neural network remains un-
changed under a set of transformations, like certain weight permutations and sign flips
                                                         19


which renders the neural network non-identifiable. For example, in the above model 2.5,
take p, k = 2 and β0 , γj0 = 0. Then,
     X 2
           βj ψ(xTi γj ) = β1 [1 + exp(−γ11 xi1 − γ12 xi2 )]−1 + β2 [1 + exp(−γ21 xi1 − γ22 xi2 )]−1
      j=1
In the foregoing equation we can notice that when β1 = β2 , two different sets of values of
(γ11 , γ12 , γ21 , γ22 ) obtained by flipping the signs, namely (1,2,-1,-2) and (-1,-2,1,2) result in
the same value for 2j=1 βj ψ(xTi γj ). However, as a special case of lemma 1 of Ghosh et al.
                          P
(2000), the joint posterior of the parameters is proper if the joint prior is proper, even in the
case of posterior invariance under the parameter transformations. Note that, as long as the
interest is on prediction rather parameter estimation, this property is sufficient for predictive
model building. In this work, we focus only on proper priors, hence the non-identifiability
of the parameters in 2.5 doesn’t cause any problem.
2.3.2      Algorithm
     We take mutually independent priors for β, γ1 , . . . , γk with β ∼ N (β0 , σ02 Ik+1 ) and
γj ∼ N (γj0 , σ12 Ip+1 ), j = 1, . . . , k. Further, we take inverse gamma prior for σ such that
σ ∼ IG(a/2, b/2). Prior selection is problem specific and it is useful to elicit the chosen prior
from the historical knowledge. However, for most practical applications, such information is
not readily available. Furthermore, neural networks are commonly applied to big data for
which a priori knowledge regarding the data as well as about the neural network parameters
is not typically known. Hence, prior elicitation from experts in the area is not applicable to
neural networks in practice. As a consequence, it seems reasonable to use near-diffuse priors
for the parameters of the given model.
     Now, the joint posterior for β, γ, σ, v given y, is
  f (β, γ, σ, v|y)
   ∝ l(y|β, γ, σ, v) π(β) π(γ) π(σ),
         3n2 Y       n
                            !− 12    (                                                          n
                                                                                                      )
         1                               1                                        τ (1 − τ ) X
                                            (y − Lβ − ξv)T V (y − Lβ − ξv) −
                                                                              
   ∝                     vi       exp −                                                            vi
         σ           i=1
                                        4σ                                              σ     i=1
                                                    20


                                            
                1             T
   × exp − 2 (β − β0 ) (β − β0 )
               2σ0
          (           k
                                                      )
                1 X
   × exp − 2            (γj − γj0 )T (γj − γj0 )
               2σ1 j=1
        a2 +1              
        1                   b
   ×            exp −           .
        σ                  2σ
where, V = diag(v1−1 , v2−1 , . . . , vn−1 ). A Gibbs sampling algorithm is used to generate samples
from the analytically intractable posterior distribution f (β, γ|y). Some of the full condition-
als required in this procedure are available only up to unknown normalizing constants, and
we used random walk Metropolis-Hastings algorithm to sample from these full conditional
distributions.
    These full conditional distributions are as follows:
 (a) π(β|γ, σ, v, y)
              "                  −1  T                                               −1 #
                  LT V L
                                                                      T
                               I            L V (y − ξv) β0               L VL       I
       ∼ N                  + 2                               + 2 ,               + 2
                    2σ        σ0                   2σ            σ0         2σ       σ0
 (b) π(γj |β, σ, v, y)
                                                                    
                     1                       T
                                                                    
       ∝ exp −           (y − Lβ − ξv) V (y − Lβ − ξv)
                    4σ
                                                           
                           1                 T
            × exp − 2 (γj − γj0 ) (γj − γj0 )
                         2σ1
 (c) π(σ|γ, β, v, y)
                                                                                        n
                                                                                                   !
                 3n + a 1                          T
                                                                                      X         b
       ∼ IG               , (y − Lβ − ξv) V (y − Lβ − ξv) + τ (1 − τ )                      vi +
                     2      4                                                          i=1
                                                                                                 2
 (d) π(vi |γ, β, σ, y)
                                               1          1                             1
       ∼ GIG (ν, ρ1 , ρ2 ) where, ν = , ρ21 =                (yi − Li β)2 , and ρ22 =     .
                                               2         2σ                           2σ
     The generalized inverse Gaussian distribution is defined as, if x ∼ GIG (ν, ρ1 , ρ2 ) then
     the probability density function of x is given by
                                               (ρ2 /ρ1 )ν ν−1
                                                                                        
                                                                        1 −1 2        2
                        f (x|ν, ρ1 , ρ2 ) =                x exp − (x ρ1 + xρ2 ) ,
                                              2Kν (ρ1 ρ2 )              2
                                                        21


      where x > 0, −∞ < ν < ∞, ρ1 , ρ2 ≥ 0 and Kν (.) is a modified Bessel function of the
      third kind (see, Barndorff-Nielsen and Shephard (2001)).
     Unlike the parsimonious parametric models, the Bayesian nonparametric models require
additional statistical justification for their theoretical validity. For that reason we are going
to provide asymptotic consistency of the posterior distribution derived in our proposed neural
network model.
2.4      Theoretical Results
     Let (x1 , y1 ), . . . , (xn , yn ) be the given data, let f0 (x) be the underlying density of X.
Let Qτ (y|X = x) = µ0 (x) be the true conditional quantile function of Y given X, and let
µ̂n (x) be the estimated conditional quantile function.
Definition 2.4.1. µ̂n (x) is asymptotically consistent for µ0 (x) if
                                        Z
                                                                        p
                                           |µ̂n (x) − µ0 (x)| f0 (x) dx → 0
                                                                                                    p
     We are essentially making use of Markov’s inequality to ultimately show that µ̂n (X) →
µ0 (X). In similar frequentist sense, Funahashi (1989) and Hornik et al. (1989) have shown
the asymptotic consistency of the neural networks for mean-regression models by showing the
existence of some neural network, µ̂n (x), whose mean squared error with the true function,
µ0 (x), converges to 0 in probability.
     We consider the notion of posterior consistency for Bayesian non-parametric problems
which is quantified by concentration around the true density function (see Wasserman (1998),
Barron et al. (1999)). This boils down to the above definition of consistency on the condi-
tional quantile functions. The main idea is that the density functions deal with the joint
distribution of X and Y , while the conditional quantile function deals with the conditional
distribution of Y given X. This conditional distribution can then be used to construct
the joint distribution by assuming certain regularity condition on the distribution of X.
This allows the use of some techniques developed in density estimation field. Some of the
                                                          22


ideas presented here can be found in Lee (2000) which developed the consistency results for
non-parametric regression using single hidden-layer feed forward neural networks.
    Let the posterior distribution of the parameters P (.|(X1 , Y1 ), . . . , (Xn , Yn )). Let f (x, y)
and f0 (x, y) denote the joint density function of x and y under the model and the truth
respectively. Indeed, one can construct the joint density f (x, y) from the condition quantile
function f (y|x) by taking f (x, y) = f (y|x)f (x) where f (x) denotes the underlying den-
sity of X. Since, one is only interested in f (y|x) and X is ancillary to the estimation of
f (y|x), one can use some convenient distribution for f (x). Similar to Lee (2000), we define
Hellinger neighborhoods of the true density function f0 (x, y) = f0 (y|x)f0 (x) which allows
us to quantify the consistency of the posterior. The Hellinger distance between f0 and any
joint density function f of x and y is defined as follows.
                                     sZ Z
                                               p                 p          2
                     DH (f, f0 ) =                  f (x, y) − f0 (x, y) dx dy                    (2.8)
Based on (2.8), an ϵ-sized Hellinger neighborhood of the true density function f0 is given by
                                       Aϵ = {f : DH (f, f0 ) ≤ ϵ}                                 (2.9)
Definition 2.4.2 (Posterior Consistency). Suppose (Xi , Yi ) ∼ f0 . The posterior is asymp-
totically consistent for f0 over Hellinger neighborhoods if ∀ϵ > 0,
                                                                         p
                                  P (Aϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 1
i.e. the posterior probability of any Hellinger neighborhood of f0 converges to 1 in probability.
    Similar to Lee (2000), we prove the asymptotic consistency of the posterior for neural
networks with number of hidden nodes, k, being function of sample size, n. This sequence of
models indexed with increasing sample size is called sieve. We take sequence of priors, {πn },
where each πn is defined for a neural network with kn hidden nodes in it. The predictive
density (Bayes estimate of f ) then be given by
                                     Z
                          fˆn (.) = f (.) dP (f |(X1 , Y1 ), . . . , (Xn , Yn ))                (2.10)
                                                      23


Let µ0 (x) = Qτ,f0 (Y |X = x) be the true conditional quantile function and let µ̂n (x) =
Qτ,fˆn (Y |X = x) be the posterior predictive conditional quantile function using a neural
network. For notational convenience we are going to drop x and denote these functions as
µ0 and µ̂n occasionally.
    The following is the key result in this case.
Theorem 2.4.3. Let the prior for the regression parameters, πn , be an independent normal
with mean 0 and variance σ02 (fixed) for each of the parameters in the neural network. Suppose
that the true conditional quantile function is either continuous or square integrable. Let kn
be the number of hidden nodes in the neural network, and let kn → ∞. If there exists a
                                                        R                             p
constant a such that 0 < a < 1 and kn ≤ na , then |µ̂n (x) − µ0 (x)| dx → 0 as n → ∞
    In order to prove Theorem 2.4.3, we assume that Xi ∼ U (0, 1), i.e. density function of
x is identically equal to 1. This implies joint densities f (x, y) and f0 (x, y) are equal to the
conditional density functions, f (y|x) and f0 (y|x) respectively. Next, we define Kullback-
Leibler distance to the true density f0 (x, y) as follows
                                                                     
                                                         f0 (X, Y )
                                DK (f0 , f ) = Ef0   log                                   (2.11)
                                                          f (X, Y )
Based on (2.11), a δ− sized neighborhood of the true density f0 is given by
                                    Kδ = {f : DK (f0 , f ) ≤ δ}                            (2.12)
Further towards the proof of Theorem 2.4.3, we define the sieve Fn as the set of all neural
networks with each parameter less than Cn in absolute value,
                     |γjh | ≤ Cn , |βj | ≤ Cn ,  j = 0, . . . , kn , h = 0, . . . , p      (2.13)
where Cn grows with n such that Cn ≤ exp(nb−a ) for any constant b where 0 < a < b < 1,
and a is same as in Theorem 2.4.3.
    For the above choice of sieve, we next provide a set of conditions on the prior πn which
guarantee the posterior consistency of f0 over the Hellinger neighborhoods. At the end of
                                                24


this section, we demonstrate that the following theorem and corollary serve as an important
tool towards the proof of Theorem 2.4.3.
Theorem 2.4.4. Suppose a prior πn satisfies
    i ∃ r > 0 and N1 s.t. πn (Fnc ) < exp(−nr), ∀n ≥ N1
   ii ∀δ, ν > 0, ∃ N2 s.t. πn (Kδ ) ≥ exp(−nν), ∀n ≥ N2 .
Then ∀ϵ > 0,
                                                                                 p
                                       P (Aϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 1
where Aϵ is the Hellinger neighborhood of f0 as in (2.9).
Corollary 2.4.5. Under the conditions of Theorem 2.4.4, µ̂n is asymptotically consistent
for µ0 , i.e.
                                           Z
                                                                             p
                                               |µ̂n (x) − µ0 (x)| dx → 0
    We present the proofs of Theorem 2.4.4 and Corollary 2.4.5 in
    The main idea behind the proof of Theorem 2.4.4 is to consider the complement of
P (Aϵ |(X1 , Y1 ), .., (Xn , Yn )) as a ratio of integrals. Hence let
                                                            Y n
                                                                 f (xi , yi )
                                                             i=1
                                              Rn (f ) =       n                                                 (2.14)
                                                           Y
                                                                 f0 (xi , yi )
                                                            i=1
Then
                                                      Z     Y n                           Z
                                                                 f (xi , yi )dπn (f )           Rn (f )dπn (f )
                                                       Acϵ i=1                              Acϵ
          P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) =   Z Y    n                         = Z
                                                                f (xi , yi )dπn (f )            Rn (f )dπn (f )
                                                           i=1
                                                      Z                               Z
                                                                 Rn (f )dπn (f ) +              Rn (f )dπn (f )
                                                       Acϵ ∩Fn                         Acϵ ∩Fnc
                                                  =                       Z
                                                                               Rn (f )dπn (f )
                                                            25


In the proof, we have shown that the numerator is small as compared to the denominator,
                                                         p
thereby ensuring P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 0. The convergence of the second term
in the numerator uses assumption i) of the Theorem 2.4.4. It systematically shows that
R
  Fc
      Rn (f )dπn (f ) < exp(−nr/2) except on a set with probability tending to zero (see Lemma
    n
A.1.3 in A for further details). The denominator is bounded using assumption ii) of Theorem
                                                       R
2.4.4. First the KL distance between f0 and f dπn (f ) is bounded and subsequently used
                                     p
to prove that P (Rn (f ) ≤ e−nς ) → 0, where ς depends on δ defined earlier. This leads to a
conclusion that for all ς > 0 and for sufficiently large n, Rn (f )dπn (f ) > e−nς except on a
                                                                 R
set of probability going to zero. The result in this case has been condensed in Lemma A.1.4
presented in A.
      Lastly, the first term in the numerator is bounded using the Hellinger bracketing entropy
defined below
Definition 2.4.6 (Bracketing Entropy). For any two functions l and u, define the bracket
[l, u] as the set of all functions f such that l ≤ f ≤ u. Let ∥.∥ be a metric. Define an ϵ-
bracket as a bracket with ∥u − l∥ < ϵ. Define the bracketing number of a set of functions F ∗
as the minimum number of ϵ-brackets needed to cover the F ∗ , and denote it by N[] (ϵ, F ∗ , ∥.∥).
Finally, the bracketing entropy, denoted by H[] () , is the natural logarithm of the bracketing
number. (Pollard, 1991)
      Wong and Shen (1995, Theorem 1, pp.348-349) gives the conditions on the rate of growth
                                                               Z
                                                                                        p
of the Hellinger bracketing entropy in order to ensure                  Rn (f )dπn (f ) → 0. We next
                                                                Acϵ ∩Fn
outline the steps to bound the bracketing entropy induced by the sieve structure in (2.13).
      In this direction, we first compute the covering number and use it as an upper bound
in order to find the bracketing entropy for a neural network. Let’s consider k, number of
hidden nodes to be fixed for now and restrict the parameter space to Fn then Fn ⊂ Rd where
d = (p + 2)k + 1. Further let the covering number be N (ϵ, Fn , ∥.∥) and use L∞ as a metric
to cover the Fn with balls of radius ϵ. Then, one does not require more than ((Cn + 1)/ϵ)d
                                                     26


such balls which implies
                                                d             d           d
                                      2Cn                 Cn + ϵ        Cn + 1
                  N (ϵ, Fn , L∞ ) ≤          +1      =              ≤                   (2.15)
                                        2ϵ                  ϵ             ϵ
Together with (2.15), we use results from to bound the bracketing number of F ∗ (the space
of all functions on x and y with parameter vectors lying in Fn ) as follows:
                                                          2 d
                                             ∗             dCn
                                    N[] (ϵ, F , ∥.∥2 ) ≤                                (2.16)
                                                            ϵ
This allows us to determine the rate of growth of Hellinger bracketing entropy which is
nothing but the log of the quantity in (2.16). For further details, we refer to Lemmas A.1.2
and A.1.3 in A.
    Going back to the proof of Theorem 2.4.3, we show that the πn in Theorem 2.4.3 satisfies
the conditions of Theorem 2.4.4 for Fn as in (2.13). Then, the result of Theorem 2.4.3 follows
from the Corollary 2.4.5 which is derived from Theorem 2.4.4. Further details of the proof
of Theorem 2.4.3 are presented in B. Although Theorem 2.4.3 uses a fixed prior, the results
can be extended to a more general class of prior distributions as long as the assumptions of
Theorem 2.4.4 hold.
2.5      Numerical Experiments
2.5.1    Simulation Studies
    We investigate the performance of the proposed BQRNN method using two simulated
examples and compare the estimated conditional quantiles of the response variable against
frequentist quantile regression (QR), Bayesian quantile regression (BQR), and quantile re-
gression neural network (QRNN) models. We implement QR from quantreg package, BQR
from bayesQR package and QRNN from qrnn (Cannon, 2011) package available in R. We
choose two simulation scenarios, (i) a linear additive model, (ii) a nonlinear polynomial
model. In both the cases we consider a heteroscedastic behavior of y given x.
    Scenario I: Linear heteroscedastic; Data are generated from
                                      Y = X T β1 + X T β2 ε,
                                                  27


    Scenario II: Non-linear heteroscedastic; Data are generated from
                                  Y = (X T β1 )4 + (X T β2 )2 ε,
where, X = (X1 , X2 , X3 ) and Xi ’s are independent and follow U (0, 5). The parameters β1
and β2 are set at (2, 4, 6) and (0.1, 0.3, 0.5) respectively.
    We chose these scenarios to make our simulation studies comparable across the past
literature in quantile regression neural network domain (Xu et al., 2017; Cannon, 2018).
The robustness of our method is illustrated using three different types of random error
component (ε): N (0, 1), U (0, 1), and E(1) where, E(ζ) is the exponential distribution with
mean ζ −1 . Fore each scenario, we generate 200 independent observations.
    We work with a single layer feedforward neural network with a fixed number of nodes
k. We have tried several values of k in the range of 2-8 and settled on k = 4 which yielded
better results than other choices while bearing reasonable computational cost. We generated
100000 MCMC samples and then discarded first half of the sampled chain as burn-in period.
The 50% burn-in samples in MCMC simulations is not quite unusual and has been suggested
by Gelman and Rubin (1992). We also choose every 10th sampled value for the estimated
parameters to diminish the effect of autocorrelation between consecutive draws. Convergence
of MCMC was checked using standard MCMC diagnostic tools (Gelman et al., 2013).
    We have tried several different values of the hyperparameters. For brevity, we report the
results for only choice of hyperparameters given by β0 = 0, σ02 = 100, γj = 0, σ12 = 100,
a = 3, and b = 0.1. This particular choice of hyperparameters reflect our preference for
near-diffuse priors since in many of the real applications of neural network we don’t have
information about the input and output variables relationship. Therefore, we wanted to test
our model performance in the absence of specific prior elicitation. We also tried different
starting values for β and γ chains and found that model output is robust to different starting
values of β but it varies noticeably for different starting values of γ. Further, we observed
that our model yields optimal results when we use QRNN estimates of γ as its starting value
in our model. We also have to fine-tune the step size of random walk Metropolis-Hastings
                                                 28


(MH) updates in the γ generation process and settled on random walk variance of 0.012 for
scenario I while 0.0012 for scenario II. These step sizes lead to reasonable rejection rates for
MH sampling of γ values. However, they indicate the slow traversal of the parameter space
for γ values.
    To compare the model performance of QR, BQR, QRNN and BQRNN, we have calcu-
lated the theoretical conditional quantiles (Cond Q) and contrasted them with the estimated
conditional quantiles from the given simulated models. Additionally, we compute standard
deviation (SD) and root mean squared error (RMSE) values for predicted conditional quan-
tiles of each observation in the BQR and BQRNN model using sampled chains. For scenar-
ios 1 and 2, Table 2.1 and Table 2.2, respectively, present these results at quantile levels,
τ = (0.05, 0.50, 0.95) for 3 observations. The Table 2.1 indicates neural network models
performs comparably with the linear models. This ensures the use of neural network models
even if the underlying relationship is linear. In Table 2.2, we can observe that BQRNN
outperforms other models in the tail area, i.e. τ = 0.05, 0.95, whereas it’s performance is
comparable to QRNN at the median. Furthermore, We notice that our relatively complex
BQRNN model has lower bias but higher variance (SD2 ) as compared to BQR model. How-
ever, BQRNN outperforms BQR proved by the lower RMSE values which consists of squared
bias and variance terms. We notice one exception when RMSE value in BQR is lower than
BQRNN for 20th observation in 95th quantile of all 3 error models. The occurrence of this
exception is random and not because BQR performed systematically better than our model.
In summary, we observe the tradeoff between bias and variance in our model which overall
performs better than a linear BQR model in a nonlinear setup. The natural advantage of
our Bayesian procedure over the frequentist QRNN is that we have posterior variance for
our conditional quantile estimates which can be used as an uncertainty quantification.
                                              29


Table 2.1 Simulation study I results. Simulated Conditional Quantiles for QR, BQR, QRNN and BQRNN
        Noise    Quantile  Obs   Theo       QR                BQR             QRNN               BQRNN
           ε        τ       No  Cond Q    Cond Q     Cond Q    SD   RMSE      Cond Q     Cond Q     SD   RMSE
       N (0, 1)   0.05       20   17.56      17.19      17.19  0.60    0.70     16.98      17.15   0.34     0.53
                             50   37.38      37.69      37.67  0.74    0.80     38.80      39.03   0.68     1.79
                            100   42.53      42.45      42.56  0.92    0.92     41.43      41.31   0.69     1.40
                  0.50       20   20.23      20.23      20.25  0.37    0.37     20.23      20.23   0.51     0.51
                             50   42.62      42.87      42.65  0.31    0.31     42.62      42.68   0.77     0.77
                            100   48.78      48.81      48.67  0.39    0.41     48.78      48.01   0.97     1.24
                  0.95       20   22.90      21.81      22.38  0.68    0.86     21.42      21.85   0.46     1.15
                             50   47.86      47.92      47.70  0.69    0.71     47.52      47.19   0.80     1.04
                            100   55.02      54.28      54.20  0.92    1.24     53.80      53.47   1.05     1.88
        U (0, 1)  0.05       20   20.31      20.39      20.25  0.40    0.41     20.55      20.44   0.30     0.33
                             50   42.78      42.68      42.56  0.47    0.52     42.89      42.92   0.61     0.63
                            100   48.97      48.93      48.82  0.56    0.58     48.34      48.82   0.70     0.72
                  0.50       20   21.04      21.28      21.25  0.19    0.29     20.98      20.97   0.33     0.34
                             50   44.21      44.30      44.22  0.22    0.22     43.95      43.98   0.66     0.70
                            100   50.68      50.88      50.79  0.25    0.28     50.39      50.82   0.76     0.78
                  0.95       20   21.77      21.87      22.10  0.46    0.56     21.72      21.82   0.33     0.34
                             50   45.65      45.58      45.69  0.46    0.47     45.43      45.38   0.65     0.70
                            100   52.39      52.36      52.45  0.58    0.58     52.49      52.52   0.75     0.76
         E(1)     0.05       20   20.31      20.24      19.94  0.45    0.58     19.94      20.27   0.31     0.31
                             50   42.78      42.92      42.93  0.51    0.53     42.98      43.17   0.63     0.74
                            100   48.97      49.06      49.05  0.61    0.62     47.24      49.60   0.75     0.98
                  0.50       20   21.36      20.95      21.04  0.23    0.40     21.11      20.91   0.39     0.59
                             50   44.83      45.54      45.47  0.27    0.70     46.30      45.66   0.74     1.11
                            100   51.41      51.93      51.91  0.41    0.65     51.68      51.94   0.89     1.04
                  0.95       20   25.09      25.08      25.17  0.80    0.81     23.73      24.27   0.79     1.14
                             50   52.17      53.15      52.61  0.90    1.00     54.98      54.43   1.24     2.58
                            100   60.15      61.24      60.75  1.06    1.22     59.87      62.96   1.46     3.17
          Obs No: Observation Number; Theo: Theoretical; Cond Q: Conditional Quantile; SD: Standard Deviation;
                                          RMSE: Root Mean Squared Error.
                                                          30


       Table 2.2 Simulation study II results. Simulated Conditional Quantiles for QR, BQR, QRNN and BQRNN
Noise    Quantile  Obs     Theo         QR                     BQR                    QRNN                    BQRNN
   ε         τ     No    Cond Q       Cond Q       Cond Q      SD       RMSE         Cond Q        Cond Q        SD        RMSE
N (0, 1)   0.05     20   167491.82   -195687.34    10207.80    33.03   157284.03      30400.87    166112.85    8460.05     8570.87
                    50  3298781.46  2107072.28     24948.89    73.02  3273832.57    2626269.69   3289014.06   48028.83    49007.23
                   100  5660909.58  2741102.73     25680.38    75.52  5635229.20    3312208.28   5311964.68   88478.24  359985.24
           0.50     20   167496.16     83421.40    92937.30   101.97    74558.93     183185.47    169748.75    2615.13     3451.33
                    50  3298798.17  2661086.40    228321.19   282.04  3070476.99    3292073.45   3290858.99   46577.95    47245.13
                   100  5660933.30  3550782.83    233862.83   258.14  5427070.47    5652427.31   5662655.46   80356.32    80366.74
           0.95     20   167500.49  1184601.83    171358.16   180.90      3861.91    193187.39    174073.37    2751.37     7125.39
                    50  3298814.88  4731674.95    423324.96   481.51  2875489.96    3309795.50   3298328.52   46685.76    46683.63
                   100  5660957.02  5575413.13    429129.54   458.70  5231827.50    5772437.35   5676164.49   80825.89    82236.16
U (0, 1)   0.05     20   167496.29   -195833.01    10207.87    33.11   157288.42     -27470.71    154598.77    7923.98    15136.80
                    50  3298798.68  2107278.20     24949.04    73.11  3273849.64    2500915.26   3295376.39   47739.91    47857.66
                   100  5660934.02  2741400.69     25680.55    75.79  5635253.47    3100036.18   5285593.84   98238.89  387980.93
           0.50     20   167497.47     83435.89    92937.55   101.16    74559.99     172793.77    168331.89    2583.07     2714.25
                    50  3298803.25  2661086.35    228322.57   278.93  3070480.69    3314585.80   3296332.61   46650.01    46710.73
                   100  5660940.51  3550796.98    233863.94   256.32  5427076.57    5607084.86   5660871.21   80101.17    80093.19
           0.95     20   167498.66  1184587.54    171358.60   163.87      3863.42    196767.97    174714.72    5874.89     9304.78
                    50  3298807.82  4731690.36    423325.69   464.88  2875482.17    3316486.10   3303092.85   47168.25    47357.80
                   100  5660947.00  5575416.01    429130.38   430.23  5231816.64    5757769.91   5661073.43   80290.13    80282.20
 E(1)      0.05     20   167496.29   -195784.51    10207.99    32.90   157288.30    -172912.92    161521.55    5217.48     7931.84
                    50  3298798.69  2107220.10     24949.33    72.48  3273849.36    2380587.55   3295994.41   47385.22    47463.40
                   100  5660934.04  2741316.61     25680.81    75.25  5635253.23    2825010.54   5194410.73   82821.54  473816.46
           0.50     20   167497.98     83442.50    92937.24   102.27    74560.81     175113.10    169606.69    2568.74     3323.21
                    50  3298805.21  2661099.63    228321.18   282.92  3070484.04    3309978.81   3297248.31   46648.10    46669.41
                   100  5660943.29  3550827.77    233862.58   258.35  5427080.71    5603733.56   5660975.15   80088.73    80080.73
           0.95     20   167504.05  1184595.04    171358.64   163.27      3858.05    182663.10    172254.32    2917.90     5574.72
                    50  3298828.60  4731683.78    423325.49   458.35  2875503.15    3337299.83   3298567.13   46680.30    46676.36
                   100  5660976.50  5575416.49    429130.11   422.79  5231846.41    5838226.18   5687421.49   80742.36    84955.07
Obs No: Observation Number; Theo: Theoretical; Cond Q: Conditional Quantile; SD: Standard Deviation; RMSE: Root Mean Squared Error.
                                                                31


2.5.2    Real Data Examples
    In this section, we apply our proposed method to three real world datasets which are
publicly available.
    The first dataset is the Boston Housing dataset which is available in R package MASS
(Venables and Ripley, 2002). It contains 506 census tracts of Boston Standard Metropolitan
Statistical Area in 1970. There are 13 predictor variables and one response variable, corrected
median value of owner-occupied homes (in USD 1000s). Predictor variables include per
capita crime rate by town, proportion of residential land zoned for lots over 25,000 sq.ft.,
nitrogen oxide concentration, proportion of owner-occupied units built prior to 1940, full-
value property-tax rate per $10,000, and lower status of the population in percent, among
others. There is high correlation among some of these predictor variables and the goal here
is to determine the best fitting functional form to improve the housing value forecasts.
    The second dataset is the Gilgais dataset available in R package MASS. This data was
collected on a line transect survey in gilgai territory in New South Wales, Australia. Gilgais
are repeated mounds and depressions formed on flat land, and many-a-times are regularly
distributed. The data collection with 365 sampling locations on a linear grid of 4 meters
spacing aims to check if the gilgai patterns are reflected in the soil properties as well. At
each of the sampling location, samples were taken at depths 0-10 cm, 30-40 cm and 80-90
cm below the surface. The input variables included pH, electrical conductivity and chloride
content and were measured on a 1:5 soil:water extract from each sample. Here, the response
variable is e80 (electrical conductivity in mS/cm: 80–90 cm) and we focus on finding the
true functional relationship present in the dataset.
    The third dataset is concrete data which is compiled by Yeh (1998) and is available on
UCI machine learning repository. It consists of 1030 records, each containing 8 input features
and compressive strength of concrete as an output variable. The input features include the
amounts of ingredients in high performance concrete (HPC) mixture which are cement, fly
ash, blast furnace slag, water, superplasticizer, coarse aggregate, and fine aggregate. More-
                                                32


over, age of the mixture in days is also included as one of the predictor variable. According
to Yeh (1998), the compressive strength of concrete is a highly non-linear function of the
given inputs. The central purpose of the study is to predict the compressive strength of HPC
using the input variables.
    In our experiments, we compare the performance of QR, BQR, QRNN, and BQRNN
estimates for f (x), the true functional form of the data, in both training and testing data
using mean check function (or, mean tilted absolute loss function). The mean check function
(MCF) is given as
                                                N
                                             1 X
                                   MCF =           ρτ (yi − fˆ(xi ))
                                            N i=1
where, ρτ (.) is defined in (2.3) and fˆ(x) is an estimate of f (x). We resort to this comparison
criterion since we don’t have the theoretical conditional quantiles for the data at our disposal.
For each dataset, we randomly choose 80% of data points for training the model and then
remaining 20% is used to test the prediction ability of the fitted model. Our single hidden-
layer neural network has k = 4 hidden layer nodes and the random walk variance is chosen
to be 0.012 . These particular choices of the number of hidden layer nodes and random
walk step size are based on their optimal performance among several different choices while
providing reasonable computational complexity. We perform these analyses for quantiles,
τ = (0.05, 0.25, 0.50, 0.75, 0.95), and present the model comparison results for both training
and testing data in Table 2.3.
    It can be seen that our model performs comparably well with QRNN model while out-
performing linear models (QR and BQR) in all the datasets. We can see that both QRNN
and BQRNN have lower mean check function values for training data than their testing
counterpart. This suggests that neural networks may be overfitting the data while trying to
find the true underlying functional form. The model performance of QR and BQR models
is inferior compared to neural network models, particularly when the regression relationship
is non-linear. Furthermore, our BQRNN model provides uncertainty estimation as a natural
byproduct which is not available in the frequentist QRNN model.
                                                 33


Table 2.3 Real data applications results. MCF values are reported
  Noise    Quantile  Sample      QR      BQR   QRNN     BQRNN
 Boston    τ = 0.05   Train   0.3009   0.3102  0.2084     0.1832
                      Test    0.3733   0.3428  0.3356     0.5842
           τ = 0.25   Train   1.0403   1.0431  0.6340     0.6521
                      Test    1.2639   1.2431  1.0205     1.2780
           τ = 0.50   Train   1.4682   1.4711  0.8444     0.8864
                      Test    1.8804   1.8680  1.4638     1.5882
           τ = 0.75   Train   1.3856   1.3919  0.6814     0.7562
                      Test    1.8426   1.8053  1.3773     1.4452
           τ = 0.95   Train   0.5758   0.6009  0.2276     0.2206
                      Test    0.7882   0.6174  0.8093     0.6880
 Gilgais   τ = 0.05   Train   3.6610   3.7156 3.0613      2.7001
                      Test    3.4976   3.2105 2.9137      3.9163
           τ = 0.25   Train  13.9794  14.5734 8.6406      8.4565
                      Test   11.8406  11.4832 9.3298 10.1386
           τ = 0.50   Train  18.1627  21.3587 10.4667 10.7845
                      Test   15.8210  17.2037 13.5297 14.3699
           τ = 0.75   Train  13.6598  18.8357 7.9679      7.9905
                      Test   12.3926  18.2477 9.4711 10.6414
           τ = 0.95   Train   3.8703   6.4137 2.3289      2.2508
                      Test    4.1266   6.4300 3.0280      2.5586
Concrete τ = 0.05     Train   2.9130   4.4500  2.0874     2.0514
                      Test    2.9891   4.2076  2.2021     2.6793
           τ = 0.25   Train  10.0127  14.7174  7.0063     7.0537
                      Test    9.6451  14.3567  6.9179     7.4069
           τ = 0.50   Train  13.1031  19.8559  9.3638     9.3728
                      Test   12.7387  18.2309  9.8936 10.9172
           τ = 0.75   Train  11.5179  17.7680  7.6789     7.3932
                      Test   10.8299  16.3257  8.5755     9.4147
           τ = 0.95   Train   3.9493   6.8747  2.4403     2.5262
                      Test    3.6489   6.8435  2.7768     4.1369
                               34


2.6     Conclusion and Discussion
    This chapter introduces the Bayesian neural network model for quantile estimation in a
systematic way. The practical implementation of Gibbs sampling coupled with Metropolis-
Hastings updates method have been discussed in detail. The method exploits the location-
scale mixture representation of the asymmetric Laplace distribution which makes its imple-
mentation easier. The model can be thought as a hierarchical Bayesian model which makes
use of independent normal priors for the neural network weight parameters. A future work in
this area could be sparsity induced priors to allow for node and layer selection in multi-layer
neural network architecture.
    Further, we have developed asymptotic consistency of the posterior distribution of the
neural network parameters. The presented result can be extended to a more general class
of prior distributions if they satisfy the Theorem 2.4.4 assumptions. Following the theory
developed here, we bridge the gap between asymptotic justifications separately available for
Bayesian quantile regression and Bayesian neural network regression. The theoretical argu-
ments developed here justify using neural networks for quantile estimation in nonparametric
regression problems using Bayesian methods.
    The proposed MCMC procedure has been shown to work when the number of parameters
are relatively low compared to the number of observations. We noticed that the conver-
gence of the posterior chains take long time and there is noticeable autocorrelation left in
the sampled chains even after burn-in period. We also acknowledge that our random-walk
Metropolis-Hastings algorithm has small step size which might lead to slow traversal of the
parameter space ultimately raising the computational cost of our algorithm. The computa-
tional complexity in machine learning methods are well-known. Further research is required
in these aspects of model implementation.
                                              35


APPENDICES
    36


                                       APPENDIX A
               LEMMAS FOR POSTERIOR CONSISTENCY PROOF
    For all the proofs in Appendix A and Appendix B, we assume Xp×1 to be uniformly
distributed on [0, 1]p and keep them fixed. Thus, f0 (x) = f (x) = 1. Conditional on X, the
univariate response variable Y has asymmetric Laplace distribution with location parameter
determined by the neural network. We are going to fix its scale parameter, σ, to be 1 for
the posterior consistency derivations. Thus,
                                         k
                                                                                          !
                                        X                         1
              Y |X = x ∼ ALD β0 +            βj                     Pp             , 1, τ   (A.1)
                                        j=1
                                                1 + exp(−γ     j0 −  h=1 γjh x h )
The number of input variables, p, is taken to be fixed while the number of hidden nodes, k,
is allowed to grow with the sample size, n.
A.1      Requisite Lemmas
    All the lemmas described below are taken from Lee (2000).
Lemma A.1.1. Suppose H[] (u) ≤ log[(Cn2 dn /u)dn ], dn = (p + 2)kn + 1, kn ≤ na and Cn ≤
exp(nb−a ) for 0 < a < b < 1. Then for any fixed constants c, ϵ > 0, and for all sufficiently
         Rϵp             √
large n, 0 H[] (u) ≤ c nϵ2 .
Proof. The proof follows from the proof of Lemma 1 from (Lee, 2000, p. 634-635).
    For Lemmas A.1.2, A.1.3 and A.1.4, we make use of the following notations. From (2.14),
recall
                                                 n
                                                Y   f (xi , yi )
                                     Rn (f ) =
                                                    f (x , y )
                                                i=1 0 i i
is the ratio of likelihoods under neural network density f and the true density f0 . Fn is the
sieve as defined in (2.13) and Aϵ is the Hellinger neighborhood of the true density f0 as in
(2.9).
                                                 37


Lemma A.1.2.            sup Rn (f ) ≤ 4exp(−c2 nϵ2 ) a.s. for sufficiently large n.
                     f ∈Acϵ ∩Fn
Proof. Using the outline of the proof of Lemma 2 from (Lee, 2000, p. 635), first we have
to bound the Hellinger bracketing entropy using Van Der Vaart and Wellner (1996, Theo-
rem 2.7.11 on p.164). Next we use Lemma A.1.1 to show that the conditions of Wong and
Shen (1995, Theorem 1 on p.348-349) hold and finally we apply that theorem to get the
result presented in the Lemma 2.
    In our case of BQRNN, we only need to derive first step using ALD density mentioned
in (A.1). And rest of the steps follow from the proof given in Lee (2000). As we are looking
for the Hellinger bracketing entropy for neural networks, we use L2 norm on the square root
of the density functions, f . The L∞ covering number was computed above in (2.15), so here
d∗ = L∞ . The version of Van Der Vaart and Wellner (1996, Theorem 2.7.11) that we are
interested in is
                           p              p
                     If       ft (x, y) −     fs (x, y) ≤ d∗ (s, t)F (x, y) for some F,
                     then, N[] (2ϵ ∥F ∥2 , F ∗ , ∥.∥2 ) ≤ N (ϵ, Fn , d∗ )
Now let’s start by defining some notations,
                                                                   
    ft (x, y) = τ (1 − τ )exp −(y − µt (x))(τ − I(y≤µt (x)) ) ,
                                            k                                             p
                                         X              βjt                        t
                                                                                       X
                                                                                              t
               where, µt (x) = β0t +                                and Aj (x) = γj0 +      γjh xh   (A.2)
                                          j=1
                                               1 + exp(−Aj (x))                         h=1
                                                                   
    fs (x, y) = τ (1 − τ )exp −(y − µs (x))(τ − I(y≤µs (x)) ) ,
                                           k                                             p
                                         X             βjs                             X
              where, µs (x) =      β0s +                                          s
                                                                   and Bj (x) = γj0 +        s
                                                                                            γjh xh   (A.3)
                                         j=1
                                              1 + exp(−Bj (x))                         h=1
For notational convenience, we drop x and y from fs (x, y), ft (x, y), µs (x), µt (x), Bj (x), and
Aj (x) and denote them as fs , ft , µs , µt , Bj , and Aj .
         p      p
            ft − fs
                                                                                                
           p                         1                                     1
        = τ (1 − τ ) exp − (y − µt )(τ − I(y≤µt ) ) − exp − (y − µs )(τ − I(y≤µs ) )
                                     2                                     2
                                                        38


       As, τ ∈ (0, 1) is fixed.
                                                                           
          1          1                                  1
       ≤ exp − (y − µt )(τ − I(y≤µt ) ) − exp − (y − µs )(τ − I(y≤µs ) )                   (A.4)
          2          2                                  2
Now let’s separate above term into two cases when: (a) µs ≤ µt and (b) µs > µt . Further
let’s consider case-a and break it into three subcases when: (i) y ≤ µs ≤ µt , (ii) µs < y ≤ µt ,
and (iii) µs ≤ µt < y.
Case-a (i) y ≤ µs ≤ µt
  The (A.4) simplifies to
                                                                       
                  1            1                             1
                     exp − (y − µt )(τ − 1) − exp − (y − µs )(τ − 1)
                  2            2                             2
                                                                            
                      1           1                            1
                   = exp − (y − µs )(τ − 1)           exp − (µs − µt )(τ − 1) − 1
                      2           2                            2
                    As first term in modulus is ≤ 1
                                                      
                      1                 1
                   ≤ 1 − exp − (µt − µs )(1 − τ )
                      2                 2
                    Note: 1 − exp(−z) ≤ z ∀z ∈ R =⇒ |1 − exp(−z)| ≤ |z| ∀z ≥ 0             (A.5)
                      1
                   ≤    |µt − µs | (1 − τ )
                      4
                      1
                   ≤ |µt − µs |
                      4
                      1
                   ≤ |µt − µs |
                      2
Case-a (ii) µs < y ≤ µt
  The (A.4) simplifies to
                                                                    
                      1          1                             1
                         exp − (y − µt )(τ − 1) − exp − (y − µs )τ
                      2          2                             2
                                                                                  
                          1           1                                 1
                      = exp − (y − µs )(τ − 1) − 1 + 1 − exp − (y − µs )τ
                          2           2                                 2
                                                                                     
                          1               1                     1          1
                      ≤ 1 − exp − (y − µt )(τ − 1) + 1 − exp − (y − µs )τ
                          2               2                     2          2
                        Let’s use calculus inequality mentioned in (A.5)
                          1                     1
                      ≤     |(y − µt )(τ − 1)| + |(y − µs )τ |
                          4                     4
                                                 39


                         Both terms are positive so we combine them in one modulus
                          1
                      =     |(y − µt )(τ − 1) + (y − µt + µt − µs )τ |
                          4
                          1
                      = |(y − µt )(2τ − 1) + (µt − µs )τ |
                          4
                          1
                      ≤ [|(y − µt )| |2τ − 1| + |µt − µs | τ ]
                          4
                      Here, |y − µt | ≤ |µt − µs | and |2τ − 1| ≤ 1
                          1
                      ≤     |µt − µs |
                          2
Case-a (iii) µs ≤ µt < y
 The (A.4) simplifies to
                                                                 
                   1              1                     1
                       exp − (y − µt )τ − exp − (y − µs )τ
                   2              2                     2
                                                                      
                        1           1                          1
                    = exp − (y − µt )τ             1 − exp − (µt − µs )τ
                        2           2                          2
                      As first term in modulus is ≤ 1
                                                    
                        1                1
                    ≤ 1 − exp − (µt − µs )τ
                        2                2
                      Using the calculus inequality mentioned in (A.5)
                        1
                    ≤     |µt − µs | τ
                        4
                        1
                    ≤ |µt − µs |
                        4
                        1
                    ≤ |µt − µs |
                        2
We can similarly bound the (A.4) in case-(b) where µs > µt by |µt − µs | /2. Now,
      p        p
         ft − fs
         1
     ≤     |µt − µs |
         2
       Now, let’s substitute µt and µs from A.2 and A.3
                  k                             k
         1 t X                βjt          s
                                               X         βjs
     = β0 +                             − β0 −
         2       j=1
                      1 + exp(−Aj )            j=1
                                                   1 + exp(−Bj )
                                                  40


           "                k
                                                                              #
         1                 X             βjt                    βjs
      ≤      β0t − β0s +                             −
         2                 j=1
                                1 + exp(−Aj ) 1 + exp(−Bj )
           "                k
                                                                              #
         1                 X     βjt − βjs + βjs                βjs
      =      β0t − β0s +                             −
         2                 j=1
                                1 + exp(−Aj ) 1 + exp(−Bj )
           "                k                           k
                                                                                                               #
         1                 X        βjt − βjs         X                      1                       1
      =      β0t − β0s +                            +       βjs                            −
         2                 j=1
                               1 + exp(−Aj ) j=1                    1 + exp(−Aj ) 1 + exp(−Bj )
       Recall that βjs ≤ Cn
           "                k                     k
                                                                                                       #
         1                 X                     X              exp(−Bj ) − exp(−Aj )
      ≤      β0t − β0s +        βjt − βjs +          Cn                                                          (A.6)
         2                 j=1                   j=1
                                                          (1 + exp(−Aj ))(1 + exp(−Bj ))
                                         
                                         exp(−Aj )(1 − exp(−(Bj − Aj ))),
                                         
                                                                                                when Bj − Aj ≥ 0
Note: |exp(−Bj ) − exp(−Aj )| =
                                         exp(−B )(1 − exp(−(A − B ))),
                                         
                                                      j                      j       j          when Aj − Bj ≥ 0
                                        Using the calculus inequality mentioned in (A.5)
                                         
                                         exp(−Aj )(Bj − Aj ), when Bj − Aj ≥ 0
                                         
                                     ≤
                                         exp(−Bj )(Aj − Bj ), when Aj − Bj ≥ 0
                                         
                                                     
                                                            exp(−A )(B −A )
                                                                      j   j    j
                                                                                         ,    when Bj − Aj ≥ 0
                                                     
             exp(−Bj ) − exp(−Aj )                   
                                                        (1+exp(−Aj ))(1+exp(−Bj ))
  So,                                             ≤
        (1 + exp(−Aj ))(1 + exp(−Bj ))               
                                                           exp(−Bj )(Aj −Bj )
                                                                                         ,    when Aj − Bj ≥ 0
                                                        (1+exp(−Aj ))(1+exp(−Bj ))
                                                  ≤ |Aj − Bj |
Hence we can bound the (A.6) as follows
                    "                   k                   k
                                                                                   #
 p       p        1                  X                     X
   ft − fs ≤           β0t − β0s +           βjt − βjs +        Cn |Aj − Bj |
                  2                   j=1                  j=1
                 Now, let’s substitute Aj and Bj from A.2 and A.3
                                                                                 p                       p
                    "                   k                   k
                                                                                                                   #
                  1                  X                     X                   X                       X
                ≤      β0t − β0s +           βjt − βjs +        Cn γj0  t
                                                                           +        γjht         s
                                                                                           xh − γj0 −        s
                                                                                                            γjh xh
                  2                   j=1                  j=1                 h=1                     h=1
                                                                                              p
                    "                   k                   k
                                                                                                                  !#
                  1                  X                     X                                X
                ≤      β0t − β0s +           βjt − βjs +        Cn γj0     t       s
                                                                               − γj0      +            t
                                                                                                |xh | γjh      s
                                                                                                           − γjh
                  2                   j=1                  j=1                              h=1
                                                       41


                  Recall that |xh | ≤ 1 and w.l.o.g assume Cn > 1
                                                                            p
                       "               k               k
                                                                                          !#
                   Cn                X                X                    X
                 ≤       β0t − β0s +      βjt − βjs +        t
                                                           γj0   − γj0s
                                                                         +       t
                                                                                γjh    s
                                                                                    − γjh
                    2                 j=1             j=1                  h=1
                   Cn d
                 ≤      ∥t − s∥∞
                     2
Now rest of the steps follow from the proof of Lemma 2 in Lee (2000, p. 635-636).
Lemma A.1.3. If there exists a constant r > 0 and N , such that Fn satisfies πn (Fnc ) <
                                                                R
exp(−nr), ∀n ≥ N , then there exists a constant c2 such that Ac Rn (f )dπn (f ) < exp(−nr/2)+
                                                                    ϵ
           2
exp(−nc2 ϵ ) except on a set of probability tending to zero.
Proof. The proof is same as the proof of Lemma 3 from (Lee, 2000, p. 636).
Lemma A.1.4. Let Kδ be the KL-neighborhood as in (2.12. Suppose that for all δ, ν >
0, ∃ N s.t. πn (Kδ ) ≥ exp(−nν), ∀n ≥ N . Then for all ς > 0 and sufficiently large n,
  Rn (f )dπn (f ) > e−nς except on a set of probability going to zero.
R
Proof. The proof is same as the proof of Lemma 5 from (Lee, 2000, p. 637).
Lemma A.1.5. Suppose that µ is a neural network regression with parameters (θ1 , . . . θd ),
and let µ̃ be another neural network with parameters (θ̃1 , . . . θ̃d˜n ). Define θi = 0 for i > d
and θ̃j = 0 for j > d˜n . Suppose that the number of nodes of µ is k, and that the number of
nodes of µ̃ is k̃n = O(na ) for some a, 0 < a < 1. Let
                               Mς = {µ̃ θi − θ̃i ≤ ς, i = 1, 2, . . . }                      (A.7)
Then for any µ̃ ∈ Mς and for sufficiently large n,
                                   sup(µ̃(x) − µ(x))2 ≤ (5na )2 ς 2
                                   x∈X
Proof. The proof is same as the proof of Lemma 6 from (Lee, 2000, p. 638-639).
                                                  42


                                                  APPENDIX B
                 POSTERIOR CONSISTENCY THEOREM PROOFS
B.1      Theorem 2.4.4 Proof
    For the proof of Theorem 2.4.4 and Corollary 2.4.5, we use the following notations. From
(2.14), recall that
                                                            n
                                                          Y     f (xi , yi )
                                              Rn (f ) =
                                                          i=1 0
                                                               f (xi , yi )
is the ratio of likelihoods under neural network density f and the true density f0 . Also, Fn
is the sieve as defined in (2.13). Finally, Aϵ is the Hellinger neighborhood of the true density
f0 as in (2.9).
                                                                             R
    By Lemma A.1.3, there exists a constant c2 such that                       Acϵ
                                                                                   Rn (f )dπn (f ) < exp(−nr/2) +
                                                                                     R
exp(−nc2 ϵ2 ) for sufficiently large n. Next, from Lemma A.1.4, Rn (f )dπn (f ) ≥ exp(−nς)
for sufficiently large n.
                                                        Z
                                                              Rn (f )dπn (f )
                                                         Acϵ
            P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) = Z
                                                             Rn (f )dπn (f )
                                                        exp − nr
                                                                     
                                                                   2
                                                                       + exp(−nc2 ϵ2 )
                                                    <
                                                                   exp(−nς)
                                                                    hr      i
                                                                         − ς + exp −nϵ2 [c2 − ς]
                                                                                                       
                                                    = exp −n
                                                                      2
                                                      r
Now we pick ς such that for φ > 0, both               2
                                                         − ς > φ and c2 − ς > φ. Thus,
                     P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) ≤ exp(−nφ) + exp(−nϵ2 φ)
                                                  p
Hence, P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 0.
B.2      Corollary 2.4.5 Proof
                                                          p
    Theorem 2.4.4 implies that DH (f0 , f ) → 0 where DH (f0 , f ) is the Hellinger distance
between f0 and f as in (2.8) and f is a random draw from the posterior. Recall from (2.10),
                                                           43


the predictive density function
                                           Z
                                fˆn (.) =       f (.) dP (f |(X1 , Y1 ), . . . , (Xn , Yn ))
gives rise to the predictive conditional quantile function, µ̂n (x) = Qτ,fˆn (y|X = x). We next
                               p
show that DH (f0 , fˆn ) → 0, which in turn implies µ̂n (x) converges in L1 -norm to the true
conditional quantile function,
                                                              k
                                                            X                           1
                µ0 (x) = Qτ,f0 (y|X = x) = β0 +                  βj
                                                                     1 + exp (−γj0 − ph=1 γjh xih )
                                                                                           P
                                                             j=1
                                           p
First we show that DH (f0 , fˆn ) → 0. Let X n = ((X1 , Y1 ), . . . , (Xn , Yn )). For any ϵ > 0:
                                     Z
                            ˆ
                  DH (f0 , fn ) ≤ DH (f0 , f ) dπn (f |X n )
                                   By Jensen’s Inequality
                                     Z                                   Z
                                                                   n
                                 ≤       DH (f0 , f ) dπn (f |X ) +             DH (f0 , f ) dπn (f |X n )
                                      A                                    Acϵ
                                     Z ϵ                      Z
                                 ≤       ϵ dπn (f |X n ) +         DH (f0 , f ) dπn (f |X n )
                                      Aϵ                         c
                                                                Aϵ
                                          Z
                                 ≤ ϵ+            DH (f0 , f ) dπn (f |X n )
                                             Acϵ
The second term goes to zero in probability by Theorem 2.4.4 and ϵ is arbitrary, therefore
                p
DH (f0 , fˆn ) → 0.
   In the remaining part of the proof, for notational simplicity, we take µ̂n (x) and µ0 (x) to
be µ̂ and µ̂0 respectively. The Hellinger distance between f0 and fˆn is
 DH (f0 , fˆn )
      Z Z q                                  2          !1/2
                  fˆn (x, y) − f0 (x, y) dy dx
                                p
 =
    Z Z                                                         
                                      1
 =           τ (1 − τ ) exp − (y − µ̂n )(τ − I(y≤µ̂n ) )
                                      2
                                                                     2          !1/2
                                        1
                           −exp − (y − µ0 )(τ − I(y≤µ0 ) )                   dy dx
                                        2
              ZZ                                                                                        1/2
                                          1                                   1
 = 2−2               τ (1 − τ )exp − (y − µ̂n )(τ − I(y≤µ̂n ) ) − (y − µ0 )(τ − I(y≤µ0 ) ) dy dx
                                          2                                   2
                                                             44


                 1                                   1
  let, T = − (y − µ̂n )(τ − I(y≤µ̂n ) ) − (y − µ0 )(τ − I(y≤µ0 ) )
                 2                                   2
           ZZ                                     1/2
 = 2−2             τ (1 − τ )exp (T ) dy dx                                                                          (B.1)
Now let’s break T into two cases: (a) µ̂n ≤ µ0 , and (b) µ̂n > µ0 .
Case-(a) µ̂n ≤ µ0
                         
                                        µ̂n +µ0
                                               
                         
                         
                         
                           − y−            2
                                                  τ,                          µ̂n ≤ µ0 < y
                         
                         
                         
                         
                         − y −         µ̂n +µ0
                                                      (y−µ0 )                        µ̂n +µ0
                         
                                            2
                                                  τ+      2
                                                               ,              µ̂n ≤       2
                                                                                                 < y ≤ µ0
                    T =
                                       µ̂n +µ0
                                                               (y−µ̂n )                      µ̂n +µ0
                         
                         
                         
                           − y−            2
                                                  (τ − 1) −        2
                                                                         ,    µ̂n < y ≤           2
                                                                                                      ≤ µ0
                         
                         
                         
                                        µ̂n +µ0
                                               
                         − y −
                                                 (τ − 1),                    y ≤ µ̂n ≤ µ0
                                            2
Case-(b) µ̂n > µ0
                         
                                        µ̂n +µ0
                                               
                         − y −
                                                 τ,                         µ0 ≤ µ̂n < y
                         
                                           2
                         
                         
                         
                                       µ̂n +µ0
                                                      (y−µ̂n )                      µ̂n +µ0
                         − y −
                         
                                            2
                                                  τ+      2
                                                                ,            µ0 ≤        2
                                                                                                < y ≤ µ̂n
                    T =
                                        µ̂n +µ0                 (y−µ0 )                      µ̂n +µ0
                                               
                         
                         
                         
                           − y−            2
                                                  (τ − 1) −        2
                                                                         ,   µ0 < y ≤            2
                                                                                                      ≤ µ̂n
                         
                         
                         
                                        µ̂n +µ0
                                               
                         − y −
                                                 (τ − 1),                   y ≤ µ0 ≤ µ̂n
                                            2
Hence now,
    Z
       τ (1 − τ )exp (T ) dy
       Z
                                    
    =       I(µ̂n ≤µ0 ) + I(µ̂n >µ0 ) τ (1 − τ )exp (T ) dy
                                 Z ∞                                    
                                                             µ̂n + µ0
    = I(µ̂n ≤µ0 ) τ (1 − τ ) ×            exp − y −                          τ dy
                                     µ0                           2
                                      Z µ0                                                          
                                                                    µ̂n + µ0              (y − µ0 )
                                  +             exp − y −                          τ+                     dy
                                        µ̂n +µ0
                                            2
                                                                         2                       2
                                      Z µ̂n +µ0                                                              
                                              2                       µ̂n + µ0                       (y − µ̂n )
                                  +               exp − y −                         (τ − 1) −                     dy
                                        µ̂n                                2                             2
                                      Z µ̂n                                               
                                                                 µ̂n + µ0
                                  +           exp − y −                         (τ − 1) dy
                                        −∞                            2
                                                            45


                                      Z   ∞                                
                                                                  µ̂n + µ0
        + I(µ̂n >µ0 ) τ (1 − τ ) ×           exp − y −                           τ dy
                                         µ̂n                          2
                                            Z µ̂n                                                     
                                                                         µ̂n + µ0             (y − µ̂n )
                                        +              exp − y −                        τ+                  dy
                                             µ̂n +µ0
                                                 2
                                                                              2                   2
                                            Z µ̂n +µ0                                                         
                                                   2                       µ̂n + µ0                    (y − µ0 )
                                        +               exp − y −                         (τ − 1) −                dy
                                             µ0                                 2                          2
                                            Z µ0                                             
                                                                      µ̂n + µ0
                                        +          exp − y −                         (τ − 1) dy
                                             −∞                            2
                                                                                            
          1−τ                |µ̂n − µ0 |                τ              |µ̂n − µ0 |
      =          exp −                    τ −                exp −                   (1 − τ )
         1 − 2τ                   2                 1 − 2τ                   2
Substituting the above expression in Equation B.1 we get DH (f0 , fˆn ) equal to,
              Z                                                                                       1/2
                      1−τ                |µ̂n − µ0 |               τ                 |µ̂n − µ0 |
        2−2                    exp −                    τ −              exp −                   (1 − τ ) dx
                     1 − 2τ                    2                1 − 2τ                     2
                        p
Since DH (f0 , fˆn ) → 0,
          Z                                                                                      
                1−τ                 |µ̂n − µ0 |               τ                |µ̂n − µ0 |                    p
                          exp −                   τ −               exp −                    (1 − τ ) dx → 1
               1 − 2τ                    2                 1 − 2τ                    2
Our next step is to show that above expression implies that |µ̂n − µ0 | → 0 a.s. on a set Ω,
                                                          R                     p
with probability tending to 1, and hence |µ̂n − µ0 | dx → 0.
    We are going to prove this using contradiction technique. Suppose that, |µ̂n − µ0 | ↛ 0
a.s. on Ω. Then, there exists an ϵ > 0 and a subsequence µ̂ni such that |µ̂ni − µ0 | > ϵ on a
set A with P (A) > 0. Now decompose the integral as
       Z                                                                                      
              1−τ                |µ̂n − µ0 |                τ               |µ̂n − µ0 |
                      exp −                   τ −                exp −                    (1 − τ ) dx
            1 − 2τ                     2                 1 − 2τ                   2
              Z                                                                                       
                      1−τ                |µ̂n − µ0 |               τ                 |µ̂n − µ0 |
           =                   exp −                    τ −              exp −                   (1 − τ ) dx
                A 1 − 2τ                       2                1 − 2τ                     2
                 Z                                                                                       
                           1−τ               |µ̂n − µ0 |               τ                 |µ̂n − µ0 |
              +                   exp −                     τ −              exp −                   (1 − τ ) dx
                  Ac 1 − 2τ                          2              1 − 2τ                    2
                                                                                  
                          (1 − τ )exp(−ϵτ /2) − τ exp(−ϵ(1 − τ )/2)
           ≤ P (A)                                                                    + P (Ac ) < 1
              | {z }                              1 − 2τ                                 | {z }
                >0 |                                  {z                            }       <1
                            <1 (max = 1 for ϵ = 0) and strictly ↓ for ϵ∈(0,∞)
So we have a contradiction since the integral converges in probability to 1. Thus |µ̂n − µ0 | →
                                                                         R
0 a.s. on Ω. Once we apply Scheffe’s theorem we get |µ̂n − µ0 | dx → 0 a.s. on Ω and hence
R                  p
  |µ̂n − µ0 | dx → 0.
                                                               46


    Below we prove the Theorem 2.4.3 and for that we make use of Theorem 2.4.4 and
Corollary 2.4.5.
B.3     Theorem 2.4.3 Proof
    We proceed by showing that with Fn as in (2.13), the prior πn of Theorem 2.4.3 satisfies
the condition (i) and (ii) of Theorem 2.4.4.
    The proof of Theorem 2.4.4 condition-(i) presented in Lee (2000, proof of Theorem 1 on
p. 639) holds in BQRNN case without any change. Next we need to show that condition-(ii)
holds in BQRNN model. Let Kδ be the KL-neighborhood of the true density f0 as in (2.12)
and µ0 the corresponding conditional quantile function. We first fix a closely approximating
neural network µ∗ of µ0 . We then find a neighborhood Mς of µ∗ as in (A.7) and show that
this neighborhood has sufficiently large prior probability. Suppose that µ0 is continuous.
For any δ > 0, choose ϵ = δ/2 in theorem from Funahashi (1989, Theorem 1 on p.184) and
                                                                           √            p
let µ∗ be a neural network such that sup |µ∗ − µ0 | < ϵ. Let ς = ( ϵ/5na ) = (δ/50)n−a in
                                            x∈X
Lemma A.1.5. Then following derivation shows us that for any µ̃ ∈ Mς , DK (f0 , f˜) ≤ δ i.e.
Mς ⊂ Kδ .
                     ZZ
                                        f0 (x, y)
      DK (f0 , f˜) =      f0 (x, y) log           dy dx
                                        f˜(x, y)
                     ZZ
                                                                           
                   =       (y − µ̃)(τ − I(y≤µ̃) ) − (y − µ0 )(τ − I(y≤µ0 ) ) f0 (y|x) f0 (x) dy dx
                    let, T = (y − µ̃)(τ − I(y≤µ̃) ) − (y − µ0 )(τ − I(y≤µ0 ) )
                     Z Z                   
                   =         T f0 (y|x) dy f0 (x) dx
Now let’s break T into two cases: (a) µ̃ ≥ µ0 , and (b) µ̃ < µ0 .
Case-(a) µ̃ ≥ µ0
                                    
                                    
                                    
                                    
                                    
                                      (µ0 − µ̃)τ,             µ0 ≤ µ̃ < y
                                    
                                    
                              T = (µ0 − µ̃)τ − (y − µ̃), µ0 < y ≤ µ̃
                                    
                                    
                                    
                                    
                                    
                                    (µ0 − µ̃)(τ − 1),
                                                              y ≤ µ0 ≤ µ̃
                                                    47


Case-(b) µ̃ ≤ µ0
                                   
                                   
                                   (µ0 − µ̃)τ,
                                   
                                                                    µ̃ ≤ µ0 < y
                                   
                                   
                                   
                             T =
                                   (µ0 − µ̃)(τ − 1) + (y − µ̃), µ̃ < y ≤ µ0
                                   
                                   
                                   
                                   
                                   (µ0 − µ̃)(τ − 1),
                                                                    y ≤ µ̃ ≤ µ0
So now,
 Z
    T f0 (y|x) dy
    Z
                       
 =        I(µ̃−µ0 ≥0) × (µ̃ − µ0 )(1 − τ )I(y≤µ0 ) − (y − µ̃)I(µ0 <y≤µ̃) − (µ̃ − µ0 )τ I(y>µ0 )
                                                                                               
     +I(µ̃−µ0 <0) × (µ̃ − µ0 )(1 − τ )I(y≤µ0 ) + (y − µ̃)I(µ̃<y≤µ0 ) − (µ̃ − µ0 )τ I(y>µ0 ) f0 (y|x) dy
    Z
        
 =        (µ̃ − µ0 )(1 − τ )I(y≤µ0 ) − (µ̃ − µ0 )τ I(y>µ0 )
                                                                         
     −(y − µ0 + µ0 − µ̃)I(µ0 <y≤µ̃) + (y − µ0 + µ0 − µ̃)I(µ̃<y≤µ0 ) f0 (y|x) dy
   let, z = y − µ0 , b = µ̃ − µ0 and note that P (y ≤ µ0 |x) = τ, and P (y > µ0 |x) = 1 − τ.
                                                  
 = E −(z − b)I(0<z<b) + (z − b)I(b<z<0) |x
                                  
 ≤ E bI(0<z<b) − bI(b<z<0) |x
 = |b| × [P (0 < z < b|x) + P (b < z < 0|x)]
 = |b| × P (0 < |z| < |b| |x)
 ≤ |b|
Hence,
ZZ                          Z
      T f0 (y|x) dy dx ≤       |b| f0 (x) dx
                            Z
                          =    |µ̃ − µ0 | f0 (x) dx
                            Z
                          =    |µ̃ − µ∗ + µ∗ − µ0 | f0 (x) dx
                            Z                                 
                                              ∗           ∗
                          ≤      sup |µ̃ − µ | + sup |µ − µ0 | f0 (x) dx
                                 x∈X              x∈X
                          Use Lemma A.1.5 to bound the first term and
                                                      48


                        use Funahashi (1989, Theorem 1 on p.184) to bound the second term.
                           Z
                        ≤ [ϵ + ϵ] f0 (x) dx
                        = 2ϵ = δ
Finally we prove that ∀δ, ν > 0, ∃Nν s.t. πn (Kδ ) ≥ exp(−nν) ∀n ≥ Nν ,
         πn (Kδ ) ≥ πn (Mς )
                      d˜n Z  θi +ς                           
                    Y                   1                1 2
                  =                 p        exp − 2 u du
                     i=1 θi −ς         2πσ02            2σ0
                      d˜n                                            
                     Y                          1               1 2
                  ≥       2ς       inf       p        exp − 2 u
                     i=1
                             u∈[θi −1,θi +1]   2πσ  2
                                                    0
                                                              2σ0
                      d˜n
                            s                         
                    Y            2              1
                  =       ς        2
                                     exp − 2 ϑi
                     i=1
                               πσ  0           2σ0
                   ϑi = max((θi − 1)2 , (θi + 1)2 )
                          s        !d˜n                    
                               2                    1 ˜
                  ≥ ς                    exp − 2 ϑdn                where, ϑ = max(ϑ1 , . . . , ϑd˜n )
                             πσ02                 2σ0                           i
                                    "                  s         #            !
                                                            δ          1
                  = exp −d˜n a log n − log                         − 2 ϑd˜n
                                                          25πσ02      2σ0
                          r
                              δ −a
                   ς=            n
                             50
                                                     
                                                  ϑ ˜
                  ≥ exp − 2a log n + 2 dn                         for large n
                                                2σ0
                                                                
                                                  ϑ              a
                  ≥ exp − 2a log n + 2 (p + 3)n
                                                2σ0
                   d˜n = (p + 2)k̃n + 1 ≤ (p + 3)na
                  ≥ exp(−nν)              for any ν and ∀n ≥ Nν for some Nν
Hence, we have proved that both the conditions of Theorem 2.4.4 hold. The result of Theorem
2.4.3 thereby follows from the Corollary 2.4.5 which is derived from Theorem 2.4.4.
    Further, we can use similar argument to show that a neural network can approximate
                                                         49


any L2 function arbitrarily closely. Note for any L2 function h, ∥h∥2 ≥ ∥h∥1 , so
                    Z                       12        Z
                                    2
                         (µ − µ0 ) dλ(x)         < ϵ =⇒     |µ − µ0 | dλ(x) < ϵ
Hence,
                     Z
      DK (f0 , f˜) ≤    |µ̃ − µ0 | f0 (x) dx
                     Z                               
                   ≤      sup |µ̃ − µ| + sup |µ − µ0 | f0 (x) dx
                          x∈X              x∈X
                    Use Hornik et al. (1989, Theorem 2.4 on p.362) and Lemma A.1.5
                     Z
                   ≤ [ϵ + ϵ] f0 (x) dx
                   = 2ϵ = δ
                                                   50


                                        CHAPTER 3
       LAYER ADAPTIVE NODE SELECTION IN BAYESIAN NEURAL
                                        NETWORKS
3.1      Introduction
    Deep learning profoundly impacts science and society due to its impressive empirical
success driven primarily by copious amounts of datasets, ever increasing computational
resources, and deep neural network’s (DNN) ability to learn task-specific representations
(LeCun et al., 2015). The key characteristic of deep learning is that accuracy empirically
scales with the size of the model and the amount of training data. As such, large neural
network models such as OpenAI GPT-3 (175 Billion) now typify the state-of-the-art across
multiple domains such as natural language processing, computer vision, speech recognition
etc. Nevertheless deep neural networks do have some drawbacks despite their wide ranging
applications. First, this form of model scaling is exorbitantly prohibitive in terms of compu-
tational requirements, financial commitment, energy requirements etc. Second, DNNs tend
to overfit leading to poor generalization in practice (Zhang et al., 2017). Finally, there are
numerous scenarios where training and deploying such huge models is practically infeasible.
Examples of such scenarios include federated learning, autonomous vehicles, robotics, recom-
mendation systems where models have to be refreshed daily/hourly or in an online manner
for optimal performance.
    A promising direction for addressing these issues while improving the efficiency of DNNs is
exploiting sparsity. From a practical perspective, it has been well-known that neural networks
can be sparsified without significant loss in performance (Mozer and Smolensky, 1988; LeCun
et al., 1990; Hassibi and Stork, 1993) and there is growing evidence that it is more so in the
case of modern DNNs (Han et al., 2015). Recently Frankle and Carbin (2019) proposed the
lottery ticket (LT) hypothesis, namely that there exist sparse, trainable sub-networks within
                                                51


the larger network which can match the performance of their dense counterpart. To this end,
sparsity in DNNs provides a promising way to reduce the network complexity by eliminating
nonessential connections from a neural network thereby improving its calibration (Hoefler
et al., 2021). A number of approaches to neural network compression via sparsity have been
proposed in the literature (Cheng et al., 2018; Gale et al., 2019). Recent approaches (Guo
et al., 2016; Molchanov et al., 2017; Zhu and Gupta, 2018) in magnitude-based pruning of
neural network weights provide high model compression rates with minimal accuracy loss.
Whereas, sparse evolutionary training learns sparse neural networks with a fixed parameter
budget throughout the training based on adaptive sparse connectivity (Mocanu et al., 2018).
    A key feature of sparsity in neural networks is its structure on the topology of the neu-
ral network weights. Weight pruning approaches perform high model compression leading
to significant storage cost reduction at test-time (Han et al., 2015, 2016; Molchanov et al.,
2017; Zhu and Gupta, 2018; Frankle and Carbin, 2019). However, they result in unstruc-
tured sparsity in deep neural architectures which leads to inefficient computational gains in
practical setups (Wen et al., 2016). Instead, inducing group sparsity on collection of incom-
ing weights into a given node (or node selection) reduces the dimensions of weight matrices
per layer allowing for significant computational savings. To that effect, edge selection and
node selection approaches are complementary with the former leading to storage reduction
and the later leading to computational speedup during inference stage. Although one may
argue node selection arises as a byproduct of edge selection, we clearly demonstrate that
an approach which targets node selection directly leads to lower latency models (smaller
number of nodes per layer) compared to an approach which achieves node selection through
edge selection.
    Node selection through group sparsity in deep neural networks has been explored under
frequentist setting in Murray and Chiang (2015), Alvarez and Salzmann (2016), Ochiai et al.
(2017), Liu et al. (2017), Luo et al. (2017) and Louizos et al. (2018), etc.. On the other hand,
Louizos et al. (2017), Neklyudov et al. (2017), and Ghosh et al. (2019) incorporate group
                                               52


sparsity via shrinkage priors in Bayesian paradigm. These group sparsity approaches specif-
ically applied for node selection have shown significant computational speedup and lower
memory footprint at inference stage. However, all of the proposed methods of neuron selec-
tion perform ad-hoc pruning requiring fine-tuned thresholding rules. Moreover, the posterior
inference of network weights in Bayesian neural networks (BNN) through standard MCMC
method, ex. Hamiltonian Monte Carlo (Neal, 1992), does not scale well to modern neural
network architectures and large datasets used in practice. Instead computationally efficient
variational inference as an alternative to MCMC (Jordan et al., 1999; Blei et al., 2017), has
been explored in the context of edge selection both theoretically and numerically by Blundell
et al. (2015), Chérief-Abdellatif (2020), and Bai et al. (2020). On the other hand, Louizos
et al. (2017) and Ghosh et al. (2019) have explored variational inference for node selection
problem. In this work, we propose a Gaussian spike-and-slab prior for automatic node selec-
tion in Bayesian neural networks thereby alleviating the need of an ad-hoc thresholding rule
for pruning. Further for scalability, we develop a variational Bayes algorithm for posterior
inference of BNN model parameters in our proposed model and demonstrate its numerical
performance through simulation and real regression and classification datasets. Finally, we
provide the theoretical guarantees to our node selection method under mild restrictions on
the network topology.
Related Work. A closely related work to our proposed model is Bai et al. (2020)’s auto-
mated edge selection model using spike-and-slab prior. There the slab distribution controls
the magnitude of weights and spike allows for the exact setting of weights to 0. We introduce
spike-and-slab framework for node selection in BNNs and show the key resource efficiency
trade-off between node and edge selection at test-time. There are two main advantages to
node selection over edge selection (1) fewer parameters to train during optimization, (2)
results in structurally compact network leading to computational speedup at test-time.
    On the theoretical front, sparse BNNs have been studied in the works of Polson and
Ročková (2018) and Sun et al. (2021). In the context of variational inference, sparse BNNs
                                              53


  Output
  Hidden
  layer 3
  Hidden                                   Sparse deep BNN with
  layer 2                                   spike-and-slab priors
                                             for node selection
  Hidden
  layer 1
   Input
Figure 3.1 Sparse neural network with node selection. Sparse deep BNN using spike-
and-slab priors achieves node selection in the given dense network on left leading to a sparse
network on right.
have been studied in the recent works of Chérief-Abdellatif (2020) and Bai et al. (2020).
All these works concentrate on the problem of edge selection facilitated through the use
of Gaussian spike-and-slab priors. In the context of node selection, Ghosh et al. (2019)
makes use of regularized horseshoe prior. The main limitations of their approach include (1)
need for fine tuning of the thresholding rule for node selection, and (2) lack of a theoretical
justification.
    The only two works which have provided theoretical guarantees of their proposed sparse
DNN methods under variational inference include those of Chérief-Abdellatif (2020) and Bai
et al. (2020). Since they focus on the problem of edge selection, their theoretical developments
are related to the results of Schmidt-Hieber (2020) (see the sieve construction in relation (4) in
Schmidt-Hieber (2020)) and not directly extendable to our setup. Additionally, they assume
certain restrictions on the network topology like (i) equal number of nodes in each layer, (ii)
a known uniform bound B on all network weights, and (iii) a global sparsity parameter which
may not lead to a structurally compact network. Although from a numerical standpoint,
one may implicitly extend the problem of edge selection to node selection, the theoretical
                                                54


guarantees of node selection consistency in sparse DNNs is not immediate.
Detailed Contributions.
   1. We propose a Gaussian spike-and-slab node selection model and develop a variational
      Bayes approach for posterior inference of the model parameters. We call our approach
      SS-IG (Spike-and-Slab Independent Gaussian) model.
   2. We derive the variational posterior consistency using a functional space of neural net-
      works which takes two layer dependent bounds, one which upper bounds the number
      of neurons in each layer and the other which upper bounds the L1 norm of the weights
      incident onto each node of a layer. These layer dependent bounds allow the general-
      ization of the theoretical results presented to guarantee the consistency of any generic
      shaped network structure. Further, it also guides the calculation of layer-wise prior
      inclusion probabilities which allow for optimal node recovery per layer in the compu-
      tational experiments.
   3. We measure the computational gains achieved by our approach using layer-wise node
      sparsities for shallow models and floating point operations in larger models. Our nu-
      merical results validate the proposed theoretical framework for the node selection in
      DNN models. These empirical experiments further justify the use of layer-wise node
      inclusion probabilities to facilitate the optimal node recovery.
3.2     Nonparametric Modeling: Deep Learning Approach
    Non-parametric modeling assumes an arbitrary relationship between the response and the
variables. The term non-parametric does not mean that the value lack inherent parameters,
but rather that the parameters are flexible and can vary. In particular, we would like to find
a function η0 (·) : Rp → R such that η0 (x) is a good approximation or a representation of y.
A standard neural network is, technically speaking, parametric since it has a fixed number
of parameters. However, most deep neural networks (DNNs) have thousands or millions of
                                                55


parameters that they could be interpreted as nonparametric. In fact, it has been proven
that in the limit of infinite width, a deep neural network can be seen as a Gaussian process,
which is a nonparametric model (Lee et al., 2018).
    Mathematically, let Y ∈ R, X ∈ X be two random variables with the following condi-
tional distribution
                         f0 (y|x) = exp [h1 (η0 (x))y + h2 (η0 (x)) + h3 (y)]                  (3.1)
where η0 (·) : X → R is a continuous function satisfying certain regularity assumptions and
X is usually a compact subspace of Rp . Note, the functions h1 , h2 , h3 are pre-determined and
different choices give rise to different families of generalized linear models. For h1 (u) = u,
h2 (u) = − log(1+eu ), h3 (y) = 1, we get the classification model. For h1 (u) = u, h2 (u) = −u2 ,
h3 (y) = −y 2 /2−log(2π)/2, we get the regression model with σ 2 = 1. Usually X = Rp . Note,
x is a feature vector from a marginal distribution PX and y is the corresponding output from
Y |X = x in (3.1). Let PX,Y be the joint distribution of (X, Y ).
                                                                              R
    Let g : X → R be a measurable function, the risk of g is R(g) =             Y×X
                                                                                    L(Y, g(X))dPX,Y
for some loss function L. The Bayes estimator minimizes this risk (Friedman et al., 2009).
For regression with squared error loss and classification with 0-1 loss, the optimal Bayes
estimators are g ∗ (x) = η0 (x) and g ∗ (x) = 1{η0 (x) ≥ 0} respectively. In practice, Bayes
estimator is not useful since the function η0 (x) is unknown. Thus, an estimator is obtained
based on the training observations, D = {(x1 , y1 ), ..., (xn , yn )}. A good estimator enjoys
universal consistency properties, i.e., its risk approaches Bayes risk as n → ∞ irrespective
of PX . To find this optimal class, we use Bayesian neural networks, ηθ (x) with θ denoting
the network weights, as an approximation to η0 (x).
Mathematical Framework
    For x ∈ Rp , consider a BNN with L hidden layers with k1 , · · · , kL the number of nodes in
the hidden layers with k0 = p, kL+1 = 1 (in regression). kL+1 > 1 allows the generalization
                                                  56


to Y ∈ Rd , d > 1, thereby providing a handle on multi class classification problems. The
total number of parameters is K = Ll=0 kl+1 (kl + 1). With Wl = [wl0 , Wl1 ], let
                                           Q
           ηθ (x) = wL0 + WL1 ψ(wL−1    0
                                             + WL−1 1
                                                         ψ(· · · ψ(w10 + W11 ψ(w00 + W01 x)))),      (3.2)
where ψ is a nonlinear activation function, wl0 are kl+1 × 1 vectors and Wl1 are kl+1 ×
kl matrices. Using the BNN in (3.2) to approximate the true function η0 (x), conditional
probabilities of Y |X = x are
                         fθ (y|x) = exp [h1 (ηθ (x))y + h2 (ηθ (x)) + h3 (y)] .                      (3.3)
Thus, the likelihood function for the data D under the model and the truth is
                                      Yn                           Yn
                              Pθn   =     fθ (yi |xi ),    P0n   =     f0 (yi |xi ).                 (3.4)
                                      i=1                          i=1
3.3     Spike-and-Slab Independent Gaussian Node Selection
3.3.1    Model
    To allow for automatic node selection, we consider a spike-and-slab prior consisting of
a Dirac spike (δ0 ) at 0 and a slab distribution (Mitchell and Beauchamp, 1988). The spike
part is represented by an indicator variable which is set to 0 if a node is not present in the
network. The slab part comes from a Gaussian distributed random variable. To allow for
the layer-wise node selection, we assume that the prior inclusion probability λl varies as a
function of the layer index l. The symbol i.d. is used to denote independently distributed
random variables.
Prior: We assume a spike-and-slab prior of the following form with zlj as the indicator for
the presence of j th node in the lth layer
                               i.d.                                           i.d.
                      wlj |zlj ∼ (1 − zlj )δ0 + zlj N (0, σ02 I) , zlj ∼ Ber(λl )
                                                                     
where l = 0, . . . , L, j = 1, . . . , kl+1 . Also, wlj = (wlj1 , . . . , wljkl +1 ) is a vector of edges
incident on the j th node in the lth layer. In the above formula, note δ0 is a Dirac spike
                                                        57


vector of dimension kl + 1 with all entries zero and I is the identity matrix of dimension
kl + 1 × kl + 1. Furthermore, zlj with j = (1, . . . , kl+1 ) all follow Bernoulli(λl ) to allow for
common prior inclusion probability, λl , for each node from a given layer l. We set λL = 1 to
ensure no node selection occurs in the output layer.
Posterior: With zl = (zl1 , · · · , zlkl+1 ), let z = (z1 , · · · , zL ) denote the vector of all indicator
variables. The posterior distribution of (θ, z) given D is given by
                                                P n π(θ|z)π(z)         P n π(θ|z)π(z)
                          π(θ, z|D) = P R θ n                        = θ                              (3.5)
                                              z   Pθ π(θ|z)π(z)dθ           m(D)
                 Qn
where Pθn =         i=1 fθ (yi |xi ) is the likelihood function as in (3.4), π(z) is the probability mass
function of z with respect to the counting measure and π(θ|z) is the conditional probability
density function with respect to the Lebesgue measure of θ given z . Further, m(D) is the
marginal density of the data and is free of (θ, z).
                     P
    Let π
        e(θ) = z π(θ, z) be the marginal prior of θ. We shall use the notation
                                                         Z
                                                Π(A) =
                                                e           π
                                                            e(θ)dθ                                    (3.6)
                                                          A
to denote the probability distribution function corresponding to the density function π             e. The
marginal posterior of θ expressed as a function of the marginal prior for θ is
                                           X                  Pθn π
                                                                  e(θ)       Pθn π
                                                                                 e(θ)
                            π
                            e(θ|D) =          π(θ, z|D) = R               =
                                            z
                                                              Pθn π
                                                                  e(θ)dθ      m(D)
Thus, the probability distribution function corresponding to the density function π                e(|D) is
then given by
                                                         Z
                                              Π(A|D)
                                              e        =    π
                                                            e(θ|D)dθ                                  (3.7)
                                                          A
Variational family: We posit the following mean field variational family (QMF ) on network
weights as
                        n                                                                      o
                                    i.d.                                           i.d.
           QMF = wlj |zlj ∼ (1 − zlj )δ0 + zlj N (µlj , diag(σlj2 )) , zlj ∼ Ber(γlj )
                                                                            
for l = 0, . . . , L, j = 1, . . . , kl+1 . This ensures that weight distributions follow spike-and-slab
structure which allows for node sparsity through variational approximation. Further, the
                                                         58


weight distributions conditioned on the node indicator variables are all independent of each
other (hence use of the term mean field family). The variational distribution of parameters
obtained post optimization will then inherently prune away redundant nodes from each layer.
Also, Gaussian distribution for slab component is widely popular for approximating neural
network weight distributions (Blundell et al., 2015; Louizos et al., 2017; Bai et al., 2020).
    Additionally, µlj = (µlj1 , . . . , µljkl +1 ) and σlj2 = (σlj1
                                                                2              2
                                                                    , . . . , σljkl +1
                                                                                       ) denote the vectors of
variational mean and standard deviation parameters of the edges incident on the j th node
in the lth layer. Similarly, γlj denotes the variational inclusion probability of the j th node in
the lth layer. We set γLj = 1 to ensure no node selection occurs in the output layer.
Variational posterior: Variational posterior aims to reduce the Kullback-Leibler (KL)
distance between a variational family and the true posterior (Blei and Lafferty, 2007; Hinton
and Van Camp, 1993) as
                                       π ∗ = argmin dKL (q, π(|D))                                       (3.8)
                                               q∈QMF
where dKL (q, π(|D)) denotes the KL-distance between q and π(|D).
    Note, the variational member q can be written as q(θ, z) = q(θ|z)q(z) where q(z) is the
probability mass function of z with respect to the counting measure and q(θ|z) is the con-
ditional density function given with respect to the Lebesgue measure of θ given z. Further,
                          XZ
             ∗
           π = argmin            [log q(θ, z) − log π(θ, z|D)]q(θ, z)dθ
                  q∈QMF     z
                                                                                                 !
                             XZ
               = argmin             [log q(θ, z) − log π(θ, z, D)]q(θ, z)dθ + log m(D)
                  q∈QMF       z
               = argmin [−ELBO(q, π(|D))] + log m(D) = argmax ELBO(q, π(|D))                             (3.9)
                  q∈QMF                                               q∈QMF
Since log m(D) is free from q, it suffices to maximize the evidence lower bound (ELBO)
above.
         e∗ (θ) =      π ∗ (θ|z)π ∗ (z) then π  e∗ denotes the marginal variational posterior for θ.
                   P
    Let π            z
                                                     59


We shall use the notation
                                                       Z
                                          e∗
                                          Π (A) =          e∗ (θ)dθ
                                                           π                                     (3.10)
                                                         A
to denote the probability distribution function corresponding to the density function π         e∗ .
3.3.2     Algorithm
Evidence Lower Bound. The ELBO presented in (3.9) is given by L = −Eq [log Pθn ] +
dKL (q, π) which is further simplified as
 − Eq [log Pθn ] + dKL (q, π)
                                                               
 = −Eq(θ|z)q(z) [log Pθn ] + dKL q(θ|z)q(z), π(θ|z)π(z)
                             X
 = −Eq(θ|z)q(z) [log Pθn ] +      dKL (q(zlj )||π(zlj ))
                              l,j
          Xh
        +      q(zlj = 1)dKL (q(wlj |zlj = 1)||π(wlj |zlj = 1))
          l,j
                                                                     i
                 + q(zlj = 0)dKL (q(wlj |zlj = 0)||π(wlj |zlj = 0))
                             X
 = −Eq(θ|z)q(z) [log Pθn ] +      dKL (q(zlj )||π(zlj ))
                              l,j
          X
        +     q(zlj = 1)dKL (q(wlj |zlj = 1)||π(wlj |zlj = 1))
          l,j
                             X
 = −Eq(θ|z)q(z) [log Pθn ] +      dKL (q(zlj )||π(zlj ))
                              l,j
          X
        +     q(zlj = 1)dKL (N (µlj , diag(σlj2 ))||N (0, σ02 I))
          l,j
     The KL of discrete variables appearing in the above expression creates a challenge in
practical implementation. Jang et al. (2017) and Maddison et al. (2017) proposed to re-
place discrete random variable with its continuous relaxation. Specifically, the continuous
relaxation approximation is achieved through Gumbel-softmax (GS) distribution, that is
q(zlj ) ∼ Ber(γlj ) is approximated by q(z̃lj ) ∼ GS(γlj , τ ), where
   z̃lj = (1 + exp(−ηlj /τ ))−1 ,   ηlj = log(γlj /(1 − γlj )) + log(ulj /(1 − ulj )), ulj ∼ U (0, 1)
                                                       60


Algorithm 3.1 Variational inference in SS-IG Bayesian neural networks
  1: Inputs: training dataset, network architecture, and optimizer tuning parameters.
  2: Model inputs: prior parameters for θ, z.
  3: Variational inputs: number of Monte Carlo samples S.
  4: Output: Variational parameter estimates of network weights and sparsity.
  5: Method: Set initial values of variational parameters.
  6: repeat
  7:    Generate S samples from ζlj ∼ N (0, I) and ulj ∼ U (0, 1)
  8:    Generate S samples for (zlj , z̃lj ) using ulj
  9:    Use µlj , σlj , ζlj and zlj to compute loss (ELBO) in forward pass
10:     Use µlj , σlj , ζlj and z̃lj to compute gradient of loss in backward pass
11:     Update the variational parameters with gradient of loss using stochastic gradient de-
        scent algorithm (e.g. Adam (Kingma and Ba, 2015))
12: until change in ELBO < ϵ
where τ is the temperature. We set τ = 0.5 for this work (also see section 5 in Bai et al.
(2020)). z̃lj is used in the backward pass for easier gradient calculation, while zlj is used for
selecting nodes in the forward pass. We use non-centered parameterization for the Gaussian
slab variational approximation where N (µlj , diag(σlj2 )) is reparameterized as µlj + σlj ⊙ ζlj
for ζlj ∼ N (0, I), where ⊙ denotes the entry-wise (Hadamard) product.
3.4      Theoretical Results
     In this section, we develop the theoretical consistency of the variational posterior in (3.10)
in context of node selection. Previous works which establish the statistical consistency of
sparse deep neural networks do so only in the context of edge selection. Thereby, the works of
Polson and Ročková (2018), Chérief-Abdellatif (2020) and Bai et al. (2020) use several results
from the pioneer work of Schmidt-Hieber (2020). In addition to node selection consistency,
we also relax certain network restrictions considered in the previous works. These restrictions
include (1) equal number of nodes in each layer which restricts one from using any previous
information on the number of nodes in the deep neural architecture (2) a known bound B on
all the neural network weights as they essentially rely on the sieve construction in equation 3
of Schmidt-Hieber (2020) which assumes that L∞ norm of all θ entries is smaller than 1 (3)
a global sparsity parameter s which does not always consider structurally sparse networks.
                                                   61


    Towards the proof, firstly our sieve construction allows the number of nodes of the neural
network to vary as a function of the layer. Secondly, instead of global sparsity parameter s
(see the sieve construction in relation (4) of Schmidt-Hieber (2020)) we allow for layer wise
sparsity vector s to account for the number of nodes in each layer. Finally, we relax the
assumption of a known bound B by considering a sieve with a layer wise constraint (denoted
by the vector B) on the L1 norm of the incoming edges of a node. Thus, our work extends on
current literature along three directions (1) theoretically quantifies predictive performance of
Bayesian neural networks with node based pruning (2) establishes that even without a fixed
bound on network weights, one can recover true solution by appropriate choice of the prior
(3) provides layer wise node inclusion probabilities to allow for structurally sparse solutions.
The relaxation of these network structure assumptions requires us to provide the framework
for node selection including appropriate sieve construction together with the derivation of
the results in Schmidt-Hieber (2020) customized to our problem.
    To establish the posterior contraction rates, we show that the variational posterior in
(3.8) concentrates in shrinking Hellinger neighborhoods of the true density function P0 with
overwhelming probability. Since X ∼ U [0, 1]p , thus f0 (x) = fθ (x) = 1. This further implies
P0 = f0 (y|x)f0 (x) = f0 (y|x) and similarly Pθ = fθ (y|x). We next define the Hellinger
neighborhood of the true density P0 as
                                     Hε = {θ : dH (P0 , Pθ ) < ε}
where the Hellinger distance between the true density function P0 and the model density Pθ
is
                                         Z p
                                       1                    p      2
                      d2H (P0 , Pθ ) =         fθ (y|x) − f0 (y|x) dydx
                                       2
We also define the KL neighborhood of the true density P0 as
                                     Nε = {θ : dKL (P0 , Pθ ) < ε}
where the KL distance dKL between the true density function P0 and the model density Pθ
                                                 62


is
                                                   Z
                                                           f0 (y|x)
                              dKL (P0 , Pθ ) =       log              f0 (y|x)dydx
                                                           fθ (y|x)
Let k = (k0 , · · · , kL+1 ) be the node vector, W l = (wl1            ⊤             ⊤
                                                                         , · · · , wlk l+1
                                                                                           )⊤ be the row represen-
tation of W l and w    el = (||wl1 ||1 , · · · , ||wlkl+1 ||1 ) be the vector of L1 norms of the rows of
W l . Next we consider layer-wise sparsity, s = (s1 , · · · , sL ) for node selection. Similarly, we
consider layer-wise norm constraints, B = (B1 , · · · , BL ) on L1 norms of weights including
bias incident onto any given node in each layer. Based on s and B, we define the following
sieve of neural networks (check definition A.1.1).
                      F(L, k, s, B) = {ηθ ∈ (3.2) : ||w        el ||0 ≤ sl , ||w el ||∞ ≤ Bl } .            (3.11)
The construction of a sieve is one of the most important tools towards the proof of consistency
in infinite-dimensional spaces. In the works of Schmidt-Hieber (2020), Polson and Ročková
(2018), Chérief-Abdellatif (2020) and Bai et al. (2020), the sieve in the context of edge
selection is given by
                         F(L, k, s) = {ηθ ∈ (3.2) : ||θ||0 ≤ s, ||θ||∞ ≤ 1} .
which works with an overall sparsity parameter s. In addition, note the L∞ norm of all the
entries in θ is assumed to be known constant equal to 1 (see relation (4) in Schmidt-Hieber
(2020) and section 4 in Polson and Ročková (2018)). Section 3 in Bai et al. (2020) does
not explicitly mention the dependence of their sieve on some fixed bound B on the edges in
a network, however, their derivations on covering numbers (see proof of Lemma 1.2 in the
supplement of Bai et al. (2020)) borrow results from Schmidt-Hieber (2020) which is based
on sieve with B = 1.
    Consider any sequence ϵn . For Lemmas 3.4.1 and 3.4.2, we use the sieve F(L, k, s, B) in
(3.11) with s = s◦ and B = B ◦ where s◦l + 1 = nϵ2n /( Lj=0 uj ) and log Bl◦ = (nϵ2n )/((L +
                                                                        P
1) Lj=0 (s◦j + 1)) with ul = (L + 1)2 (log n + log(L + 1) + log kl+1 + log(kl + 1)). Note, s◦l and
   P
Bl◦ do not depend on l.
                                                        63


    Lemma 3.4.1 below holds when the covering number (check definition A.1.2) of the func-
tions which belong to the sieve F(L, k, s◦ , B ◦ ) is well under control. Lemma 3.4.2 below
states that for the same choice of the sieve, the prior gives sufficiently small probabilities
on the complement space F(L, k, s◦ , B ◦ )c (see the discussion under Theorem 3.4.4 for more
details).
    For the subsequent results, the symbol Ac is used to denote complement of a set A.
Lemma 3.4.1 (Existence of Test Functions). Let ϵn → 0 and nϵ2n → ∞. There exists a
testing function ϕ ∈ [0, 1] and constants C1 , C2 > 0,
                                                      EP0 (ϕ) ≤ exp{−C1 nϵ2n }
                               sup               EPθ (1 − ϕ) ≤ exp{−C2 nd2H (P0 , Pθ )}
                    θ∈Hϵcn ,ηθ ∈F (L,k,s◦ ,B ◦ )
where Hϵn = {θ : dH (P0 , Pθ ) ≤ ϵn } is the Hellinger neighborhood of radius ϵn .
                                                                                      PL
Lemma 3.4.2 (Prior mass condition.). Let ϵn → 0, nϵ2n → ∞ and nϵ2n /                    l=0 ul → ∞, then
for Πe as in (3.6) and some constant C3 > 0,
                                                                          XL
                             Π(F(L,
                             e          k, s◦ , B ◦ )c ) ≤ exp(−C3 nϵ2n /     ul )
                                                                          l=0
    Whereas Lemmas 3.4.1 and 3.4.2 work with a specific choice of the sieve, the following
Lemma 3.4.3 is developed for any generic choice of sieve indexed by s and B. The final piece
of the theory developed next tries to addresses two main questions (1) Can we get a sparse
network solution whose layer-wise sparsity levels and L1 norms of incident edges (including
the bias) of the nodes are controlled at levels s and B respectively? (2) Does this sparse
network retain the same predictive performance as the original network?
    In this direction, let
                                    ξ = minηθ ∈F (L,k,s,B) ||ηθ − η0 ||2∞
Based on the values s and B, we also define
                              X L                                                              L
                                                                                               X
            2
    ϑl = Bl /(kl + 1) +               log Bm + L + log kl+1 + log(kl + 1) + log n + log(          um )
                           m=0,m̸=l                                                           m=0
                                                         64


     rl = sl (kl + 1)ϑl /n                                                                      (3.12)
    Lemma 3.4.3 has two sub conditions. Condition 1. requires that shrinking KL neigh-
borhood of the true density function P0 gets sufficiently large probability. This along with
Lemma 3.4.1 and 3.4.2 is an essential condition to guarantee the convergence of the true
posterior in (3.5). Condition 2. is the assumption needed to control the KL distance be-
tween true posterior and variational posterior and thereby guarantees the convergence of the
variational posterior in (3.8) (see the discussion under Theorem 3.4.4 for more details).
                                                                PL
                                                                            → 0 and n( Ll=0 rl +ξ) →
                                                                                        P
Lemma 3.4.3 (Kullback-Leibler conditions). Suppose                l=0 rl +ξ
∞ and the following two conditions hold for the prior Π         e in (3.6) and some q ∈ QMF
                                                     X L
                 1.  Π N Ll=0 rl +ξ ≥ exp(−C4 n(
                      e    P                                rl + ξ))
                                                        l=0
                                     XZ                                      XL
                 2.  dKL (q, π) + n          dKL (P0 , Pθ )q(θ, z)dθ ≤ C5 n(     rl + ξ)
                                      z                                      l=0
where π is the joint prior of (θ, z), q is the joint variational distribution of (θ, z) and
NPLl=0 rl +ξ is the KL neighborhood of radius Ll=0 rl + ξ.
                                                    P
    The following result shows that the variational posterior is consistent as long as Lemma
3.4.1, Lemma 3.4.2 and Lemma 3.4.3 hold. The proof of Theorem 3.4.4 demonstrates how
the validity of these three lemmas imply variational posterior consistency.
Theorem 3.4.4. Suppose Lemma 3.4.3 holds and Lemmas 3.4.1 and 3.4.2 hold for ϵn =
qP
  ( Ll=0 rl + ξ) Ll=0 ul . Then for some slowly increasing sequence Mn → ∞, Mn ϵn → 0
                   P
and Πe ∗ as in (3.10),
                                      e ∗ (Hc
                                      Π     Mn ϵn ) → 0,    n→∞
in P0n probability where HM   c
                                n ϵn
                                     = {θ : dH (P0 , Pθ ) ≤ Mn ϵn } is the Hellinger neighborhood of
radius Mn ϵn .
    Note, the above contraction rate depends mainly on two quantities rl and ξ. Note rl
controls the number of nodes in the neural network. If the network is not sparse, then rl is
                                                    65


kl+1 (kl + 1)ϑl /n instead of sl (kl + 1)ϑl /n which can in turn make the convergence of ϵn → 0
difficult. On the other hand, if sl and Bl are too small, it will cause ξ to explode since a
good approximation to the true function may not exist in a very sparse space.
Remark (Rates as a function of n). Let L ∼ O(log n), Bl2 ∼ O(kl + 1) and sl (kl +
1) = O(n1−2ϱ ), for some ϱ > 0, then one can work with ϵn = n−ϱ log3 (n) as long as ξ =
O(n−2ϱ log2 (n)). The exact expression of ϱ is determined by the degree of smoothness of the
function η0 .
Proof of Theorem 3.4.4 Discussion. To further enunciate Lemmas 3.4.1 and 3.4.2 con-
                                   R
sider the quantity E1n = Hc (Pθn /P0n )e                π (θ)dθ as used in the following proof. Here, E1n can
                                     Mn ϵn
be split into two parts
                 Z                                                   Z
         E1n =                               (Pθn /P0n )e
                                                        π (θ)dθ   +                               (Pθn /P0n )e
                                                                                                             π (θ)dθ
                  HMc      ∩F (L,k,s◦ ,B ◦ )                            c
                                                                       HM      ∩F (L,k,s◦ ,B ◦ )c
                      n ϵn                                                n ϵn
Whereas Lemma 3.4.1 provides a handle on the first term by controlling the covering number
of the sieve F(L, k, s◦ , B ◦ ), Lemma 3.4.2 gives a handle on the second term by controlling
Π(F(L,
e         k, s◦ , B ◦ )c ) (for more details we refer to Lemma A.2.6 in the Appendix A).
                                                           R
    Next, consider the quantity E2n = log (Pθn /P0n )e                  π (θ)dθ in the following proof. Lemma
3.4.3 part 1. provides a control on this term (see Lemma A.2.7 in the the Appendix A for
                                                                                      P R
more details). Finally, consider the quantity E3n = dKL (q, π) + z log(P0n /Pθn )q(θ, z)dθ in
the following proof. Indeed Lemma 3.4.3 part 2. provides a control on this term (see Lemma
A.2.8 in the Appendix A for further details).
Proof. Let Π   e and Π    e ∗ be as in (3.7) and (3.10) respectively. Now,
                                              e∗ (θ)                             e∗ (θ)
                          Z                                  Z
                                              π                                  π
  dKL (e ∗
       π ,πe(|D)) =           π  ∗
                               e (θ) log              dθ +         e∗ (θ) log
                                                                   π                       dθ
                           A                 π
                                             e(θ|D)           Ac                π
                                                                                e(θ|D)
                                             e∗ (θ)                                          e∗ (θ)
                                     Z                                               Z
                               ∗             π            π
                                                          e(θ|D)             ∗    c         π               π
                                                                                                            e(θ|D)
                      = −Π (A)
                            e                       log ∗          dθ − Π (A )
                                                                           e                            log ∗       dθ
                                       A Π  e ∗ (A)        π
                                                           e (θ)                       Ac Π e ∗ (Ac )        π
                                                                                                             e (θ)
                                          e∗                            e∗ c
                      ≥Π  e ∗ (A) log Π (A) + Π         e ∗ (Ac ) log Π (A ) ,                      Jensen’s inequality
                                        Π(A|D)
                                        e                              e c |D)
                                                                       Π(A
                                                             66


where the above lines hold for any set A. Since Π(A|D)          e           ≤ 1,
               ≥Π  e ∗ (A) log Π e ∗ (A) + Π   e ∗ (Ac ) log Πe ∗ (Ac ) − Π e ∗ (Ac ) log Π(A
                                                                                           e c |D)
               ≥ −Π   e ∗ (Ac ) log Π(A
                                     e c |D) − log 2, (∵ x log x + (1 − x) log(1 − x) ≥ − log 2)
                                        Z                                  Z                        !
               = −Π   e ∗ (Ac ) log          (Pθn /P0n )eπ (θ)dθ − log (Pθn /P0n )e        π (θ)dθ − log 2
                                         A c
                                   |              {z              } |              {z             }
                                                 E1n                               E2n
The above representation is similar to the proof of Theorems 3.1 and 3.2 in Bhattacharya
and Maiti (2021). For any q ∈ QMF ,
       −Πe ∗ (Ac )E1n ≤ dKL (e   π∗, πe(|D)) − Π    e ∗ (Ac )E2n + log 2
                        ≤ dKL (π ∗ , π(|D)) − Π     e ∗ (Ac )E2n + log 2            by Lemma A.2.3
                        ≤ dKL (q, π(|D)) − Π       e ∗ (Ac )E2n + log 2           π ∗ is the KL minimizer
                                            XZ            Pn                          e ∗ (Ac ))E2n + log 2
                        ≤ dKL (q, π) +                log 0n q(θ, z)dθ +(1 − Π
                                              z
                                                          P θ
                           |                       {z                     }
                                                  E3n
                        = E3n + (1 − Π     e ∗ (Ac ))E2n + log 2                                              (3.13)
where the fourth inequality in the above equation follows since
      dKL (q, π(|D))
         XZ
       =          (log q(θ, z) − log Pθn − log π(θ, z) + log m(D))q(θ, z)dθ
           z
         XZ                                                           XZ
       =          (log q(θ, z) − log π(θ, z))q(θ, z)dθ +                       (log P0n − log Pθn )q(θ, z)dθ
         |z                          {z                          }      z
                                 dKL (q,π)
       + log m(D) − log P0n
         |          {z           }
                    E2n
where m(D) is the marginal distribution of data as in (3.5).
                  c
    Take A = HM     n ϵn
                          = {θ : dH (P0 , Pθ ) > Mn ϵn }
If Lemma 3.4.1 and Lemma 3.4.2 hold, then by Lemma A.2.6, E1n ≤ −nCMn2 ϵ2n /
                                                                                                            P
                                                                                                              ul for
any Mn → ∞ with high probability.
                                                           67


If Lemma 3.4.3 condition 1 holds, then by Lemma A.2.7, E2n ≤ nMn ( Ll=0 rl + ξ) for any
                                                                                   P
Mn → ∞ with high probability.
If Lemma 3.4.3 condition 2 holds, then by Lemma A.2.8, E3n ≤ nMn ( Ll=0 rl + ξ) for any
                                                                                   P
Mn → ∞ with high probability.
Therefore, by (3.13), we get
                                               L                    L
         nCMn2 ϵ2n e ∗      c
                                             X                   X
           P       Π HMn ϵn ≤ nMn (                rl + ξ) + nMn (     rl + ξ) + log 2
              ul                              l=0                  l=0
                                              XL                  X L                 X L
                                     ≤ nMn (       rl + ξ) + nMn (     rl + ξ) + Mn (      rl + ξ)
                                              l=0                  l=0                 l=0
                                              PL             P
                                                 l=0 rl + ξ)
                                    3Mn (                     ul
             =⇒ Π   e ∗ HM  c
                                     ≤
                              n ϵn
                                                C1 Mn2 ϵ2n
              qP
                    L              P
Taking ϵn =         l=0 (rl + ξ)      ul and noting Mn → ∞, the proof follows.
    We next give conditions on the prior probabilities λl and σ0 to guarantee that Lemmas
3.4.1, 3.4.2 and 3.4.3 hold. This in turn implies the conditions of Theorem 3.4.4 hold and
variational posterior is consistent.
Corollary 3.4.5. Let σ02 = 1, − log λl = log(kl+1 ) + Cl (kl + 1)ϑl , then conditions of Theorem
3.4.4 hold and Π e ∗ as in (3.10) satisfies
                                        e ∗ (Hc
                                        Π      Mn ϵn ) → 0,   n→∞
in P0n probability where and HMn ϵn = {θ : dH (P0 , Pθ ) ≤ Mn ϵn } is the Hellinger neighborhood
of radius Mn ϵn .
    The proof of the corollary has been provided in Appendix A. In this corollary, note that
our expression of prior inclusion probability varies as a function of l thereby providing a
handle on layer-wise sparsity. Indeed, using these expressions in numerical studies further
substantiates the theoretical framework developed in this section.
Remark (Optimal Contraction). For a fixed choice of k, the optimal contraction rate is
achieved at s⋆ , B ⋆ = argmin( rl + ξ). Thus, s⋆ and B ⋆ are the optimal values of s and B
                                     P
                            s,B
                                                       68


which give the best sparse network with minimal loss in the true accuracy. The corresponding
probability expressions in Corollary 3.4.5 can be accordingly modified by setting s = s⋆ and
B = B ⋆ in the expressions of ϑl and rl in (3.12).
3.5     Numerical Experiments
    In this section, we present several numerical experiments to demonstrate the performance
of our spike-and-slab independent Gaussian (SS-IG) Bayesian neural networks which we im-
plement in PyTorch (Paszke et al., 2019). Further, to evaluate the efficacy of the variational
inference we benchmark our model on synthetic as well as real datasets. Our numerical inves-
tigation justifies the use of proposed choices of prior hyperparameters specifically layer-wise
prior inclusion probabilities, which in turn substantiates the significance of our theoretical
developments. With fully Bayesian treatment, we are also able to quantify the uncertain-
ties for the parameter estimates and variational inference helps to scale our model to large
network architectures as well as complex datasets.
    We compare our sparse model with a node selection technique: horseshoe BNN (HS-
BNN) (Ghosh et al., 2019) and an edge selection technique: spike-and-slab BNN (SV-BNN)
(Bai et al., 2020) in the second simulation study and UCI regression dataset examples. We
use optimal choices of prior parameters and fine tuning parameters provided by the authors
of HS-BNN and SV-BNN in their respective models. Further we compare our model against
dense variational BNN model (VBNN) (Blundell et al., 2015) in all of the experiments.
Since it has no sparse structure, it serves as a baseline allowing to check whether sparsity
compromises accuracy. In all the experiments, we fix σ02 = 1 and σe2 = 1. For our model,
the choices of layer-wise λl follow from Corollary 3.4.5: λl = (1/kl+1 )exp(−Cl (kl + 1)ϑl ).
We take Cl values in the negative order of 10 such that prior inclusion probabilities do not
fall below 10−50 otherwise λl values close to 0 might prune away all the nodes from a layer
(check appendix B for more discussion). The remaining tuning parameter details such as
learning rate, minibatch size, and initial parameter choice are provided in the appendix B.
                                               69


The prediction accuracy is calculated using variational Bayes posterior mean estimator with
30 Monte Carlo samples in testing phase.
Node sparsity estimates. In our experiments, we provide node sparsity estimates for
each hidden layer separately. For all models, the node sparsity in a given hidden layer is the
ratio of number of neurons with atleast one nonzero incoming edge over the original number
of neurons present in that layer before training. The layer-wise node sparsity estimates
give clear picture of the structural compactness of the trained model during test time. The
structurally compact trained model has lower latency during inference stage.
3.5.1    Simulation Study - I
    We consider a two dimensional regression problem where the true response y0 is gener-
ated by sampling X from U ([−1, 1]2 ) and feeding it to a deep neural network with known
                                                               p
parameters. We add a random Gaussian noise with σ = 5% V ar(y0 ) to y0 to get noisy
outputs y. We create the dataset using a shallow neural network consisting of 2 inputs, one
hidden layer with 2 nodes and 1 output (2-2-1 network). We train our SS-IG model and
VBNN model using a single hidden layer network with 20 neurons in the hidden layer and
administer sigmoid activation. Each model is trained till convergence. We found that both
models give competitive predictive performance while fitting the given data. In Figure 3.2
we plot the magnitudes of the incoming weights into the hidden layer nodes using boxplots.
Our model with the help of spike and slab prior is able to prune away redundant nodes not
required for fitting the model. Since VBNN is densely connected, it shows all the nodes
being active in its final model. From this experiment, it is clear that neural networks can
be pruned leading to more compact models at inference stage without compromising the
accuracy. We also performed the same experiment with a wider neural network consisting of
100 nodes in the single hidden layer and provide the results in the appendix B. There again
we show that our model can easily recover the sparse solution with competitive performance.
                                             70


                  (a) VBNN                                         (b) SS-IG
Figure 3.2 Simulation study I results. Node-wise weight magnitudes recovered by VBNN
and proposed SS-IG model in the synthetic regression data generated using 2-2-1 network.
The boxplots show the distribution of incoming weights into a given hidden layer node.
3.5.2   Simulation Study - II
   We consider a nonlinear regression example and generate the data from the following
model:
                                    7x2
                             y=           + sin(x3 x4 ) + 2x5 + ε,
                                  1 + x21
where ε ∼ N (0, 1). Further all the covariates are i.i.d. N (0, 1) and independent of ε.
We generated 3000 data entries to create the training data for the experiment. Additional
1000 observations were generated for testing. We modeled this data using 2-hidden layer
neural network which consists of 20 neurons per hidden layer. Sigmoid activation function is
administered for each model used for comparative analysis. Table 3.1 provides the RMSEs
on train and test dataset as well as layer-wise node sparsity estimates for SS-IG, SV-BNN,
HS-BNN, and VBNN models. Our model is extremely well at pruning redundant nodes
which leads to the most compact model compared to the other sparse models: SV-BNN
and HS-BNN. Moreover it exhibits lower root mean squared error (RMSE) values on test
data among the sparse models while showing similar predictive performance compared to
the densely connected VBNN. This experiment further underscores the major benefit of
                                              71


Table 3.1 Simulation study II results. Performance of the proposed SS-IG, SV-BNN,
HS-BNN, and VBNN models where each model was trained for 10k epochs with learning
rate 5×10−3 . Mean and S.D. of RMSE values and median sparsity estimates were calculated
from last 1000 epochs (with jump of 10 giving us sample of 100). The sparsity estimates are
given as a tuple of 2 values representing layer-1 and layer-2 node sparsities.
               Model        Train RMSE        Test RMSE       Sparsity Estimate
               SS-IG       1.2087±0.0490    1.1947±0.0587         (0.35,0.05)
               SV-BNN 1.2897±0.0323         1.2760±0.0363         (0.45,0.35)
               HS-BNN 1.2580±0.0305         1.2436±0.0394        (1.00, 1.00)
               VBNN        1.1661±0.0335    1.1614±0.0349             NA
our proposed approach to generate very compact models which could reduce computational
times and memory usage at inference stage.
3.5.3    UCI Regression Datasets
    We apply our model to traditional UCI regression datasets (Dua and Graff, 2017) and
contrast our performance against SV-BNN, HS-BNN, and VBNN models. We follow the
protocol proposed by (Hernandez-Lobato and Adams, 2015) and train a single layer neural
network with sigmoid activations. For smaller datasets - Concrete, Wine, Power Plant,
Kin8nm, we take 50 nodes in the hidden layer, while for larger datasets - Protein, Year, we
take 100 nodes in the hidden layer. We spilt data randomly while maintaining 9:1 train-test
ratio in each case and for smaller datasets we repeat this technique 20 times. In Protein data
we perform 5 repetitions while in Year data we use a single random split (more details in the
appendix B). For the comparative analysis, we benchmark against SV-BNN, HS-BNN and
VBNN. Moreover, VBNN test RMSEs serve as baseline in each dataset. Table 3.2 summarises
our results including the sparsity estimate representing hidden layer-1 node sparsity (since
there is only one hidden layer in the networks considered).
    We achieve lower RMSEs compared to SV-BNN and HS-BNN in Power Plant, Kin8nm,
and Year datasets and in other cases we achieve comparable RMSE values. In all the
                                              72


                          Table 3.2 UCI regression datasets results.
                                            Test RMSE                    Sparsity Estimate
  Dataset          n(k0 )       SS-IG   SV-BNN    HS-BNN       VBNN      SS-IG     SV-BNN
  Concrete        1030 (8)    7.92±0.68 8.22±0.70 5.34±0.53  7.34±0.62 0.42±0.06   0.98±0.02
  Wine           1599 (11)    0.66±0.05 0.65±0.05 0.66±0.05  0.64±0.05 0.18±0.05   0.87±0.04
  Power Plant 9568 (4)        4.28±0.20 4.32±0.19 4.34±0.18  4.27±0.17 0.18±0.03   0.24±0.03
  Kin8nm          8192 (8)    0.09±0.00 0.11±0.01 0.10±0.00  0.09±0.00 0.43±0.04   0.47±0.04
  Protein        45730 (9)    4.85±0.05 4.93±0.06 4.59±0.02  4.78±0.06 0.81±0.03   0.93±0.03
  Year          515345 (90)   8.68±NA   8.78±NA   9.33±NA    8.67±NA   0.71±NA     0.78±NA
datasets, our predictive performance is close to the dense baseline of VBNN. We provide
node sparsity estimates in our SS-IG and SV-BNN models. HS-BNN was not able to achieve
sparse structure which is consistent with the results provided in the appendix of (Ghosh
et al., 2019). In contrast to HS-BNN, our model sparsifies the model during training without
requiring ad-hoc pruning rule. Table 3.2 demonstrates that our approach uniformly achieves
better sparsity than SV-BNN. In particular, Concrete and Wine datasets show the high
compressive ability of our model over SV-BNN leading to very compact models for inference.
3.5.4     Image Classification Datasets
    Here, we benchmark the empirical performance of our proposed SS-IG method on network
architectures and image classification datasets used in practice.
Baselines. We compare our model against VBNN model which serves as a dense baseline to
gauge the trade-off between predictive performance and sparsity. Moreover, to highlight the
complementary behavior in memory and computational efficiency of node selection compared
to edge selection achieved via Bayesian spike-and-slab prior framework, we compare our
model against the edge selection model, SV-BNN.
Network architectures. We consider 2 neural network model architectures: (i) multi-
layer perceptron (MLP), and (ii) LeNet-5-Caffe. In MLP model, we take 2 hidden layers
                                               73


with 400 neurons in each layer. Output layer has 10 neurons since there are 10 classes in both
datasets. Next, LeNet-5-Caffe model has 2 convolutional layers with 20 and 50 feature maps
respectively with filter size 5 × 5 for both layers. In SS-IG model, for convolution layers,
we prune output channels (similar to neurons in linear layers) using our spike-and-slab prior
where each output channel is assigned an Bernoulli variable to collectively prune parameters
incident on that channel. We apply 2 × 2 max pooling layer after each convolution layer.
The flattened feature layer after second convolution layer has size 4 ∗ 4 ∗ 50 = 800 serving as
input to the fully connected block, where there are 2 hidden layers with 800 and 500 neurons
respectively. The output layer has 10 neurons.
Datasets. We apply each network architecture on 2 image classification datasets: (i)
MNIST: dataset of 60,000 small square 28×28 pixel grayscale images of handwritten sin-
gle digits between 0 and 9, and (ii) Fashion-MNIST (Xiao et al., 2017): dataset of 60,000
small square 28×28 pixel grayscale images of items of 10 types of clothing. We preprocess
the images in the MNIST data by dividing their pixel values by 126. In Fashion-MNIST
data, we horizontally flip images at random with probability of 0.5.
Metrics. We quantify the predictive performance using the accuracy of the test data
(MNIST and Fashion-MNIST). Besides the test accuracy, we evaluate our model against
SV-BNN using the metrics that relate to the model compression and computational com-
plexity. First the compression ratio is the ratio of number of nonzero weights in the com-
pressed network versus the dense model and is an indicator of storage cost at test-time.
Next, we present layer-wise node sparsities in MLP experiments to highlight the computa-
tional speedups at test-time. In LeNet-5-Caffe experiments, we provide the floating point
operations (FLOPs) ratio which is the ratio of number of FLOPs required to predict y from
x during test time in the compressed network versus its dense counterpart. We have detailed
the FLOPs calculation in neural networks in Appendix B.
Nonlinear activation. We use swish activations (Elfwing et al., 2018; Ramachandran et al.,
                                              74


2017) instead of ReLUs in our proposed SS-IG model to avoid the dying neuron problem
(Lu et al., 2020). Specifically in large scale datasets turning off a node with more than
100 incoming edges adversely impacts the training process of ReLU networks. Smoother
activation functions such as sigmoid, tanh, swish etc help alleviate this problem. We choose
swish since it has the best performance. For VBNN and SV-BNN, we use ReLU activations
as recommended by their authors.
MLP Experiments
    The results of MLP network experiments on MNIST and Fashion-MNIST are presented
in Figure 3.3. We provide test data accuracy, model compression ratio, and layer-wise node
sparsities in each experiment.
    In MLP/MNIST experiment (Figure 3.3a - 3.3d), we observe that VBNN and SS-IG
models only require ∼ 400 epochs to achieve stable predictive performance (Figure 3.3a). In
contrast, SV-BNN slightly degrades after 600 epochs and takes longer to achieve convergence
in layer-wise node sparsities compared to our approach (Figure 3.3c and 3.3d). Moreover,
for SS-IG model, we observe that as we start to learn sparse network our model shows peak
test accuracy when most of the nodes are present in the model and it starts to drop as we
learn sparser network and ultimately the test accuracy stabilizes when the node sparsities
converge. Furthermore, SV-BNN has better model compression ratio (Figure 3.3b) in this
experiment at the expense of lower predictive performance. Our method is prunes off ∼ 80%
of first hidden layer nodes and ∼ 90% of second hidden layer nodes at the expense of ∼ 2%
accuracy loss due to sparsification compared to the dense VBNN.
    In MLP/Fashion-MNIST experiment (Figure 3.3e - 3.3h), we observe that VBNN model
takes ∼ 200 epochs and our model takes ∼ 600 epochs for convergence. SV-BNN model
takes longer to achieve convergence in layer-wise node sparsities (Figure 3.3g and 3.3h).
We also observe the complementary behavior of our model and SV-BNN in memory and
computational efficiency where our model achieves better layer-wise node sparsities and SV-
                                             75


                     (a) Test accuracy           (b) Compression ratio
                 (c) Layer-1 node sparsity     (d) Layer-2 node sparsity
                     (e) Test accuracy           (f) Compression ratio
                 (g) Layer-1 node sparsity     (h) Layer-2 node sparsity
Figure 3.3 MLP/MNIST and MLP/Fashion-MNIST experiments results. First
two rows (a)-(d) represent the MLP on MNIST experiment results. Bottom two rows (e)-(h)
represent the MLP on Fashion-MNIST experiment results.
                                           76


BNN has better model compression ratio (Figure 3.3f) with both models having similar
predictive performance (Figure 3.3e). Furthermore, our method prunes off ∼ 90% of first
hidden layer nodes and ∼ 92% of second hidden layer nodes at the expense of ∼ 3% accuracy
loss due to sparsification compared to the densely connected VBNN.
LeNet-5-Caffe Experiments
    The results of more complex LeNet-5-Caffe network experiments on MNIST and Fashion-
MNIST are presented in Figure 3.4. We provide test data accuracy, model compression
ratio, and FLOPs ratio in each experiment over 1200 epochs. Here, FLOPs ratio serve as
a collective indicator of layer-wise node sparsities since FLOPs are directly related to how
many neurons or channels are remaining in linear or convolution layers respectively.
    In LeNet-5-Caffe/MNIST experiment (Figure 3.4a - 3.4c), we observe that our model has
better predictive accuracy than SV-BNN (Figure 3.4a). Moreover, we achieve 10% more
        (a) Test accuracy            (b) Compression ratio           (c) FLOPs ratio
        (d) Test accuracy            (e) Compression ratio           (f) FLOPs ratio
Figure 3.4 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experi-
ments results. Top row (a)-(c) represent the LeNet-5-Caffe/MNIST experiment results.
Bottom row (d)-(f) represent the LeNet-5-Caffe/Fashion-MNIST experiment results.
                                              77


reduction in Flops (Figure 3.4c)) compared to SV-BNN whereas SV-BNN achieves better
model compression than our approach (Figure 3.4b). Lastly, our method is able to reduce
the FLOPs of the model during inference at test-time by 90% at the expense of ∼ 0.5%
accuracy loss due to sparsification compared to the densely connected VBNN.
    In LeNet-5-Caffe/Fashion-MNIST experiment (Figure 3.4d - 3.4f), we observe that both
SS-IG and SV-BNN have similar test accuracies at convergence (Figure 3.4d). However, our
model has 40% less FLOPs (Figure 3.4f) during inference stage compared to SV-BNN which
again achieves better model compression (Figure 3.4e). This highlights the complementary
nature of our method of node selection that leads to a structurally sparse model with sig-
nificantly lower (almost 5 times) FLOPs compared to weight pruning approach, SV-BNN,
which induces unstructured sparsity in the pruned network leading to significant model com-
pression with low storage cost. Lastly, our method leads to a sparse model with only 8% of
the FLOPs as compared to VBNN at the expense of ∼ 3% accuracy loss underscoring the
trade-off between predictive accuracy and sparsity.
3.6     Conclusion and Discussion
    In this chapter, we have proposed sparse deep Bayesian neural networks using spike-and-
slab priors for optimal node recovery. Our method incorporates layer-wise prior inclusion
probabilities and recovers underlying structurally sparse model effectively. Our theoretical
developments highlight the conditions required for the posterior consistency of the variational
posterior to hold. With layer-wise characterisation of prior inclusion probabilities we show
that the proposed sparse BNN approximations can achieve predictive performance compa-
rable to dense networks. Our results relax the constraints of equal number of nodes and
uniform bounds on weights thereby achieving optimal node recovery on more generic neural
network structure. The closeness of a true function to the topology induced by layer-wise
node distribution depends on the degree of smoothness of the true underlying function. In
this work, this has not been studied in depth and forms a future direction for work.
                                             78


    Note, in contrast to previous works, our work assumes a spike-and-slab prior on the
entire vector of incoming weights and bias onto a node. We underscore the fact that node
selection has complementary behavior with edge selection approaches as established by our
empirical experiments. Node selection offers significant computational speedup whereas edge
selection achieves significant model compression at test-time. The demonstration of the
efficacy of our node selection approach opens the avenue for exploration of sophisticated
group sparsity priors for node selection. Our detailed experiments show the subnetwork
selection ability of our method which underscores the notion that deep neural networks can
be heavily pruned without losing predictive performance. The experiment with convolution
neural network (LeNet-5-Caffe) highlights the generalizability of our approach from mere
multi layer perceptron to complex deep learning models. Although our method performs
model reduction while maintaining predictive power, some further improvements may be
obtained by choosing the number of layers in a data-driven fashion and can be a part of
future work.
                                             79


APPENDICES
    80


                                              APPENDIX A
                       PROOFS OF SS-IG THEORETICAL RESULTS
A.1         Definitions
Definition A.1.1 (Sieve). Consider a sequence of function classes F1 ⊆ F2 ⊆ · · · ⊆ Fn ⊆
Fn+1 ⊆ · · · ⊆ F, where ∀f ∈ F, ∃ fn ∈ Fn s.t. d(f, fn ) → 0 as n → ∞ where d(., .) is
some pseudo-metric on F. More precisely, ∪∞           n=1 Fn is dense in F. Then, Fn is called a sieve
space of F with respect to the pseudo-metric d(., .), and the sequence {fn } is called a sieve
(Grenander, 1981).
Definition A.1.2 (Covering number). Let (V, ||.||) be a normed space, and F ⊂ V . Then,
{V1 , · · · , VN } is an ε−covering of F if F ⊂ ∪N        i=1 B(Vi , ε), or equivalently, ∀ ϱ ∈ F, ∃ i
such that ||ϱ − Vi || < ε. The covering number of F denoted by N (ε, F, ||.||) = min{n :
∃ ε − covering over F of size n} (Pollard, 1991).
A.2         General Lemmas
Lemma A.2.1. Let g1 and g2 be any two density functions. Then
                                 Eg1 (|log(g1 /g2 )|) ≤ dKL (g1 , g2 ) + 2/e
Proof. Refer to Lemma 4 in Lee (2000).
Lemma A.2.2. For any K > 0, let a, a0 ∈ [0, 1]K such that K
                                                                           P                PK 0
                                                                               k=1 ak =        k=1 ak = 1, then
the KL divergence between mixture densities K
                                                      P                   PK 0 0
                                                         k=1 ak gk and      k=1 ak gk is bounded as
                            K             K
                                                  !                     K
                           X             X                             X
                      dKL      a0k gk0 ,     ak gk ≤ dKL (a0 , a) +        a0k dKL (gk0 , gk )
                           k=1           k=1                           k=1
Proof. Refer to Lemma 6.1 in Chérief-Abdellatif and Alquier (2018).
                                                      81


Lemma A.2.3.
                                   dKL (eπ∗, πe(|D)) ≤ dKL (π ∗ , π(|D))
Proof. Using Lemma A.2.2 with a0 = π ∗ (z), a = π(z|D), g 0 = π ∗ (θ|z) and g = π(θ|z, D),
we get
                                      X                        X
                 π∗, π
            dKL (e     e(|D)) = dKL (      π ∗ (θ|z)π ∗ (z),       π(θ|z, D)π(z|D))
                                      z                          z
                                                               X
                              ≤ dKL (π ∗ (z), π(z|D)) +            dKL (π ∗ (θ|z), π(θ|z, D))π ∗ (z)
                                                                z
                              = dKL (π ∗ (θ, z), π(θ, z|D)) = dKL (π ∗ , π(|D))
Lemma A.2.4. For any 1-Lipschitz continuous activation function ψ such that ψ(x) ≤
x ∀x ≥ 0,
                                                            "L                                        !s l #
                                        X            X Y                        Bl
      N (δ, F(L, k, s, B), ||.||∞ ) ≤           ···                                  QL          kl+1
                                      s∗L ≤sL       s∗0 ≤s0  l=0
                                                                   δBl /(2(L + 1)(     j=0 Bj ))
where N denotes the covering number.
Proof. Given a neural network
      η(x) = vL + WL ψ(vL−1 + WL−1 ψ(vL−2 + WL−2 ψ(· · · ψ(v1 + W1 ψ(v0 + W0 x)))
    for l ∈ {1, · · · , L}, we define A+                p
                                         l η : [0, 1] → R ,
                                                               kl
            A+l η(x) = ψ(vl−1 + Wl−1 ψ(vl−2 + Wl−2 ψ(· · · ψ(v1 + W1 ψ(v0 + W0 x)))
and A− l η : R
               kl−1
                     → RkL+1 ,
             A−l η(y) = vL + WL ψ(vL−1 + WL−1 ψ(· · · ψ(vl + Wl ψ(vl−1 + Wl−1 y)))
The above framework is also used in the proof of lemma 5 in Schmidt-Hieber (2020). Next,
                     −
                                                                                                       Ql−1
set A+0 η(x) = AL+2 η(x) = x and further note that for η ∈ F(L, k), |Al η(x)|∞ ≤
                                                                                         +
                                                                                                           j=0 Bj
                                                          82


where k = (p, k1 , · · · , kL , kL+1 ) and kL+1 = 1. Next, we derive upper bound on Lipschitz
constant of A−   l η.
                                                          −                           −
         |WL A+                    +                             +
                 L η(x1 ) − WL AL η(x2 )|∞ = |Al η(Al−1 η(x1 )) − Al η(Al−1 η(x2 ))|∞
                                                                                             +
                                                                                                                   (A.1)
                                     QL
    l.h.s. is bounded above by           j=0  Bj and r.h.s consists of composition of Lipschitz functions
A−            +
  l η and Al−1 η with C1 and C2 being corresponding Lipschitz constants. So we can bound
r.h.s. by,
           |A−       +                 −       +
              l η(Al−1 η(x1 )) − Al η(Al−1 η(x2 ))|∞ ≤ C1 C2 ||x1 − x2 ||∞                       ∀x1 , x2 ∈ Rp
    If we choose x1 = x ∈ [0, 1]p and x2 = 0 then,
                      |A−     +                  −       +
                         l η(Al−1 η(x)) − Al η(Al−1 η(0))|∞ ≤ C1 C2                     ∀x ∈ [0, 1]p
                                                                                                        Ql−2
    Since C2 is Lipschitz constant for A+            l−1 η and we know that |Al−1 η|∞ ≤
                                                                                           +
                                                                                                          j=0 Bj . So we
              Ql−2
get C2 ≤ 2 j=0       Bj . We use this in above expression,
                                                                              l−2
                                                                              Y
                   |A−      +                 −       +
                      l η(Al−1 η(x)) − Al η(Al−1 η(0))|∞ ≤ 2C1                      Bj     ∀x ∈ [0, 1]p            (A.2)
                                                                              j=0
                                                                                          QL
    Next we know that l.h.s. of (A.2) can be bounded above by 2                              j=0   Bj because of (A.1).
So we get bound on Lipschitz constant of A−                l η,
                                     Yl−2             YL                         Y L
                                2C1        Bj ≤ 2         Bj =⇒ C1 ≤                    Bj
                                     j=0              j=0                      j=l−1
                                                                                                           ∗
    Let η, η ∗ ∈ F(L, k, s, B) be two neural networks with W l = (vl , Wl ) and W l = (vl∗ , Wl∗ )
                                                                                                             ∗
respectively. Here, we define δ l using the L1 norms of the rows of D l = W l − W l as follows
                                  ⊤            ⊤
                         D l = (dl1 , · · · , dlkl+1 )⊤     δ l = (||dl1 ||1 , · · · , ||dlkl+1 ||1 )
We choose η, η ∗ such that ||δ l ||∞ ≤ ζBl . This also means that all parameters in each layer
of these two networks are at most ζBl distance away from each other. Then, we can bound
the absolute difference between these two neural networks by,
         |η(x) − η ∗ (x)|
                                                            83


           L+1
           X
        ≤        |A−                                   +    ∗              −          ∗
                   l+1 η(ψ(vl−1 + Wl−1 Al−1 η (x))) − Al+1 η(ψ(vl−1 + Wl−1 Al−1 η (x)))|
                                                                                                ∗    +     ∗
            l=1
           L+1      L
                             !
           X       Y
                                                                  ∗               ∗         ∗         ∗
        ≤                Bj     ||ψ(vl−1 + Wl−1 A+                                               +
                                                             l−1 η (x)) − ψ(vl−1 + Wl−1 Al−1 η (x))||∞
            l=1    j=l
           L+1      L
                             !
           X       Y
                                                ∗                        ∗           ∗
        ≤                Bj     ||vl−1 − vl−1        + (Wl−1 − Wl−1          )A+l−1 η (x))||∞
            l=1    j=l
           L+1 Y    L
                             !
           X
                                                         ∗
        ≤                Bj     ||δ l−1 ||∞ ||A+   l−1 η (x))||∞
            l=1    j=l
           L+1      L
                             !              l−2                          L
                                                                               !
           X       Y                       Y                            Y
        ≤                Bj     ζBl−1           Bj = ζ(L + 1)               Bj                                          (A.3)
            l=1    j=l                      j=0                         j=0
                                                                                                               kl+1       sl
                                                                                                                    
   Recall that we have at most kl number of nodes in each layer and there are                                   sl
                                                                                                                      ≤ kl+1
combinations of nodes to choose sl active nodes in the given layer. Since supremum norm
of L1 norms of the rows of Wl is bounded above by Bl in our family of neural networks
F(L, k, s, B) so we can discretize these L1 norms with grid size δBl /(2(L + 1)( Lj=0 Bj ))
                                                                                                                 Q
and obtain upper bound on covering number as follows
                                                                  "L                                              !s l #
                                                X            X Y                          Bl
     N (δ, F(L, k, s, B), ||.||∞ ) ≤                   ···                                     QL            kl+1
                                                ∗
                                               sL ≤sL        ∗
                                                            s0 ≤s0 l=0
                                                                            δB l /(2(L   + 1)(   j=0 B  j ))
                                                L                           L
                                                                                  !       !(sl +1)
                                               Y                           Y
                                            ≤          2δ −1 (L + 1)           Bj kl+1                                  (A.4)
                                               l=0                         j=0
Lemma A.2.5. Let θ ∗ =                   arg min |ηθ − η0 |2∞ and W          fl = supi ||wli − w∗li ||1 , then for any
                                       θ∈F (L,k,s,B)
density q = Lj=0 q(θj ),
              Q
        Z                                     X L          Z                      L
                                                                                  Y     Z
            ||ηθ −  ηθ∗ ||22 q(θ)dθ        ≤        c2j−1      fj2 qj (θj )dθj
                                                               W                           fm + Bm )2 q(θ)dθ
                                                                                          (W
                                               j=0                             m=j+1
                 L Xj−1                    Z                                     L    Z
               X                                                               Y
          +2               cj−1 c   j ′ −1    W
                                              fj (W fj + Bj )qj (θj )dθj                  fm + Bm )2 q(θ)dθ
                                                                                         (W
                j=0 j ′ =0                                                    m=j+1
               Z                             j−1
                                             Y Z
           ×       W
                   fj ′ qj ′ (θj ′ )dθj ′               (W
                                                         fm + Bm )q(θ)dθ                                                (A.5)
                                           m=j ′ +1
                Qj−1
where cj−1 ≤      m=0    Bm .
                                                                84


Proof. Let ηθl be the partial networks defined as
                                   
                                      ηθ0 (x) := ψ(W0 x + v0 ),
                                   
                                   
                                   
                                   
                                   
                                   
                                   
                                     ηθl (x) := ψ(Wl ηθl−1 (x) + vl ),
                                   
                                   
                                   
                                   ηθL (x) := WL ηθL−1 (x) + vL .
                                   
                                   
Similar to the proof of theorem 2 in Chérief-Abdellatif (2020), define
                              φl (θ) = sup            sup |ηθl (x)i − ηθl ∗ (x)i |.
                                           x∈[0,1]p 1≤i≤kl+1
We next show by induction
                                                      X l
                                          φl (θ) ≤          fj cj−1 Rl
                                                            W             j+1
                                                      j=0
                                                                                                     Ql
where we define cl = max(supx∈[0,1]p sup1≤i≤kl+1 |ηθl ∗ (x)i |, 1), c0 = 1, Rj+1            l
                                                                                                   =  m=j+1 (Wm
                                                                                                             f +
Bm ).
   Claim: cl ≤ Bl cl−1 . Note
                           cl ≤ sup           sup (|wli∗ ⊤ ηθl−1   ∗ (x)| + |vli |)
                                  x∈[0,1]p 1≤i≤kl+1
                                                       Xkl
                                                                 ∗
                              ≤ sup           sup (          |wlij  ||ηθl−1
                                                                         ∗ (x)j | + |vli |)
                                  x∈[0,1]p 1≤i≤kl+1 j=1
                                                    Xkl
                                                             ∗
                              ≤     sup (cl−1             |wlij | + cl−1 |vli |)
                                  1≤i≤kl+1          j=1
                              ≤ cl−1       sup ||w∗li ||1 = Bl cl−1
                                       1≤i≤kl+1
where the above result holds since supi ||w∗li ||1 ≤ Bl . Next,
                                      Xkl
         φl (θ) ≤ sup        sup (          |wlij ηθl−1 (x)j − wlij      ηθ∗ (x)j | + |vli − vli∗ |)
                                                                      ∗ l−1
                  x∈[0,1]p 1≤i≤kl+1 j=1
                                      Xkl
                                                                      ∗ l−1
                ≤ sup        sup (          |wlij ηθl−1 (x)j − wlij      ηθ (x)j |
                  x∈[0,1]p 1≤i≤kl+1 j=1
                                     + |wlij ∗ l−1
                                                ηθ (x)j − wlij        ηθ∗ (x)j | + |vli − vli∗ |)
                                                                    ∗ l−1
                                      Xkl
                                                         ∗
                ≤ sup        sup (          |wlij − wlij    ||ηθl−1 (x)j |
                  x∈[0,1]p 1≤i≤kl+1 j=1
                                                           85


                                             X  kl
                                                       ∗                                              ∗
                                          +         |wlij ||ηθl−1 (x)j − ηθl−1   ∗ (x)j | + |vli − vli |)
                                              j=1
                                           Xkl
                                                              ∗
                 ≤ sup           sup (           |wlij − wlij     ||ηθl−1 (x)j − ηθl−1  ∗ (x)j |
                     x∈[0,1]p 1≤i≤kl+1 j=1
                                             X  kl
                                                                  ∗                              ∗
                                          +         |wlij − wlij      ||ηθl−1
                                                                           ∗ (x)j | + |vli − vli |) + φl−1 (θ)Bl
                                              j=1
                 ≤W  fl (φl−1 (θ) + cl−1 ) + φl−1 (θ)Bl = φl−1 (θ)(W                    fl + Bl ) + cl−1 Wfl
Now applying recursion we get
           φl (θ) ≤ (φl−2 (θ)(W       fl−1 + Bl−1 ) + cl−2 W          fl−1 )(W fl + Bl ) + cl−1 W   fl
                    = φl−2 (θ)(W    fl + Bl )(W     fl−1 + Bl−1 ) + cl−2 W         fl−1 (W  fl + Bl ) + cl−1 W fl
Repeating this we get
                                       Y l                        X l                Yl
                  φl (θ) ≤ φ0 (θ) (Wj + Bj ) +
                                             f                          cj−1 Wj
                                                                              f           (Wfj + Bj )
                                       j=1                        j=1              u=j+1
                                   Y l                      X   l                          Yl
                           =W f0        (W
                                         fj + Bj ) +               B1 · · · Bj−1 W   fj        (Wfj + Bj )
                                   j=1                       j=1                         u=j+1
                              X l                           Y l                         Xl
                           =        B1 · · · Bj−1 W  fj            (W fj + Bj ) =            fj cj−1 Rl
                                                                                             W         j+1
                              j=0                          u=j+1                        j=0
      Z                                    Z                                       Z
         ||ηθ −    ηθ∗ ||22 q(θ)dθ     ≤       ||ηθ −    ηθ∗ ||2∞ q(θ)dθ      =       φ2L (θ)q(θ)dθ
         Z X    L
      = (           fj cj−1 RL )2 q(θ)dθ
                    W            j+1
              j=0
           L          Z                                      L X   j−1                Z
         X                                                  X
      =        c2j−1      fj2 (Rj+1
                          W        L
                                       )2 q(θ)dθ      +2                 cj−1 cj ′ −1     W
                                                                                          fj W      L
                                                                                              fj ′ Rj+1 RjL′ +1 q(θ)dθ
         j=0                                                j=0 j ′ =0
           L          Z              L
                                                             !2
         X                          Y
      =        c2j−1      fj2
                          W                (Wfm + Bm )             q(θ)dθ
         j=0                      m=j+1
              L X  j−1               Z                   L                           L
             X                                        Y                             Y
       +2               cj−1 cj ′ −1      W
                                          fj W fj ′           (W fm + Bm )                 (W
                                                                                            fm + Bm )q(θ)dθ
             j=0 j ′ =0                              m=j+1                       m=j ′ +1
                                                   QL
The proof follows by noting q(θ) =                   j=0  q(θj ).
                                                                86


Lemma A.2.6. Suppose Lemma 3.4.1 and Lemma 3.4.2 in the Section 3.4 hold, with dom-
inating probability
                                                    Pθn                  Cnϵ2n
                                            Z
                                       log              π(θ)dθ    ≤   −
                                                    P0n
                                                                          P
                                              Hϵcn                            ul
                                                             PL                                       PL
Proof. Let Fn = F(L, k, s◦ , B ◦ ), s◦l + 1 = nϵ2n /            j=0   uj , log Bl◦ = nϵ2n /((L + 1)           ◦
                                                                                                        j=0 (sj + 1))
and Hϵn = {θ : dH (P0 , Pθ ) < ϵn } is the Hellinger neighborhood of size ϵn
                         Pθn                            Pθn                      Pθn
                   Z                        Z                              Z
                              e(θ)dθ ≤
                              π                             π
                                                            e(θ)dθ +                 π
                                                                                     e(θ)dθ
                    Hϵcn P0n                  Hϵcn ∩Fn  P0n                  Fnc P0
                                                                                   n
                                                        Pθn                           (C0 /2)nϵ2n
                                            Z                                                    
                                        ≤                   e(θ)dθ + exp − P
                                                            π
                                              Hϵcn ∩Fn  P0n                                  ul
where the last inequality follows from Lemma 3.4.2 because by Markov’s inequality
                                    Pθn                         (C0 /2)nϵ2n
                           Z                                                   
                     PP0n             n
                                        e(θ)dθ > exp − P
                                        π
                                Fnc P0                                   ul
                                               2                    n
                                                        Z                     
                                   (C0 /2)nϵn                    Pθ
                      ≤ exp          P              EP0n            n
                                                                      π
                                                                      e(θ)dθ
                                         ul                  Fnc P0
                                   (C0 /2)nϵ2n e c                          (C0 /2)nϵ2n
                                                                                      
                      ≤ exp          P              Π(Fn ) = exp − P                         →0
                                         ul                                        ul
Further,
                       Pθn                              Pθn                                   Pn
           Z                             Z                                Z
                           e(θ)dθ ≤
                           π                         ϕ nπ   e(θ)dθ +                 (1 − ϕ) θn π e(θ)dθ
             Hϵcn ∩Fn  P0n                 Hϵcn ∩Fn P0                      Hϵcn ∩Fn          P0
                                         |             {z           } |                  {z            }
                                                       T1                                 T2
Next, borrowing steps from proof of theorem 3.1 in Pati et al. (2018), we have EP0n (ϕ) ≤
exp(−C1 nϵ2n ), thus for any C1′ < C1 , ϕ ≤ exp(−C1′ nϵ2n ) with probability at least 1 −
exp(−(C1 − C1′ )nϵ2n ). Thus,
                                          T1 ≤ exp(−C1′ nϵ2n )T1 + T2
which implies with dominating probability T1 ≤ T2 . Thus, it only remains to show T2 ≤
exp(−C2′ (nϵ2n )/( ul )) for some C2′ > 0. This is true since
                   P
                          C2 nϵ2           nϵ2                   C2 nϵ2
                                                                        Z
                        −  P n          C2 P un                   P n
         PP0n (T2 > e        ul
                                 )≤e           l EP0n (T2 ) ≤ e     ul
                                                                                    EPθ (1 − ϕ)e π (θ)dθ
                                                                           Hϵcn ∩Fn
                                                          87


                                         C2 nϵ2
                                                Z
                                          P n                         2
                                   ≤e       ul
                                                              e−C2 ndH (P0 ,Pθ ) π
                                                                                 e(θ)dθ
                                                   Hϵcn ∩Fn
                                         C2 nϵ2
                                                             Z
                                          P n    −C2 nϵ2n
                                                                                                         X
                                   ≤e       ul
                                                e                       e(θ)dθ ≤ exp(−C2′ nϵ2n /
                                                                        π                                      ul )
                                                              Hϵcn ∩Fn
Therefore, for sufficiently large n and C = min(C0 /2, C2′ )/2
        Pθn
 Z                                           X                                        X                               X
                                    ′   2                                         2                              2
          n
            π
            e (θ)dθ ≤   2exp(−C     2 nϵn /        u l ) +  exp(−(C     0 /2)nϵ   n /      u l ) ≤  exp(−Cnϵ     n  /   ul )
   Hϵcn P0
Lemma A.2.7. Suppose Lemma 3.4.3 part 1. in the Section 3.4 holds, then for any Mn → ∞
, with dominating probability,
                                             P0n
                                        Z                                 X
                                   log            π (θ)dθ      ≤  nM  n (     rl + ξ)
                                             Pθn
                                                  e
Proof. By Markov’s inequality,
           Z n                                                                                    Z n
                    P0                    X                                1                            Pθ
      PP0n log        n
                        e(θ) ≥ nMn (
                        π                       rl + ξ) ≤                                EP0n log           π
                                                                                                            e(θ)dθ
                                                                                                        P0n
                                                                          P
                    Pθ                                            nMn ( rl + ξ)
                                                                                     Z          Z n
                                                                        1                          Pθ
                                                          =          P                    log        n
                                                                                                       e(θ)dθ P0n dµ
                                                                                                       π
                                                               nMn ( rl + ξ)                       P0
                                                                                                             
                                                                        1                        n   ∗     2
                                                          ≤          P                  dKL (P0 , L ) +
                                                               nMn ( rl + ξ)                                e
where L∗ = Pθn π
                R
                     e(θ)dθ and the last inequality follows from Lemma A.2.1.
                                                                                                       !
                                             P0n                                         P0n
                                                        
               n  ∗
       dKL (P0 , L ) = EP0n log R n                          ≤ EP0n log R
                                        Pθ π e(θ)dθ                            NP rl +ξ θ
                                                                                          P nπe(θ)dθ
                         Z                        Z
                     ≤             π
                                   e(θ)dθ +                    dKL (P0n , Pθn )eπ (θ)dθ Jensen’s inequality
                          NP rl +ξ                  NP rl +ξ
                                       P                  X                                X
                     ≤ − log e−nC(        rl +ξ)
                                                  + n(         rl + ξ) = n(C + 1)(               rl + ξ)
where the last inequality follows from Lemma 3.4.3 part 1. in the Section 3.4. The proof
follows by noting C/Mn → 0.
Lemma A.2.8. Suppose Lemma 3.4.3 part 2. in the Section 3.4 holds, then for any Mn → ∞
, with dominating probability,
                                       XZ                P0n                           X
                        dKL (q, π) +              log        q(θ,  z)dθ    ≤ nM     n (     rl + ξ)
                                         z
                                                         Pθn
                                                              88


Proof. By Markov’s inequality we have
                                                                                      !
                                   XZ                    P0n             X
              PP0n  dKL (q, π) +           q(θ, z) log n dθ > nMn (          rl + ξ)
                                    z
                                                         Pθ
                                                                                       !
                          1                                 XZ                 P0n
               ≤        P              dKL (q, π) + EP0n           q(θ, z) log n dθ
                  nMn (     rl + ξ)                          z
                                                                               Pθ
                                                                                          !!
                          1                                  XZ                   Pθn
               ≤        P              dKL (q, π) + EP0n            q(θ, z) log n dθ
                  nMn (     rl + ξ)                            z
                                                                                  P0
                                                                                            !
                                                     XZ                      P0n n
                                                                     Z
                          1
               =       P               dKL (q, π) +          q(θ, z)     log n P0 dµdθ
                  nMn ( rl + ξ)                        z
                                                                             Pθ
By Lemma A.2.1, we get
                                                  XZ                                     !
                      1                                                                2
             ≤       P              dKL (q, π) +           q(θ, z) dKL (P0n , Pθn ) +      dθ
                nMn ( rl + ξ)                       z
                                                                                       e
                                                                                           !
                      1                              XZ                                  2
             =       P              dKL (q, π) + n           q(θ, z)dKL (P0 , Pθ )dθ +
                nMn ( rl + ξ)                         z
                                                                                         e
                      C            X                        
             =       P              n(     rl + ξ) + (2/e) → 0
                nMn ( rl + ξ)
where the last line in the above holds due to Lemma 4.3 part 2. in the Section 3.4.
A.3      Proof of Lemmas and Corollary in the Section 3.4
Proof of Lemma 3.4.1
                              PL                                           PL
     Take s◦l + 1 = (nϵ2n )/(   j=0 uj ) and log Bl◦ = (nϵ2n )/((L + 1)             ◦
                                                                              j=0 (sj + 1)).
     We know from Lemma 2 of Ghosal and Van Der Vaart (2007) that, there exists a function
φ ∈ [0, 1], such that
                                     EP0 (φ) ≤ exp{−nd2H (Pθ1 , P0 )/2}
                                EPθ (1 − φ) ≤ exp{−nd2H (Pθ1 , P0 )/2}
for all Pθ ∈ F(L, k, s◦ , B ◦ ) satisfying dH (Pθ , Pθ1 ) ≤ dH (P0 , Pθ1 )/18.
     Let H = N (ϵn /19, F(L, k, s◦ , B ◦ ), dH (., .)) denote the covering number of F(L, k, s◦ , B ◦ ),
i.e., there exist H Hellinger balls of radius ϵn /19, that entirely cover F(L, k, s◦ , B ◦ ). For
                                                     89


any θ ∈ F(L, k, s◦ , B ◦ ) w.l.o.g we assume Pθ belongs to the Hellinger ball centered at Pθh
and if dH (Pθ , P0 ) > ϵn , then we must have that dH (P0 , Pθh ) > (18/19)ϵn and there exists a
testing function φh , such that
                                EP0 (φh ) ≤ exp{−nd2H (Pθh , P0 )/2}
                                          ≤ exp{−((182 /192 )/2)nϵ2n }
                           EPθ (1 − φh ) ≤ exp{−nd2H (Pθh , P0 )/2}
                                          ≤ exp{−n(dH (P0 , Pθ ) − ϵn /19)2 /2}
                                          ≤ exp{−((182 /192 )/2)nd2H (P0 , Pθ )}.
Next we define ϕ = maxh=1,··· ,H φh . Then we must have
                                     X
                         EP0 (ϕ) ≤        EP0 (φh ) ≤ Hexp{−((182 /192 )/2)nϵ2n }
                                       h
                                  ≤ exp{−((182 /192 )/2)nϵ2n − log H}
Using Lemma A.2.4 with s = s◦ and B = B ◦ , we get
      log H = log N (ϵn /19, F(L, k, s◦ , B ◦ ), dH (., .))
                   √
       ≤ log N ( 8σe2 ϵn /19, F(L, k, s◦ , B ◦ ), ||.||∞ )
              
                  L                          L
                                                    !      !(s◦l +1) 
                Y          38               Y
       ≤ log           √        (L + 1)        Bj◦ kl+1             
                          8σ 2ϵ
                l=0          e n            j=0
           L                                         L
                                                           !       !
         X                         38               Y
       =      (s◦l + 1) log √            (L + 1)        Bj◦ kl+1
                                  8σ 2ϵ
          l=0                        e n            j=0
            " L                                            L
                                                                                 !#
              X                     1                    X
       ≤C          (s◦l + 1) log + log(L + 1) +               log Bj◦ + log kl+1
               l=0
                                    ϵn                    j=0
            X L                                      XL
                     ◦
       ≤C         (sl + 1)(log n + log(L + 1) +          log Bj◦ + log kl+1 )
             l=0                                     j=0
            X L                                      XL
       ≤C         (s◦l + 1)(log n + log(L + 1) +         log Bj◦ + log kl+1 + log(kl + 1)) ≤ Cnϵ2n
             l=0                                     j=0
                                                      90


where, C in each step is different which tends to absorb the extra constants in it. First
inequality holds due to the following
                                                                                             
                                                                         1                 2
                                 d2H (Pθ , P0 )  ≤ 1 − exp − 2 ||η0 − ηθ ||∞
                                                                        8σe
and ϵn = o(1), the second inequality is because of (A.4), and fourth inequality is because of
s◦l log(1/ϵn ) ≍ s◦l log n. Therefore,
                                                  X
                                     EP0 (ϕ) ≤          EP0 (φh ) = exp{−C1 nϵ2n }
                                                    h
for some C1 = (182 /192 )/2 − 1/4. On the other hand, for any θ, such that dH (Pθ , P0 ) ≥ ϵn ,
say Pθ belongs to the hth Hellinger ball, then we have
                           EPθ (1 − ϕ) ≤ EPθ (1 − φh ) ≤ exp{−C2 nd2H (P0 , Pθ )}
where C2 = (182 /192 )/2. This concludes the proof.
Proof of Lemma 3.4.2
                                                   X L                                X
        Assumption:        s◦l +1=      (nϵ2n )/(        uj ), λl kl+1 /s◦l → 0,             ul log L = o(nϵ2n )        (A.6)
                                                   j=0
                                          L
                                                                !             L
                                                                                                       !
                                        [                                    [
     Π(F(L,
      e        k, s◦ , B ◦ )c ) ≤ Π e       {||w el ||0 > s◦l }      +Π  e       {||wel ||∞ > Bl◦ }
                                        l=0                                  l=0
                                    XL                             X L
                                ≤       Π(||
                                        e w    el ||0 > s◦l ) +          Π(||
                                                                         e w  el ||∞ > Bl◦ )
                                    l=0                             l=0
                                    XL X                                         X L X
                                =            Π(||w   el ||0 >  s◦l |z)π(z)    +               Π(||wel ||∞ > Bl◦ |z)π(z)
                                    l=0   z                                       l=0     z
                                             kl+1
                                     L
                                                              !         L
                                                                                                                !
                                    X        X                        X
                                ≤       P           zli > s◦l     +        P        sup       ||wli ||1 > Bl◦ z
                                              i=1                               i=1,··· ,kl+1
                                    l=0                                l=0
where w  el = (||wl1 ||1 , · · · , ||wlkl+1 ||1 )T and the last inequality holds since Π(||w               el ||0 > s◦l |z) ≤
         el ||0 > s◦l |z) = 1 iff        zli > s◦l and π(z) ≤ 1. We now break the proof in two parts
                                      P
1, Π(||w
as follows.
                                                              91


Part 1.
                                    kl+1                               kl+1
                         L
                                                    !      L
                                                                                                           !
                       X            X                    X             X
                             P           zli > s◦l     =        P           zli − kl+1 λl > s◦l − kl+1 λl
                       l=0           i=1                  l=0           i=1
By Bernstein inequality
         L                                                                        L
                                  −1/2(s◦l − kl+1 λl )2                                        −1/2(s◦l − kl+1 λl )2
        X                                                                 X                                            
    ≤        exp                                                             ≤        exp
        l=0
                       kl+1 λl (1 − λl ) + 1/3(s◦l − kl+1 λl )                  l=0
                                                                                            kl+1 λl + 1/3(s◦l − kl+1 λl )
         L                                                         L
                       −sl /2(1 − kl+1 λl /s◦l )2
                     ◦
                                                                                  3s◦l
                                                                                     
        X                                                       X                                    kl+1 λl
    =        exp                                    ◦
                                                           →            exp −                 since      ◦
                                                                                                             → 0 by (A.6)
                         1/3(1     +   2k l+1 λl /sl )                              2                  s l
        l=0                                                      l=0
         L                       2
                                                                             nϵ2n                   nϵ2n
                                                                                                      
        X                   3nϵn         3
    =        exp − P +                         ≤ 5(L + 1)exp − P                       ≤ exp − P
        l=0
                         4       u  l    2                                 2     u l               4 ul
                                               ul log L = o(nϵ2n ) by (A.6).
        P                                  P
since       ul log(5(L + 1)) ∼
Part 2.
  L
                                                !
X
      P        sup       ||wli ||1 > Bl◦ z
           i=1,··· ,kl+1
 l=0
      L X kl+1                             
    X
 ≤             P ||wli ||1 > Bl◦ z
     l=0 i=1
      L kl+1
    XX                                   Bl◦
                                                   
 ≤             P ||wli ||∞ >                     z
     l=0 i=1
                                        kl + 1
      L X kl+1 kl +1
                                           Bl◦
    X          X                                    
 ≤                    P |wlij | >                  z
     l=0 i=1 j=1
                                         kl + 1
        L X kl+1 kl +1
                                         Bl◦ 2
       X          X                              
 ≤2                     exp −                           By concentration inequality
       l=0 i=1 j=1
                                      (kl + 1)2
        L X kl+1 kl +1
      X           X                                    2nϵ2n                                     
 =2                     exp − exp(                     PL            ◦
                                                                               − 2 log(kl + 1))
       l=0 i=1 j=1
                                            ((L + 1)       j ′ =0 (sj ′  + 1)
      L X kl+1 kl +1
    X          X                      1
 ≤                                                  exp(−nϵ2n ) = exp(−nϵ2n )
     l=0 i=1 j=1
                      (L + 1)kl+1 (kl + 1)
where the third inequality holds since |wlij | given z is bound above by a |N (0, σ02 )| random
variable. The above proof holds as long as
                                                                  !
                      2nϵ2n
exp                  PL           ◦
                                            − 2 log(kl + 1)           ≥ nϵ2n +log(L+1)+log kl+1 +log(kl +1)+log 2
        (L + 1)         j ′ =0 (sj ′  + 1)
                                                                    92


Taking log on both sides we get
                                                       !
             nϵ2n                                            1
            PL          ◦
                                     − log(kl + 1)        ≥    log(nϵ2n +log(L+1)+log kl+1 +log(kl +1)+log 2)
  (L + 1)     j ′ =0 (sj ′  + 1)                             2
                        PL            ◦
                                          + 1) = (L + 1)nϵ2n /
                                                                    P
This is true since           j ′ =0 (sj ′                              ul is bounded above by
                                                               nϵ2n
         (L + 1)(log(kl + 1) + 12 log(nϵ2n + log(L + 1) + log kl+1 + log(kl + 1) + log 2)
Proof of Lemma 3.4.3 part 1.
   Assumption:             − log λl = O{(kl + 1)ϑl }, − log(1 − λl ) = O{(sl /kl+1 )(kl + 1)ϑl }                      (A.7)
                      Z             Z                       
                                                   P0 (y, x)
  dKL (P0 , Pθ ) =                             log              P0 (y, x)dydx
                       x∈[0,1]p       y∈R          Pθ (y, x)
                                                (y − η0 (x))2                                              (y − ηθ (x))2
                                                                                                                      
                            1                                                               1
     P0 (y, x) = p                  exp −                            Pθ (y, x) = p               exp −
                           2πσe2                     2σe2                                  2πσe2                2σe2
So we get,
                                                             (y − η0 (x))2 (y − ηθ (x))2
                         Z             Z                                                             
    dKL (P0 , Pθ ) =                           log exp −                       +                          P0 (y, x)dydx
                            x∈[0,1]p y∈R                            2σe2                    2σe2
                                                2y(η0 (x) − ηθ (x)) − (η02 (x) − ηθ2 (x))
                         Z             Z
                     =                                                                            P0 (y, x)dydx
                            x∈[0,1]p y∈R                            2σe2
                                       2η02 (x) − 2η0 (x)ηθ (x) − η02 (x) + ηθ2 (x)
                         Z
                     =                                                                        dx
                            x∈[0,1]p                          2σe2
                                       (η0 (x) − ηθ (x))2
                         Z
                                                                     1
                     =                                       dx = ||η0 − ηθ ||22                                      (A.8)
                            x∈[0,1]p               2                 2
where, σe2 = 1 can be chosen w.l.o.g.
   Next, let ηθ∗ (x) be θ ∗ satisfying arg minηθ ∈F (L,k,s,B) |ηθ − η0 |2∞ . Then,
                                                                                     p
                                            ||ηθ∗ − η0 ||1 ≤ ||ηθ∗ − η0 ||∞ =            ξ                            (A.9)
                                                                                                                ∗
Here, we redefine δ l by considering the L1 norms of the rows of D l = W l − W l as follows
                                          ⊤          ⊤
                           D l = (dl1 , · · · , dlkl+1 )⊤       δ l = (||dl1 ||1 , · · · , ||dlkl+1 ||1 )
                                                               93


Next we define a neighborhood M√P r as follows:
                                                        l
                  (                             pP                                                                            )
                                                        rl Bl
    M√P r =           θ : ||dli ||1 ≤                QL              , i ∈ Sl , ||dli ||1 = 0, i ∈ Slc , l = 0, · · · , L
            l
                                        (L + 1)(           j=0 Bj )
where Slc is the set where ||w∗li ||1 = 0, l = 0, · · · , L. Then, for every θ ∈ M√P r using
                                                                                                                            l
(A.3), we have
                                                                        qX
                                                ||ηθ − ηθ∗ ||1 ≤                   rl                                        (A.10)
                                                                                                                  √
Combining (A.9) and (A.10), we get for θ ∈ M√P r , ||ηθ − η0 ||1 ≤
                                                                                                     pP
                                                                                                          rl +      ξ. So we get,
                                                                             l
                                                          pP             √
                                                        (         rl +      ξ)2        X
                                dKL (P0 , Pθ ) ≤                                   ≤         rl + ξ
                                                                   2
Since θ ∈ NP rl +ξ for every θ ∈ M√P r ; therefore,
                                                         l
                                    Z                                Z
                                                   e(θ)dθ ≥
                                                   π                                  π
                                                                                      e(θ)dθ
                                      θ∈NP rl +ξ                       θ∈M√P r
                                                                                    l
                   rl Bl )/((L + 1)( Lj=0 Bj )) and A = {wli : ||wli − w∗li ||1 ≤ δn }
          pP                              Q
Let δn = (
                    X                             
    e M√ P
    Π                  =       Π     M   √P z π(z)
                r l                             rl
                            z
                                                                                          
                                                                    Π M√P r z π(z)
                                           X
                       ≥
                                                                                      l
                          {z:zli =1,i∈Sl ,zli =0,i∈Slc ,l=0,··· ,L}
                           L
                                                               E(1{wli ∈A} |zli = 1)
                          Y                                Y
                       =      (1 − λl )kl+1 −sl λsl l
                          l=0                             i∈Sl
                           L                                                       kl2+1 kY l +1
                                                                                                          w2lij
                                                           YZ                                                 
                          Y
                                          kl+1 −sl sl                          1
                       ≥      (1 − λl )            λl                                             exp −             dwlij
                          l=0                             i∈Sl    wli ∈A     2π             j=1
                                                                                                            2
                           L                                            kl2+1 kY l +1 Z w ∗ + δn
                                                           Y                                                   w2lij
                                                                                                                     
                          Y                                         1                        lij kl +1
                       ≥      (1 − λl )kl+1 −sl λsl l                                                  exp −             dwlij
                          l=0                             i∈Sl
                                                                   2π            j=1     w ∗ − δn
                                                                                           lij k +1
                                                                                                                   2
                                                                                                  l
                           L                                            kl2+1 kY l +1                      2 
                                                           Y                                             w
                                                                                                       
                          Y                                         1                     2δn             blij
                       =      (1 − λl )kl+1 −sl λsl l                                             exp −
                          l=0                             i∈Sl
                                                                   2π            j=1
                                                                                        kl +    1          2
where the third equality follows since E(1{wli ∈A} |zli = 0) = 1 since ||w∗li ||1 = 0, for i ∈ Slc .
The last equality is by mean value theorem, w                    blij ∈ [w∗lij − δn /(kl + 1), w∗lij + δn /(kl + 1)], thus
                                                                                                               kX
                                                                                                                           !
            L                                                                                                   l +1   2
           Y
                           kl+1 −sl sl
                                          Y              kl + 1           1                           2δn            wblij
        =      (1 − λl )            λl          exp                log         + (kl + 1) log              −
           l=0                           i∈S
                                                             2           2π                         kl + 1 j=1 2
                                              l
                                                                 94


                    "      L n                                                     
                         X                   1                                   1
           = exp −               sl log             + (kl+1 − sl ) log
                          l=0
                                            λl                                1 − λl
                                                                                               kX
                                                                                                             !)#
                                                                                                 l +1     2
                              X            kl + 1            1                       2δn                w
                                                                                                        blij
                           +           −              log        − (kl + 1) log             +
                               i∈Sl
                                               2            2π                     kl + 1 j=1 2
                    "      L
                               (                                                   
                         X                   1                                   1
           = exp −               sl log              + (kl+1 − sl ) log
                          l=0
                                             λl                               1 − λl
                                                                                                                 )#
                                  sl (kl + 1)           1                          2δn       X kX     l +1
                                                                                                            w 2
                                                                                                            blij
                              −                  log         − sl (kl + 1) log             +                               (A.11)
                                        2              2π                         kl + 1 i∈S j=1 2
                                                                                                   l
Now,
    L X kX    l +1   2           L        k +1
                                           l
  X                w
                   blij      1 XXX
                         ≤                      max((w∗lij − δn /(kl + 1))2 , (w∗lij + δn /(kl + 1))2 )
  l=0 i∈Sl j=1
                    2        2 l=0 i∈S j=1
                                        l
                             XL X kX    l +1                                   XL X                      XL X
                         ≤                   (w∗2lij +   δn2 /(kl      2
                                                                   + 1) ) ≤              ||w∗li ||21 +            δn2 /(kl + 1)
                             l=0 i∈Sl j=1                                      l=0 i∈Sl                  l=0 i∈Sl
                             XL                          X             X             
                         ≤        sl (Bl2 + 1) ≤ n             rl ≤ n          rl + ξ                                      (A.12)
                             l=0
where the above line uses δn → 0. Finally
    L                                                                                                                   
  X                  1                                    1           sl (kl + 1)         1                            2δn
          sl log           + (kl+1 − sl ) log                      −                log      − sl (kl + 1) log
   l=0
                    λ  l                               1 −   λ l            2           2π                           kl + 1
        L
                                       (                                                   L
                                                                                                                             )!
      X                  sl (kl + 1)                                                      X                          X
  ≤           Cnrl +                      2 log(kl + 1) + 2 log(L + 1) + 2                         log Bm − log           rl
       l=0
                               2                                                      m=0,m̸=l
           X                 X             
  ≤ Cn           rl ≤ Cn            rl + ξ                                                                                 (A.13)
where the first inequality follows from (A.7) and expanding δn . The last inequality follows
          P                                              P
since n rl → ∞ which implies − log rl = O(log n). Combining (A.12) and (A.13) and
replacing (A.11), the proof follows.
Proof of Lemma 3.4.3 part 2.
        Assumption:           − log λl = O{(kl + 1)ϑl }, − log(1 − λl ) = O{(sl /kl+1 )(kl + 1)ϑl }
                                                                  95


Suppose there exists q ∈ QMF such that
                                                                          X
                                                     dKL (q, π) ≤ C1 n           rl ,
                                  XZ                                       X
                                           |ηθ − ηθ∗ |22 q(θ, z)dθ ≤              rl .                            (A.14)
                                    z    Θ
Recall θ ∗ = arg minθ∈θ(L,p,s,B) |ηθ − η0 |2∞ . By relation (A.8),
       XZ                                           XnZ
                 ndKL (P0 , Pθ )q(θ, z)dθ =                    ||η0 − ηθ ||22 q(θ, z)dθ
         z                                           z
                                                         2
                                                           Z
                                                    nX                                         n
                                                ≤              ||ηθ∗ − ηθ ||22 q(θ, z)dθ + ||ηθ∗ − η0 ||2∞
                                                    2 z                                        2
                                                         X
                                                ≤ Cn(        rl + ξ)
where the above relation is due to (A.14) which completes the proof.
    We next construct q ∈ QMF as
      wlij |zli ∼ zli N (w∗lij , σl2 ) + (1 − zli )δ0 ,      zli ∼ Bern(γli∗ )          γli∗ = 1(||wli∗ ||1 ̸= 0)
                   sl                                        QL              2 −1
where σl2 =     8n(L+1)
                        (4L−l (kl    + 1) log(kl+1 2kl +1 )     m=0,m̸=l  Bm     ) .
    We next consider the relation (A.5) in Lemma A.2.5.
We upper bound the expectation of the supremum of L1 norm of multivariate Gaussian
variables:
     Z                       Z                                          Z
         Wl q(θ, z)dθ ≤ sup ||wli − wli ||1 q(θ|z)dθ ≤ sup ||wli − w∗li ||1 q(θ|z = 1)dθ
         f                                          ∗
                                    i                                          i
since q(z) ≤ 1. If zli = 1, then ||wli − w∗li ||1 = 0, thus the above integral is maximized
at z = 1 where z = 1 indicates all neurons are present in the network. In this case, all
wlij are nothing but independent Gaussian random variables. In this direction we make use
of concentration inequalities similar to the proof of theorem 2 in Chérief-Abdellatif (2020).
Let, Y = supi ||wli − w∗li ||1 .
            exp(tEY ) ≤ E(exp(tY )) = E[sup exp(t||wli − w∗li ||1 )]
                                                      i
                            kl+1            kXl +1                      kl+1 kl +1
                            X                                           X     Y
                         ≤         E[exp(t         |wlij −  w∗lij |)] =              E[exp(t|wlij − w∗lij |)]
                             i=1             j=1                        i=1 j=1
                                                           96


                            kl+1 kl +1
                                                  σl2 t2                                               σl2 t2
                            X     Y                                                                        
                                                                                 kl +1
                        =               2exp                Φ(σl t) ≤ kl+1 2           exp (kl + 1)
                            i=1 j=1
                                                    2                                                    2
                                                                                         p
Thus, EY ≤ (log(kl+1 2kl +1 ) + (kl + 1)σl2 t2 /2)/t. Let t = (1/σl ) (2/(kl + 1)) log(kl+1 2kl +1 ),
                            r
                               kl + 1 hp                                p                     i
                EY ≤ σl                        log(kl+1 2kl +1 ) + log(kl+1 2kl +1 )
                                   2
                         q                                              q
                     = 2σl2 (kl + 1) log(kl+1 2kl +1 ) ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 )
Similarly,
   Z                       Z                                                    Z
      f 2 q(θ, z)dθ =
      W  l                     sup(||wli −        w∗li ||1 )2 q(θ, z)dθ     ≤       sup(||wli − w∗li ||1 )2 q(θ|z = 1)
                                 i                                                    i
Let, Y ′ = supi (||wli − w∗li ||1 )2 .
    exp(tEY ′ ) ≤ E(exp(tY ′ )) = E[sup exp(t(||wli − w∗li ||1 )2 )]
                                                i
                    kl+1              kXl +1                            kl+1                     kXl +1
                    X                                                   X
                ≤        E[exp(t(            |wlij −     w∗lij |)2 )] ≤      E[exp(t(kl + 1)           (wlij − w∗lij )2 )]
                    i=1                j=1                               i=1                      j=1
                    kl+1 kl +1                                                   kl+1 kl +1                         21
                    X    Y                                                       X      Y               1
                =              E[exp(t(kl + 1)(wlij − w∗lij )2 )] =
                    i=1 j=1                                                       i=1 j=1
                                                                                              1 − 2t(kl + 1)σl2
                                                       kl2+1
                                          1
                ≤ kl+1
                             1 − 2t(kl + 1)σl2
Thus, EY ′ ≤ (log kl+1 − ((kl + 1)/2) log(1 − 2t(kl + 1)σl2 ))/t. Let t = 1/(4σl2 (kl + 1)),
                                                                         
              ′                                          kl + 1                                               kl +1
           EY ≤    4σl2 (kl  + 1) log kl+1 +                          log 2 = 4σl2 (kl + 1) log(kl+1 2 2 )
                                                             2
                ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 )
   Next we also get,
  Z                                Z                                   q
     (W
      fl + Bl )q(θ, z)dθ =              fl q(θ, z)dθ + Bl ≤
                                        W                                  4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl ≤ 2Bl
           Z                                   Z                                 Z
                          2                        f 2 q(θ, z)dθ                     fl q(θ, z)dθ + B 2
             (W
              fl + Bl ) q(θ, z)dθ =                W   l                + 2Bl        W                     l
                                                             q
           ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 ) + 2Bl               4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl2 ≤ 4Bl2
                                                                97


                 Z                                        Z                               Z
                                                                  2
                    Wfl (Wfl + Bl )q(θ, z)dθ =                Wl q(θ, z)dθ + Bl W
                                                              f                               fl q(θ, z)dθ
                                                                  q
                 ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl 4σl2 (kl + 1) log(kl+1 2kl +1 )
                    q                                         q                                                 
                 ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 )                  4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl
                          q
                 ≤ 2Bl 4σl2 (kl + 1) log(kl+1 2kl +1 )
      p
since 4σl2 (kl + 1) log(kl+1 2kl +1 ) is bounded above by
  v                                                                          !−1
  u                                                               L
  u      4s l
                                                                 Y
  t                   4L−l (kl + 1) log(kl+1 2kl +1 )                   Bm 2         (kl + 1) log(kl+1 2kl +1 )
     8n(L + 1)                                               m=0,m̸=l
        v
        u                               L
                                                 !  −1
        u         sl             L−l
                                       Y
                                               2
   = Bl t                      4            Bm          ≤ Bl , The quantity in square root < 1 for large n.
           2n(L + 1)                  m=0
Let bj = (kj + 1) log(kj+1 2kj +1 ). From relation (A.5), we get
   Z                                        L                          L
                                                                                     !
                                          X                           Y
      ||ηθ − ηθ∗ ||22 q(θ, z)dθ ≤                c2j−1 (4σj2 bj )            4Bm 2
                                           j=0                      m=j+1
              j−1                                                                         j−1
         L X                                            L
                                                                    !                                    !
        X                             q                Y              q                   Y
                                                                  2
   +2              cj−1 cj ′ −1 2Bj 4σj2 bj                   4Bm         4σj2′ bj ′             2Bm
        j=0 j ′ =0                                    m=j+1                            m=j ′ +1
                               j−1
         L
                                          !         L
                                                              !
        X                      Y                  Y
    =4        4L−j σj2 bj            Bm2
                                                         Bm2
        j=0                    m=0              m=j+1
                j−1     j−1               j−1                                            j−1
          L X
                                    !                 !             L
                                                                                !                        !
         X              Y                 Y                       Y                      Y                 q      q
                                                                              2
     +8                        Bm               Bm 2Bj                   4Bm                    2Bm         σj2 bj σj2′ bj ′
         j=0 j ′ =0     m=0               m=0                    m=j+1                 m=j ′ +1
        XL                        Y L
                2L−2j
    =4        2        σj2 bj             Bm 2
        j=0                   m=0,m̸=j
                j−1                    j−1              j−1
          L X
                                                  !               !        L
                                                                                       !        L
                                                                                                           !
                                                                                                             q          q
                              j−j ′
         X                             Y                Y                Y                    Y
     +8              4L−j 2                 Bm               Bm                  Bm                    Bm        σj2 bj  σj2′ bj ′
         j=0 j ′ =0                    m=0              m=0             m=j+1                m=j ′ +1
         L                            L
                                                   !
        X                            Y
                2L−2j
    =4        2        σj2 bj                Bm 2
        j=0                      m=0,m̸=j
                j−1
          L X                               L
                                                        !          L
                                                                                !
                                                                                  q          q
                      L−j L−j ′
         X                                Y                       Y
     +8              2      2                       Bm                    Bm          σj2 bj   σj2′ bj ′
         j=0 j ′ =0                    m=0,m̸=j               m=0,m̸=j ′
            L                             L
                                                      !!2              L r
                                                                                               !2
          X
                  L−j
                       q                Y                             X               sj
    =4           2         σj2 bj                Bm           =4
          j=0                         m=0,m̸=j                        j=0
                                                                                8n(L + 1)
                                                                 98


                           L
                                      !2      PL                   L
             1            X     √                 j=0   sj       X
   =                               sj     ≤                 ≤          rl
      2n(L + 1)           j=0
                                                   2n            j=0
This concludes the proof of (A.14). Next,
                           1                                    Y kY
                                                            n L−1      l+1 kl +1 n                                           o
                                 + 1(z = γ )dKL
                                                                           Y
 dKL (q, π) ≤ log                               ∗
                                                                                    γli∗ N (w∗lij , σl2 ) + (1 − γli∗ )δ0
                        π(z)                                    l=0 i=1 j=1
                                       Y kY  l+1 kl +1 n
                                                                                                                          !
       kY
        L +1                    o n L−1            Y                                              o kYL +1              o
              N (w∗Lj , σL2 ) ,                            zli N (0, σ02 ) + (1 − zli )δ0                   N (0, σ02 )
        j=1                            l=0 i=1 j=1                                                    j=1
                                                 L−1 X  kl+1 kl +1
                           1                     X            X           
  = log QL−1                                 +                       dKL γli∗ N (w∗lij , σl2 ) + (1 − γli∗ )δ0 ,
             l=0  λsl l (1 − λl )kl+1 −sl         l=0 i=1 j=1
                                            kX     L +1                                           
      γli∗ N (0, σ02 ) + (1 − γli∗ )δ0 +                  dKL N (w∗Lj , σL2 ), N (0, σ02 )
                                                   j=1
     L−1
                                                              ! L−1 kl+1 k +1 (                                                    )
                                                                                  l                       2       2       ∗ 2
     X                1                                 1             XXX                       1       σ       σ l +  w  lij    1
  =          sl log + (kl+1 − sl ) log                             +                     γli∗      log 02 +             2
                                                                                                                               −
     l=0
                     λ  l                           1  −  λ l
                                                                      l=0 i=1 j=1
                                                                                                2       σ l         2σ  0        2
           kX
                 (                                            )
            L +1                                     2
                    1         σ02 σL2 + w∗Lj                1
       +                log 2 +                         −
            j=1
                    2         σL           2σ02             2
     L−1              L−1
                                         "                                                #
     X               X       sl kl + sl σl2               Bl2                        σ02
  ≤        Cnrl +                            2
                                                + 2                   − 1 + log 2
     l=0              l=0
                                   2       σ0        σ 0 (kl + 1)                    σl
                                         "                                               #
                                 kL + 1 σL2             BL2                        σ02
                            +                                       − 1 + log 2
                                     2      σ02 σ02 (kL + 1)                       σL
where the first inequality follows from Lemma A.2.2. The inequality in the above line uses
Pkl +1 ∗ 2          2
  j=1 w lij ≤ Bl and similar to the proof of Lemma 4.1 in Bai et al. (2020) uses (A.7).
   Let σ02 = 1 and it could be easily derived that σl2 ≤ 1.
                   L−1                 L−1
                                                            "                        #                      "                    #
                   X                   X    sl                    Bl2                          (k L +  1)       BL
                                                                                                                  2
   dKL (q, π) ≤            Cnrl +              (kl + 1)                   − log σl2 +                                  − log σL2
                    l=0                l=0
                                            2                  k l +  1                             2         kL +  1
       L−1                L−1
                                             "                                          "                L
                                                                                                                    #−1 !#
       X                  X     sl                 Bl2                        sl                       Y
    =        Cnrl +                (kl + 1)                − log                          4L−l bl               Bm2
       l=0                l=0
                                2              k  l +1                  8n(L    +   1)              m=0,m̸=l
                     "                                          "          L
                                                                                         #   −1 !#
        (kL + 1)           BL2                         1                  Y
                                                                                       2
     +                              − log                          bL              Bm
              2          kL + 1                8n(L + 1)              m=0,m̸=L
       L−1                 L
                                             "                                          "                L
                                                                                                                    #−1 !#
       X                  X     sl                 Bl2                        sl              L−l
                                                                                                       Y
                                                                                                                  2
    =        Cnrl +                (kl + 1)                − log                          4 bl                  Bm
       l=0                l=0
                                2              kl + 1                   8n(L + 1)                   m=0,m̸=l
                                                                   99


        L−1              L              L
                                                                                !
        X               X    sl        X    sl                    8n(L + 1)
     =       Cnrl +             Bl2 +          (kl + 1) log
         l=0            l=0
                             2         l=0
                                            2                           sl
           L                                    L
          X                                    X    sl
       +      sl (kl + 1)(L − l) log 2 +               (kl + 1) log(kl + 1)
          l=0                                  l=0
                                                     2
           L                                                 L                        L
                                                                                                    !
          X   sl                                          X                       X
       +          (kl + 1) log log(kl+1 2kl +1 ) +               sl (kl + 1)                log Bm
          l=0
               2                                            l=0                  m=0,m̸=l
        L−1              L              L
                                                                                !          L
        X               X    sl 2      X    sl                    8n(L + 1)               X
     ≤       Cnrl +             B +            (kl + 1) log                         +L        sl (kl + 1)
         l=0            l=0
                             2 l       l=0
                                            2                           sl                l=0
           L                                                                  L                      L
                                                                                                                 !
          X   sl                                                            X                       X
       +          (kl + 1)(log(kl + 1) + log(kl+1 + kl + 1)) +                    sl (kl + 1)             log Bm
          l=0
               2                                                             l=0                 m=0,m̸=l
        L−1              L              L
                                                                                !          L
        X               X sl           X sl                       8n(L + 1)               X
     ≤       Cnrl +             Bl2 +          (kl + 1) log                         +L        sl (kl + 1)
         l=0            l=0
                             2         l=0
                                            2                           sl                l=0
           L                                             L                       L
                                                                                                !
          X                                            X                        X
       +      sl (kl + 1) log(kl+1 + kl + 1) +              sl (kl + 1)                 log Bm
          l=0                                           l=0                 m=0,m̸=l
        L−1              L
                                        "                           L
                                                                                   !
        X               X                      Bl2                X
     ≤       Cnrl +         sl (kl + 1)                  +                log Bm       + L + log(kl+1 + kl + 1)
         l=0            l=0
                                           2(kl + 1)           m=0,m̸=l
                                 !#
          1         8n(L + 1)
       + log
          2              sl
        L−1
        X
     ≤       (C + C ′ )nrl + C ′ nrL
         l=0
           L
                           "                    L
                                                                !                                               !#
          X                     Bl2            X                                                             n
       +      sl (kl + 1)             +                log Bm       + L + log(kl+1 + kl + 1) + log
          l=0
                             kl + 1        m=0,m̸=l
                                                                                                             sl
        L−1
        X                                   XL                              X L
                      ′             ′
     ≤       (C + C )nrl + C nrL +               sl (kl + 1)ϑl ≤ C1 n            rl
         l=0                                l=0                             l=0
This concludes the proof of (A.14).
Proof of Corollary 3.4.5
The proof is a direct consequence of Theorem 3.4.4 in the Section 3.4 as long as assumptions
of Lemma 3.4.2 and Lemma 3.4.3 parts 1 and 2 hold when σ02 = 1, − log λl = log(kl+1 ) +
                             qP
Cl (kl + 1)ϑl and ϵn = ( Ll=0 rl + ξ) Ll=0 ul . This what we show next.
                                                  P
                                                           100


                                                                        ul = O(ϵ2n ), thus
                                                                      P
Verifying assumption (A.6) under Proof of Lemma 3.4.2: Note,
                      X                                           X
                          ul log L = o(nϵ2n ) ⇐⇒ log L = o(n(       rl + ξ))
which is indeed true since log L = o(L2 ) and L2 ≤ n         rl . We show that (kl+1 λl )/s◦l → 0.
                                                           P
With λl = (1/kl+1 )exp(−Cl (kl + 1)ϑl ),
                        P                                                    P
             kl+1 λl       ul exp(−C(kl + 1)ϑl )     exp(−C(kl + 1)ϑl + log     ul )
                     ≤                            =
               s◦l                 nϵ2n                            nϵ2n
                        exp(−C(kl + 1)ϑl + ϑl )
                     ≤                            →0
                                   nϵ2n
                                               ul ≤ ϑl , ϑl → ∞, kl → ∞ and nϵ2n → ∞.
                                             P
where the above relation holds since log
Verifying assumption (A.7) under Proof of Lemma 3.4.3 part 1. and part 2. Note,
           − log λl = log(kl+1 ) + Cl (kl + 1)ϑl ≤ ϑl + Cl (kl + 1)ϑl = O{(kl + 1)ϑl }
And then,
                              1 − λl = 1 − exp(−Cl ϑl (kl + 1))/kl+1
              − log(1 − λl ) ∼ exp(−Cl ϑl (kl + 1))/kl+1 = O{(kl + 1)sl ϑl /kl+1 }
since exp(−Cl ϑl (kl + 1)) → 0 and (kl + 1)sl ϑl → ∞.
                                                101


                                          APPENDIX B
            ADDITIONAL NUMERICAL EXPERIMENTS DETAILS
B.1      FLOPs Calculation
    We only count multiply operation for floating point operations (FLOPs) similar to Zhao
et al. (2019). In 2D convolution layer, we assume convolution is implemented as a sliding
window and that the nonlinearity function is computed for free. Then, for a 2D convolutional
layer (given bias is present) we get FLOPs as:
                          FLOPs = (Cin,pruned Kw Kh + 1)Ow Oh Cout,pruned
where, Cin,pruned , Cout,pruned are the number of input channels and output channels after prun-
ing. Channels are pruned if all the parameters associated with that channel in convolution
mapping are zero. Kw and Kh are the kernel width and height respectively. Finally, Ow , Oh
are output width and height where Ow = (Iw + 2 × Pw − Dw × (Kw − 1) − 1)/Sw + 1 and
Oh = (Ih + 2 × Ph − Dh × (Kh − 1) − 1)/Sh + 1. Here, Iw , Ih are input, Pw , Ph are padding,
Dw , Dh are dilation, Sw , Sh are stride widths and heights respectively.
For fully connected (linear) layers (with bias) we get FLOPs as:
                                    FLOPs = (Ipruned + 1)Opruned
where, Ipruned is the number of pruned input neurons and Opruned is the number of pruned
output neurons.
B.2      Variational parameters initialization
    We initialize the γlj ’s at a value close to 1 for all of our experiments. This ensures that
at epoch 0, we have a fully connected deep neural network. This also warrants that most
of the weights do not get pruned off at a very early stage of training which might lead
                                                102


to bad performance. The variational parameters µljj ′ are initialized using U (−0.6, 0.6) for
simulation and UCI regression examples whereas for classification Kaiming uniform initial-
ization (He et al., 2015) is used. Moreover, σljj ′ are reparameterized using softplus function:
σljj ′ = log(1 + exp(ρljj ′ )) and ρljj ′ are initialized using a constant value of -6. This keeps
initial values of σljj ′ close to 0 ensuring that the initial values of network weights stay close
to Kaiming uniform initialization.
B.3       Hyperparameters for training
     We keep MC sample size (S) to be 1 during training. We choose learning rate of 3 × 10−3 ,
batch size of 400, and 10000 epochs in the 20 neurons case of simulation study-I. We use
learning rate of 10−3 , batch size of 400, and 20000 epochs in the 100 neurons case of simulation
study-I. Next, we use learning rate of 5 × 10−3 , full batch, and 10000 epochs for simulation
study-II. In UCI regression datasets, we choose batch size = 128 and run 500 epochs for
Concrete, Wine, Power Plant, 800 epochs for Kin8nm. For Protein and Year datasets, we
choose batch size of 256 and run 100 epochs. For all the UCI regression datasets we keep
learning rate of 10−3 . The Adam algorithm is chosen for optimization of model parameters.
     In image classification datasets, for SS-IG model, we use 10−3 learning rate and minibatch
size of 1024 in all experiments except in LeNet-5-Caffe on Fashion-MNIST experiment where
we use 2 × 10−3 learning rate and 1024 minibatch size. For SV-BNN model, we take 10−3
learning rate and 1024 minibatch size in all experiments after extensive hyperparameter
search. For VBNN model, we take learning rate of 10−4 and minibatch size of 128 according
to Blundell et al. (2015). We train each model for 1200 epochs using Adam optimizer in all
the image classification experiments provided in the Section 3.5.
B.4       Fine-tuning of the constant in prior inclusion probability ex-
          pression
     Recall the layer-wise prior inclusion probabilities: λl = (1/kl+1 )exp(−Cl (kl + 1)ϑl ) from
the Corollary 3.4.5. In our numerical experiments, we use this expression to choose an
                                                   103


optimal value of λl in each layer of a given network. The λl varies as we vary our constant
Cl and we next describe how is Cl chosen. The influence of Cl is mainly due to the kl + 1
term and Bl 2 /(kl + 1) from ϑl term. We ensure that each incoming weight and bias onto
the node from layer l + 1 is bounded by 1 which leads us to choose Bl to be kl + 1. So the
leading term from (kl + 1)ϑl is (kl + 1) and Cl has to be chosen such that we avoid making
exponential term from λl expression close to 0. In our experiments we choose Cl values in
the negative order of 10 such that prior inclusion probabilities do not fall below 10−50 . If we
instead choose a λl value very close to 0 then we might prune off all the nodes in each layer
or might make the training unstable which is not ideal. Overall the aforementioned strategy
of choosing Cl constant values ensure reasonable values for the λl in each layer.
B.5      Simulation study I: extra details
    First we provide the network parameters used to generate the data for this simulation
experiment. The edge weights in the underlying 2-2-1 network are as follows: W0 = {w011 =
10, w012 = 15, w021 = −15, w022 = 10}; W1 = {w111 = −3, w121 = 3} and v0 = {v01 =
−5, v02 = 5}; v1 = {v11 = 4}.
                   (a) VBNN                                       (b) SS-IG
Figure B.1 Simulation study I: additional experiment results. Node-wise weight
magnitudes recovered by VBNN and proposed SS-IG model in the synthetic regression data
generated using 2-2-1 network. The boxplots show the distribution of incoming weights into
a given hidden layer node. Only the 20 nodes with the largest edge weights are displayed.
                                             104


    In Figure B.1, we provide additional results demonstrating the model selection ability of
our SS-IG approach in a wider network consisting of 100 nodes in the single hidden layer
structure considered in the simulation study-I from the Section 3.5.
B.5.1     Effect of Hidden Layer Widths
    Here, we explore 2-hidden layer neural networks with varying widths. For our SS-IG
model we use 10−3 learning rate and minibatch size of 1024 while for VBNN model, we take
learning rate of 10−4 and minibatch size of 128 according to Blundell et al. (2015). We train
both the models for 400 epochs using Adam optimizer.
    Figure B.2 summarizes the results. We have provided results for 3 different architectures
which have 400, 800, and 1200 nodes each in their 2-hidden layers. In Figure B.2a, we
find that across the architectures both SS-IG and VBNN models have similar predictive
performance. Further, our method is able to prune off more than 88% of first hidden layer
nodes and more than 92% of second hidden layer nodes (Figure B.2b) at the expense of
2% accuracy loss due to sparsification compared to the densely connected VBNN. We also
observe that as model capacity increases the sparsity percentage per layer decreases. This
suggests that, each architecture is trying to reach a sparse network of comparable size.
     (a) Prediction accuracy per architecture       (b) Layer-wise sparsity per architecture
       Figure B.2 MNIST experiment results for varying hidden layer widths.
                                              105


                                       CHAPTER 4
    COMPACT BAYESIAN NEURAL NETWORKS WITH STRUCTURED
                                        SPARSITY
4.1     Introduction
    In high-dimensional modeling, predictor selection and sparse signal recovery are routine
statistical and machine learning practices. Sparse parameter estimation via high-dimensional
regularization penalizing model dimensionality is well studied in the literature. Two of the
most popular regularization techniques are lasso and horseshoe regularizers (Bhadra et al.,
2019). The lasso estimator (Tibshirani, 1996) induces sparsity by constraining the L1 norm
of the parameters in the model. The horseshoe estimator (Carvalho et al., 2010) places ab-
solutely continuous shrinkage priors on the entire parameter vector that selectively shrinks
the small signals since horseshoe prior has heavy tails supporting both zero values and large
values. Both Lasso and horseshoe procedures come with strong theoretical guarantees for
estimation and prediction. In this work, we propose a spike-and-slab prior framework sim-
ilar to SS-IG model proposed in Chapter 3 for dynamic node pruning with slab component
using either group lasso or group horseshoe priors. This combination of spike-and-slab prior
and group shrinkage priors, first ensures that unnecessary collection of weights incident on a
node are shrunk to zero and then the spike-and-slab setup allows for automated pruning of
such shrunken weights. In Figure 4.1, we provide an image classification experiment where
our proposed approaches of spike-and-slab Group Lasso (SS-GL) and spike-and-slab Group
Horseshoe (SS-GHS) demonstrate the improvement over simple Gaussian prior in the slab
part. In order to conduct posterior approximation, VI in the sparse BNNs with spike-and-
slab Gaussian prior framework for edge selection was introduced by Chérief-Abdellatif (2020)
and later Jantre et al. (2021a) extended it to node pruning. In this work, we adopt varia-
tional Bayesian inference leading to tractable model training in conjunction with continuous
                                             106


     (a) Prediction accuracy        (b) Layer-1 node sparsity      (c) Layer-2 node sparsity
Figure 4.1 MNIST experiment results: motivation for group shrinkage priors over
Gaussian prior. Here, we demonstrate the performance of our SS-GL and SS-GHS models
in 2-layer perceptron network to classify MNIST, hand-written digits dataset. (a) we plot
the classification accuracy on the test data for our models and include (Jantre et al., 2021a)’s
SS-Gauss model. (b) and (c) we plot the proportion of active nodes (node sparsity) in the
layer-1 and layer-2 of the network respectively. We observe that our SS-GHS yields the most
compact network with the best classification accuracy.
relaxation of discrete Bernoulli variables associated with the spike part (Maddison et al.,
2017; Jang et al., 2017) similar to SS-IG model.
4.1.1     Proposed Methods
    Firstly, there does not exist any cohesive literature which establishes the numerical effi-
ciency of shrinkage priors over Gaussian slabs in the context of training structurally sparse
networks. Secondly, the numerical properties of the corresponding variational implementa-
tion remain unexplored. To address these issues, we consider a spike-and-slab framework
with group shrinkage priors: (i) group lasso and (ii) group horseshoe, which first shrinks the
redundant model weights through the slab component and the spike component prunes out
the nodes with weights whose values are shrunk close to zero. Accordingly,
Detailed Contribution.
    • We propose structurally sparse Bayesian neural networks using two distinct spike and
       slab prior setups, where the slab component uses hierarchical priors on the group of
       incoming weights (including bias) on the neurons: (i) Spike-and-Slab Group Lasso
       (SS-GL), and (ii) Spike-and-Slab Group HorseShoe (SS-GHS).
                                               107


4.2     Structured Sparsity: Spike-and-Slab Hierarchical Priors
    In order to carry out automatic node selection to induce structured sparsity in BNNs,
we consider spike-and-slab priors. A zero-mean Gaussian distribution is the commonly used
slab distribution in spike-and-slab priors (Jantre et al., 2021a). However, their use can lead
to inflated predictive uncertainties, especially when used in conjunction with fully factorized
variational inference (Ghosh et al., 2019). Instead, if we consider a slab distribution having
zero-mean Gaussian distribution with its scale being a random variable then the slab part
of the marginal prior distribution will have heavier tails and higher mass at zero. Such
hierarchical distributions in slab part further improve the sparsity as well as circumvent
the inflated predictive uncertainties. Below, we describe the two hierarchical spike-and-
slab priors and corresponding fully factorized variational family that we use in each of our
proposed approaches.
4.2.1    Spike-and-Slab Group Lasso (SS-GL):
    To facilitate the optimal layer-wise node selection, we allow the prior inclusion probability
λl to vary as a function of the layer index l.
Prior: We assume a spike-and-slab prior of the following form with zlj being the indicator
for the presence of j th node in the lth layer.
 π(wlj |zlj ) = (1 − zlj )δ0 + zlj N (0, σ02 τlj2 I) , π(zlj ) = Ber(λl ), π(τlj2 ) = G(kl /2 + 1, ς 2 /2)
                                                      
where l = 0, . . . , L, j = 1, . . . , kl+1 . N (., .), Ber(.), and G(., .) represent Gaussian, Bernoulli,
and Gamma distributions. wlj = (wlj1 , . . . , wljkl +1 ) is a vector of edges incident on the j th
node in the lth layer. δ0 is a Dirac spike vector of dimension kl + 1 with all zero entries
and I is identity matrix of dimension kl + 1 × kl + 1. zlj with j = (1, . . . , kl+1 ) all follow
Ber(λl ) to allow for common prior inclusion probability, λl , for each node from a given layer
l. We set λL = 1 to ensure no node selection occurs in the output layer. σ0 and τlj are the
                                                         108


constant global and the variable local (per node) scale mixture components of the Gaussian
slab distribution. ς 2 /2 is the constant rate hyperparameter of the Gamma distribution.
Variational family: We consider the following fully factorized variational family
                    q(wlj |zlj ) = (1 − zlj )δ0 + zlj N (µlj , diag(σlj2 )) , q(zlj ) = Ber(γlj )
                                                                                      
                                      {τ }     {τ } 2
                    q(τlj2 ) = LN (µlj , σlj          )
for l = 0, . . . , L, j = 1, . . . , kl+1 . LN (., .) denotes Log-Normal distribution and q(log τlj2 ) ∼
     {τ }     {τ } 2
N (µlj , σlj         ). The spike-and-slab structure of the variational family ensures that the vari-
ational weight distributions follow spike-and-slab structure allowing for exact node sparsity
through variational approximation. Further, the weight distributions conditioned on the
node indicator variables are all independent of each other. The variational distribution of
parameters obtained post optimization will then inherently prune away redundant nodes
from each layer. Moreover, we use Log-Normal family instead of Gamma family to approxi-
mate Gamma distributed τlj2 since we obtain closed form expressions for dKL (q(τlj2 ), π(τlj2 |ς 2 )).
    Additionally, µlj = (µlj1 , . . . , µljkl +1 ) and σlj2 = (σlj1                  2             2
                                                                                        , . . . , σljk l +1
                                                                                                            ) denote the vectors
of variational mean and standard deviation parameters of the slab component of q(wlj |zlj ).
                                                                              ′ th
diag(σlj2 ) is the diagonal matrix with σljj             2
                                                             ′ being the j         diagonal entry. Similarly, γlj denotes
the variational inclusion probability parameter of q(zlj ). We set γLj = 1 to ensure no node
                                                        {τ }         {τ } 2
selection occurs in the output layer. µlj and σlj                           denote the variational mean and standard
deviation parameters of the Gaussian distribution associated with q(log τlj2 ).
ELBO: Let θ be the network weights and ϖ = (z, τ 2 ) be the remaining parameters. We
minimize the loss function: L = −ELBO(q(θ, ϖ), π(θ, ϖ|D)),
                                       X                      Z
  L = −Eq(θ,ϖ) [log L(θ)] +                  q(zlj = 1)          dKL (q(wlj |zlj = 1), π(wlj |τlj2 , zlj = 1))q(τlj2 )dτlj2
                                        l,j
         X                                   X
     +          dKL (q(zlj ), π(zlj )) +           dKL (q(τlj2 ), π(τlj2 |ς 2 ))
          l,j                                 l,j
                                   X                       Z
   = −Eq(θ,ϖ) [log L(θ)] +                 q(zlj = 1)           dKL (N (µlj , diag(σlj2 )), N (0, σ02 τlj2 I))q(τlj2 )dτlj2
                                     l,j
                                                                  109


                                                                                                     
                                                                 {τ }   2{τ }
           X                                 X
                                                                                               2
      +         dKL (Ber(γlj ), Ber(λl )) +       dKL    LN (µlj , σlj ), G(kl /2        + 1, ς /2)
            l,j                               l,j
4.2.2        Spike-and-Slab Group Horseshoe (SS-GHS):
     In this model, we consider spike-and-slab prior with group horseshoe distribution in the
slab part.
Prior: We consider regularized version of group horseshoe (Piironen and Vehtari, 2017) in
the slab part to circumvent the numerical stability issues associated with the unregularized
group horseshoe. We define our prior similar to SS-GL earlier.
                π(wlj |zlj ) = (1 − zlj )δ0 + zlj N (0, σ02 τelj2 s2 I) , τelj2 = c2 τlj2 /(c2 + τlj2 s2 )
                                                                      
                     π(zlj ) = Ber(λl ), π(τlj ) = C + (0, 1), π(s) = C + (0, s0 )
where l = 0, . . . , L, j = 1, . . . , kl+1 . C + (., .) denotes half Cauchy distribution. τelj2 is varying
local (per node) scale parameter, s2 is the varying global scale parameter, and σ02 is the
constant global scale parameter. Note that, when weights are strongly shrinking towards
0 then τlj2 s2 ≪ c2 and τelj2 → τlj2 s2 which leads to the unregularized version of the group
horseshoe. Whereas, when weights are away from 0 then corresponding τlj2 s2 will be large,
i.e, τlj2 s2 ≫ c2 and τelj2 → c2 , where c2 is constant. For these weights corresponding version
of regularized group horseshoe in the slab follows N (0, σ02 c2 I). This helps in thinning out
the heavy tails associated with the horseshoe prior. Next, the prior inclusion probabilities,
λl are common for for all nodes from a given layer similar to SS-GL. Additionally, s0 is the
scale parameter of half Cauchy prior on s that can be tuned for specific situations.
     Instead of directly working with the half-Cauchy distributions, we employ a decomposi-
tion of the half-Cauchy that relies upon gamma and inverse gamma distributions (Louizos
et al., 2017) as this allows us to compute the negative KL-divergence from the scale distribu-
tion π(τ ) to an approximate log-normal scale posterior q(τ ) in closed form. More specifically,
we have a half-Cauchy distribution that can be expressed in a non-centered parametrization
                                                         110


as:
                                 β̃ ∼ IG(1/2, 1), α̃ ∼ G(1/2, k 2 ), τ 2 = β̃ α̃
where IG(., .), G(., .) correspond to the inverse Gamma and Gamma distributions in the
scale parametrization, and τ follows a half-Cauchy distribution with scale k. Therefore we
re-express the whole SS-GHS prior hierarchy as:
 π(wlj |zlj ) = (1 − zlj )δ0 + zlj N (0, σ02 τelj2 s2 I) , π(zlj ) = Ber(λl )
                                                           
      π(βlj ) = IG (1/2, 1) , π(αlj ) = G (1/2, 1) , π(sb ) = IG (1/2, 1) , π(sa ) = G 1/2, s20
                                                                                                                        
Variational family: We consider the following fully factorized variational family
                   q(wlj |zlj ) = (1 − zlj )δ0 + zlj N (µlj , diag(σlj2 )) , q(zlj ) = Ber(γlj )
                                                                             
                                       {β}  {β} 2                       {α}     {α} 2
                   q(βlj ) = LN (µlj , σlj        ), q(αlj ) = LN (µlj , σlj          ),
                                                2                                    2
                   q(sb ) = LN (µ{sb } , σ {sb } ), q(sa ) = LN (µ{sa } , σ {sa } )
for l = 0, . . . , L, j = 1, . . . , kl+1 . Similar to SS-GL variational family, we use Log-Normal
family instead of Gamma and Inverse-Gamma families to approximate Gamma and Inverse-
Gamma distributed variables to obtain closed form expression for the KL divergence between
                                                                  {β}   {β} 2       {α}  {α} 2                     2
prior and variational distributions. Moreover, (µlj , σlj                     ), (µlj , σlj    ), (µ{sb } , σ {sb } ), and
                 2
(µ{sa } , σ {sa } ) denote the variational mean and standard deviation parameters of the Gaus-
sian distribution associated with q(log βlj ), q(log αlj ), q(sb ) and q(sa ).
ELBO: Let θ be the network weights and ϖ = (z, τ 2 , s2 ) be the remaining parameters.
Similar to SS-GL, We minimize the loss function: L = −ELBO(q(θ, ϖ), π(θ, ϖ|D)),
  L = −ELBO(q(θ, ϖ), π(θ, ϖ|D))
                                   X
   = −Eq(θ,ϖ) [log L(θ)] +               dKL (q(zlj ), π(zlj ))
                                     l,i
          Z Z "X                       Z Z
     +                   q(zlj = 1)          dKL (q(wlj |zlj = 1), π(wlj |βlj , αlj , sb , sa , zlj = 1))
                     l,j
                                              #
                 ×q(βlj ) q(αlj ) dβlj dαlj       q(sb ) q(sa ) dsb dsa
                                                           111


         Xh                                                      i
       +      dKL (q(βlj ), π(βlj )) + dKL (q(αlj ), π(αlj )) + dKL (q(sb ), π(sb )) + dKL (q(sa ), π(sa ))
          l,j
                                 X
   = −Eq(θ,ϖ) [log L(θ)] +            dKL (Ber(γlj ), Ber(λl ))
                                  l,i
         Z Z "X                     Z Z
       +              q(zlj = 1)             dKL (N (µlj , diag(σlj2 )), N (0, σ02 βlj αlj sb sa I))
                 l,j
                                              #
              ×q(βlj ) q(αlj ) dβlj dαlj        q(sb ) q(sa ) dsb dsa
                                    {β} 2                                          {α} 2
         Xh                {β}                                             {α}
                                                                                                         i
       +      dKL (LN (µlj , σlj          ), IG (1/2, 1)) + dKL (LN (µlj , σlj           ), G (1/2, 1))
          l,j
                                  2                                                2
       + dKL (LN (µ{sb } , σ {sb } ), IG (1/2, 1)) + dKL (LN (µ{sa } , σ {sa } ), G 1/2, s20 )
                                                                                                     
4.2.3      Algorithm and Computational Details
     We minimise the loss L for both SS-GL and SS-GHS models by recursively sampling their
corresponding variational posterior, allowing us to propagate the information through the
network. The Gaussian variational approximations, N (µlj , diag(σlj2 )), are reparameterized
as µlj + σlj ⊙ ζlj for ζlj ∼ N (0, I), where ⊙ denotes the entry-wise (Hadamard) product.
Continuous Relaxation. The discrete spike variables (z) are replaced with their continu-
ous relaxation to circumvent the nondifferentiablility in L making practical implementation
easier (Jang et al., 2017; Maddison et al., 2017). Specifically, the Gumbel-softmax (GS)
distribution is used for continuous relaxation, that is q(zlj ) ∼ Ber(γlj ) is approximated by
q(z̃lj ) ∼ GS(γlj , τ ), where
   z̃lj = (1 + exp(−ηlj /τ ))−1 ,         ηlj = log(γlj /(1 − γlj )) + log(ulj /(1 − ulj )),          ulj ∼ U (0, 1)
where τ is the temperature. We keep τ = 0.5 for all the experiments similar to (Jantre et al.,
2021a). The use of z̃lj in the backward pass eases gradient calculation, while zlj is used in
the forward pass for exact node sparsity.
                                                          112


Algorithm 4.1 Variational inference in SS-GL and SS-GHS Bayesian neural networks
   Inputs: training dataset, network architecture, and optimizer tuning parameters.
   Model inputs: prior parameters for T = (θ, z, τ 2 ) in SS-GL and T = (θ, z, τ 2 , s2 ) in
   SS-GHS.
   Variational inputs: number of Monte Carlo samples S.
   Output: Variational parameter estimates of network weights, scales, and sparsity.
   Method: Set initial values of variational parameters.
   repeat
     Generate S samples of βlj , zlj , z̃lj , τlj2 , s2 (for SS-GHS)
     Use βlj , τlj2 , s2 and zlj to compute L in forward pass
     Use βlj , τlj2 , s2 and z̃lj to compute gradient of L in backward pass
     Update the variational parameters with gradient of loss using stochastic gradient descent
     algorithm (e.g. SGD with momentum (Sutskever et al., 2013))
   until change in ELBO < ϵ
4.3     Numerical Experiments
    In this section, we demonstrate the performance of our proposed SS-GL and SS-GHS
approaches on network architectures and techniques used in practice. We consider multilayer
perceptron (MLP), LeNet-5-Caffe, and ResNet architectures which we implement in PyTorch
(Paszke et al., 2019). We perform image classification using aforementioned neural networks
in widely used MNIST, Fashion-MNIST, and CIFAR-10 datasets.
    In all the experiments, we fix σ02 = 1 and σe2 = 1. The remaining tuning parameter
details such as learning rate, minibatch size, and initial parameter choice are provided in the
Appendix. The prediction accuracy is calculated using variational Bayes posterior mean es-
timator with 10 Monte Carlo samples at test-time. We use swish (SiLU) activations (Elfwing
et al., 2018; Ramachandran et al., 2017) instead of ReLU in our proposed SS-GL and SS-
GHS models similar to Jantre et al. (2021a)’s spike-and-slab Gaussian node selection model
(SS-IG) to avoid the dying neuron problem (Lu et al., 2020). Smoother activation functions
such as sigmoid, tanh, etc also help alleviate this problem. We choose swish since it has the
best performance.
    We provide node sparsity estimates for each linear hidden layer separately. For all models,
the node sparsity in a given linear layer is the ratio of number of neurons with atleast one
                                                       113


nonzero incoming edge over the original number of neurons present in that layer before
training. In a convolution layer we provide channel sparsity estimate which is ratio of the
number of output channels with atleast one nonzero incoming connection over the total
number of output channels present in the dense counterpart. The layer-wise node or channel
sparsity estimates provide granular illustration of the structural compactness of the trained
model. The structural sparsity in the trained model leads to lower computational complexity
at test-time which is vital for resource constrained devices.
4.3.1     MLP MNIST Classification
    In this experiment, we use MLP model with 2 hidden layers having 400 nodes per layer
to fit the MNIST data which consists of 60,000 small square 28×28 pixel grayscale images of
handwritten single digits between 0 and 9. We preprocess the images in the MNIST data by
dividing their pixel values by 126. Output layer has 10 neurons since there are 10 classes in
the MNIST data. We provide the graph of the prediction accuracy of the i.i.d. test data over
training period of 1200 epochs. We provide layer-wise node sparsity plots for both layers to
highlight the dynamic structural compactness of the model under training. In what follows,
we discuss the choice of ς 2 in SS-GL model as well as the choice of creg values in SS-GHS
model.
SS-GL penalty parameter choice. In SS-GL model, the value of ς 2 need to be carefully
tuned for numerical experiments (Xu and Ghosh, 2015). A very large value of ς 2 will over-
shrink the network weights and leading to biased estimates; ς 2 → 0 will lead to a very diffuse
distribution for the slab part. Instead we place a conjugate gamma prior on the penalty pa-
rameter, ς 2 ∼ Γ(c, d), and estimate it through our variational inference framework via an
approximating family q(ς 2 ) := LN (µς , σς2 ).
    Figure 4.2 summarizes the results of MLP-MNIST experiment using SS-GL model with
fixed ς 2 = 1 and variable ς 2 ∼ Γ(c = 4, d = 2). The values of the shape (c = 4) and rate
(d = 2) parameters where chosen based on hyperparameters search and past literature. We
                                                114


     (a) Prediction accuracy       (b) Layer-1 node sparsity     (c) Layer-2 node sparsity
Figure 4.2 SS-GL penalty parameter choice experiment results. Here, we demonstrate
the performance of SS-GL with fixed ς 2 = 1 and variable ς 2 ∼ Γ(c = 4, d = 2). (a) The
classification accuracy on the test data. (b) and (c) the node sparsity in the layer-1 and
layer-2 of the network respectively. We observe that placing a prior on ς 2 yields better
classification accuracy.
observe that inferring the value of ς 2 from Bayesian estimation significantly improves the
predictive accuracy compared to fixed ς 2 (Figure 4.2a). The fixed ς 2 model has better node
sparsities in both the layers of the MLP model (Figure 4.2b and 4.2c). This suggests that
ς 2 = 1 might be overshrinking the weights which assists in pruning them via spike-and-slab
prior, however this also hampers the predictive performance of the model. In rest of the
experiments involving SS-GL, we place the gamma prior on ς 2 ∼ Γ(c = 4, d = 2).
SS-GHS regularization constant choice. In what follows, we provide MLP-MNIST
experiment using SS-GHS model with with regularization constant values of creg = 1 and
creg = kl + 1. In MLP, the kl + 1 = 400 + 1 = 401 is a large constant and essentially acts as
an unregularized model. We ran unregularized version of the model and verified this claim
but do not provide the results for brevity.
    Figure 4.3 summarizes the results of MLP-MNIST experiment using SS-GHS model creg =
1 and creg = kl + 1. We observe that both values of creg lead to same predictive accuracies
on test data. However in creg = 1 scenario, SS-GHS model has better layer-1 node sparsity
(Figure 4.3b). Layer-2 node sparsity is same in both the creg values (Figure 4.3c). In rest of
the experiments involving SS-GHS, we choose creg = 1.
                                              115


     (a) Prediction accuracy        (b) Layer-1 node sparsity      (c) Layer-2 node sparsity
Figure 4.3 SS-GHS regularization constant choice experiment results. We demon-
strate the performance of SS-GHS with regularization constant of creg = 1 and creg = kl +1 =
401. (a) The classification accuracy on the test data. (b) and (c) The node sparsity in the
layer-1 and layer-2 of the network respectively. We observe that both creg choices lead to
similar classification accuracies with creg = 1 having better layer-1 node sparsity.
MLP-MNIST comparison with SS-IG
    We provide MLP-MNIST experiment where we compare our proposed models with SS-
IG model. The results are presented in Figure 4.4. We provide test data accuracy, model
compression ratio, flops ratio, and layer-wise node sparsities in each experiment.
Additional metrics. We provide two additional metrics that relate to the model com-
pression and computational complexity. (i) compression ratio: it is the ratio of number
of nonzero weights in the compressed network versus the dense model and is an indicator
of storage cost at test-time. (ii) floating point operations (FLOPs) ratio: it is the ratio of
number of FLOPs required to predict the output from the input during test-time in the
compressed network versus its dense counterpart. We have detailed the FLOPs calculation
in neural networks in Chapter 3 Appendix B. Layer-wise node and channel sparsities are
directly related to FLOPs ratio hence we only provide FLOPs ratio in LeNet-5-Caffe and
ResNet models.
    In Figure 4.4a, we observe that SS-GHS has better predictive accuracy compared to SS-
GL and SS-IG models. Moreover, SS-GHS model not only has minimal storage cost among
the node selection models compared (Figure 4.4b) but also the least number of FLOPs
required for inference during test-time (Figure 4.4c). In Figure 4.4d and 4.4e, we observe
                                               116


     (a) Prediction accuracy          (b) Compression ratio             (c) FLOPs ratio
                     (d) Layer-1 node sparsity      (e) Layer-2 node sparsity
Figure 4.4 MLP/MNIST experiment results. Here, we demonstrate the performance of
our SS-GL (ς 2 ∼ Γ(c = 4, d = 2)) and SS-GHS (creg = 1) models compared against SS-IG
model. (a) we plot the classification accuracy on the test data. (b) and (c) we plot the node
sparsity in the layer-1 and layer-2 of the network respectively. We observe that our SS-GHS
yields the most compact network with the best classification accuracy.
that SS-GHS has pruned away maximum number of nodes in contrast to SS-GL and SS-IG
models and this also leads to the maximum reduction in FLOPs evident from (Figure 4.4c).
Lastly, SS-GL and SS-IG models have similar predictive accuracies; however, SS-GL has
lower layer-wise node sparsities in both layers, hence lower FLOPs ratio and it also has
lower storage cost at test-time compared to SS-IG.
4.3.2    LeNet-5-Caffe Experiments
    The results of more complex LeNet-5-Caffe network experiments on MNIST and Fashion-
MNIST are presented in Figure 4.5. We provide test data accuracy, model compression
ratio, and FLOPs ratio in each experiment over 1200 epochs. Here, FLOPs ratio serve as
a collective indicator of layer-wise node sparsities since FLOPs are directly related to how
many neurons or channels are remaining in linear or convolution layers respectively.
                                               117


   In LeNet-5-Caffe/MNIST experiment (Figure 4.5a - 4.5c), we observe that our SS-GHS
and SS-GL models have better predictive accuracy than SS-IG (Figure 4.5a). We observe
that both SS-GHS and SS-GL models have better model compression ratio (Figure 4.5b).
Moreover, all three models compared achieve similar reduction in Flops (Figure 4.5c)). In
contrast with MLP-MNIST experiment (Figure 4.4) our SS-GHS and SS-GL have same
performance on all metrics in LeNet-5-Caffe-MNIST experiment.
   In LeNet-5-Caffe/Fashion-MNIST experiment (Figure 4.5d - 4.5f), we observe that SS-
GHS has better predictive accuracy compared to SS-GL and SS-IG models. The storage
cost reduction in SS-GHS model is similar to SS-IG but better than SS-GL (Figure 4.5e).
Next, SS-IG achieves best reduction in FLOPs compared to both our approaches and SS-
GHS has lower FLOPs than SS-GL. Lastly, SS-GL and SS-IG models have similar predictive
accuracies; however, SS-IG has lower FLOPs and storage cost at test-time.
       (a) Test accuracy           (b) Compression ratio          (c) FLOPs ratio
       (d) Test accuracy           (e) Compression ratio           (f) FLOPs ratio
Figure 4.5 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experi-
ments results. Top row (a)-(c) represent the LeNet-5-Caffe on MNIST experiment results.
Bottom row (d)-(f) represent the LeNet-5-Caffe on Fashion-MNIST experiment results.
                                           118


4.3.3    Residual Network Experiments
    This section presents an example demonstrating the trade-off between the computational
complexity and memory cost at test-time among our structured pruning methods and recent
unstructured pruning methods in Residual Networks (ResNet) applied on CIFAR-10 dataset
(He et al., 2016). The CIFAR-10 dataset (Krizhevsky, 2009) consists of 60000 32x32 colour
images in 10 classes, with 6000 images per class. There are 50000 training images and 10000
test images. We use ResNet-20 and ResNet-32 architectures and follow the experimental
setting provided by Sun et al. (2021). We compare our proposed SS-GL and SS-GHS methods
with SS-IG (Jantre et al., 2021a), consistent sparse deep learning (BNNcs ) (Sun et al., 2021),
and variational BNN with mixture Gaussian prior (VBNN) (Blundell et al., 2015).
    In all of our experiments, we follow the same training setup as used in Sun et al. (2021),
that is, each model considered in Table 4.1 was trained using SGD with momentum for 300
epochs with mini-batch size of 128 during training, the momentum parameter was set to
0.9, and the initial learning rate was set to 0.1. The training data was preprocessed using
the random erasing data augmentation strategy proposed by Zhong et al. (2020). We use
step-wise constant learning rate schedule and decrease the learning rate by a factor of 10 at
epochs 150 and 225. We refine the parameters associated with sparse sub-network obtained
after first 300 epochs for additional 100 epochs where sparsity parameters in SS-IG and
our models are not trained and only parameters associated with the slab part are learned.
We use swish activations in our SS-GL and SS-GHS models as well as SS-IG. For BNNcs
and VBNN models, we use ReLU activations as recommended by their authors. In all the
models, we set σe2 = 1 and σ02 = 0.04 same as Sun et al. (2021). In SS-IG, SS-GL, and
SS-GHS we use common λl = 10−4 after hyperparamter search, since our theory does not
cover Bayesian CNNs. However, establishing the posterior consistency in the Bayesian CNN
is an interesting future direction of work.
    We quantify the predictive performance using the accuracy of the test data. Besides the
test accuracy, we use compression (%) and pruned FLOPs (%) which are compression ratio
                                              119


and FLOPs ratio discussed earlier converted to percentages respectively. In this experiment,
we only count parameters and FLOPs over convolutional and last fully connected layer,
because our proposed methods focus on channel and node pruning of convolutional and
linear layers respectively.
    For ResNet architecture, our proposed methods under centered parameterization similar
to previous experiments in Section 4.3.1 and 4.3.2 have unstable performance. Instead we
incorporate non-centered parameterization (Ghosh et al., 2019) to stabilize the training.
Below we detail the non-centered parameterization strategy:
Non-Centered Parameterization
    We adopt non-centered parameterization for the Gaussian slab component in both the
prior setups to circumvent the pathological funnel shaped geometries associated with the
coupled posterior (Ingraham and Marks, 2017; Ghosh et al., 2019). Accordingly, the coupling
between weights wlj and scales τlj∗2 = τlj2 (for SS-GL) or τlj∗2 = τelj2 s2 (for SS-GHS) can be
reformulated as
                               βlj ∼ N (0, σ02 τlj∗2 I), wlj = τlj∗ βlj ,
This formulation leads to independent sampling of weights and scales from their respec-
tive prior distributions which are now marginally uncorrelated leads to simpler posterior
geometries (Betancourt and Girolami, 2015). This non-centered reparameterization leads to
efficient posterior inference without change in functional form of the respective prior.
    We summarize the ResNet experiment results in Table 4.1. The comparison with BNNcs
and VBNN models indicates that our SS-GL and SS-GHS methods have significantly better
prediction accuracy in both ResNet-20 and ResNet-32 setups. Moreover, we demonstrate
that even though BNNcs and VBNN models have predefined high levels of pruned parameters,
our models have significantly less FLOPs during inference at test-time. This highlights the
trade-off between unstructured sparsity and structured sparsity methods, where the former
leads to significant reduction in storage cost and the later has significantly less computa-
                                                   120


Table 4.1 ResNet-20/CIFAR-10 and ResNet-32/CIFAR-10 experiments results.
The results of each method is calculated by averaging over 3 independent runs with stan-
dard deviation reported in parentheses. For BNNcs and VBNN models, we show predefined
percentages of pruned parameters used for magnitude pruning given in (Sun et al., 2021).
        Model     Method          Test Accuracy   Compression (%)    Pruned FLOPs (%)
      ResNet-20   BNNcs (20%)       92.23 (0.16)    19.29 (0.12)         98.94 (0.38)
                  BNNcs (10%)       91.43 (0.11)     9.18 (0.13)         99.13 (0.37)
                  VBNN (20%)        89.61 (0.04)    19.55 (0.01)        100.00 (0.00)
                  VBNN (10%)        88.43 (0.13)     9.50 (0.00)         99.93 (0.00)
                  SS-IG             92.94 (0.15)    79.52 (0.98)         88.39 (1.00)
                  SS-GL (ours)     92.99  (0.11)    76.10 (1.55)       85.15 (1.76)
                  SS-GHS (ours)     92.87 (0.23)    78.70 (0.42)         86.18 (1.02)
      ResNet-32   BNNcs (10%)       92.65 (0.03)     9.15 (0.03)         94.53 (0.86)
                  BNNcs (5%)        91.39 (0.08)     4.49 (0.02)         90.79 (1.35)
                  VBNN (10%)        89.37 (0.04)     9.61 (0.01)         99.99 (0.02)
                  VBNN (5%)         87.38 (0.22)     4.59 (0.01)         94.27 (0.54)
                  SS-IG             93.08 (0.23)    55.28 (2.96)         67.59 (2.36)
                  SS-GL (ours)     93.33  (0.11)    54.27 (1.73)        66.93  (2.98)
                  SS-GHS (ours)     93.15 (0.23)    53.72 (2.11)       66.68   (2.75)
tional complexity at test-time. In comparison with SS-IG node selection model, we observe
that our SS-GL and SS-GHS models have lower storage cost and FLOPS at test-time with
comparable predictive accuracy in ResNet-20 architecture. In ResNet-32 case, SS-GL has
better predictive accuracy than SS-IG, while compression (%) and pruned FLOPs (%) in
our models are comparable to SS-IG within standard deviation. This comparison further
highlights the advantage of using group shrinkage priors instead of Gaussian in the slab part
of spike-and-slab framework to achieve better test accuracies with lower computational and
memory footprint.
4.4     Conclusion and Discussion
    In this chapter, we introduced the compact Bayesian neural network methods to handle
the model compression in a principled manner. Our proposed spike-and-slab models combine
the automated sparsity learning with the hierarchical group shrinkage priors: group lasso
                                              121


and group horseshoe. We provide computationally efficient and scalable variational inference
algorithms in both the models. In the large scale experiments involving ResNet architectures,
we relied on the non-centered parameterization which ensured the numerical stability of
our models. We demonstrate the superior performance of the group shrinkage priors over
Gaussian prior in slab component in several experiments which further highlights our point
that group shrinkage priors shrink the collection of weights incident on the node close to
zero which helps in removal of that node through spike-and-slab framework.
   An immediate future work would be to establish the variational posterior consistency and
corresponding contraction rate in both the models. Moreover, the superiority of the SS-GL
and SS-GHS models over the SS-IG model could be established through the faster posterior
convergence rate for the former models compared to the later.
                                             122


APPENDIX
   123


                                            APPENDIX
              ADDITIONAL NUMERICAL EXPERIMENTS DETAILS
A.1       Variational parameters initialization
    We initialize the γlj ’s at a value close to 1 for all of our experiments. This ensures that
at epoch 0, we have a fully connected deep neural network. The variational parameters µljj ′
are initialized using Kaiming uniform initialization (He et al., 2015). Moreover, σljj ′ are
reparameterized using softplus function: σljj ′ = log(1 + exp(ρljj ′ )) and ρljj ′ are initialized
using a constant value of -6. This keeps initial values of σljj ′ close to 0 ensuring that the
initial values of network weights stay close to Kaiming uniform initialization.
                  {τ }                                          {τ }               {τ }           {τ }
    In SS-GL, µlj is initialized using U (−0.6, 0.6) and σlj = log(1 + exp(ρlj )) where ρlj
are initialized to -6. Moreover, µς is initialized to 1 and σς = log(1 + exp(ρ{τ } )) where ρ{τ }
is initialized to -6.
                     {α}       {β}                                        {α}                   {α}
    In SS-GHS, µlj       and µlj   are initialized using U (−0.6, 0.6). σlj   = log(1 + exp(ρlj ))
       {β}                  {β}            {α}      {β}
and σlj = log(1 + exp(ρlj )) where ρlj         and ρlj are initialized to -6. Next, µ{sa } and µ{sb }
are initialized to 1. σ {sa } = log(1 + exp(ρ{sa } )) and σ {sb } = log(1 + exp(ρ{sb } )) where ρ{sa }
and ρ{sb } are initialized to -6.
A.2       Hyperparameters for training
    In MLP-MNIST and LeNet-5-Caffe-MNIST experiments, we use 10−3 learning rate and
1024 minibatch size for all three models compared. In LeNet-5-Caffe-Fashion-MNIST exper-
iment, we train SS-GL with 10−3 learning rate whereas, SS-GHS and SS-IG are trained with
2 × 10−3 where minibatch size is 1024 in all three models compared. We train each model
for 1200 epochs using Adam optimizer in all the MLP and LeNet-5-Caffe experiments.
                                                  124


                                        CHAPTER 5
      SEQUENTIAL BAYESIAN NEURAL SUBNETWORK ENSEMBLES
5.1      Introduction
    Bayesian neural networks (BNNs) have pushed the envelope of probabilistic machine
learning through the combination of deep neural network architecture and Bayesian inference.
However, due to the enormous number of parameters, BNNs adopt approximate inference
techniques such as variational inference with a fully factorized approximating family (Jordan
et al., 1999). Although this approximation is crucial for computational tractability, they
could lead to under-utilization of BNN’s true potential (Izmailov et al., 2021).
    Recently, ensemble of neural networks (Lakshminarayanan et al., 2017) has been proposed
to account for the parameter/model uncertainty, which has been shown to be analogous to
the Bayesian model averaging and sampling from the parameter posteriors in the Bayesian
context to estimate the posterior predictive distribution (Wilson and Izmailov, 2020). In this
spirit, the diversity of the ensemble has been shown to be a key to improving the predictions,
uncertainty, and robustness of the model. To this end, diverse ensembles can mitigate some of
the shortcomings introduced by approximate Bayesian inference techniques without compro-
mising computational tractability. Several different diversity-inducing techniques have been
explored in the literature. The approaches range from using a specific learning rate schedule
(Huang et al., 2017), to introducing kernalized repulsion terms among the ensembles in the
loss function at train time (D’Angelo and Fortuin, 2021), mixture of approximate posteriors
to capture multiple posterior modes (Dusenberry et al., 2020), appealing to sparsity (albeit
ad-hoc) as a mechanism for diversity (Havasi et al., 2021; Liu et al., 2022) and finally ap-
pealing to diversity in model architectures through neural architecture and hyperparameter
searches (Egele et al., 2021; Wenzel et al., 2020).
    However, most approaches prescribe parallel ensembles, with each individual model part
                                              125


of an ensemble starting with a different initialization, which can be expensive in terms
of computation as each of the ensembles has to train longer to reach the high-performing
neighborhood of the parameter space. Although the aspect of ensemble diversity has taken
center stage, the cost of training these ensembles has not received much attention. However,
given that the size of models is only growing as we advance in deep learning, it is crucial to
reduce the training cost of multiple individual models forming an ensemble in addition to
increasing their diversity.
    To this end, sequential ensembling techniques offer an elegant solution to reduce the cost
of obtaining multiple ensembles, whose origin can be traced all the way back to Swann and
Allinson (1998); Xie et al. (2013), wherein ensembles are created by combining epochs in
the learning trajectory. Jean et al. (2015); Sennrich et al. (2016) use intermediate stages of
model training to obtain the ensembles. Moghimi et al. (2016) used boosting to generate
ensembles. In contrast, recent works by Huang et al. (2017); Garipov et al. (2018); Liu et al.
(2022) force the model to visit multiple local minima by cyclic learning rate annealing and
collect ensembles only when the model reaches a local minimum. Notably, the aforemen-
tioned sequential ensembling techniques in the literature have been proposed in the context
of deterministic machine learning models. Extending the sequential ensembling technique to
Bayesian neural networks is attractive because we can potentially get high-performing en-
sembles without the need to train from scratch, analogous to sampling with a Markov chain
Monte Carlo sampler that extracts samples from the posterior distribution. Furthermore,
sequential ensembling is complementary to the parallel ensembling strategy, where, if the
models and computational resources permit, each parallel ensemble can generate multiple
sequential ensembles, leading to an overall increase in the total number of diverse models in
an ensemble.
    On the other hand, Chapters 3 and 4 make a case for (i) the automatic data-driven
sparsity learning in Bayesian neural networks through the use of spike-and-slab priors, (ii)
the use of group sparsity priors Louizos et al. (2017); Ghosh et al. (2019); Jantre et al.
                                              126


(2021a) to provide structural sparsity in Bayesian neural networks leading to significant
computational gains. In this work, we leverage the automated structural sparsity learning
using spike-and-slab priors similar to Jantre et al. (2021a) in our approach to sequentially
generate multiple Bayesian neural subnetworks with varying sparse connectivities which
when combined yields highly diverse ensemble.
    To this end, we propose Sequential Bayesian Neural Subnetwork Ensembles (SeBayS)
with the following major contributions:
    • We propose a sequential ensembling strategy for Bayesian neural networks (BNNs)
      which learns multiple subnetworks in a single forward-pass. The approach involves a
      single exploration phase with a large (constant) learning rate to find high-performing
      sparse network connectivity yielding structurally compact network. This is followed
      by multiple exploitation phases with sequential perturbation of variational mean pa-
      rameters using corresponding variational standard deviations together with piecewise-
      constant cyclic learning rates.
    • We combine the strengths of the automated sparsity-inducing spike-and-slab prior that
      allows dynamic pruning during training, which produces structurally sparse BNNs, and
      the proposed sequential ensembling strategy to efficiently generate diverse and sparse
      Bayesian neural networks, which we refer to as Bayesian neural subnetworks.
5.1.1    Related Work
    Ensembles of neural networks: Ensembling techniques in the context of neural net-
works are increasingly being adopted in the literature due to their potential to improve
accuracy, robustness, and quantify uncertainty. The most simple and widely used approach
is Monte Carlo dropout, which is based on Bernoulli noise (Gal and Ghahramani, 2016) and
deactivates certain units during training and testing. This, along with techniques such as
DropConnect (Wan et al., 2013), Swapout (Singh et al., 2016) are referred to as“implicit”
                                             127


ensembles as model ensembling is happening internally in a single model. Although they
are efficient, the gain in accuracy and robustness is limited and they are mainly used in the
context of deterministic models. Although most recent approaches have targeted parallel
ensembling techniques, few approaches such as BatchEnsemble (Wen et al., 2020) appealed
to parameter efficiency by decomposing ensemble members into a product of a shared matrix
and a rank-one matrix, and using the latter for ensembling and MIMO (Havasi et al., 2021)
which discovers subnetworks from a larger network via multi-input multi-output configura-
tion. In the context of Bayesian neural network ensembles, Dusenberry et al. (2020) proposed
a rank-1 parameterization of BNNs, where each weight matrix involves only a distribution
on a rank-1 subspace and uses mixture approximate posteriors to capture multiple modes.
    Sequential ensembling techniques offer an elegant solution to ensemble training but have
not received much attention recently due to a wider focus of the community on diversity of
ensembles and less on the computational cost. Notable sequential ensembling techniques are
Huang et al. (2017); Garipov et al. (2018); Liu et al. (2022) that enable the model to visit
multiple local minima through cyclic learning rate annealing and collect ensembles only when
the model reaches a local minimum. The difference is that Huang et al. (2017) adopts cyclic
cosine annealing, Garipov et al. (2018) uses a piecewise linear cyclic learning rate schedule
that is inspired by geometric insights. Finally, Liu et al. (2022) adopts a piecewise-constant
cyclic learning rate schedule. We also note that all of these approaches have been primarily
in the context of deterministic neural networks.
    Our proposed approach (i) introduces sequential ensembling into Bayesian neural net-
works, (ii) combines it with dynamic sparsity through sparsity-inducing Bayesian priors to
generate Bayesian neural subnetworks, and subsequently (iii) produces diverse model en-
sembles efficiently. It is also complementary to other parallel ensembling as well as efficient
ensembling techniques.
                                              128


5.2     Sequential Bayesian Neural Subnetwork Ensembles
5.2.1     Base Learner
    We call individual models which are part of an ensemble the “Base learners”. Here we
provide the prior and corresponding fully factorized variational family that we use in our
proposed sequential ensembles.
    Prior Choice. Zero-mean Gaussian distribution is a widely popular choice of prior for
the model parameters (θ) (Izmailov et al., 2021; Louizos et al., 2017; Mackay, 1992; Neal,
1996; Blundell et al., 2015). In our sequential ensemble of dense BNNs, we adopt the zero-
mean Gaussian prior similar to Blundell et al. (2015) in each individual BNN model part
of an ensemble. The prior and corresponding fully factorized variational family is given as
follows
                            p(θljk ) = N (0, σ02 ),                         2
                                                      q(θljk ) = N (µljk , σljk )
where θljk is the k th weight incident onto the j th node (in MLP) or output channel (in CNN)
in the lth layer. N (., .) represents the Gaussian distribution. σ02 is the constant prior Gaussian
                                                                            2
variance and is chosen through hyperparameter search. µljk and σljk             are the variational mean
and standard deviation parameters of q(θljk ).
    Dynamic sparsity learning for our sequential ensemble of sparse BNNs is achieved via
spike-and-slab prior. We adopt the sparse BNN model, SS-IG (Jantre et al., 2021a) as a
base learner in to achieve the structural sparsity in Bayesian neural networks. The prior and
corresponding variational family in SS-IG model is given in Section 2.3.
5.2.2     Sequential Ensembling and Bayesian Neural Subnetworks
    We propose an ensembling procedure to obtain the base learners {θ 1 , θ 2 , · · · , θ M } se-
quentially in a single training run and construct the ensemble. The ensemble predictions
are calculated using the uniform average of the predictions obtained from each base learner.
                                                    129


                  j
Specifically, if ynew represents the outcome of the mth base learner, then the ensemble pre-
                                                                      PM
diction of M base learners (for continuous outcomes) is ynew = M1             m
                                                                        m=1 ynew .
    Sequential Perturbations. Our ensembling strategy produces diverse set of base learn-
ers from a single end-to-end training process. It consists of an exploration phase followed by
M exploitation phases. The exploration phase is carried out with a large constant learning
rate for t0 time. This allows us to explore high-performing regions of the parameter space.
At the conclusion of the exploration phase, the variational posterior approximation for the
model parameters reaches a good region on the posterior density surface. Next, during each
equally spaced exploitation phase (time = tex ) of the ensemble training, we first use mod-
erately large learning rate for tex /2 time followed by small learning rate for remaining tex /2
time. After the first model convergence step (time = t0 + tex ), we perturb the mean parame-
ters of the variational posterior distributions of the model weights using their corresponding
standard deviations. The initial values of these mean variational parameters at each subse-
quent exploitation phase become µ′ljk = µljk ± ρ ∗ σljk , where ρ is a perturbation factor. This
perturbation and subsequent model learning strategy is repeated a total of M − 1 times,
generating M base learners (either dense or sparse BNNs) creating our sequential ensemble.
    Sequential Bayesian Neural Subnetwork Ensemble (SeBayS). In this ensembling
procedure we use a large (and constant) learning rate (e.g., 0.1) in the exploration phase to
find high-performing sparse network connectivity in addition to exploring a wide range of
model parameter variations. The use of large learning rate facilitates pruning of excessive
nodes or output channels, leading to a compact Bayesian neural subnetwork. This struc-
tural compactness of the Bayesian neural subnetwork further helps us after each sequential
perturbation step by quickly converging to different local minima potentially corresponding
to the different modes of the true Bayesian posterior distribution of the model parameters.
    Freeze vs No Freeze Sparsity. In our SeBayS ensemble, we propose to evaluate two
different strategies during the exploitation phases: (1) SeBayS-Freeze: freezing the sparse
connectivity after the exploration phase, and (2) SeBayS-No Freeze: letting the sparsity
                                                130


Algorithm 5.1 Sequential Bayesian neural subnetwork ensemble (SeBayS) algorithm
  1: Inputs: training data D = {(xi , yi )}N      i=1 , network architecture ηθ , ensemble size M , per-
     turbation factor ρ, exploration phase training time t0 , training time of each exploitation
     phase tex .
      Model inputs: prior hyperparameters for θ, z (for sparse models).
  2: Output: Variational parameter estimates of network weights and sparsity.
  3: Method: Set initial values of variational parameters: µinit , σinit , γinit .
     # Exploration Phase
  4: for t = 1, 2, . . . , t0 do
  5:   Update µ0lj , σlj0 , and γlj0 (for sparse models) ← SGD(L).
  6: end for
  7: Fix the sparsity variational parameters, γlj , for freeze sparse models
     # M Sequential Exploitation Phases
  8: for m = 1, 2, . . . , M do
  9:   for t = 1, 2, . . . , tex do
10:       Update µm           m        m
                       lj , σlj , and γlj (for no freeze sparse models) ← SGD(L).
11:    end for
12:    Save variational parameters of converged base learner ηθm .
13:    Perturb variational mean parameters using standard deviations: µm+1                   m
                                                                                     init = µ ± ρ ∗ σ
                                                                                                      m
                                                                       m+1        −6
14:    Set variational standard deviations to a small value: σinit = 10 .
15:  end for
parameters learn after the exploration phase. The first approach fixes the sparse connectivity
leading to lower computational complexity during the exploitation phase training. The
diversity in the SeBayS-Freeze ensemble is achieved via sequential perturbations of the mean
parameters of the variational distribution of the active model parameters in the subnetwork.
The second approach lets the sparsity learn beyond the exploration phase, leading to highly
diverse subnetworks at the expense of more computational complexity compared to the
SeBayS-Freeze approach.
     We found that the use of sequential perturbations and dynamic sparsity leads to high-
performant subnetworks with different sparse connectivities. Compared to parallel ensem-
bles, we achieve higher ensemble diversity in single forward pass. The use of a spike-and-slab
prior allows us to dynamically learn the sparsity during training, while the Bayesian frame-
work provides uncertainty estimates of the model and sparsity parameters associated with
the network. Our approach is the first one in the literature that performs sequential en-
                                                       131


sembling of dynamic sparse neural networks, and more so in the context of Bayesian neural
networks.
    Initialization Strategy. We initialize the variational mean parameters, µ, using Kaim-
ing He initialization (He et al., 2015) while variational standard deviations, σ, are initialized
to a value close to 0. For dynamic sparsity learning, we initialize the variational inclusion
probability parameters associated with the sparsity, γlj , to be close to 1, which ensures that
the training starts from a densely connected network. Moreover, it allows our spike-and-slab
framework to explore potentially different sparse connectivities before the sparsity param-
eters are converged after the initial exploration phase. After initialization, the variational
parameters are optimized using the stochastic gradient descent with momentum algorithm
(Sutskever et al., 2013).
    Algorithm. We provide the pseudocode for our sequential ensembling approaches: (i)
BNN sequential ensemble, (ii) SeBayS-Freeze ensemble, (iii) SeBayS-No Freeze ensemble
in Algorithm 5.1.
5.3     Numerical Experiments
    In this section, we demonstrate the performance of our proposed SeBayS approach on
network architectures and techniques used in practice. We consider ResNet-32 on CIFAR10
(He et al., 2016), and ResNet-56 on CIFAR100. These networks are trained with batch nor-
malization, stepwise (piecewise constant) decreasing learning rate schedules, and augmented
training data. We provide the source code, all the details related to fairness, uniformity, and
consistency in training and evaluation of these approaches and reproducibility considerations
for SeBayS and other baseline models in the Appendix A.
    Baselines. Our baselines include the frequentist model of a deterministic deep neural
network (trained with SGD), BNN (Blundell et al., 2015), spike-and-slab BNN for node
sparsity (Jantre et al., 2021a), single forward pass ensemble models including rank-1 BNN
Gaussian ensemble (Dusenberry et al., 2020), MIMO (Havasi et al., 2021), and EDST ensem-
                                               132


ble (Liu et al., 2022), multiple forward pass ensemble methods: DST ensemble (Liu et al.,
2022) and Dense ensemble of deterministic neural networks. For fair comparison, we keep
the training hardware, environment, data augmentation, and training schedules of all the
models same. We adopted and modified the open source code provided by the Liu et al.
(2022) and Nado et al. (2021) to implement the baselines and train them. Extra details
about model implementation and learning parameters are provided in the Appendix A.
    Metrics. We quantify predictive accuracy and robustness focusing on the accuracy
and negative log-likelihood (NLL) of the i.i.d. test data (CIFAR-10 and CIFAR-100) and
corrupted test data (CIFAR-10-C and CIFAR-100-C) involving 19 types of corruption (e.g.,
added blur, compression artifacts, frost effects) (Hendrycks and Dietterich, 2019). More
details on the evaluation metrics are given in the Appendix A.
    Results. The results for CIFAR10 and CIFAR100 experiments are presented in Ta-
bles 5.1 and 5.2, respectively. For all ensemble baselines, we keep the number of base models
M = 3 similar to Liu et al. (2022). We report the results for sparse models in the upper
half and dense models in the lower half of Tables 5.1 and 5.2. In our models, we choose the
perturbation factor (ρ) to be 3. See the Appendix E for additional results on the effect of
perturbation factor.
    We observe that our BNN sequential ensemble consistently outperforms single sparse
and dense models, as well as sequential ensemble models in both CIFAR10 and CIFAR100
experiments. Whereas compared to models with 3 parallel runs, our BNN sequential ensem-
ble outperforms the DST ensemble while being comparable to the dense ensemble in simpler
CIFAR10 experiments. Next, our SeBayS-Freeze and SeBayS-No Freeze ensembles outper-
form single-BNN, SSBNN, and MIMO while being comparable to deterministic and rank-1
BNN in CIFAR10 case. Whereas, in CIFAR10-C they outperform SSBNN, MIMO, and
rank-1 BNN. Additionally, SeBayS-No Freeze ensemble has comparable performance to de-
terministic and EDST ensemble, while SeBayS-Freeze ensemble outperforms deterministic
and EDST ensemble in CIFAR10-C. In ResNet-32/CIFAR10 case, we dynamically pruned
                                              133


Table 5.1 ResNet-32/CIFAR10 experiment results. We mark the best results out of
single-pass sparse models in bold and single-pass dense models in blue.
                                                                             # Forward
  Methods                              Acc (↑)   NLL (↓)  cAcc (↑)  cNLL (↓)
                                                                             passes (↓)
  SSBNN                                  91.2     0.320     67.5      1.479      1
  MIMO (M=3)                             88.9     0.333     65.9      1.102      1
  EDST Ensemble (M = 3)                 93.1      0.214     69.8      1.236      1
  SeBayS-Freeze Ensemble (M = 3)         92.5     0.273     70.4      1.344      1
  SeBayS-No Freeze Ensemble (M = 3)      92.4     0.274     69.8      1.356      1
  DST Ensemble (M = 3)                   93.3     0.206     71.9      1.018      3
  Deterministic                          92.6     0.378     69.9      2.143      1
  BNN                                    91.9     0.353     71.3      1.422      1
  Rank-1 BNN (M=3)                       92.4     0.238     68.7      1.271      1
  BNN Sequential Ensemble (M = 3)       93.8      0.265     73.3      1.341      1
  Dense Ensemble (M=3)                   93.8     0.214     72.5      1.381      3
off close to 50% of the parameters in SeBayS approach.
    In more complex ResNet56/CIFAR100 experiment, our SeBayS-Freeze ensemble out-
performs SSBNN and MIMO in both CIFAR100 and CIFAR100-C, while it outperforms
deterministic model in CIFAR100-C. Next our SeBayS-No Freeze ensemble outperforms
SSBNN in both CIFAR100 and CIFAR100-C while it outperforms MIMO in CIFAR100 and
Table 5.2 ResNet-56/CIFAR100 experiment results. We mark the best results out of
single-pass sparse models in bold and single-pass dense models in blue.
                                                                             # Forward
  Methods                              Acc (↑)   NLL (↓)  cAcc (↑)  cNLL (↓)
                                                                             passes (↓)
  SSBNN                                  67.9     1.511     38.9      4.527      1
  MIMO (M=3)                             65.8     1.528     42.3      2.522      1
  EDST Ensemble (M = 3)                 71.9      0.997     44.3      2.787      1
  SeBayS-Freeze Ensemble (M = 3)         69.4     1.393     42.4      3.855      1
  SeBayS-No Freeze Ensemble (M = 3)      69.4     1.403     41.7      3.906      1
  DST Ensemble (M = 3)                   74.0     0.914     46.7      2.529      3
  Deterministic                          69.8     1.786     41.6      5.856      1
  BNN                                    70.4     1.335     43.2      3.774      1
  Rank-1 BNN (M=3)                       70.7     1.075     43.9      2.752      1
  BNN Sequential Ensemble (M = 3)       72.2      1.250     44.9      3.537      1
  Dense Ensemble (M=3)                   74.2     1.236     45.4      4.093      3
                                             134


deterministic model in CIFAR100-C. Given the complexity of the CIFAR100, our SeBayS
approach was able to dynamically prune off close to 18% of the ResNet56 model parameters.
5.4    Sequential BNN Ensemble Analysis
5.4.1   Function Space Analysis
    Quantitative Metrics. We measure the diversity of the base learners in our sequential
ensembles by quantifying the pairwise similarity of the base learner’s predictions on the test
data. The average pairwise similarity is given by
                          Dd = E [d(P1 (y|x1 , · · · , xN ), P2 (y|x1 , · · · , xN ))]
where d(., .) is a distance metric between the predictive distributions and {(xi , yi )}i=1,··· ,N
are the test data. We consider two distance metrics:
(1) Disagreement: the fraction of the predictions on the test data on which the base learners
disagree: ddis (P1 , P2 ) = N1 N
                              P
                               i=1 I(arg maxyˆi P1 (ŷi ) ̸= arg maxyˆi P2 (ŷi )).
(2) Kullback-Leibler (KL) divergence: dKL (P1 , P2 ) = E [log P1 (y) − log P2 (y)].
    When given two models have the same predictions for all the test data, then both dis-
agreement and KL divergence are zero.
Table 5.3 Diversity metrics in ResNet-32/CIFAR-10 and ResNet-56/CIFAR100
experiments. We mark the best results out of single-pass models in bold.
                                         ResNet-32/CIFAR10                      ResNet-56/CIFAR100
     Methods                         ddis (↑)    dKL (↑)      Acc (↑)       ddis (↑)   dKL (↑) Acc (↑)
     EDST Ensemble                    0.058        0.106        93.1         0.209      0.335   71.9
     BNN Sequential Ensemble          0.061        0.201        93.8         0.208      0.493   72.2
     SeBayS-Freeze Ensemble           0.060        0.138        92.5         0.212      0.452   69.4
     SeBayS-No Freeze Ensemble       0.106         0.346        92.4         0.241      0.597   69.4
     DST Ensemble                     0.085        0.205        93.3         0.292      0.729   74.0
    We report the results of the diversity analysis of the base learners that make up our
sequential ensembles in Table 5.3 and compare them with the DST and EDST ensembles.
We observe that for simpler CIFAR10 case, our sequential perturbation strategy helps in
                                                    135


   generating diverse base learners compared to EDST ensemble. Specifically, the SeBayS-No
   Freeze ensembles have significantly high prediction disagreement and KL divergence among
   all the methods, especially surpassing DST ensembles which involve multiple parallel runs.
   In a more complex setup of CIFAR100, we observe that SeBayS-No Freeze ensemble has
   the highest diversity metrics among single-pass ensemble learners. This highlights the im-
   portance of dynamic sparsity learning during each exploitation phase.
              Training Trajectory. We use t-SNE (Van Der Maaten and Hinton, 2008) to visual-
   ize the training trajectories of the base learners obtained using our sequential ensembling
   strategy in functional space. In the ResNet32-CIFAR10 experiment, we periodically save
   the checkpoints during each exploitation training phase and collect the predictions on the
   test dataset at each checkpoint. After training, we use t-SNE plots to project the collected
   predictions into the 2D space. In Figure 5.1, the local optima reached by individual base
   learners using sequential ensembling in all three models is fairly different. The distance be-
   tween the optima can be explained by the fact that the perturbed variational parameters in
   each exploitation phase try to reach nearby local optima.
                                                                           Learner-1                                     Learner-1
                                                                           Learner-2                                     Learner-2
                                                                           Learner-3                                     Learner-3
Dimension 2                               Dimension 2                                  Dimension 2
                Learner-1
                Learner-2
                Learner-3
                            Dimension 1                      Dimension 1                                   Dimension 1
     (a) BNN Sequential Ensemble                        (b) SeBayS-Freeze                            (c) SeBayS-No Freeze
   Figure 5.1 Training trajectories of base learners in ResNet32/CIFAR10 experi-
   ment. Training trajectories obtained by BNN sequential ensemble, SeBayS-Freeze Ensem-
   ble, and SeBayS-No Freeze Ensemble.
                                                              136


               (a) Pruned Parameter Ratio               (b) Pruned Flops Ratio
Figure 5.2 Dynamic sparsity and FLOPs curves. They show ratio of remaining param-
eters and FLOPs for our SeBayS-Freeze and SeBayS-No Freeze ensembles in ResNet32-
CIFAR10 experiment.
5.4.2    Dynamic Sparsity Learning
    In this section, we highlight the dynamic sparsity training in our SeBayS ensemble meth-
ods. We focus on ResNet-32/CIFAR10 experiment and consider M = 3 exploitation phases.
In particular, we plot the ratios of remaining parameters and floating point operations
(FLOPs) in the SeBayS sparse base learners. In Figure 5.2, We observe that during ex-
ploration phase, SeBayS prunes off 50% of the network parameters and more than 35% of
the FLOPs compared to its dense counterpart.
5.4.3    Effect of Ensemble size
    In this section, we explore the effect of the ensemble size M in ResNet-32/CIFAR10
experiment. According to the ensembling literature (Hansen and Salamon, 1990; Ovadia
et al., 2019), increasing number of diverse base learners in the ensemble improves predictive
performance, although with a diminishing impact. In our ensembles, we generate models
and aggregate performance sequentially with increasing M , the number of base learners in
the ensemble.
                                             137


 (a) BNN Sequential Ensemble         (b) SeBayS-Freeze             (c) SeBayS-No Freeze
Figure 5.3 Predictive performance results of the base learners and the sequential
ensembles as the ensemble size M varies in ResNet32/CIFAR10 experiment.
    In Figure 5.3, we plot the performance of the individual base learners, as well as the
sequential ensemble as M varies. For individual learners, we provide the mean test accu-
racy with corresponding one standard deviation spread. When M = 1, the ensemble and
individual model refer to a single base learner and hence their performance is matched. As
M grows, we observe significant increase in the performance of our ensemble models with
diminishing improvement for higher M s. The high performance of our sequential ensembles
compared to their individual base models further underscores the benefits of ensembling in
sequential manner.
5.5     Conclusion and Discussion
    In this work, we propose SeBayS ensemble, which is an approach to generate sequential
Bayesian neural subnetwork ensembles through a combination of novel sequential ensem-
bling approach for BNNs and dynamic sparsity with sparsity-inducing Bayesian prior that
provides a simple and effective approach to improve the predictive performance and model
robustness. The highly diverse Bayesian neural subnetworks converge to different optima
in function space and, when combined, form an ensemble which demonstrates improving
performance with increasing ensemble size. Our simple yet highly effective sequential per-
                                             138


turbation strategy enables a dense BNN ensemble to outperform deterministic dense ensem-
ble. Whereas, the Bayesian neural subnetworks obtained using spike-and-slab node pruning
prior produce ensembles that are highly diverse, especially the SeBays-No Freeze ensembles
compared to EDST in both CIFAR10/100 experiments and DST ensemble in our simpler
CIFAR10 experiment. Future work will explore the combination of parallel ensembling of
our sequential ensembles leading to a multilevel ensembling model. In particular, we will
leverage the exploration phase to reach highly sparse network and next perturb more than
once and learn each subnetwork in parallel while performing sequential exploitation phases
on each subnetwork. We expect this strategy would lead to highly diverse base learners with
potentially significant improvements in model performance and robustness.
                                            139


APPENDICES
    140


                                       APPENDIX A
                      REPRODUCIBILITY CONSIDERATIONS
A.1      Hyperparameters
Hyperparameters for single and parallel ensemble models. For the ResNet/CIFAR
models, we use minibatch size of 128 uniformly across all the methods. We train each single
model (Deterministic, BNN, SSBNN) as well as each member of Dense and DST ensemble
for 250 epochs with a learning rate of 0.1 which is decayed by a factor of 0.1 at 150 and 200
epochs. For frequentist methods, we use weight decay of 5e − 4 whereas for Bayesian models
the weight decay is 0 (since the KL term in the loss acts as a regularizer). For the DST
ensemble, we take the sparsity S = 0.8, the update interval ∆T = 1000, and the exploration
rate p = 0.5, same as Liu et al. (2022).
Hyperparameters for sequential ensemble models. For the ResNet/CIFAR models,
the minibatch size is 128 for all the methods compared. We train each sequential model
with M = 3 for 450 epochs. In the BNN and SeBayS ensembles, the exploration phase is
run for t0 = 150 epochs and each exploitation phase is run for tex = 100 epochs. We fix the
perturbation factor to be 3. During the exploration phase, we take a high learning rate of
0.1. Whereas, for each exploitation phase, we use learning rate of 0.01 for first tex /2 = 50
epochs and 0.001 for remaining tex /2 = 50 epochs. For the EDST ensemble, we take an
exploration time (tex ) of 150 epochs, each refinement phase time (tre ) of 100 epochs, sparsity
S = 0.8, and exploration rate q = 0.8, same as Liu et al. (2022).
A.2      Data Augmentation
    For CIFAR10 and CIFAR100 training dataset, we first pad the train images using 4
pixels of value 0 on all borders and then crop the padded image at a random location gen-
erating train images of the same size as the original train images. Next, with a probability
                                              141


of 0.5, we horizontally flip a given cropped image. Finally, we normalize the images us-
ing mean = (0.4914, 0.4822, 0.4465) and standard deviation = (0.2470, 0.2435, 0.2616) for
CIFAR10. Whereas, we use mean = (0.5071, 0.4865, 0.4409) and standard deviation =
(0.2673, 0.2564, 0.2762) for CIFAR100. Next, we split the train data of size 50000 images
into a TRAIN/VALIDATION split of 45000/5000 transformed images. For CIFAR10/100
test data, we normalize the 10000 test images in each data case using the corresponding
mean and standard deviation of their respective training data.
A.3      Evaluation Metrics
    We quantify the predictive performance of each method using the accuracy of the test
data (Acc). For a measure of robustness or predictive uncertainty, we use negative log-
likelihood (NLL) calculated on the test dataset. Moreover, we adopt {cAcc, cNLL} to
denote the corresponding metrics on corrupted test datasets. We also use VALIDATION
data to determine the best epoch in each model which is later used for TEST data evaluation.
    In the case of Deterministic and each member of Dense, MIMO, and DST ensemble, we
use a single prediction for each test data element and calculate the corresponding evaluation
metrics for each individual model. In case of all the Bayesian models, we use one Monte Carlo
sample to generate the network parameters and correspondingly generate a single prediction
for each single model, which is used to calculate the evaluation metrics in those individual
models. For all the ensemble models, we generate a single prediction from each base learner
present in the ensemble. Next, we evaluate the ensemble prediction using a simple average of
M predictions generated from M base learners and use this averaged prediction to calculate
the evaluation metrics mentioned above for the ensemble models.
A.4      Hardware and Software
    The Deterministic, MIMO, Rank-1 BNN, and Dense Ensemble models are run using
the Uncertainty Baselines (Nado et al., 2021) repository, but with the data, model and
                                              142


hyperparameter settings described in Section 5.3. Moreover, we consistently run all the
experiments on a single NVIDIA A100 GPU for all the approaches evaluated in this work.
                                          143


                                       APPENDIX B
               OUT-OF-DISTRIBUTION EXPERIMENT RESULTS
    In Table B.1, we present the AU-ROC results for out-of-distribution (OoD) detection
for the ResNet-32/CIFAR10 models. In this case, the out-of-distribution data was taken to
be CIFAR100. The results show that our SebayS-Freeze Ensemble perform better than the
single SSBNN and MIMO model. On the other hand, SebayS-No Freeze Ensemble performss
better than SSBNN. Next, our BNN sequential ensemble performs better than deterministic
and BNN models.
Table B.1 OoD detection results in ResNet-32/CIFAR10 experiment. We mark the
best results out of single-pass sparse models in bold and single-pass dense models in blue.
                                                                   # Forward
                Methods                               AUROC (↑)
                                                                    passes (↓)
                SSBNN                                    0.806          1
                MIMO (M=3)                               0.840          1
                EDST Ensemble (M = 3)                   0.872           1
                SeBayS-Freeze Ensemble (M = 3)           0.864          1
                SeBayS-No Freeze Ensemble (M = 3)        0.842          1
                DST Ensemble (M = 3)                     0.879          3
                Deterministic                            0.854          1
                BNN                                      0.841          1
                Rank-1 BNN (M=3)                        0.866           1
                BNN Sequential Ensemble (M = 3)          0.863          1
                Dense Ensemble (M=3)                     0.879          3
                                             144


                                      APPENDIX C
                         EFFECT OF THE ENSEMBLE SIZE
In Section 5.4.3, we have explored the effect of the ensemble size M in ResNet-32/CIFAR10
experiment through comparison of mean individual learner test accuracies compared to en-
semble accuracies in uncorrupted test dataset. In Table C.1, we provide the results on both
CIFAR10 and CIFAR10-C datasets for our sequential ensembles with an increasing number
of base learners M = 3, 5, 10. We also provide BNN and SSBNN baselines to compare against
BNN sequential ensemble and SeBayS ensembles, respectively. We observe that our BNN
sequential ensemble and SeBayS ensembles of any size significantly outperform single BNN
and SSBNN models, respectively. With an increasing number of base learners (M = 3, 5, 10)
within each of our sequential ensembles, we observe a monotonically increasing predictive
performance. The NLLs for BNN sequential ensemble decrease as M increases. The NLLs
for the SeBayS ensembles are either similar or increasing as M increases, which suggests the
influence of the KL divergence term in ELBO optimization in variational inference.
Table C.1 Ensemble size effect results in ResNet-32/CIFAR10 experiment. We
mark the best results out of the sparse models in bold and the dense models in blue.
                                                                                 # Forward
 Methods                                 Acc (↑)   NLL (↓)  cAcc (↑)  cNLL (↓)
                                                                                 passes (↓)
 SSBNN                                    91.2      0.320     67.5      1.479        1
 SeBayS-Freeze Ensemble (M = 3)           92.5      0.273     70.4      1.344        1
 SeBayS-Freeze Ensemble (M = 5)           92.5      0.273     70.9      1.359        1
 SeBayS-Freeze Ensemble (M = 10)          92.7      0.275     71.0      1.386        1
 SeBayS-No Freeze Ensemble (M = 3)        92.4      0.274     69.8      1.356        1
 SeBayS-No Freeze Ensemble (M = 5)        92.5      0.271     70.2      1.375        1
 SeBayS-No Freeze Ensemble (M = 10)       92.7      0.272     70.8      1.375        1
 BNN                                      91.9      0.353     71.3      1.422        1
 BNN Sequential Ensemble (M = 3)          93.8      0.265     73.3      1.341        1
 BNN Sequential Ensemble (M = 5)          94.1      0.253     73.7      1.318        1
 BNN Sequential Ensemble (M = 10)         94.2      0.244     73.9      1.300        1
                                             145


                                       APPENDIX D
                  EFFECT OF THE MONTE CARLO SAMPLE SIZE
    In variational inference during evaluation phase, model prediction is calculated using the
average of the predictions from ensemble of networks where the weights of each network
represent one sample from the posterior distributions of the weights. The number of such
networks used to build ensemble prediction is called the Monte Carlo (M C) sample. In
Table D.1, we present our sequential ensemble models as well as BNN and SSBNN baselines in
the ResNet-32/CIFAR10 experiment. Here, we take M C = 1 which is used in the Section 5.3
experiments and compare it with M C = 5 for each method. In single BNN and SSBNN
models, we observe significant improvement in model performance when using M C = 5
instead of 1. However, when we compare the SebayS ensembles using M C = 1 or 5 with
SSBNN using M C = 5, we observe that their performance is similar, indicating that M C = 1
is sufficient for our SebayS ensembles. On the other hand, sequential BNN ensembles using
M C = 1 has better performance compared to BNN with M C = 5. Whereas, sequential BNN
ensemble using M C = 1 and 5 have similar performance. This highlights the importance of
sequential perturbation strategy, which leads to more diverse ensembles compared to mere
Monte Carlo sampling.
                                              146


Table D.1 Monte Carlo sample size effect results in ResNet-32/CIFAR10 experi-
ment. We mark the best results out of the sparse models in bold and dense models in blue.
M C is the Monte Carlo sample size
               Methods                            MC    Acc (↑)   NLL (↓)
               SSBNN                               1      91.2     0.320
               SSBNN                               5      92.3     0.270
               SeBayS-Freeze Ensemble (M=3)        1      92.5     0.273
               SeBayS-Freeze Ensemble (M=3)        5      92.5     0.270
               SeBayS-No Freeze Ensemble (M=3)     1      92.4     0.274
               SeBayS-No Freeze Ensemble (M=3)     5      92.6     0.268
               BNN                                 1      91.9     0.353
               BNN                                 5      93.2     0.271
               BNN Sequential Ensemble (M=3)       1      93.8     0.265
               BNN Sequential Ensemble (M=3)       5      93.9     0.254
                                          147


                                       APPENDIX E
                   EFFECT OF THE PERTURBATION FACTOR
    In this Appendix, we explore the influence of the perturbation factor on our sequential
ensemble models through the ResNet-32/CIFAR10 experiment. In Table E.1, we report the
results for our three sequential approaches for three perturbation factors, ρ = 2, 3, 5. For our
SeBayS-Freeze and No Freeze ensembles, the lower perturbations with ρ = 2 lead to higher
test accuracies and NLLs over ρ = 3, 5 in both the CIFAR10 and CIFAR10-C test datasets.
This means that higher perturbations ρ = 3, 5 might need a higher number of epochs to
reach the convergence in each exploitation phase. However, in the BNN sequential ensemble
ρ = 3 has an overall higher performance compared to ρ = 2, 5. This points to the fact that
the lower perturbation, ρ = 2, may not lead to the best ensemble model. In Table E.2, we
present the prediction disagreement and KL divergence metrics for the experiments described
in this Appendix. In the BNN sequential ensemble, the ρ = 5 perturbation model has the
best diversity metrics, whereas the ρ = 3 perturbation model has the best accuracy. In the
SeBayS approach, the perturbation of ρ = 3 leads to the best diversity metrics nonetheless
at the expense of slightly lower predictive performance. This highlights the fact that the
ρ = 3 SeBayS approaches lead to the best ensembles given the training budget constraint.
Hence, we use ρ = 3 for our three sequential models in all the experiments presented in the
Section 5.3.
                                              148


Table E.1 Perturbation factor effect results in ResNet-32/CIFAR10 experiment.
We mark the best results out of different perturbation models under a given method in bold.
Ensemble size is fixed at M = 3. ρ is the perturbation factor.
          Methods                       ρ  Acc (↑)   NLL (↓)      cAcc (↑)   cNLL (↓)
          SeBayS-Freeze Ensemble        2   92.7       0.264        70.6       1.303
          SeBayS-Freeze Ensemble        3    92.5       0.273       70.4       1.344
          SeBayS-Freeze Ensemble        5    92.5       0.267       70.6       1.314
          SeBayS-No Freeze Ensemble     2   92.7       0.268        70.4       1.331
          SeBayS-No Freeze Ensemble     3    92.4       0.274       69.8       1.356
          SeBayS-No Freeze Ensemble     5    92.4       0.272       70.1       1.353
          BNN Sequential Ensemble       2    93.6       0.269       73.3       1.361
          BNN Sequential Ensemble       3   93.8        0.265       73.3       1.341
          BNN Sequential Ensemble       5    93.6      0.262        73.0       1.366
Table E.2 Diversity metrics for models trained with different perturbation factors
in ResNet-32/CIFAR-10 experiment. We mark the best results out of different per-
turbation models under a given method in bold. Ensemble size is fixed at M = 3. ρ is the
perturbation factor.
                                                       ResNet-32/CIFAR10
                 Methods                       ρ   ddis (↑)   dKL (↑)    Acc (↑)
                 BNN Sequential Ensemble       2    0.062      0.205      93.6
                 BNN Sequential Ensemble       3    0.061      0.201      93.8
                 BNN Sequential Ensemble       5   0.063       0.211      93.6
                 SeBayS-Freeze Ensemble        2    0.058      0.135      92.7
                 SeBayS-Freeze Ensemble        3   0.060       0.138      92.5
                 SeBayS-Freeze Ensemble        5    0.059      0.137      92.5
                 SeBayS-No Freeze Ensemble     2    0.082      0.222      92.7
                 SeBayS-No Freeze Ensemble     3   0.106       0.346      92.4
                 SeBayS-No Freeze Ensemble     5    0.083      0.228      92.4
                                             149


                                        APPENDIX F
           EFFECT OF THE CYCLIC LEARNING RATE SCHEDULE
    In this Appendix, we provide the effect of different cyclic learning rate strategies during
exploitation phases in our three sequential ensemble methods. We explore the stepwise (our
approach), cosine (Huang et al., 2017), linear-fge (Garipov et al., 2018), and linear-1 cyclic
learning rate schedules.
    Cosine. The cyclic cosine learning rate schedule reduces the higher learning rate of 0.01
to a lower learning rate of 0.001 using the shifted cosine function (Huang et al., 2017) in
each exploitation phase.
    Linear-fge. In the cyclic linear-fge learning rate schedule, we first drop the high learning
rate of 0.1 used in the exploration phase to 0.01 linearly in tex /2 epochs and then further drop
the learning rate to 0.001 linearly for the remaining tex /2 epochs during the first exploitation
phase. Afterwards, in each exploitation phase, we linearly increase the learning rate from
0.001 to 0.01 for tex /2 and then linearly decrease it back to 0.001 for the next tex /2 similar
to Garipov et al. (2018).
    Linear-1. In the linear-1 cyclic learning rate schedule, we linearly decrease the learning
rate from 0.01 to 0.001 for tex epochs in each exploitation phase and then suddenly increase
the learning to 0.01 after each sequential perturbation step.
    In Figure F.1, we present the plots of the cyclic learning rate schedules considered in this
Appendix.
                                               150


Figure F.1 Cyclic learning rate schedules. The red dots represent the converged models
after each exploitation phase used in our final sequential ensemble.
    In Table F.1, we present the results for our three sequential ensemble methods under the
four cyclic learning rate schedules mentioned above. We observe that, in all three sequential
ensembles, the cyclic stepwise learning rate schedule yields the best performance in almost
all criteria compared to the rest of the learning rate schedules in each sequential ensemble
method. In Table F.2, we present the prediction disagreement and KL divergence metrics
for the experiments described in this Appendix. We observe that, in SeBayS-No Freeze
ensemble, cyclic stepwise schedule generates highly diverse subnetworks, which also leads to
high predictive performance. Whereas, in the BNN sequential and SeBayS-Freeze ensemble,
we observe lower diversity metrics for the cyclic stepwise learning rate schedule compared to
the rest of the learning rate schedules.
                                              151


Table F.1 Cyclic learning rate schedules results in ResNet-32/CIFAR10 experi-
ment. We mark the best results out of different learning rate (LR) schedules under a given
method in bold. Ensemble size is fixed at M = 3.
     Methods                       LR Schedule      Acc (↑)   NLL (↓)      cAcc (↑)   cNLL (↓)
     SeBayS-Freeze  Ensemble         stepwise        92.5        0.273       70.4       1.344
     SeBayS-Freeze  Ensemble          cosine         92.3        0.301       69.8       1.462
     SeBayS-Freeze  Ensemble        linear-fge       92.5       0.270        70.1       1.363
     SeBayS-Freeze  Ensemble         linear-1        92.1        0.310       69.8       1.454
     SeBayS-No  Freeze Ensemble      stepwise        92.4       0.274        69.8       1.356
     SeBayS-No  Freeze Ensemble       cosine         92.2        0.294       69.9       1.403
     SeBayS-No  Freeze Ensemble     linear-fge       92.4        0.276       70.0       1.379
     SeBayS-No  Freeze Ensemble      linear-1        92.2        0.296       69.7       1.412
     BNN   Sequential Ensemble       stepwise        93.8       0.265        73.3       1.341
     BNN   Sequential Ensemble        cosine         93.7        0.279       72.7       1.440
     BNN   Sequential Ensemble      linear-fge       93.5        0.270       73.1       1.342
     BNN   Sequential Ensemble       linear-1        93.4        0.287       72.2       1.430
Table F.2 Diversity metrics for models trained with different cyclic learning rate
schedules in ResNet-32/CIFAR10 experiment. We mark the best results out of dif-
ferent learning rate (LR) schedules under a given method in bold. Ensemble size is fixed at
M = 3.
                                                                ResNet-32/CIFAR10
            Methods                       LR Schedule       ddis (↑)   dKL (↑)    Acc (↑)
            BNN   Sequential Ensemble        stepwise        0.061      0.201      93.8
            BNN   Sequential Ensemble          cosine        0.068      0.256      93.7
            BNN   Sequential Ensemble       linear-fge       0.070      0.249      93.5
            BNN   Sequential Ensemble         linear-1      0.071       0.275      93.4
            SeBayS-Freeze  Ensemble          stepwise        0.060      0.138      92.5
            SeBayS-Freeze  Ensemble            cosine        0.072      0.204      92.3
            SeBayS-Freeze  Ensemble         linear-fge      0.076       0.215      92.5
            SeBayS-Freeze  Ensemble           linear-1       0.074      0.209      92.1
            SeBayS-No   Freeze Ensemble      stepwise       0.106       0.346      92.4
            SeBayS-No   Freeze Ensemble        cosine        0.078      0.222      92.2
            SeBayS-No   Freeze Ensemble     linear-fge       0.074      0.199      92.4
            SeBayS-No   Freeze Ensemble       linear-1       0.077      0.217      92.2
                                                152


                                         CHAPTER 6
                                          EPILOGUE
6.1     Summary
    This dissertation focuses on the development of novel theoretically consistent Bayesian
neural networks (BNN) models for wide range of data scenarios. We have proposed: the
Bayesian quantile regression neural networks (BQRNN) in Chapter 2, the spike-and-slab
Gaussian node selection technique (SS-IG) in Chapter 3, the spike-and-slab group lasso (SS-
GL) and the spike-and-slab group horseshoe (SS-GHS) in Chapter 4. In each of BQRNN
and SS-IG methods, we provide rigorous theoretical justification via posterior consistency
results and the optimal contraction rate. We also provide numerical evidence establishing
the advantage of our proposed methods compared to the recent competing techniques in
the literature. In Chapter 5, we propose sequential Bayesian neural subnetwork ensembles
(SeBayS) which use SS-IG models as the base models in the ensemble. We concluded that
chapter with several experiments showcasing the effectiveness of our proposed approach as
well as few studies where we explore the effect of changing some parameters in the model.
6.2     Broader Impacts
    Our BQRNN approach is particularly useful when the relationships in the lower and
upper tail areas of the response variable distribution are of greater interest such as extreme
weather events, cascading failures in electric power grids, and other rare events modeling. On
the other hand, as deep learning gets harnessed by big industrial corporations in recent years
to improve their products, the demand for models with both high predictive and uncertainty
estimation performance is rising. The vast variety of its applications range from computer
vision, pattern recognition, to natural language processing. However, as deep learning models
are pushed into smaller and smaller embedded devices, such as, smart cameras recognizing
                                               153


visitors at your front door, the design of resource efficient neural networks is of extreme
practical importance. These real-world applications demand real-time, on-device neural
network inference. Our work on sparse BNNs addresses this computational bottleneck by
compressing neural networks by inducing sparsity during training. The Bayesian framework
estimates the posterior of model parameters allowing for uncertainty quantification around
the parameter estimates which can be vital in medical diagnostics. For example, many of
the brain imaging data could be processed through our model yielding a decision on certain
medical condition with added benefit of quantified confidence associated with that decision.
6.3     Future Research
    In the future, the Bayesian neural networks still have a huge room to investigate. There
are few promising research directions which stem from our current work.
    • The development of sparse deep Bayesian quantile networks which can allow for ex-
      treme quantile inference with fewer data points. Such a model can benefit in cases
      where the event of interest is rarely manifested in a given data.
    • Bayesian convolutional neural network approximation theory which consists of the
      derivations of posterior consistency, variable selection consistency, and asymptotically
      optimal generalization bounds.
    • The theoretical framework for the Bayesian ensembling including not only the posterior
      consistency but also the nonasymptotic generalization error upper bounds. Such a
      bound might depend on the data size as well as explicit number of base models in an
      ensemble. This theoretical development will also help in deciding the optimal number
      of base models in a given ensemble.
    • Development of sparse Bayesian tensor-to-tensor convolution neural networks involving
      structured sparsity learning which would potentially benefit ill-posed as well as well-
      posed problems which can occur for instance in tomographic reconstruction.
                                              154


BIBLIOGRAPHY
      155


                                    BIBLIOGRAPHY
Alhamzawi, R. (2018). Brq: R package for bayesian quantile regression. https://cran.r-
  project.org/web/packages/Brq/Brq.pdf. Online; accessed 15 May 2020.
Alvarez, J. M. and Salzmann, M. (2016). Learning the number of neurons in deep networks.
  In Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS
  2016).
Andrews, D. F. and Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal
  of the Royal Statistical Society – Series B, 36(1):99–102.
Arangio, S. and Bontempi, F. (2015). Structural health monitoring of a cable-stayed bridge
  with bayesian neural networks. Structure and Infrastructure Engineering, 11(4):575–587.
Bai, J., Song, Q., and Cheng, G. (2020). Efficient variational inference for sparse deep learn-
  ing with theoretical guarantee. In Proceedings of the 33th Advances in Neural Information
  Processing Systems (NeurIPS 2020).
Barndorff-Nielsen, O. and Shephard, N. (2001). Non-Gaussian OU based models and some
  of their uses in financial economics. Journal of the Royal Statistical Society – Series B,
  63:167–241.
Barron, A., Schervish, M. J., and Wasserman, L. (1999). The consistency of posterior
  distributions in nonparametric problems. Annals of Statistics, 10:536–561.
Bateni, S. M., Jeng, D.-S., and Melville, B. W. (2007). Bayesian neural networks for pre-
  diction of equilibrium and time-dependent scour depth around bridge piers. Advances in
  Engineering Software, 38(2):102–111.
Beker, W., Wolos, A., Szymkuć, S., and Grzybowski, B. (2020). Minimal-uncertainty pre-
  diction of general drug-likeness based on bayesian neural networks. Nature Machine
  Intelligence, 2:457–465.
Benoit, D. F., Alhamzawi, R., Yu, K., and den Poel, D. V. (2017). R package ‘bayesqr’.
  https://cran.r-project.org/web/packages/bayesQR/bayesQR.pdf. Online; accessed 15
  May 2020.
Betancourt, M. and Girolami, M. (2015). Hamiltonian monte carlo for hierarchical models.
  Current trends in Bayesian methodology with applications, 79(30).
Bhadra, A., Datta, J., Polson, N., and Willard, B. (2019). Lasso meets horseshoe: A survey.
  Statistical Science, 34(3):405–427.
                                             156


Bhattacharya, S. and Maiti, T. (2021). Statistical foundation of variational bayes neural
  networks. Neural Networks, 137:151–173.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for
  statisticians. Journal of the American Statistical Association, 112(518):859–877.
Blei, D. M. and Lafferty, J. D. (2007). A correlated topic model of science. The Annals of
  Applied Statistics, 1(1):17–35.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty
  in neural network. In Proceedings of Machine Learning Research, volume 37, pages 1613–
  1622. PMLR.
Buntine, W. L. and Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems,
  5:603–643.
Cannon, A. J. (2011). R package ‘qrnn’ for quantile regression neural network. https://cran.r-
  project.org/web/packages/qrnn/qrnn.pdf. Online; accessed 15 May 2020.
Cannon, A. J. (2018). Non-crossing nonlinear regression quantiles by monotone composite
  quantile regression neural network, with application to rainfall extremes. Stoch Environ
  Res Risk Assess, 32:3207–3225.
Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse
  signals. Biometrika, 97(2):465–480.
Chen, C. (2007). A finite smoothing algorithm for quantile regression.             Journal of
  Computational and Graphical Statistics, 16:136–164.
Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2018). Model compression and acceleration
  for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing
  Magazine, 35(1):126–136.
Chérief-Abdellatif, B.-E. (2020). Convergence rates of variational inference in sparse deep
  learning. In Proceedings of the 37th International Conference on Machine Learning
  (ICML-2020).
Chérief-Abdellatif, B.-E. and Alquier, P. (2018). Consistency of variational bayes infer-
  ence for estimation and model selection in mixtures. Electronic Journal of Statistics,
  12(2):2995–3035.
Cobb, A. D. et al. (2019). An ensemble of bayesian neural networks for exoplanetary atmo-
  spheric retrieval. The Astronomical Journal, 158(1).
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics
                                             157


  of Controls, Signals, and Systems, 2:303–314.
D’Angelo, F. and Fortuin, V. (2021). Repulsive deep ensembles are bayesian. In Proceedings
  of the 34th Advances in Neural Information Processing Systems (NeurIPS 2021).
Dantzig, G. B. (1963). Linear Programming and Extensions. Princeton University Press,
  Princeton.
De Freitas, N., Andrieu, C., Højen-Sørensen, P., Niranjan, M., and Gee, A. (2001). Sequential
  monte carlo methods for neural networks. In Sequential Monte Carlo Methods in Practice,
  pages 359—-379. Springer, New York.
Dua, D. and Graff, C. (2017). UCI machine learning repository.
Dusenberry, M., Jerfel, G., Wen, Y., Ma, Y., Snoek, J., Heller, K., Lakshminarayanan, B.,
  and Tran, D. (2020). Efficient and scalable bayesian neural nets with rank-1 factors. In
  Proceedings of the 37th International Conference on Machine Learning (ICML-2020).
Egele, R., Maulik, R., Raghavan, K., Balaprakash, P., and Lusch, B. (2021). Autodeuq: Au-
  tomated deep ensemble with uncertainty quantification. arXiv preprint arXiv:2110.13511.
Elfwing, S., Uchibe, E., and Doya, K. (2018). Sigmoid-weighted linear units for neural
  network function approximation in reinforcement learning. Neural Networks, 107:3–11.
  Special issue on deep reinforcement learning.
Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, train-
  able neural networks. In 7th International Conference on Learning Representations
  (ICLR-2019).
Friedman, J., Hastie, T., and Tibshirani, R. (2009). The elements of statistical learning.
  Springer series in statistics. Springer, New York.
Funahashi, K. (1989). On the approximate realization of continuous mappings by neural
  networks. Neural Networks, 2:183–192.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing
  model uncertainty in deep learning. In Proceedings of the 33rd International Conference
  on Machine Learning (ICML-2016.
Gale, T., Elsen, E., and Hooker, S. (2019). The state of sparsity in deep neural networks.
  arXiv preprint arXiv:1902.09574.
Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss
  surfaces, mode connectivity, and fast ensembling of dnns. In Proceedings of the 31st
  Advances in Neural Information Processing Systems (NeurIPS-2018).
                                              158


Gelfand, E. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating
  marginal densities. Journal of the American Statistical Association, 85:398–409.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).
  Bayesian Data Analysis. CRC Press, Third edition.
Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple
  sequences. Statistical Science, 7(4):457–472.
Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the
  bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine
  Intelligence, 6(6):721–741.
Ghosal, S. and Van Der Vaart, A. W. (2007). Convergence rates of posterior distributions
  for noniid observations. The Annals of Statistics, 35(1):192–223.
Ghosh, M., Ghosh, A., Chen, M. H., and Agresti, A. (2000). Noninformative priors for
  one-parameter item models. Journal of Statistical Planning and Inference, 88:99–115.
Ghosh, S., Yao, J., and Doshi-Velez, F. (2019). Model selection in bayesian neural networks
  via horseshoe priors. Journal of Machine Learning Research, 20:1–46.
Grenander, U. (1981). Abstract Inference. Wiley.
Guo, Y., Yao, A., and Chen, Y. (2016). Dynamic network surgery for efficient dnns. In
  Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS 2016).
Han, S., Mao, H., and Dally, W. J. (2016). Deep compression: Compressing deep neural
  network with pruning, trained quantization and huffman coding. In 4th International
  Conference on Learning Representations (ICLR-2016).
Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections
  for efficient neural network. In Proceedings of the 28th Advances in Neural Information
  Processing Systems (NIPS 2015).
Hansen, L. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions on
  Pattern Analysis and Machine Intelligence, 12(10):993–1001.
Hassibi, B. and Stork, D. G. (1993). Second order derivatives for network pruning: Optimal
  brain surgeon. In Advances in Neural Information Processing Systems.
Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their
  applications. Biometrika, 57:97–109.
Havasi, M., Jenatton, R., Fort, S., Liu, J. Z., Snoek, J., Lakshminarayanan, B., Dai, A. M.,
                                              159


  and Tran, D. (2021). Training independent subnetworks for robust prediction. In 9th
  International Conference on Learning Representations (ICLR-2021).
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing
  human-level performance on imagenet classification. In IEEE International Conference on
  Computer Vision (ICCV-2015), pages 1026–1034.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
  In IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2016), pages
  770–778.
Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to
  common corruptions and perturbations. In 7th International Conference on Learning
  Representations (ICLR-2019).
Hernandez-Lobato, J. M. and Adams, R. (2015). Probabilistic backpropagation for scalable
  learning of bayesian neural networks. In Proceedings of the 32nd International Conference
  on Machine Learning (ICML-2015).
Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing
  the description length of the weights. In Proceedings of the sixth annual conference on
  Computational Learning Theory (COLT-1993), pages 5–13.
Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., and Peste, A. (2021). Sparsity in
  deep learning: Pruning and growth for efficient inference and training in neural networks.
  Journal of Machine Learning Research, 22(241):1–124.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
  universal approximators. Neural Networks, 2:359–366.
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. (2017).
  Snapshot ensembles: Train 1, get m for free. In 5th International Conference on Learning
  Representations (ICLR-2017).
Ingraham, J. B. and Marks, D. S. (2017). Variational inference for sparse and undi-
  rected models. In Proceedings of the 34th International Conference on Machine Learning
  (ICML-2017).
Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. G. (2021). What are bayesian
  neural network posteriors really like? In Proceedings of the 38th International Conference
  on Machine Learning (ICML-2021).
Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax.
  In 5th International Conference on Learning Representations (ICLR-2017).
                                             160


Jantre, S., Bhattacharya, S., and Maiti, T. (2021a). Layer adaptive node selection in
  bayesian neural networks: Statistical guarantees and implementation details. arXiv
  preprint arXiv:2108.11000.
Jantre, S., Bhattacharya, S., and Maiti, T. (2021b). Quantile regression neural networks: a
  bayesian approach. Journal of Statistical Theory and Practice, 15(3):1–34.
Jantre, S., Madireddy, S., Bhattacharya, S., Maiti, T., and Balaprakash, P. (2022). Sequential
  bayesian neural subnetwork ensembles. arXiv preprint arXiv:2206.00794.
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015). On using very large target vocab-
  ulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the
  Association for Computational Linguistics and the 7th International Joint Conference on
  Natural Language Processing (Volume 1: Long Papers), pages 1–10.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Sau, L. K. (1999). An introduction to
  variational methods for graphical models. Machine Learning, 37:183–233.
Karmarkar, N. (1984). A new polynomial time algorithm for linear programming.
  Combinatorica, 4(4):373–395.
Kendall, A. and Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for
  computer vision? In Proceedings of the 31st Advances in Neural Information Processing
  Systems (NIPS-2017).
Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd
  International Conference on Learning Representations (ICLR-2015).
Kingma, D. and Welling, M. (2014). Auto-encoding variational bayes. In 2nd International
  Conference on Learning Representations (ICLR-2014).
Koenker, R. (2005). Quantile Regression. Cambridge University Press, Cambridge, First
  edition.
Koenker, R. (2017). R package ‘quantreg’ for quantile regression. https://cran.r-
  project.org/web/packages/quantreg/quantreg.pdf. Online; accessed 15 May 2020.
Koenker, R. and Basset, G. (1978). Regression quantiles. Econometrica, 46:33–50.
Koenker, R. and Machado, J. (1999). Goodness of fit and related inference processes for
  quantile regression. Journal of the American Statistical Association, 94:1296–1309.
Kottas, A. and Gelfand, A. E. (2001). Bayesian semiparametric median regression modeling.
  Journal of the American Statistical Association, 96:1458–1468.
                                             161


Kozumi, H. and Kobayashi, G. (2011). Gibbs sampling methods for bayesian quantile re-
  gression. Journal of Statistical Computation and Simulation, 81:1565–1578.
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. M.Sc. Thesis.
Kwon, Y., Won, J.-H., Kim, B. J., and Paik, M. C. (2020). Uncertainty quantification using
  bayesian neural networks in classification: Application to biomedical image segmentation.
  Computational Statistics and Data Analysis, 142.
Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive
  uncertainty estimation using deep ensembles. In Proceedings of the 30th Advances in
  Neural Information Processing Systems (NIPS 2017).
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521:436–444.
LeCun, Y., Denker, J. S., and Solla, S. A. (1990). Optimal brain damage. In Proceedings of
  the 3rd Advances in Neural Information Processing Systems (NIPS-1990).
Lee, H. K. H. (2000). Consistency of posterior distributions for neural networks. Neural
  Networks, 13:629–642.
Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J.
  (2018). Deep neural networks as gaussian processes. In 6th International Conference on
  Learning Representations (ICLR-2018).
Liu, S., Chen, T., Atashgahi, Z., Chen, X., Sokar, G., Mocanu, E., Pechenizkiy, M., Wang,
  Z., and Mocanu, D. C. (2022). Deep ensembling with no overhead for either training or
  testing: The all-round blessings of dynamic sparsity. In 10th International Conference on
  Learning Representations (ICLR-2022).
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. (2017). Learning efficient con-
  volutional networks through network slimming. In Proceedings of the IEEE International
  Conference on Computer Vision (ICCV).
Louizos, C., Ullrich, K., and Welling, M. (2017). Bayesian compression for deep learning. In
  Proceedings of the 30th Advances in Neural Information Processing Systems (NIPS 2017).
Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks
  through l0 regularization. In 6th International Conference on Learning Representations
  (ICLR-2018).
Lu, L., Shin, Y., Su, Y., and Em Karniadakis, G. (2020). Dying relu and initialization:
  Theory and numerical examples. Communications in Computational Physics, 28(5):1671–
  1706.
                                             162


Luo, J.-H., Wu, J., and Lin, W. (2017). Thinet: A filter level pruning method for deep neural
  network compression. In Proceedings of the IEEE International Conference on Computer
  Vision (ICCV).
Mackay, D. J. C. (1992). A practical bayesian framework for backpropagation networks.
  Neural Computation, 4:448–472.
Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous
  relaxation of discrete random variables. In 5th International Conference on Learning
  Representations (ICLR-2017).
Madsen, K. and Nielsen, H. B. (1993). A finite smoothing algorithm for linear l1 estimation.
  SIAM Journal of Optimization, 3:223–235.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953).
  Equation of state calculations by fast computing machines. Journal of Chemical Physics,
  21:1087–1092.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression.
  Journal of the American Statistical Association, 83(404):1023–1032.
Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. (2018).
  Evolutionary training of sparse artificial neural networks: A network science perspective.
  Nature Communication, 9(2383).
Moghimi, M., Belongie, S. J., Saberian, M. J., Yang, J., Vasconcelos, N., and Li, L.-J. (2016).
  Boosted convolutional neural networks. In BMVC, volume 5, page 6.
Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Variational dropout sparsifies deep
  neural networks. In Proceedings of the 34th International Conference on Machine Learning
  (ICML-2017).
Mozer, M. C. and Smolensky, P. (1988). Skeletonization: A technique for trimming the fat
  from a network via relevance assessment. In Proceedings of the 1st Advances in Neural
  Information Processing Systems (NIPS-1988).
Murray, K. and Chiang, D. (2015). Auto-sizing neural networks: With applications to n-
  gram language models. In Proceedings of the 2015 Conference on Empirical Methods in
  Natural Language Processing (EMNLP 2015), pages 908–916.
Nado, Z. et al. (2021). Uncertainty Baselines: Benchmarks for uncertainty & robustness in
  deep learning. In Bayesian Deep Learning workshop, NeurIPS-2021.
Neal, R. (1992). Bayesian learning via stochastic dynamics. In Proceedings of the 5th
  Advances in Neural Information Processing Systems (NIPS-1992).
                                              163


Neal, R. M. (1996). Bayesian Learning for Neural Networks. New York: Springer Verlag.
Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D. P. (2017). Structured bayesian
  pruning via log-normal multiplicative noise. In Proceedings of the 33rd Advances in Neural
  Information Processing Systems (NeurIPS-2020).
Ochiai, T., Matsuda, S., Watanabe, H., and Katagiri, S. (2017). Automatic node selection
  for deep neural networks using group lasso regularization. In 2017 IEEE International
  Conference on Acoustics, Speech and Signal Processing (ICASSP).
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Dillon, J. V., Lakshminarayanan,
  B., and Snoek, J. (2019). Can you trust your model’s uncertainty? evaluating predictive
  uncertainty under dataset shift. In Proceedings of the 32th Advances in Neural Information
  Processing Systems (NeurIPS-2019).
Papamarkou, T., Hinkle, J., Young, M. T., and Womble, D. (2022). Challenges in markov
  chain monte carlo for bayesian neural networks. Statistical Science, 37(3):425–442.
Paszke, A. et al. (2019). PyTorch: An imperative style, high-performance deep learning
  library. In Proceedings of the 32nd Advances in Neural Information Processing Systems
  (NeurIPS-2019).
Pati, D., Bhattacharya, A., and Yang, Y. (2018). On statistical optimality of variational
  bayes. In Proceedings of the 21st International Conference on Artificial Intelligence and
  Statistics (AISTATS-2018).
Perreault Levasseur, L., Hezaveh, Y. D., and Wechsler, R. H. (2017). Uncertainties in
  parameters estimated with neural networks: Application to strong gravitational lensing.
  The Astrophysical Journal, 850(1).
Piironen, J. and Vehtari, A. (2017). On the hyperprior choice for the global shrinkage
  parameter in the horseshoe prior. In Proceedings of the 20th International Conference on
  Artificial Intelligence and Statistics (AISTATS-2017).
Pollard, D. (1991). Bracketing methods in statistics and econometrics. In Nonparametric
  and semiparametric methods in econometrics and statistics: Proceedings of the Fifth
  International Symposium in Econometric Theory and Econometrics, pages 337–355. Cam-
  bridge, UK: Cambridge University Press.
Polson, N. and Ročková, V. (2018). Posterior concentration for sparse deep learn-
  ing. In Proceedings of the 31st Advances in Neural Information Processing Systems
  (NeurIPS-2018).
Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions. arXiv
  preprint arXiv:1710.05941.
                                              164


Rosenblatt, F. (1957). The Perceptron - a perceiving and recognizing automaton (project
  para). Cornell Aeronautical Laboratory.
Rosenblatt, F. (1961). Principles of neurodynamics. perceptrons and the theory of brain
  mechanisms. Technical Report, Cornell Aeronautical Laboratory.
Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu
  activation function. The Annals of Statistics, 48(4):1875–1897.
Schwartz, L. (1965). On bayes procedures. Z. Wahrsch. Verw. Gebiete, 4:10–26.
Sennrich, R., Haddow, B., and Birch, A. (2016). Edinburgh neural machine translation
  systems for WMT 16. In Proceedings of the First Conference on Machine Translation:
  Volume 2, Shared Task Papers.
Singh, S., Hoiem, D., and Forsyth, D. (2016). Swapout: Learning an ensemble of deep
  architectures. Proceedings of the 29th Advances in Neural Information Processing Systems
  (NIPS-2016).
Sriram, K., Ramamoorthi, R. V., and Ghosh, P. (2013). Posterior consistency of bayesian
  quantile regression based on the misspecified asymmetric laplace density. Bayesian
  Analysis, 8(2):479–504.
Sun, Y., Song, Q., and Liang, F. (2021). Consistent sparse deep learning: Theory and
  computation. Journal of the American Statistical Association, pages 1–15.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initializa-
  tion and momentum in deep learning. In Proceedings of the 30th International Conference
  on Machine Learning (ICML-2013).
Swann, A. and Allinson, N. (1998). Fast committee learning: Preliminary results. Electronics
  Letters, 34(14):1408–1410.
Taylor, J. W. (2000). A quantile regression neural network approach to estimating the
  conditional density of multiperiod returns. Journal of Forecasting, 19(4):299–311.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
  Statistical Society – Series B, 58:267–288.
Titterington, D. M. (2004). Bayesian methods for neural networks and related models.
  Statistical Science, 19:128–139.
Van Der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine
  Learning Research, 9(86):2579–2605.
                                             165


Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes.
  Springer, First edition.
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer,
  Fourth edition.
Walker, S. G. and Mallick, B. K. (1999). A bayesian semiparametric accelerated failure time
  model. Biometrics, 55:477–483.
Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. (2013). Regularization of neural
  networks using dropconnect. In Proceedings of the 30th International Conference on
  Machine Learning (ICML-2013).
Wasserman, L. (1998). Asymptotic properties of nonparametric bayesian procedures. In
  Practical nonparametric and semiparametric Bayesian statistics, pages 293–304. New York:
  Springer.
Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. (2016). Learning structured sparsity
  in deep neural networks. In Proceedings of the 29th Advances in Neural Information
  Processing Systems (NIPS-2016).
Wen, Y., Tran, D., and Ba, J. (2020). Batchensemble: an alternative approach to efficient
  ensemble and lifelong learning. 8th International Conference on Learning Representations
  (ICLR-2020).
Wenzel, F., Snoek, J., Tran, D., and Jenatton, R. (2020). Hyperparameter ensembles for
  robustness and uncertainty quantification. Proceedings of the 33rd Advances in Neural
  Information Processing Systems (NeurIPS-2020).
Wilson, A. G. and Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective
  of generalization. Proceedings of the 33rd Advances in Neural Information Processing
  Systems (NeurIPS-2020).
Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and conver-
  gence rates of sieve mles. Annals of Statistics, 23:339–362.
Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for
  benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
Xie, J., Xu, B., and Chuang, Z. (2013). Horizontal and vertical ensemble with deep repre-
  sentation for classification. arXiv preprint arXiv:1306.2759.
Xu, Q., Deng, K., Jiang, C., Sun, F., and Huang, X. (2017). Composite quantile regression
  neural network with applications. Expert Systems with Applications, 76:129–139.
                                              166


Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso.
  Bayesian Analysis, 10(4):909–936.
Yeh, I.-C. (1998). Modeling of strength of high performance concrete using artificial neural
  networks. Cement and Concrete Research, 28(12):1797–1808.
Yu, K. and Moyeed, R. A. (2001). Bayesian quantile regression. Statistics and Probability
  Letters, 54(4):437–447.
Yu, K. and Zhang, J. (2005). A three-parameter asymmetric laplace distribution and its
  extensions. Communications in Statistics- Theory and Methods, 34:1867–1879.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep
  learning requires rethinking generalization. In 5th International Conference on Learning
  Representations (ICLR-2017).
Zhang, F. and Gao, C. (2020). Convergence rates of variational posterior distributions. The
  Annals of Statistics, 48(4):2180–2207.
Zhao, C., Ni, B., Zhang, J., Zhao, Q., Zhang, W., and Tian, Q. (2019). Variational convolu-
  tional neural network pruning. In 2019 IEEE/CVF Conference on Computer Vision and
  Pattern Recognition (CVPR).
Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2020). Random erasing data augmen-
  tation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7):13001–13008.
Zhu, M. and Gupta, S. (2018). To prune, or not to prune: Exploring the efficacy of pruning
  for model compression. In 6th International Conference on Learning Representations
  (ICLR-2018), Workshop Track Proceedings.
                                            167