CONSISTENT BAYESIAN LEARNING FOR NEURAL NETWORK MODELS: THEORY AND COMPUTATION By Sanket Rajendra Jantre A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2022 ABSTRACT CONSISTENT BAYESIAN LEARNING FOR NEURAL NETWORK MODELS: THEORY AND COMPUTATION By Sanket Rajendra Jantre Bayesian framework adapted for neural network learning, Bayesian neural networks, have received widespread attention and successfully applied to various applications. Bayesian inference for neural networks promises improved predictions with reliable uncertainty esti- mates, robustness, principled model comparison, and decision-making under uncertainty. In this dissertation, we propose novel theoretically consistent Bayesian neural network models and provide their computationally efficient posterior inference algorithms. In Chapter 2, we introduce a Bayesian quantile regression neural network assuming an asymmetric Laplace distribution for the response variable. The normal-exponential mixture representation of the asymmetric Laplace density is utilized to derive the Gibbs sampling coupled with Metropolis-Hastings algorithm for the posterior inference. We establish the posterior consistency under a misspecified asymmetric Laplace density model. We illustrate the proposed method with simulation studies and real data examples. Traditional Bayesian learning methods are limited by their scalability to large data and feature spaces due to the expensive inference approaches, however recent developments in variational inference techniques and sparse learning have brought renewed interest to this area. Sparse deep neural networks have proven to be efficient for predictive model building in large-scale studies. Although several works have studied theoretical and numerical properties of sparse neural architectures, they have primarily focused on the edge selection. In Chapter 3, we propose a sparse Bayesian technique using spike-and-slab Gaussian prior to allow for automatic node selection. The spike-and-slab prior alleviates the need of an ad-hoc thresholding rule for pruning. In addition, we adopt a variational Bayes ap- proach to circumvent the computational challenges of traditional Markov chain Monte Carlo implementation. In the context of node selection, we establish the variational posterior con- sistency together with the layer-wise characterization of prior inclusion probabilities. We empirically demonstrate that our proposed approach outperforms the edge selection method in computational complexity with similar or better predictive performance. The structured sparsity (e.g. node sparsity) in deep neural networks provides low latency inference, higher data throughput, and reduced energy consumption. Alternatively, there is a vast albeit growing literature demonstrating shrinkage efficiency and theoretical optimal- ity in linear models of two sparse parameter estimation techniques: lasso and horseshoe. In Chapter 4, we propose structurally sparse Bayesian neural networks which systemati- cally prune excessive nodes with (i) Spike-and-Slab Group Lasso, and (ii) Spike-and-Slab Group Horseshoe priors, and develop computationally tractable variational inference We demonstrate the competitive performance of our proposed models compared to the Bayesian baseline models in prediction accuracy, model compression, and inference latency. Deep neural network ensembles that appeal to model diversity have been used successfully to improve predictive performance and model robustness in several applications. However, most ensembling techniques require multiple parallel and costly evaluations and have been proposed primarily with deterministic models. In Chapter 5, we propose sequential en- sembling of dynamic Bayesian neural subnetworks to generate diverse ensemble in a single forward pass. The ensembling strategy consists of an exploration phase that finds high- performing regions of the parameter space and multiple exploitation phases that effectively exploit the compactness of the sparse model to quickly converge to different minima in the energy landscape corresponding to high-performing subnetworks yielding diverse ensembles. We empirically demonstrate that our proposed approach surpasses the baselines of the dense frequentist and Bayesian ensemble models in prediction accuracy, uncertainty estimation, and out-of-distribution robustness. Furthermore, we found that our approach produced the most diverse ensembles compared to the approaches with a single forward pass and even compared to the approaches with multiple forward passes in some cases. To Dr. Rahman for introducing me to the Bayesian way iv ACKNOWLEDGEMENTS I would like to take this opportunity to thank the people who have extended their support and assistance throughout my Ph.D. journey. First, I would like to express my deepest gratitude to my advisors, Dr. Tapabrata Maiti and Dr. Shrijita Bhattacharya, for their support, guidance, and encouragement. Dr. Maiti’s exemplary commitment to research will continue to serve as a guide in my own academic pursuits. Thank you, Dr. Bhattacharya, for consistently providing me with hands-on help in my research. I would also like to extend my sincere appreciation to the rest of my dissertation com- mittee members, Dr. Yuehua Cui and Dr. Andrew Finley, for providing me a careful review of my dissertation. I would like to extend many thanks to my research collaborators, Dr. Sandeep Madireddy, Dr. Zichao Di, and Dr. Prasanna Balaprakash from Argonne National Laboratory for providing me the opportunity to work on developing practical models useful for scientific problems and allowing me to participate in exciting interdisciplinary projects at Argonne. Moreover, I want to thank Dr. Sandeep Madireddy and Dr. Prasanna Balaprakash for funding my research throughout the final year of my Ph.D. My sincere thanks to the Michigan State University Graduate School for awarding me the Dissertation Completion Fellowship during the Summer 2022. Lastly, I am grateful to my family. Thanks to my mother for her unequivocal support during all my studies. Thanks to my sisters, Smita and Sanchita, and brother-in-law, Sam- panna, for always cheering me on throughout my dissertation. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Feedforward Neural Networks . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Bayesian Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Markov Chain Monte Carlo Sampling . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Variational Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Posterior Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 CHAPTER 2 BAYESIAN QUANTILE REGRESSION NEURAL NETWORKS . . 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Bayesian Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Bayesian Quantile Regression Neural Networks . . . . . . . . . . . . . . . . . 18 2.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.1 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.2 Real Data Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 APPENDIX A LEMMAS FOR POSTERIOR CONSISTENCY PROOF . . . 37 APPENDIX B POSTERIOR CONSISTENCY THEOREM PROOFS . . . . 43 CHAPTER 3 LAYER ADAPTIVE NODE SELECTION IN BAYESIAN NEU- RAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Nonparametric Modeling: Deep Learning Approach . . . . . . . . . . . . . . 55 3.3 Spike-and-Slab Independent Gaussian Node Selection . . . . . . . . . . . . . 57 3.3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5.1 Simulation Study - I . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 vi 3.5.2 Simulation Study - II . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.3 UCI Regression Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5.4 Image Classification Datasets . . . . . . . . . . . . . . . . . . . . . . 73 3.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 APPENDIX A PROOFS OF SS-IG THEORETICAL RESULTS . . . . . . . 81 APPENDIX B ADDITIONAL NUMERICAL EXPERIMENTS DETAILS . . 102 CHAPTER 4 COMPACT BAYESIAN NEURAL NETWORKS WITH STRUC- TURED SPARSITY . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.1.1 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.2 Structured Sparsity: Spike-and-Slab Hierarchical Priors . . . . . . . . . . . . 108 4.2.1 Spike-and-Slab Group Lasso (SS-GL): . . . . . . . . . . . . . . . . . . 108 4.2.2 Spike-and-Slab Group Horseshoe (SS-GHS): . . . . . . . . . . . . . . 110 4.2.3 Algorithm and Computational Details . . . . . . . . . . . . . . . . . 112 4.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.3.1 MLP MNIST Classification . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.2 LeNet-5-Caffe Experiments . . . . . . . . . . . . . . . . . . . . . . . . 117 4.3.3 Residual Network Experiments . . . . . . . . . . . . . . . . . . . . . 119 4.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 CHAPTER 5 SEQUENTIAL BAYESIAN NEURAL SUBNETWORK ENSEMBLES 125 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.2 Sequential Bayesian Neural Subnetwork Ensembles . . . . . . . . . . . . . . 129 5.2.1 Base Learner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2.2 Sequential Ensembling and Bayesian Neural Subnetworks . . . . . . . 129 5.3 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4 Sequential BNN Ensemble Analysis . . . . . . . . . . . . . . . . . . . . . . . 135 5.4.1 Function Space Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.4.2 Dynamic Sparsity Learning . . . . . . . . . . . . . . . . . . . . . . . 137 5.4.3 Effect of Ensemble size . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 APPENDIX A REPRODUCIBILITY CONSIDERATIONS . . . . . . . . . . 141 APPENDIX B OUT-OF-DISTRIBUTION EXPERIMENT RESULTS . . . . 144 APPENDIX C EFFECT OF THE ENSEMBLE SIZE . . . . . . . . . . . . . 145 APPENDIX D EFFECT OF THE MONTE CARLO SAMPLE SIZE . . . . . 146 APPENDIX E EFFECT OF THE PERTURBATION FACTOR . . . . . . . 148 APPENDIX F EFFECT OF THE CYCLIC LEARNING RATE SCHEDULE 150 CHAPTER 6 EPILOGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 vii 6.2 Broader Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 6.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 viii LIST OF TABLES Table 2.1 Simulation study I results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Table 2.2 Simulation study II results . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 2.3 Real data applications results . . . . . . . . . . . . . . . . . . . . . . . . . 34 Table 3.1 Simulation study II results . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Table 3.2 UCI regression datasets results . . . . . . . . . . . . . . . . . . . . . . . . 73 Table 4.1 ResNet-20/CIFAR-10 and ResNet-32/CIFAR-10 experiments results . . . 121 Table 5.1 ResNet-32/CIFAR10 experiment results . . . . . . . . . . . . . . . . . . . 134 Table 5.2 ResNet-56/CIFAR100 experiment results . . . . . . . . . . . . . . . . . . 134 Table 5.3 Diversity metrics in ResNet-32/CIFAR-10 and ResNet-56/CIFAR100 ex- periments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Table B.1 OoD detection results in ResNet-32/CIFAR10 experiment . . . . . . . . . 144 Table C.1 Ensemble size effect results in ResNet-32/CIFAR10 experiment . . . . . . 145 Table D.1 Monte Carlo sample size effect results in ResNet-32/CIFAR10 experiment 147 Table E.1 Perturbation factor effect results in ResNet-32/CIFAR10 experiment . . . 149 Table E.2 Diversity metrics for models trained with different perturbation factors in ResNet-32/CIFAR-10 experiment . . . . . . . . . . . . . . . . . . . . . 149 Table F.1 Cyclic learning rate schedules results in ResNet-32/CIFAR10 experiment . 152 Table F.2 Diversity metrics for models trained with different cyclic learning rate schedules in ResNet-32/CIFAR10 experiment . . . . . . . . . . . . . . . . 152 ix LIST OF FIGURES Figure 1.1 Single-layer neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 1.2 ReLU and SiLU activations . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 2.1 Quantile loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 3.1 Sparse neural network with node selection . . . . . . . . . . . . . . . . . 54 Figure 3.2 Simulation study I results . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Figure 3.3 MLP/MNIST and MLP/Fashion-MNIST experiments results . . . . . . . 76 Figure 3.4 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experiments results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure B.1 Simulation study I: additional experiment results . . . . . . . . . . . . . 104 Figure B.2 MNIST experiment results for varying hidden layer widths . . . . . . . . 105 Figure 4.1 MNIST experiment results: motivation for group shrinkage priors over Gaussian prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 4.2 SS-GL penalty parameter choice experiment results . . . . . . . . . . . . 115 Figure 4.3 SS-GHS regularization constant choice experiment results . . . . . . . . . 116 Figure 4.4 MLP/MNIST experiment results . . . . . . . . . . . . . . . . . . . . . . 117 Figure 4.5 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experiments results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Figure 5.1 Training trajectories of base learners in ResNet32/CIFAR10 experiment . 136 Figure 5.2 Dynamic sparsity and FLOPs curves . . . . . . . . . . . . . . . . . . . . 137 Figure 5.3 Predictive performance results of the base learners and the sequential ensembles as the ensemble size varies in ResNet32/CIFAR10 experiment 138 Figure F.1 Cyclic learning rate schedules . . . . . . . . . . . . . . . . . . . . . . . . 151 x LIST OF ALGORITHMS Algorithm 3.1 Variational inference in SS-IG Bayesian neural networks . . . . . . . 61 Algorithm 4.1 Variational inference in SS-GL and SS-GHS Bayesian neural networks 113 Algorithm 5.1 Sequential Bayesian neural subnetwork ensemble (SeBayS) algorithm 131 xi CHAPTER 1 INTRODUCTION Artificial neural networks (ANN) are biologically inspired predictive models involving computations and mathematics, which simulate the human–brain processes. Many of the recent successes of the artificial intelligence such as image and voice recognition, robotics, are powered by ANNs. However, ANNs still suffer from many fundamental issues from the perspective of statistical modeling. One of the major challenges is their ability to model the uncertainty, and hence build reliable and robust models while capturing complex data dependencies and being computationally tractable. Probabilistic approaches and especially the systematic Bayesian framework provides an exciting avenue to address this challenge. In this dissertation, we propose novel Bayesian neural network models which are theoretically consistent along with their computationally efficient implementations for the model inference. In this Chapter, we briefly introduce the main concepts which are fundamental part of this dissertation, neural networks and their Bayesian counterpart (Section 1.1), Markov chain Monte Carlo sampling methods (Section 1.2), variational Bayesian inference (Section 1.3), and posterior consistency preliminaries (Section 1.4). In addition, we discuss some existing work that have significant impact to this field. Finally, we provide a brief outline of the rest of the chapters in Section 1.5. 1.1 Neural Networks 1.1.1 Feedforward Neural Networks Feedforward neural network can approximate any continuous function f (.) : Rp → R arbitrarily well. Neural network tries to simulate the human brain, so it has many layers of “neurons” just like the neurons in our brain. Originated from a multi-layer perceptron (MLP) (Rosenblatt, 1961) which has multiple hidden layers compared to single hidden layer 1 Input Hidden Output layer layer layer I1 H1 I2 .. .. O . . Ip − 1 Hk Ip Figure 1.1 Single-layer neural network. counterpart, the perceptron (Rosenblatt, 1957), neural networks are getting deeper to be more accurate in approximating a continuous function. In Figure 1.1 we illustrate a single hidden layer neural network or shallow neural network. The input to the hidden layer consists of a linear combination of the model inputs passed through a non-linear activation function. Let us consider the input vector x ∈ RP . Then the output of the shallow neural network in Figure 1.1 is given by p k ! X X η(x) = β0 + βj × ψ γj0 + γjh xh j=1 h=1 where γjh (j = 0, · · · , k; h = 1, · · · , p) are the input layer (including intercept h = 0) to hidden layer weights, βj (j = 0, · · · , k) are the hidden layer (including intercept j = 0) to output layer weights, and ψ(.) denotes the non-linear activation function. The universal approximation theorem (Cybenko, 1989) states the approximation power of the shallow neural networks. Theorem 1.1.1 (Universal approximation theorem). Let ψ(.) be such that ψ(t) → 0 as t → −∞ and ψ(t) → 1 as t → ∞. Then for a continuous function f on [0, 1]p and an arbitrary 2 ϵ > 0, there exist k and parameters βj (j = 0, · · · , k), and γjh (j = 0, · · · , k; h = 1, · · · , p) such that, |f (x) − η(x)| < ϵ, ∀x ∈ [0, 1]p The universal approximation capacity of neural networks along with available computing power explain the widespread use of deep learning nowadays. In this dissertation, we use the following non-linear activation functions. • Sigmoid: ψ(x) = exp(x) 1+exp(x) . • Rectified Linear Unit (ReLU): ψ(x) = x+ . • Sigmoid Linear Unit (SiLU or Swish): ψ(x) = x × Sigmoid(x). ReLU is one of the most popular activation function used in many deep neural archi- tectures. However, it sometimes suffers from the dead ReLU problem where ReLU neurons become inactive and only output 0 for any input (Lu et al., 2020). We ourselves encounter this problem in our spike-and-slab models and there instead of ReLU we use the SiLU acti- vations (Elfwing et al., 2018; Ramachandran et al., 2017) which unlike ReLU is smooth and nonmonotonic (Figure 1.2). Figure 1.2 ReLU and SiLU activations. 3 1.1.2 Bayesian Neural Networks Bayesian neural networks (BNN) differ from deterministic neural networks in that their weights are assigned a probability distribution instead of a single value or point estimate. These probability distributions describe the uncertainty in weights and can be used to esti- mate uncertainty in predictions. A neural network model can be viewed as a probabilistic model: p(y|x, θ) where θ denotes the neural network weights. For classification, y is a set of classes and p(y|x, θ) is a categorical distribution. For regression, y is a continuous variable and p(y|x, θ) is a Gaussian distribution. In Bayesian framework, instead of optimizing over a single probabilistic model, p(y|x, θ), we discover all likely models via posterior inference over model parameters. First, we place a prior distribution p(θ) on the neural network weights θ. The Bayes’ rule provides the exact posterior distribution as follows, p(D|θ)p(θ) p(θ|D) = R (1.1) θ p(D|θ)p(θ)dθ where p(D|θ) denotes the likelihood of D given the model parameters θ. The main goal of the neural network is predictions on the new inputs. Given the poste- rior in (1.1) we predict the label corresponding to a new example xnew by Bayesian model averaging: Z p(ynew |xnew , D) = p(ynew |xnew , θ)p(θ|D)dθ The key distinguishing property of Bayesian approach from deterministic one is marginal- ization instead of optimization, where we represent solutions given by all settings of param- eters weighted by their posterior probabilities, rather than bet everything on a single setting of parameters (Wilson and Izmailov, 2020). Bayesian procedures adapted for deep learning have received widespread attention and applications of BNNs are found in several fields e.g., computer vision (Kendall and Gal, 2017), civil engineering (Bateni et al., 2007; Arangio 4 and Bontempi, 2015), astronomy (Perreault Levasseur et al., 2017; Cobb et al., 2019), and medicine (Kwon et al., 2020; Beker et al., 2020). 1.2 Markov Chain Monte Carlo Sampling Markov chain Monte Carlo (MCMC) methods have been used in several Physics problems for many years and later have been widely applied to Bayesian statistical modeling (Neal, 1996). MCMC methods do not make any assumptions regarding the form of the distribution to be sampled, for instance whether a given distribution can be approximated by Gaussian. Ideally, they are supposed to cover all the modes of a target distribution during sampling. However, the high computational complexity of MCMC methods is their major disadvantage in complex Bayesian models and large scale datasets. In what follows, we describe two well-known MCMC sampling algorithms. the combina- tion of these two algorithm is used in Chapter 2 for posterior inference in BQRNN model. 1.2.1 Metropolis-Hastings Algorithm One of the most common algorithms for sampling from the posterior p(θ|D) is the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970). In the Markov chain defined by the MH algorithm, the new state θ(t+1) is generated given the current state θ(t) by first sampling a candidate state θ∗ from a proposal density gθ and subsequently accepting the proposed candidate state with probability  n ∗ (t) o min p(θ |D)gθ∗ (θ ) , 1 if p(θ(t) |D)g ∗  θ(t) (θ ) > 0,  p(θ(t) |D)gθ(t) (θ∗ )  1  otherwise. By taking a sufficient number of trial steps all of state space is explored and the MH algorithm ensures that the points are distributed according to the required target distribution. Typically, the proposal distribution is chosen to be symmetrical, satisfying the condition ′ gθ′ (θ) = gθ (θ ). Hence, the acceptance probability simplifies to min{p(θ∗ |D)/p(θ(t) |D), 1} which yields the so called random walk metropolis algorithm. Commonly used symmetrical 5 proposal distribution is Gaussian with density gθ = N (θ, Σ) where Σ is a constant covariance matrix. In our BQRNN model, we use the normal proposal density for the parameters sampled using MH algorithm. 1.2.2 Gibbs Sampling Algorithm The Gibbs algorithm was formally described by Geman and Geman (1984) and later Gelfand and Smith (1990) showed its potential in a wide variety of conventional statistical problems. In its basic version, Gibbs sampling is a special case of the MH algorithm. The point of Gibbs sampling is that given a multivariate distribution over parameters θ = {θ1 , · · · , θp }, it is simpler to sample from a conditional distribution than to marginalize by integrating over a joint distribution. This allows us to simulate a Markov chain in which θ (t+1) is generated from θ (t) as follows: (t+1) (t) (t) (t) Pick θ1 from the distribution of θ1 given θ2 , θ3 , · · · , θp (t+1) (t+1) (t) (t) Pick θ2 from the distribution of θ2 given θ1 , θ3 , · · · , θp .. . (t+1) (t+1) (t+1) (t) (t) Pick θj from the distribution of θj given θ1 , · · · , θj−1 , θj+1 · · · , θp .. . (t+1) (t+1) (t+1) (t+1) Pick θp from the distribution of θp given θ1 , θ2 , · · · , θp−1 The samples obtained using above procedure for all the parameters together approximate their joint distribution. The marginal distribution of any subset of variables can be approxi- mated by simply considering the samples for that subset of variables while ignoring the rest. The expected value of any variable can be approximated by averaging over all the samples obtained from above procedure. Although Gibbs sampling is easier to implement, it is only useful when the posterior dis- tribution of one parameter conditional on given values of the other parameters has a known distributional form. For many conventional statistical problems, these conditional distribu- tions are of standard forms, hence efficient Gibbs sampling procedures are implemented with 6 ease. On the contrary in neural networks, the conditional posterior distribution for hidden layer weights in the network given values for the rest of the weights can be extremely messy, with multiple modes. This is the case we encounter for hidden layer weights in our BQRNN model where we implement combination of random walk MH and Gibbs sampling algorithm for posterior inference. 1.3 Variational Bayesian Inference Although, Markov chain Monte Carlo sampling is the gold standard for inference in Bayesian models, it is computationally inefficient (Izmailov et al., 2021). As an alterna- tive, variational inference or variational approximation tends to be faster and scales well on complex Bayesian learning tasks with large datasets (Blei et al., 2017). Variational Bayesian (VB) learning recasts the sampling problem to an optimization problem minimiz- ing Kullback-Leibler (KL) divergence which intuitively measures the dissimilarity between a surrogate distribution (called a variational distribution) q(θ) and the true posterior distri- bution p(θ|D) (Jordan et al., 1999). Definition 1.3.1 (Kullback-Leibler (KL) divergence). For two probability measures P1 and P2 over a set X such that P1 is absolutely continuous with respect to P2 , the KL divergence of P1 with respect to P2 is defined as Z   dP1 dKL (P1 , P2 ) = log dP1 X dP2 where dP1 /dP2 is the Radon–Nikodym derivative of P1 with respect to P2 . If P1 is not absolutely continuous with respect to P2 then dKL (P1 , P2 ) = ∞. As a first step in variational learning, we define a family of surrogate distributions, also called variational family (Q) which consists of distributions of simpler form than the true posterior (1.1) (ex. a family of Gaussian distributions). Q = {q(θ, ν) : q is a simple candidate distribution used for approximation.}, 7 where ν denotes the parameters of the variational distribution, also known as variational parameters. For instance, if Q is a Gaussian family then ν includes the mean (µ) and standard deviation (σ) of a Gaussian candidate distribution. Once we select an appropriate variational family, the VB infers the parameters of a distribution on the model parameters q(θ) that minimises the Kullback-Leibler (KL) distance from the true posterior p(θ|D): q ∗ (θ) = argmin d KL (q(θ), p(θ|D)) (1.2) q(θ)∈Q We simplify the d KL (q(θ), p(θ|D)): d KL (q(θ), p(θ|D)) = Eq(θ) [log q(θ) − log p(θ|D)] = −Eq(θ) [log p(θ, D) − log q(θ)] + log m(D) = −Eq(θ) [log p(D|θ)] + dKL (q(θ), p(θ)) + log m(D) where m(D) is the marginal distribution of the data and is free of the θ. Hence, the op- timization problem in (1.2) is equivalent to minimizing the negative evidence lower bound (ELBO), which is defined as L = −Eq(θ) [log p(D|θ)] + dKL (q(θ), p(θ)), where the first term is the data-dependent cost widely known as the negative log-likelihood (NLL), and the second term is prior-dependent and serves as regularization. Since, the direct optimization of (3.9) is computationally prohibitive, gradient descent methods are used (Kingma and Welling, 2014). 1.4 Posterior Consistency In Bayesian analysis, one starts with a prior distribution (either informative or non- informative) on the model parameters and updates the knowledge of the model as the number of data observations grows, reflected in the posterior distribution. It is therefore important to know whether the posterior distribution concentrates on neighborhoods of the true data 8 generating distribution as the data is collected indefinitely. This is known as the Bayesian consistency of the posterior distribution. Although it is an asymptotic property, consistency is one of the benchmarks since the violation of consistency is clearly undesirable and one may have serious doubts against inferences based on an inconsistent posterior distribution. Let P denote the prior distribution and {Pn (.|Dn )} denote a sequence of posterior dis- tributions where Pn (.|Dn ) is the posterior distribution based on the n-th data sample. Then we define the posterior consistency as follows Definition 1.4.1 (Posterior consistency.). The sequence of posteriors is consistent at θ0 if {Pn (U |Dn )} → 1 a.s. Pθ∞0 for all neighborhoods U of θ0 . where Pθ∞0 is the joint distribution of {Di }∞ i=1 when θ0 is the true value of θ. Alternatively, let f0 (x) be the underlying density of X. Let E(Y |X = x) = ν0 (x) be the true regression function of Y given X, and let ν̂n (x) be the estimated regression function. Definition 1.4.2. ν̂n (x) is asymptotically consistent for ν0 (x) if Z p (µ̂n (x) − µ0 (x))2 f0 (x) dx → 0 where p above the arrow denotes the convergence in probability. In this frequentist sense, Funahashi (1989) and Hornik et al. (1989) have established that the neural networks are asymptotically consistent by showing the existence of some neural network, ν̂n (x), whose mean squared error with the true function, ν0 (x), converges to 0 in probability. Lee (2000) showed that the posterior distribution for feedforward (single-layer) sigmoidal neural networks is consistent. Their proof of consistency embedded the problem in a density function estimation, which uses bounds on the bracketing entropy to show that the posterior is consistent over Hellinger neighborhoods. Mathematically, let f0 (x, y) denote the true joint density function of X and Y random variables, and let f (x, y) be the corresponding density function under the neural network model. The Hellinger distance between these two density 9 functions is given by sZ Z p p 2 dH (f, f0 ) = f (x, y) − f0 (x, y) dx dy Definition 1.4.3. The posterior is asymptotically consistent for f0 over ϵ-Hellinger neigh- borhoods ∀ϵ > 0 if, p P ({f : dH (f, f0 ) ≤ ϵ}|(X1 , Y1 ), . . . , (Xn , Yn )) → 1 We use this framework to establish the Bayesian posterior consistency in our BQRNN model in Chapter 2 where we combine the ideas provided by Lee (2000) and Sriram et al. (2013). Specifically, Sriram et al. (2013) established the posterior consistency of the Bayesian quantile regression under misspecified ALD model. In variational Bayesian inference, Zhang and Gao (2020) studied contraction rates of the variational posterior distributions for nonparametric and high-dimensional inference. They provided the conditions on the prior, the likelihood and the variational family that charac- terize the contraction rates. Similar to the “prior mass and testing” conditions considered in the past literature (Schwartz, 1965), they found the contraction rate to be the sum of two terms. The first term stands for the contraction rate of the true Bayesian posterior distribution, and the second term is contributed by the variational approximation error. Bhattacharya and Maiti (2021) used the framework provided by Zhang and Gao (2020) and established the conditions needed for the variational posterior consistency of the single- layer Bayesian neural networks. They establish that a simple Gaussian mean-field approx- imation is good enough to achieve the variational posterior consistency. In this direction, they show that ϵ- Hellinger neighborhood of the true density function receives close to 1 probability under the variational posterior. We use the similar framework, in our SS-IG model to establish the consistency of the variational posterior in Chapter 3. 10 1.5 Dissertation Outline The main theme of this dissertation is the development of asymptotically consistent Bayesian neural network models tailored for different scenarios. Each chapter forms a sep- arate manuscript which is either published or under review. For ease of readability, we separate appendices of each chapter providing them after the main content in each chapter. Chapter 2 is based on our published paper titled “Quantile Regression Neural Networks: A Bayesian Approach”1 . In this chapter, we present the Bayesian quantile regression neural network (BQRNN) and provide a Metropolis-Hastings coupled with Gibbs sampling algo- rithm for posterior inference. We establish the posterior consistency in our proposed model and present a set of simulation examples as well as real-data applications demonstrating the efficacy of our method. The proofs of the requisite lemmas and posterior consistency theorems are discussed in the chapter appendices. Chapter 3 is based on our manuscript titled “Layer Adaptive Node Selection in Bayesian Neural Networks: Statistical Guarantees and Implementation Details”2 . In this chapter, we develop spike-and-slab Gaussian node selection model and provide a variational algorithm for posterior inference. We derive the variational posterior consistency and its contraction rate for any generic shaped network structure. We measure the computational gains achieved by our approach using layer-wise node sparsities for shallow models and floating point operations in larger models. We also discuss the memory efficiency and computational speedup trade- off between edge selection and node selection approach during test time. The proofs of the lemmas required to establish the variational posterior consistency as well as additional numerical experiment details are presented in the chapter appendices. Chapter 4 is based on our manuscript titled “Compact Bayesian Neural Networks with Structured Sparsity”3 . In this chapter, we propose structurally sparse Bayesian neural net- 1 Adpated by permission from Springer Nature: Journal of Statistical Theory and Practice, (Jantre et al., 2021b), License No: 5352561465807. 2 The revision is currently under review (Jantre et al., 2021a). 3 The manuscript is currently under preparation. 11 works using two distinct spike and slab prior setups, where the slab component uses hier- archical priors on the group of incoming weights on the neurons: (i) Spike-and-Slab Group Lasso (SS-GL), and (ii) Spike-and-Slab Group HorseShoe (SS-GHS). The chapter appendix discusses additional numerical experiment details. Chapter 5 is based on our manuscript titled “Sequential Bayesian Neural Subnetwork En- sembles”4 . In this chapter, we propose a sequential ensembling strategy for Bayesian neural networks (BNNs) which learns multiple subnetworks in a single forward-pass. We combine the strengths of the automated sparsity-inducing spike-and-slab prior that allows dynamic pruning during training, which produces structurally sparse BNNs, and the proposed sequen- tial ensembling strategy to efficiently generate diverse and sparse Bayesian neural networks. Reproducibility considerations and additional numerical experiment details are presented in the chapter appendices. In Chapter 6, we summarize the work we have done in this dissertation and discuss the likely future methodological and theoretical extensions of our current work. We also provide fully documented Python codes that reproduce all the results in this dissertation and can be easily modified and used by practitioners in chapter specific public repositories 5 . 4 The manuscript is currently under revision (Jantre et al., 2022). This work is a collaborative work with my Argonne mentors, Dr. Madireddy and Dr. Balaprakash. 5 https://github.com/jsanket123 12 CHAPTER 2 BAYESIAN QUANTILE REGRESSION NEURAL NETWORKS 2.1 Introduction Quantile regression (QR), proposed by Koenker and Basset (1978), models conditional quantiles of the dependent variable as a function of the covariates. The method supple- ments the least squares regression and provides a more comprehensive picture of the entire conditional distribution. This is particularly useful when the relationships in the lower and upper tail areas are of greater interest. Quantile regression has been extensively used in wide array of fields such as economics, finance, climatology, and medical sciences, among others (Koenker, 2005). Quantile regression estimation requires specialized algorithms and reliable estimation techniques which are available in both frequentist and Bayesian liter- ature. Frequentist techniques include simplex algorithm (Dantzig, 1963) and the interior point algorithm (Karmarkar, 1984), whereas Bayesian technique using Markov chain Monte Carlo (MCMC) sampling was first proposed by Yu and Moyeed (2001). Their approach em- ployed the asymmetric Laplace distribution (ALD) for the response variable, which connects to frequentist quantile estimate, since its maximum likelihood estimates are equivalent to the quantile regression using check-loss function (Koenker and Machado, 1999). Recently, Kozumi and Kobayashi (2011) proposed a Gibbs sampling algorithm, where they exploit the normal-exponential mixture representation of the asymmetric Laplace distribution which considerably simplified the computation for Bayesian quantile regression models. Artificial neural networks are helpful in estimating possibly non-linear models without specifying an exact functional form. The neural networks which are most widely used in engineering applications are the single hidden-layer feedforward neural networks. These networks consist of a set of inputs X, which are connected to each of k hidden nodes, which, in turn, are connected to an output layer (O). In a typical single layer feedforward neural 13 network, the outputs are computed as p k ! X X O i = b0 + bj ψ cj0 + Xih cjh j=1 h=1 where, cjh is the weight from input Xih to the hidden node j. Similarly, bj is the weight associated with the hidden unit j. The cj0 and b0 are the biases for the hidden nodes and the output unit. The function ψ(.) is a nonlinear activation function. Some common choices of ψ(.) are the sigmoid and the hyperbolic tangent functions. The interest in neural networks is motivated by the universal approximation capability of feedforward neural networks (FNNs) (Cybenko, 1989; Funahashi, 1989; Hornik et al., 1989). According to these authors, standard feedforward neural networks with as few as one hidden layer whose output functions are sigmoid functions are capable of approximating any continuous function to a desired level of accuracy, if the number of hidden layer nodes are sufficiently large. Taylor (2000) introduced a practical implementation of quantile regression neural networks (QRNN) to combine the approximation ability of neural networks with robust nature of quantile regression. Several variants of QRNN have been developed such as composite QRNN where neural networks are extended to the linear composite quantile regression (Xu et al., 2017) and later Cannon (2018) introduced monotone composite QRNN which guaranteed the non-crossing of regression quantiles. Bayesian neural network learning models find the predictive distributions for the target values in a new test case given the inputs for that case as well as inputs and targets for the training cases. Early work of Buntine and Weigend (1991) and Mackay (1992) has inspired widespread research in Bayesian neural network models. Their work implemented Bayesian learning using Gaussian approximations. Later, Neal (1996) applied Hamiltonian Monte Carlo in Bayesian statistical applications. Further, Sequential Monte Carlo techniques ap- plied to neural network architectures are described in De Freitas et al. (2001). A detailed review of MCMC algorithms applied to neural networks is presented by Titterington (2004). Although Bayesian neural networks have been widely developed in the context of mean re- gression models, there has been limited or no work available on its development in connection 14 to quantile regression both from a theoretical and implementation standpoint. We also note that the literature on MCMC methods applied to neural networks is somewhat limited due to several challenges including lack of parameter identifiability, high computational costs, and convergence failures (Papamarkou et al., 2022). Our contributions: In this work, we develop the statistical framework for Bayesian quan- tile regression neural network and study its properties both from theoretical as well as numerical standpoint. The natural advantage of our Bayesian procedure over the frequentist models is that we have posterior variance for our conditional quantile estimates which can be used as an uncertainty quantification. The proposed Bayesian quantile regression neural network (BQRNN) uses a single hidden layer FNN with sigmoid activation function, and a linear output unit. On the numerical front, we have implemented the Bayesian procedure using Gibbs sampling combined with random walk Metropolis-Hastings algorithm. We have shown that our model outperforms Bayesian quantile regression (BQR) in the nonlinear data setup while performing comparably in linear data setup. We use mean squared error to provide empirical justification for the use of our proposed BQRNN model in any given data. Our theoretical development includes establishment of posterior consistency, an essential property in nonparametric Bayesian statistics, which in turn provides confidence in the use of Bayesian quantile regression neural network models across all disciplines. The posterior consistency of our method makes use of techniques from the works of Lee (2000) and Sriram et al. (2013). The former has shown posterior consistency in the context of Bayesian neural network for mean models while the later has shown it in the case of Bayesian quantile regression. Following the framework of Lee (2000), we prove consistency of the posterior by using universal approximation properties of neural networks as discussed in Funahashi (1989), Hornik et al. (1989) and others. Analogous to the work of Lee (2000), our current works borrow several ideas for establishing consistency in the context of density estimation from Barron et al. (1999). Finally, to handle the case of ALD responses, we use the framework in Sriram et al. (2013)’s which provides a method for handling of ALD in BQR scenario. 15 The rest of this chapter is organized as follows. Section 2.2 introduces quantile regression and its Bayesian formulation by establishing relationship between quantile regression and asymmetric Laplace distribution. In Section 2.3, we propose the Bayesian quantile regression neural network (BQRNN) model and the prior used in this study. Further, we detail our hierarchical BQRNN model and provide the MCMC procedure which couples Gibbs sampling with random walk Metropolis-Hastings algorithm. Section 2.4 provides an overview of the posterior consistency results for our model. Section 2.5 presents simulation studies and real world applications. A brief discussion and conclusion is provided in Section 2.6. The proofs of the requisite lemmas and posterior consistency theorems are presented in the Appendix A and Appendix B. 2.2 Bayesian Quantile Regression Quantile regression (Koenker and Basset, 1978) offers a practically important alterna- tive to mean regression by allowing the inference about the conditional distribution of the response variable through modeling of its conditional quantiles. Let Y and X denote the response and the predictors respectively and τ ∈ (0, 1) be the quantile level of the condi- tional distribution of Y and F (.) be the cumulative distribution function of Y , then a linear conditional quantile function of Y is denoted as follows Qτ (yi |Xi = xi ) ≡ F −1 (τ ) = xi T β(τ ), i = 1, . . . , n, where β(τ ) ∈ Rp is a vector of quantile specific regression coefficients of length p. The aim of quantile regression is to estimate the conditional quantile function Q(.). Let us consider the following linear model in order to formally define the quantile regres- sion problem, Y = X T β(τ ) + ε, (2.1) R0 where ε is the error vector restricted to have its τ th quantile to be zero, i.e. −∞ f (εi )dεi = τ . The probability density of this error is often left unspecified in the classical literature. 16 The estimation through quantile regression proceeds by minimizing the following objective function X n min ρτ (yi − xi T β(τ )) (2.2) β(τ ) ∈ Rp i=1 where ρτ (.) is the check function or quantile loss function with the following form: ρτ (u) = u.{τ − I(u < 0)}, (2.3) I(.) is an indicator function. which is either 0 or 1 depending on whether it satisfies the given condition or not. This check function is not differentiable at zero, see Figure 2.1. ρp(u) p-1 p Figure 2.1 Quantile loss function. Classical methods employ linear programming techniques such as the simplex algorithm, the interior point algorithm, or the smoothing algorithm to obtain quantile regression es- timates for β(τ ) (Madsen and Nielsen, 1993; Chen, 2007). The statistical programming language R makes use of quantreg package (Koenker, 2017) to implement quantile regres- sion techniques whilst confidence intervals are obtained via bootstrap (Koenker, 2005). Median regression in Bayesian setting has been considered by Walker and Mallick (1999) and Kottas and Gelfand (2001). In quantile regression, a link between maximum-likelihood theory and minimization of the sum of check functions, (2.2), is provided by asymmetric Laplace distribution (ALD) (Koenker and Machado, 1999; Yu and Moyeed, 2001). This distribution has location parameter µ, scale parameter σ and skewness parameter τ . Further details regarding the properties of this distribution are specified in Yu and Zhang (2005). If Y ∼ ALD(µ, σ, p), then its probability distribution function is given by    τ (1 − τ ) y−µ f (y|µ, σ, τ ) = exp −ρτ σ σ 17 As discussed in Yu and Moyeed (2001), using the above skewed distribution for errors pro- vides a way to implement Bayesian quantile regression effectively. According to them, any reasonable choice of prior, even an improper prior, generates a posterior distribution for β(τ ). Subsequently, they made use of a random walk Metropolis Hastings algorithm with a Gaussian proposal density centered at the current parameter value to generate samples from analytically intractable posterior distribution of β(τ ). In the aforementioned approach, the acceptance probability depends on the choice of the value of τ , hence the fine tuning of parameters like proposal step size is necessary to obtain the appropriate acceptance rates for each τ . Kozumi and Kobayashi (2011) overcame this limitation and showed that Gibbs sampling can be incorporated with AL density being represented as a mixture of normal and exponential distributions. Consider the linear model from (2.1), where εi ∼ ALD(0, σ, τ ), then this model can be written as √ yi = xi T β(τ ) + θvi + κ σvi ui , i = 1, . . . , n, (2.4) where, ui and vi are mutually independent, with ui ∼N(0, 1), vi ∼ E(1/σ) and E(1/σ) is the exponential distribution with mean σ. The θ and κ constants in (2.4) are given by s 1 − 2τ 2 θ= and κ = τ (1 − τ ) τ (1 − τ ) Consequently, a Gibbs sampling algorithm based on normal distribution can be implemented effectively. Currently, Brq (Alhamzawi, 2018) and bayesQR (Benoit et al., 2017) packages in R provide Gibbs sampler for Bayesian quantile regression. We are employing the same technique to derive Gibbs sampling steps for all except hidden layer node weight parameters for our Bayesian quantile regression neural network model. 2.3 Bayesian Quantile Regression Neural Networks 2.3.1 Model In this work, we focus on feedforward neural networks with a single hidden layer of units with logistic activation functions, and a linear output unit. Consider the univariate response 18 variable Yi and the covariate vector Xi (i = 1, 2, . . . , n). Further, denote the number of covariates by p and the number of hidden nodes by k which is allowed to vary as a function of n. Denote the input weights by γjh and the output weights by βj . Let, τ ∈ (0, 1) be the quantile level of the conditional distribution of Yi given Xi and keep it fixed. Then, the resulting conditional quantile function is denoted as follows k X 1 Qτ (yi |Xi = xi ) = β0 + βj 1 + exp(−γj0 − ph=1 γjh xih ) P j=1 Xk = β0 + βj ψ(xTi γj ) = β T ηi (γ) = Li β (2.5) j=1 where, β = (β0 , . . . , βk )T , xi = (1, xi1 , . . . , xip )T , ηi (γ) = (1, ψ(xTi γ1 ), . . . , ψ(xTi γk ))T and L = (η1 (γ), . . . , ηn (γ))T , i = 1, . . . , n. ψ(.) is the logistic activation function. The specified model for Yi conditional on Xi = xi is given by Yi ∼ ALD(Li β, σ, τ ) with a likelihood proportional to ( n ) −n X |εi | + (2τ − 1)εi σ exp − (2.6) i=1 2σ where, εi = yi −Li β. The above ALD based likelihood can be represented as a location-scale mixture of normals (Kozumi and Kobayashi, 2011). For any a, b > 0, we have the following equality (Andrews and Mallows, 1974) Z ∞   a 1 2 2 −1 exp{−|ab|} = √ exp − (a v + b v ) dv 0 2πv 2 √ √ Letting a = 1/ 2σ, b = ε/ 2σ, and multiplying by exp{−(2τ − 1)ε/2σ} the (2.6) becomes ( n ) n Z ∞ (εi − ξvi )2 X |εi | + (2τ − 1)εi   −n Y 1 σ exp − = √ exp − − ζvi dvi (2.7) i=1 2σ i=1 0 σ 4πσvi 4σvi where, ξ = (1 − 2τ ) and ζ = τ (1 − τ )/σ. (2.7) is beneficial in the sense that there is no need to worry about the prior distribution of vi as it is extracted in the same equation. The prior of vi in (2.7) is exponential distribution with mean ζ −1 and it depends on the value of τ . Further we observe that, the output of the aforementioned neural network remains un- changed under a set of transformations, like certain weight permutations and sign flips 19 which renders the neural network non-identifiable. For example, in the above model 2.5, take p, k = 2 and β0 , γj0 = 0. Then, X 2 βj ψ(xTi γj ) = β1 [1 + exp(−γ11 xi1 − γ12 xi2 )]−1 + β2 [1 + exp(−γ21 xi1 − γ22 xi2 )]−1 j=1 In the foregoing equation we can notice that when β1 = β2 , two different sets of values of (γ11 , γ12 , γ21 , γ22 ) obtained by flipping the signs, namely (1,2,-1,-2) and (-1,-2,1,2) result in the same value for 2j=1 βj ψ(xTi γj ). However, as a special case of lemma 1 of Ghosh et al. P (2000), the joint posterior of the parameters is proper if the joint prior is proper, even in the case of posterior invariance under the parameter transformations. Note that, as long as the interest is on prediction rather parameter estimation, this property is sufficient for predictive model building. In this work, we focus only on proper priors, hence the non-identifiability of the parameters in 2.5 doesn’t cause any problem. 2.3.2 Algorithm We take mutually independent priors for β, γ1 , . . . , γk with β ∼ N (β0 , σ02 Ik+1 ) and γj ∼ N (γj0 , σ12 Ip+1 ), j = 1, . . . , k. Further, we take inverse gamma prior for σ such that σ ∼ IG(a/2, b/2). Prior selection is problem specific and it is useful to elicit the chosen prior from the historical knowledge. However, for most practical applications, such information is not readily available. Furthermore, neural networks are commonly applied to big data for which a priori knowledge regarding the data as well as about the neural network parameters is not typically known. Hence, prior elicitation from experts in the area is not applicable to neural networks in practice. As a consequence, it seems reasonable to use near-diffuse priors for the parameters of the given model. Now, the joint posterior for β, γ, σ, v given y, is f (β, γ, σ, v|y) ∝ l(y|β, γ, σ, v) π(β) π(γ) π(σ),   3n2 Y n !− 12 ( n ) 1 1  τ (1 − τ ) X (y − Lβ − ξv)T V (y − Lβ − ξv) −  ∝ vi exp − vi σ i=1 4σ σ i=1 20   1 T × exp − 2 (β − β0 ) (β − β0 ) 2σ0 ( k ) 1 X × exp − 2 (γj − γj0 )T (γj − γj0 ) 2σ1 j=1   a2 +1   1 b × exp − . σ 2σ where, V = diag(v1−1 , v2−1 , . . . , vn−1 ). A Gibbs sampling algorithm is used to generate samples from the analytically intractable posterior distribution f (β, γ|y). Some of the full condition- als required in this procedure are available only up to unknown normalizing constants, and we used random walk Metropolis-Hastings algorithm to sample from these full conditional distributions. These full conditional distributions are as follows: (a) π(β|γ, σ, v, y) " −1  T −1 # LT V L   T I L V (y − ξv) β0 L VL I ∼ N + 2 + 2 , + 2 2σ σ0 2σ σ0 2σ σ0 (b) π(γj |β, σ, v, y)   1  T  ∝ exp − (y − Lβ − ξv) V (y − Lβ − ξv) 4σ   1 T × exp − 2 (γj − γj0 ) (γj − γj0 ) 2σ1 (c) π(σ|γ, β, v, y) n ! 3n + a 1  T  X b ∼ IG , (y − Lβ − ξv) V (y − Lβ − ξv) + τ (1 − τ ) vi + 2 4 i=1 2 (d) π(vi |γ, β, σ, y) 1 1 1 ∼ GIG (ν, ρ1 , ρ2 ) where, ν = , ρ21 = (yi − Li β)2 , and ρ22 = . 2 2σ 2σ The generalized inverse Gaussian distribution is defined as, if x ∼ GIG (ν, ρ1 , ρ2 ) then the probability density function of x is given by (ρ2 /ρ1 )ν ν−1   1 −1 2 2 f (x|ν, ρ1 , ρ2 ) = x exp − (x ρ1 + xρ2 ) , 2Kν (ρ1 ρ2 ) 2 21 where x > 0, −∞ < ν < ∞, ρ1 , ρ2 ≥ 0 and Kν (.) is a modified Bessel function of the third kind (see, Barndorff-Nielsen and Shephard (2001)). Unlike the parsimonious parametric models, the Bayesian nonparametric models require additional statistical justification for their theoretical validity. For that reason we are going to provide asymptotic consistency of the posterior distribution derived in our proposed neural network model. 2.4 Theoretical Results Let (x1 , y1 ), . . . , (xn , yn ) be the given data, let f0 (x) be the underlying density of X. Let Qτ (y|X = x) = µ0 (x) be the true conditional quantile function of Y given X, and let µ̂n (x) be the estimated conditional quantile function. Definition 2.4.1. µ̂n (x) is asymptotically consistent for µ0 (x) if Z p |µ̂n (x) − µ0 (x)| f0 (x) dx → 0 p We are essentially making use of Markov’s inequality to ultimately show that µ̂n (X) → µ0 (X). In similar frequentist sense, Funahashi (1989) and Hornik et al. (1989) have shown the asymptotic consistency of the neural networks for mean-regression models by showing the existence of some neural network, µ̂n (x), whose mean squared error with the true function, µ0 (x), converges to 0 in probability. We consider the notion of posterior consistency for Bayesian non-parametric problems which is quantified by concentration around the true density function (see Wasserman (1998), Barron et al. (1999)). This boils down to the above definition of consistency on the condi- tional quantile functions. The main idea is that the density functions deal with the joint distribution of X and Y , while the conditional quantile function deals with the conditional distribution of Y given X. This conditional distribution can then be used to construct the joint distribution by assuming certain regularity condition on the distribution of X. This allows the use of some techniques developed in density estimation field. Some of the 22 ideas presented here can be found in Lee (2000) which developed the consistency results for non-parametric regression using single hidden-layer feed forward neural networks. Let the posterior distribution of the parameters P (.|(X1 , Y1 ), . . . , (Xn , Yn )). Let f (x, y) and f0 (x, y) denote the joint density function of x and y under the model and the truth respectively. Indeed, one can construct the joint density f (x, y) from the condition quantile function f (y|x) by taking f (x, y) = f (y|x)f (x) where f (x) denotes the underlying den- sity of X. Since, one is only interested in f (y|x) and X is ancillary to the estimation of f (y|x), one can use some convenient distribution for f (x). Similar to Lee (2000), we define Hellinger neighborhoods of the true density function f0 (x, y) = f0 (y|x)f0 (x) which allows us to quantify the consistency of the posterior. The Hellinger distance between f0 and any joint density function f of x and y is defined as follows. sZ Z p p 2 DH (f, f0 ) = f (x, y) − f0 (x, y) dx dy (2.8) Based on (2.8), an ϵ-sized Hellinger neighborhood of the true density function f0 is given by Aϵ = {f : DH (f, f0 ) ≤ ϵ} (2.9) Definition 2.4.2 (Posterior Consistency). Suppose (Xi , Yi ) ∼ f0 . The posterior is asymp- totically consistent for f0 over Hellinger neighborhoods if ∀ϵ > 0, p P (Aϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 1 i.e. the posterior probability of any Hellinger neighborhood of f0 converges to 1 in probability. Similar to Lee (2000), we prove the asymptotic consistency of the posterior for neural networks with number of hidden nodes, k, being function of sample size, n. This sequence of models indexed with increasing sample size is called sieve. We take sequence of priors, {πn }, where each πn is defined for a neural network with kn hidden nodes in it. The predictive density (Bayes estimate of f ) then be given by Z fˆn (.) = f (.) dP (f |(X1 , Y1 ), . . . , (Xn , Yn )) (2.10) 23 Let µ0 (x) = Qτ,f0 (Y |X = x) be the true conditional quantile function and let µ̂n (x) = Qτ,fˆn (Y |X = x) be the posterior predictive conditional quantile function using a neural network. For notational convenience we are going to drop x and denote these functions as µ0 and µ̂n occasionally. The following is the key result in this case. Theorem 2.4.3. Let the prior for the regression parameters, πn , be an independent normal with mean 0 and variance σ02 (fixed) for each of the parameters in the neural network. Suppose that the true conditional quantile function is either continuous or square integrable. Let kn be the number of hidden nodes in the neural network, and let kn → ∞. If there exists a R p constant a such that 0 < a < 1 and kn ≤ na , then |µ̂n (x) − µ0 (x)| dx → 0 as n → ∞ In order to prove Theorem 2.4.3, we assume that Xi ∼ U (0, 1), i.e. density function of x is identically equal to 1. This implies joint densities f (x, y) and f0 (x, y) are equal to the conditional density functions, f (y|x) and f0 (y|x) respectively. Next, we define Kullback- Leibler distance to the true density f0 (x, y) as follows   f0 (X, Y ) DK (f0 , f ) = Ef0 log (2.11) f (X, Y ) Based on (2.11), a δ− sized neighborhood of the true density f0 is given by Kδ = {f : DK (f0 , f ) ≤ δ} (2.12) Further towards the proof of Theorem 2.4.3, we define the sieve Fn as the set of all neural networks with each parameter less than Cn in absolute value, |γjh | ≤ Cn , |βj | ≤ Cn , j = 0, . . . , kn , h = 0, . . . , p (2.13) where Cn grows with n such that Cn ≤ exp(nb−a ) for any constant b where 0 < a < b < 1, and a is same as in Theorem 2.4.3. For the above choice of sieve, we next provide a set of conditions on the prior πn which guarantee the posterior consistency of f0 over the Hellinger neighborhoods. At the end of 24 this section, we demonstrate that the following theorem and corollary serve as an important tool towards the proof of Theorem 2.4.3. Theorem 2.4.4. Suppose a prior πn satisfies i ∃ r > 0 and N1 s.t. πn (Fnc ) < exp(−nr), ∀n ≥ N1 ii ∀δ, ν > 0, ∃ N2 s.t. πn (Kδ ) ≥ exp(−nν), ∀n ≥ N2 . Then ∀ϵ > 0, p P (Aϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 1 where Aϵ is the Hellinger neighborhood of f0 as in (2.9). Corollary 2.4.5. Under the conditions of Theorem 2.4.4, µ̂n is asymptotically consistent for µ0 , i.e. Z p |µ̂n (x) − µ0 (x)| dx → 0 We present the proofs of Theorem 2.4.4 and Corollary 2.4.5 in The main idea behind the proof of Theorem 2.4.4 is to consider the complement of P (Aϵ |(X1 , Y1 ), .., (Xn , Yn )) as a ratio of integrals. Hence let Y n f (xi , yi ) i=1 Rn (f ) = n (2.14) Y f0 (xi , yi ) i=1 Then Z Y n Z f (xi , yi )dπn (f ) Rn (f )dπn (f ) Acϵ i=1 Acϵ P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) = Z Y n = Z f (xi , yi )dπn (f ) Rn (f )dπn (f ) i=1 Z Z Rn (f )dπn (f ) + Rn (f )dπn (f ) Acϵ ∩Fn Acϵ ∩Fnc = Z Rn (f )dπn (f ) 25 In the proof, we have shown that the numerator is small as compared to the denominator, p thereby ensuring P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 0. The convergence of the second term in the numerator uses assumption i) of the Theorem 2.4.4. It systematically shows that R Fc Rn (f )dπn (f ) < exp(−nr/2) except on a set with probability tending to zero (see Lemma n A.1.3 in A for further details). The denominator is bounded using assumption ii) of Theorem R 2.4.4. First the KL distance between f0 and f dπn (f ) is bounded and subsequently used p to prove that P (Rn (f ) ≤ e−nς ) → 0, where ς depends on δ defined earlier. This leads to a conclusion that for all ς > 0 and for sufficiently large n, Rn (f )dπn (f ) > e−nς except on a R set of probability going to zero. The result in this case has been condensed in Lemma A.1.4 presented in A. Lastly, the first term in the numerator is bounded using the Hellinger bracketing entropy defined below Definition 2.4.6 (Bracketing Entropy). For any two functions l and u, define the bracket [l, u] as the set of all functions f such that l ≤ f ≤ u. Let ∥.∥ be a metric. Define an ϵ- bracket as a bracket with ∥u − l∥ < ϵ. Define the bracketing number of a set of functions F ∗ as the minimum number of ϵ-brackets needed to cover the F ∗ , and denote it by N[] (ϵ, F ∗ , ∥.∥). Finally, the bracketing entropy, denoted by H[] () , is the natural logarithm of the bracketing number. (Pollard, 1991) Wong and Shen (1995, Theorem 1, pp.348-349) gives the conditions on the rate of growth Z p of the Hellinger bracketing entropy in order to ensure Rn (f )dπn (f ) → 0. We next Acϵ ∩Fn outline the steps to bound the bracketing entropy induced by the sieve structure in (2.13). In this direction, we first compute the covering number and use it as an upper bound in order to find the bracketing entropy for a neural network. Let’s consider k, number of hidden nodes to be fixed for now and restrict the parameter space to Fn then Fn ⊂ Rd where d = (p + 2)k + 1. Further let the covering number be N (ϵ, Fn , ∥.∥) and use L∞ as a metric to cover the Fn with balls of radius ϵ. Then, one does not require more than ((Cn + 1)/ϵ)d 26 such balls which implies  d  d  d 2Cn Cn + ϵ Cn + 1 N (ϵ, Fn , L∞ ) ≤ +1 = ≤ (2.15) 2ϵ ϵ ϵ Together with (2.15), we use results from to bound the bracketing number of F ∗ (the space of all functions on x and y with parameter vectors lying in Fn ) as follows:  2 d ∗ dCn N[] (ϵ, F , ∥.∥2 ) ≤ (2.16) ϵ This allows us to determine the rate of growth of Hellinger bracketing entropy which is nothing but the log of the quantity in (2.16). For further details, we refer to Lemmas A.1.2 and A.1.3 in A. Going back to the proof of Theorem 2.4.3, we show that the πn in Theorem 2.4.3 satisfies the conditions of Theorem 2.4.4 for Fn as in (2.13). Then, the result of Theorem 2.4.3 follows from the Corollary 2.4.5 which is derived from Theorem 2.4.4. Further details of the proof of Theorem 2.4.3 are presented in B. Although Theorem 2.4.3 uses a fixed prior, the results can be extended to a more general class of prior distributions as long as the assumptions of Theorem 2.4.4 hold. 2.5 Numerical Experiments 2.5.1 Simulation Studies We investigate the performance of the proposed BQRNN method using two simulated examples and compare the estimated conditional quantiles of the response variable against frequentist quantile regression (QR), Bayesian quantile regression (BQR), and quantile re- gression neural network (QRNN) models. We implement QR from quantreg package, BQR from bayesQR package and QRNN from qrnn (Cannon, 2011) package available in R. We choose two simulation scenarios, (i) a linear additive model, (ii) a nonlinear polynomial model. In both the cases we consider a heteroscedastic behavior of y given x. Scenario I: Linear heteroscedastic; Data are generated from Y = X T β1 + X T β2 ε, 27 Scenario II: Non-linear heteroscedastic; Data are generated from Y = (X T β1 )4 + (X T β2 )2 ε, where, X = (X1 , X2 , X3 ) and Xi ’s are independent and follow U (0, 5). The parameters β1 and β2 are set at (2, 4, 6) and (0.1, 0.3, 0.5) respectively. We chose these scenarios to make our simulation studies comparable across the past literature in quantile regression neural network domain (Xu et al., 2017; Cannon, 2018). The robustness of our method is illustrated using three different types of random error component (ε): N (0, 1), U (0, 1), and E(1) where, E(ζ) is the exponential distribution with mean ζ −1 . Fore each scenario, we generate 200 independent observations. We work with a single layer feedforward neural network with a fixed number of nodes k. We have tried several values of k in the range of 2-8 and settled on k = 4 which yielded better results than other choices while bearing reasonable computational cost. We generated 100000 MCMC samples and then discarded first half of the sampled chain as burn-in period. The 50% burn-in samples in MCMC simulations is not quite unusual and has been suggested by Gelman and Rubin (1992). We also choose every 10th sampled value for the estimated parameters to diminish the effect of autocorrelation between consecutive draws. Convergence of MCMC was checked using standard MCMC diagnostic tools (Gelman et al., 2013). We have tried several different values of the hyperparameters. For brevity, we report the results for only choice of hyperparameters given by β0 = 0, σ02 = 100, γj = 0, σ12 = 100, a = 3, and b = 0.1. This particular choice of hyperparameters reflect our preference for near-diffuse priors since in many of the real applications of neural network we don’t have information about the input and output variables relationship. Therefore, we wanted to test our model performance in the absence of specific prior elicitation. We also tried different starting values for β and γ chains and found that model output is robust to different starting values of β but it varies noticeably for different starting values of γ. Further, we observed that our model yields optimal results when we use QRNN estimates of γ as its starting value in our model. We also have to fine-tune the step size of random walk Metropolis-Hastings 28 (MH) updates in the γ generation process and settled on random walk variance of 0.012 for scenario I while 0.0012 for scenario II. These step sizes lead to reasonable rejection rates for MH sampling of γ values. However, they indicate the slow traversal of the parameter space for γ values. To compare the model performance of QR, BQR, QRNN and BQRNN, we have calcu- lated the theoretical conditional quantiles (Cond Q) and contrasted them with the estimated conditional quantiles from the given simulated models. Additionally, we compute standard deviation (SD) and root mean squared error (RMSE) values for predicted conditional quan- tiles of each observation in the BQR and BQRNN model using sampled chains. For scenar- ios 1 and 2, Table 2.1 and Table 2.2, respectively, present these results at quantile levels, τ = (0.05, 0.50, 0.95) for 3 observations. The Table 2.1 indicates neural network models performs comparably with the linear models. This ensures the use of neural network models even if the underlying relationship is linear. In Table 2.2, we can observe that BQRNN outperforms other models in the tail area, i.e. τ = 0.05, 0.95, whereas it’s performance is comparable to QRNN at the median. Furthermore, We notice that our relatively complex BQRNN model has lower bias but higher variance (SD2 ) as compared to BQR model. How- ever, BQRNN outperforms BQR proved by the lower RMSE values which consists of squared bias and variance terms. We notice one exception when RMSE value in BQR is lower than BQRNN for 20th observation in 95th quantile of all 3 error models. The occurrence of this exception is random and not because BQR performed systematically better than our model. In summary, we observe the tradeoff between bias and variance in our model which overall performs better than a linear BQR model in a nonlinear setup. The natural advantage of our Bayesian procedure over the frequentist QRNN is that we have posterior variance for our conditional quantile estimates which can be used as an uncertainty quantification. 29 Table 2.1 Simulation study I results. Simulated Conditional Quantiles for QR, BQR, QRNN and BQRNN Noise Quantile Obs Theo QR BQR QRNN BQRNN ε τ No Cond Q Cond Q Cond Q SD RMSE Cond Q Cond Q SD RMSE N (0, 1) 0.05 20 17.56 17.19 17.19 0.60 0.70 16.98 17.15 0.34 0.53 50 37.38 37.69 37.67 0.74 0.80 38.80 39.03 0.68 1.79 100 42.53 42.45 42.56 0.92 0.92 41.43 41.31 0.69 1.40 0.50 20 20.23 20.23 20.25 0.37 0.37 20.23 20.23 0.51 0.51 50 42.62 42.87 42.65 0.31 0.31 42.62 42.68 0.77 0.77 100 48.78 48.81 48.67 0.39 0.41 48.78 48.01 0.97 1.24 0.95 20 22.90 21.81 22.38 0.68 0.86 21.42 21.85 0.46 1.15 50 47.86 47.92 47.70 0.69 0.71 47.52 47.19 0.80 1.04 100 55.02 54.28 54.20 0.92 1.24 53.80 53.47 1.05 1.88 U (0, 1) 0.05 20 20.31 20.39 20.25 0.40 0.41 20.55 20.44 0.30 0.33 50 42.78 42.68 42.56 0.47 0.52 42.89 42.92 0.61 0.63 100 48.97 48.93 48.82 0.56 0.58 48.34 48.82 0.70 0.72 0.50 20 21.04 21.28 21.25 0.19 0.29 20.98 20.97 0.33 0.34 50 44.21 44.30 44.22 0.22 0.22 43.95 43.98 0.66 0.70 100 50.68 50.88 50.79 0.25 0.28 50.39 50.82 0.76 0.78 0.95 20 21.77 21.87 22.10 0.46 0.56 21.72 21.82 0.33 0.34 50 45.65 45.58 45.69 0.46 0.47 45.43 45.38 0.65 0.70 100 52.39 52.36 52.45 0.58 0.58 52.49 52.52 0.75 0.76 E(1) 0.05 20 20.31 20.24 19.94 0.45 0.58 19.94 20.27 0.31 0.31 50 42.78 42.92 42.93 0.51 0.53 42.98 43.17 0.63 0.74 100 48.97 49.06 49.05 0.61 0.62 47.24 49.60 0.75 0.98 0.50 20 21.36 20.95 21.04 0.23 0.40 21.11 20.91 0.39 0.59 50 44.83 45.54 45.47 0.27 0.70 46.30 45.66 0.74 1.11 100 51.41 51.93 51.91 0.41 0.65 51.68 51.94 0.89 1.04 0.95 20 25.09 25.08 25.17 0.80 0.81 23.73 24.27 0.79 1.14 50 52.17 53.15 52.61 0.90 1.00 54.98 54.43 1.24 2.58 100 60.15 61.24 60.75 1.06 1.22 59.87 62.96 1.46 3.17 Obs No: Observation Number; Theo: Theoretical; Cond Q: Conditional Quantile; SD: Standard Deviation; RMSE: Root Mean Squared Error. 30 Table 2.2 Simulation study II results. Simulated Conditional Quantiles for QR, BQR, QRNN and BQRNN Noise Quantile Obs Theo QR BQR QRNN BQRNN ε τ No Cond Q Cond Q Cond Q SD RMSE Cond Q Cond Q SD RMSE N (0, 1) 0.05 20 167491.82 -195687.34 10207.80 33.03 157284.03 30400.87 166112.85 8460.05 8570.87 50 3298781.46 2107072.28 24948.89 73.02 3273832.57 2626269.69 3289014.06 48028.83 49007.23 100 5660909.58 2741102.73 25680.38 75.52 5635229.20 3312208.28 5311964.68 88478.24 359985.24 0.50 20 167496.16 83421.40 92937.30 101.97 74558.93 183185.47 169748.75 2615.13 3451.33 50 3298798.17 2661086.40 228321.19 282.04 3070476.99 3292073.45 3290858.99 46577.95 47245.13 100 5660933.30 3550782.83 233862.83 258.14 5427070.47 5652427.31 5662655.46 80356.32 80366.74 0.95 20 167500.49 1184601.83 171358.16 180.90 3861.91 193187.39 174073.37 2751.37 7125.39 50 3298814.88 4731674.95 423324.96 481.51 2875489.96 3309795.50 3298328.52 46685.76 46683.63 100 5660957.02 5575413.13 429129.54 458.70 5231827.50 5772437.35 5676164.49 80825.89 82236.16 U (0, 1) 0.05 20 167496.29 -195833.01 10207.87 33.11 157288.42 -27470.71 154598.77 7923.98 15136.80 50 3298798.68 2107278.20 24949.04 73.11 3273849.64 2500915.26 3295376.39 47739.91 47857.66 100 5660934.02 2741400.69 25680.55 75.79 5635253.47 3100036.18 5285593.84 98238.89 387980.93 0.50 20 167497.47 83435.89 92937.55 101.16 74559.99 172793.77 168331.89 2583.07 2714.25 50 3298803.25 2661086.35 228322.57 278.93 3070480.69 3314585.80 3296332.61 46650.01 46710.73 100 5660940.51 3550796.98 233863.94 256.32 5427076.57 5607084.86 5660871.21 80101.17 80093.19 0.95 20 167498.66 1184587.54 171358.60 163.87 3863.42 196767.97 174714.72 5874.89 9304.78 50 3298807.82 4731690.36 423325.69 464.88 2875482.17 3316486.10 3303092.85 47168.25 47357.80 100 5660947.00 5575416.01 429130.38 430.23 5231816.64 5757769.91 5661073.43 80290.13 80282.20 E(1) 0.05 20 167496.29 -195784.51 10207.99 32.90 157288.30 -172912.92 161521.55 5217.48 7931.84 50 3298798.69 2107220.10 24949.33 72.48 3273849.36 2380587.55 3295994.41 47385.22 47463.40 100 5660934.04 2741316.61 25680.81 75.25 5635253.23 2825010.54 5194410.73 82821.54 473816.46 0.50 20 167497.98 83442.50 92937.24 102.27 74560.81 175113.10 169606.69 2568.74 3323.21 50 3298805.21 2661099.63 228321.18 282.92 3070484.04 3309978.81 3297248.31 46648.10 46669.41 100 5660943.29 3550827.77 233862.58 258.35 5427080.71 5603733.56 5660975.15 80088.73 80080.73 0.95 20 167504.05 1184595.04 171358.64 163.27 3858.05 182663.10 172254.32 2917.90 5574.72 50 3298828.60 4731683.78 423325.49 458.35 2875503.15 3337299.83 3298567.13 46680.30 46676.36 100 5660976.50 5575416.49 429130.11 422.79 5231846.41 5838226.18 5687421.49 80742.36 84955.07 Obs No: Observation Number; Theo: Theoretical; Cond Q: Conditional Quantile; SD: Standard Deviation; RMSE: Root Mean Squared Error. 31 2.5.2 Real Data Examples In this section, we apply our proposed method to three real world datasets which are publicly available. The first dataset is the Boston Housing dataset which is available in R package MASS (Venables and Ripley, 2002). It contains 506 census tracts of Boston Standard Metropolitan Statistical Area in 1970. There are 13 predictor variables and one response variable, corrected median value of owner-occupied homes (in USD 1000s). Predictor variables include per capita crime rate by town, proportion of residential land zoned for lots over 25,000 sq.ft., nitrogen oxide concentration, proportion of owner-occupied units built prior to 1940, full- value property-tax rate per $10,000, and lower status of the population in percent, among others. There is high correlation among some of these predictor variables and the goal here is to determine the best fitting functional form to improve the housing value forecasts. The second dataset is the Gilgais dataset available in R package MASS. This data was collected on a line transect survey in gilgai territory in New South Wales, Australia. Gilgais are repeated mounds and depressions formed on flat land, and many-a-times are regularly distributed. The data collection with 365 sampling locations on a linear grid of 4 meters spacing aims to check if the gilgai patterns are reflected in the soil properties as well. At each of the sampling location, samples were taken at depths 0-10 cm, 30-40 cm and 80-90 cm below the surface. The input variables included pH, electrical conductivity and chloride content and were measured on a 1:5 soil:water extract from each sample. Here, the response variable is e80 (electrical conductivity in mS/cm: 80–90 cm) and we focus on finding the true functional relationship present in the dataset. The third dataset is concrete data which is compiled by Yeh (1998) and is available on UCI machine learning repository. It consists of 1030 records, each containing 8 input features and compressive strength of concrete as an output variable. The input features include the amounts of ingredients in high performance concrete (HPC) mixture which are cement, fly ash, blast furnace slag, water, superplasticizer, coarse aggregate, and fine aggregate. More- 32 over, age of the mixture in days is also included as one of the predictor variable. According to Yeh (1998), the compressive strength of concrete is a highly non-linear function of the given inputs. The central purpose of the study is to predict the compressive strength of HPC using the input variables. In our experiments, we compare the performance of QR, BQR, QRNN, and BQRNN estimates for f (x), the true functional form of the data, in both training and testing data using mean check function (or, mean tilted absolute loss function). The mean check function (MCF) is given as N 1 X MCF = ρτ (yi − fˆ(xi )) N i=1 where, ρτ (.) is defined in (2.3) and fˆ(x) is an estimate of f (x). We resort to this comparison criterion since we don’t have the theoretical conditional quantiles for the data at our disposal. For each dataset, we randomly choose 80% of data points for training the model and then remaining 20% is used to test the prediction ability of the fitted model. Our single hidden- layer neural network has k = 4 hidden layer nodes and the random walk variance is chosen to be 0.012 . These particular choices of the number of hidden layer nodes and random walk step size are based on their optimal performance among several different choices while providing reasonable computational complexity. We perform these analyses for quantiles, τ = (0.05, 0.25, 0.50, 0.75, 0.95), and present the model comparison results for both training and testing data in Table 2.3. It can be seen that our model performs comparably well with QRNN model while out- performing linear models (QR and BQR) in all the datasets. We can see that both QRNN and BQRNN have lower mean check function values for training data than their testing counterpart. This suggests that neural networks may be overfitting the data while trying to find the true underlying functional form. The model performance of QR and BQR models is inferior compared to neural network models, particularly when the regression relationship is non-linear. Furthermore, our BQRNN model provides uncertainty estimation as a natural byproduct which is not available in the frequentist QRNN model. 33 Table 2.3 Real data applications results. MCF values are reported Noise Quantile Sample QR BQR QRNN BQRNN Boston τ = 0.05 Train 0.3009 0.3102 0.2084 0.1832 Test 0.3733 0.3428 0.3356 0.5842 τ = 0.25 Train 1.0403 1.0431 0.6340 0.6521 Test 1.2639 1.2431 1.0205 1.2780 τ = 0.50 Train 1.4682 1.4711 0.8444 0.8864 Test 1.8804 1.8680 1.4638 1.5882 τ = 0.75 Train 1.3856 1.3919 0.6814 0.7562 Test 1.8426 1.8053 1.3773 1.4452 τ = 0.95 Train 0.5758 0.6009 0.2276 0.2206 Test 0.7882 0.6174 0.8093 0.6880 Gilgais τ = 0.05 Train 3.6610 3.7156 3.0613 2.7001 Test 3.4976 3.2105 2.9137 3.9163 τ = 0.25 Train 13.9794 14.5734 8.6406 8.4565 Test 11.8406 11.4832 9.3298 10.1386 τ = 0.50 Train 18.1627 21.3587 10.4667 10.7845 Test 15.8210 17.2037 13.5297 14.3699 τ = 0.75 Train 13.6598 18.8357 7.9679 7.9905 Test 12.3926 18.2477 9.4711 10.6414 τ = 0.95 Train 3.8703 6.4137 2.3289 2.2508 Test 4.1266 6.4300 3.0280 2.5586 Concrete τ = 0.05 Train 2.9130 4.4500 2.0874 2.0514 Test 2.9891 4.2076 2.2021 2.6793 τ = 0.25 Train 10.0127 14.7174 7.0063 7.0537 Test 9.6451 14.3567 6.9179 7.4069 τ = 0.50 Train 13.1031 19.8559 9.3638 9.3728 Test 12.7387 18.2309 9.8936 10.9172 τ = 0.75 Train 11.5179 17.7680 7.6789 7.3932 Test 10.8299 16.3257 8.5755 9.4147 τ = 0.95 Train 3.9493 6.8747 2.4403 2.5262 Test 3.6489 6.8435 2.7768 4.1369 34 2.6 Conclusion and Discussion This chapter introduces the Bayesian neural network model for quantile estimation in a systematic way. The practical implementation of Gibbs sampling coupled with Metropolis- Hastings updates method have been discussed in detail. The method exploits the location- scale mixture representation of the asymmetric Laplace distribution which makes its imple- mentation easier. The model can be thought as a hierarchical Bayesian model which makes use of independent normal priors for the neural network weight parameters. A future work in this area could be sparsity induced priors to allow for node and layer selection in multi-layer neural network architecture. Further, we have developed asymptotic consistency of the posterior distribution of the neural network parameters. The presented result can be extended to a more general class of prior distributions if they satisfy the Theorem 2.4.4 assumptions. Following the theory developed here, we bridge the gap between asymptotic justifications separately available for Bayesian quantile regression and Bayesian neural network regression. The theoretical argu- ments developed here justify using neural networks for quantile estimation in nonparametric regression problems using Bayesian methods. The proposed MCMC procedure has been shown to work when the number of parameters are relatively low compared to the number of observations. We noticed that the conver- gence of the posterior chains take long time and there is noticeable autocorrelation left in the sampled chains even after burn-in period. We also acknowledge that our random-walk Metropolis-Hastings algorithm has small step size which might lead to slow traversal of the parameter space ultimately raising the computational cost of our algorithm. The computa- tional complexity in machine learning methods are well-known. Further research is required in these aspects of model implementation. 35 APPENDICES 36 APPENDIX A LEMMAS FOR POSTERIOR CONSISTENCY PROOF For all the proofs in Appendix A and Appendix B, we assume Xp×1 to be uniformly distributed on [0, 1]p and keep them fixed. Thus, f0 (x) = f (x) = 1. Conditional on X, the univariate response variable Y has asymmetric Laplace distribution with location parameter determined by the neural network. We are going to fix its scale parameter, σ, to be 1 for the posterior consistency derivations. Thus, k ! X 1 Y |X = x ∼ ALD β0 + βj Pp , 1, τ (A.1) j=1 1 + exp(−γ j0 − h=1 γjh x h ) The number of input variables, p, is taken to be fixed while the number of hidden nodes, k, is allowed to grow with the sample size, n. A.1 Requisite Lemmas All the lemmas described below are taken from Lee (2000). Lemma A.1.1. Suppose H[] (u) ≤ log[(Cn2 dn /u)dn ], dn = (p + 2)kn + 1, kn ≤ na and Cn ≤ exp(nb−a ) for 0 < a < b < 1. Then for any fixed constants c, ϵ > 0, and for all sufficiently Rϵp √ large n, 0 H[] (u) ≤ c nϵ2 . Proof. The proof follows from the proof of Lemma 1 from (Lee, 2000, p. 634-635). For Lemmas A.1.2, A.1.3 and A.1.4, we make use of the following notations. From (2.14), recall n Y f (xi , yi ) Rn (f ) = f (x , y ) i=1 0 i i is the ratio of likelihoods under neural network density f and the true density f0 . Fn is the sieve as defined in (2.13) and Aϵ is the Hellinger neighborhood of the true density f0 as in (2.9). 37 Lemma A.1.2. sup Rn (f ) ≤ 4exp(−c2 nϵ2 ) a.s. for sufficiently large n. f ∈Acϵ ∩Fn Proof. Using the outline of the proof of Lemma 2 from (Lee, 2000, p. 635), first we have to bound the Hellinger bracketing entropy using Van Der Vaart and Wellner (1996, Theo- rem 2.7.11 on p.164). Next we use Lemma A.1.1 to show that the conditions of Wong and Shen (1995, Theorem 1 on p.348-349) hold and finally we apply that theorem to get the result presented in the Lemma 2. In our case of BQRNN, we only need to derive first step using ALD density mentioned in (A.1). And rest of the steps follow from the proof given in Lee (2000). As we are looking for the Hellinger bracketing entropy for neural networks, we use L2 norm on the square root of the density functions, f . The L∞ covering number was computed above in (2.15), so here d∗ = L∞ . The version of Van Der Vaart and Wellner (1996, Theorem 2.7.11) that we are interested in is p p If ft (x, y) − fs (x, y) ≤ d∗ (s, t)F (x, y) for some F, then, N[] (2ϵ ∥F ∥2 , F ∗ , ∥.∥2 ) ≤ N (ϵ, Fn , d∗ ) Now let’s start by defining some notations,  ft (x, y) = τ (1 − τ )exp −(y − µt (x))(τ − I(y≤µt (x)) ) , k p X βjt t X t where, µt (x) = β0t + and Aj (x) = γj0 + γjh xh (A.2) j=1 1 + exp(−Aj (x)) h=1  fs (x, y) = τ (1 − τ )exp −(y − µs (x))(τ − I(y≤µs (x)) ) , k p X βjs X where, µs (x) = β0s + s and Bj (x) = γj0 + s γjh xh (A.3) j=1 1 + exp(−Bj (x)) h=1 For notational convenience, we drop x and y from fs (x, y), ft (x, y), µs (x), µt (x), Bj (x), and Aj (x) and denote them as fs , ft , µs , µt , Bj , and Aj . p p ft − fs     p 1 1 = τ (1 − τ ) exp − (y − µt )(τ − I(y≤µt ) ) − exp − (y − µs )(τ − I(y≤µs ) ) 2 2 38 As, τ ∈ (0, 1) is fixed.     1 1 1 ≤ exp − (y − µt )(τ − I(y≤µt ) ) − exp − (y − µs )(τ − I(y≤µs ) ) (A.4) 2 2 2 Now let’s separate above term into two cases when: (a) µs ≤ µt and (b) µs > µt . Further let’s consider case-a and break it into three subcases when: (i) y ≤ µs ≤ µt , (ii) µs < y ≤ µt , and (iii) µs ≤ µt < y. Case-a (i) y ≤ µs ≤ µt The (A.4) simplifies to     1 1 1 exp − (y − µt )(τ − 1) − exp − (y − µs )(τ − 1) 2 2 2     1 1 1 = exp − (y − µs )(τ − 1) exp − (µs − µt )(τ − 1) − 1 2 2 2 As first term in modulus is ≤ 1   1 1 ≤ 1 − exp − (µt − µs )(1 − τ ) 2 2 Note: 1 − exp(−z) ≤ z ∀z ∈ R =⇒ |1 − exp(−z)| ≤ |z| ∀z ≥ 0 (A.5) 1 ≤ |µt − µs | (1 − τ ) 4 1 ≤ |µt − µs | 4 1 ≤ |µt − µs | 2 Case-a (ii) µs < y ≤ µt The (A.4) simplifies to     1 1 1 exp − (y − µt )(τ − 1) − exp − (y − µs )τ 2 2 2     1 1 1 = exp − (y − µs )(τ − 1) − 1 + 1 − exp − (y − µs )τ 2 2 2     1 1 1 1 ≤ 1 − exp − (y − µt )(τ − 1) + 1 − exp − (y − µs )τ 2 2 2 2 Let’s use calculus inequality mentioned in (A.5) 1 1 ≤ |(y − µt )(τ − 1)| + |(y − µs )τ | 4 4 39 Both terms are positive so we combine them in one modulus 1 = |(y − µt )(τ − 1) + (y − µt + µt − µs )τ | 4 1 = |(y − µt )(2τ − 1) + (µt − µs )τ | 4 1 ≤ [|(y − µt )| |2τ − 1| + |µt − µs | τ ] 4 Here, |y − µt | ≤ |µt − µs | and |2τ − 1| ≤ 1 1 ≤ |µt − µs | 2 Case-a (iii) µs ≤ µt < y The (A.4) simplifies to     1 1 1 exp − (y − µt )τ − exp − (y − µs )τ 2 2 2     1 1 1 = exp − (y − µt )τ 1 − exp − (µt − µs )τ 2 2 2 As first term in modulus is ≤ 1   1 1 ≤ 1 − exp − (µt − µs )τ 2 2 Using the calculus inequality mentioned in (A.5) 1 ≤ |µt − µs | τ 4 1 ≤ |µt − µs | 4 1 ≤ |µt − µs | 2 We can similarly bound the (A.4) in case-(b) where µs > µt by |µt − µs | /2. Now, p p ft − fs 1 ≤ |µt − µs | 2 Now, let’s substitute µt and µs from A.2 and A.3 k k 1 t X βjt s X βjs = β0 + − β0 − 2 j=1 1 + exp(−Aj ) j=1 1 + exp(−Bj ) 40 " k # 1 X βjt βjs ≤ β0t − β0s + − 2 j=1 1 + exp(−Aj ) 1 + exp(−Bj ) " k # 1 X βjt − βjs + βjs βjs = β0t − β0s + − 2 j=1 1 + exp(−Aj ) 1 + exp(−Bj ) " k k # 1 X βjt − βjs X 1 1 = β0t − β0s + + βjs − 2 j=1 1 + exp(−Aj ) j=1 1 + exp(−Aj ) 1 + exp(−Bj ) Recall that βjs ≤ Cn " k k # 1 X X exp(−Bj ) − exp(−Aj ) ≤ β0t − β0s + βjt − βjs + Cn (A.6) 2 j=1 j=1 (1 + exp(−Aj ))(1 + exp(−Bj ))  exp(−Aj )(1 − exp(−(Bj − Aj ))),  when Bj − Aj ≥ 0 Note: |exp(−Bj ) − exp(−Aj )| = exp(−B )(1 − exp(−(A − B ))),  j j j when Aj − Bj ≥ 0 Using the calculus inequality mentioned in (A.5)  exp(−Aj )(Bj − Aj ), when Bj − Aj ≥ 0  ≤ exp(−Bj )(Aj − Bj ), when Aj − Bj ≥ 0   exp(−A )(B −A ) j j j , when Bj − Aj ≥ 0  exp(−Bj ) − exp(−Aj )  (1+exp(−Aj ))(1+exp(−Bj )) So, ≤ (1 + exp(−Aj ))(1 + exp(−Bj ))   exp(−Bj )(Aj −Bj ) , when Aj − Bj ≥ 0 (1+exp(−Aj ))(1+exp(−Bj )) ≤ |Aj − Bj | Hence we can bound the (A.6) as follows " k k # p p 1 X X ft − fs ≤ β0t − β0s + βjt − βjs + Cn |Aj − Bj | 2 j=1 j=1 Now, let’s substitute Aj and Bj from A.2 and A.3 p p " k k # 1 X X X X ≤ β0t − β0s + βjt − βjs + Cn γj0 t + γjht s xh − γj0 − s γjh xh 2 j=1 j=1 h=1 h=1 p " k k !# 1 X X X ≤ β0t − β0s + βjt − βjs + Cn γj0 t s − γj0 + t |xh | γjh s − γjh 2 j=1 j=1 h=1 41 Recall that |xh | ≤ 1 and w.l.o.g assume Cn > 1 p " k k !# Cn X X X ≤ β0t − β0s + βjt − βjs + t γj0 − γj0s + t γjh s − γjh 2 j=1 j=1 h=1 Cn d ≤ ∥t − s∥∞ 2 Now rest of the steps follow from the proof of Lemma 2 in Lee (2000, p. 635-636). Lemma A.1.3. If there exists a constant r > 0 and N , such that Fn satisfies πn (Fnc ) < R exp(−nr), ∀n ≥ N , then there exists a constant c2 such that Ac Rn (f )dπn (f ) < exp(−nr/2)+ ϵ 2 exp(−nc2 ϵ ) except on a set of probability tending to zero. Proof. The proof is same as the proof of Lemma 3 from (Lee, 2000, p. 636). Lemma A.1.4. Let Kδ be the KL-neighborhood as in (2.12. Suppose that for all δ, ν > 0, ∃ N s.t. πn (Kδ ) ≥ exp(−nν), ∀n ≥ N . Then for all ς > 0 and sufficiently large n, Rn (f )dπn (f ) > e−nς except on a set of probability going to zero. R Proof. The proof is same as the proof of Lemma 5 from (Lee, 2000, p. 637). Lemma A.1.5. Suppose that µ is a neural network regression with parameters (θ1 , . . . θd ), and let µ̃ be another neural network with parameters (θ̃1 , . . . θ̃d˜n ). Define θi = 0 for i > d and θ̃j = 0 for j > d˜n . Suppose that the number of nodes of µ is k, and that the number of nodes of µ̃ is k̃n = O(na ) for some a, 0 < a < 1. Let Mς = {µ̃ θi − θ̃i ≤ ς, i = 1, 2, . . . } (A.7) Then for any µ̃ ∈ Mς and for sufficiently large n, sup(µ̃(x) − µ(x))2 ≤ (5na )2 ς 2 x∈X Proof. The proof is same as the proof of Lemma 6 from (Lee, 2000, p. 638-639). 42 APPENDIX B POSTERIOR CONSISTENCY THEOREM PROOFS B.1 Theorem 2.4.4 Proof For the proof of Theorem 2.4.4 and Corollary 2.4.5, we use the following notations. From (2.14), recall that n Y f (xi , yi ) Rn (f ) = i=1 0 f (xi , yi ) is the ratio of likelihoods under neural network density f and the true density f0 . Also, Fn is the sieve as defined in (2.13). Finally, Aϵ is the Hellinger neighborhood of the true density f0 as in (2.9). R By Lemma A.1.3, there exists a constant c2 such that Acϵ Rn (f )dπn (f ) < exp(−nr/2) + R exp(−nc2 ϵ2 ) for sufficiently large n. Next, from Lemma A.1.4, Rn (f )dπn (f ) ≥ exp(−nς) for sufficiently large n. Z Rn (f )dπn (f ) Acϵ P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) = Z Rn (f )dπn (f ) exp − nr  2 + exp(−nc2 ϵ2 ) < exp(−nς)  hr i − ς + exp −nϵ2 [c2 − ς]  = exp −n 2 r Now we pick ς such that for φ > 0, both 2 − ς > φ and c2 − ς > φ. Thus, P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) ≤ exp(−nφ) + exp(−nϵ2 φ) p Hence, P (Acϵ |(X1 , Y1 ), . . . , (Xn , Yn )) → 0. B.2 Corollary 2.4.5 Proof p Theorem 2.4.4 implies that DH (f0 , f ) → 0 where DH (f0 , f ) is the Hellinger distance between f0 and f as in (2.8) and f is a random draw from the posterior. Recall from (2.10), 43 the predictive density function Z fˆn (.) = f (.) dP (f |(X1 , Y1 ), . . . , (Xn , Yn )) gives rise to the predictive conditional quantile function, µ̂n (x) = Qτ,fˆn (y|X = x). We next p show that DH (f0 , fˆn ) → 0, which in turn implies µ̂n (x) converges in L1 -norm to the true conditional quantile function, k X 1 µ0 (x) = Qτ,f0 (y|X = x) = β0 + βj 1 + exp (−γj0 − ph=1 γjh xih ) P j=1 p First we show that DH (f0 , fˆn ) → 0. Let X n = ((X1 , Y1 ), . . . , (Xn , Yn )). For any ϵ > 0: Z ˆ DH (f0 , fn ) ≤ DH (f0 , f ) dπn (f |X n ) By Jensen’s Inequality Z Z n ≤ DH (f0 , f ) dπn (f |X ) + DH (f0 , f ) dπn (f |X n ) A Acϵ Z ϵ Z ≤ ϵ dπn (f |X n ) + DH (f0 , f ) dπn (f |X n ) Aϵ c Aϵ Z ≤ ϵ+ DH (f0 , f ) dπn (f |X n ) Acϵ The second term goes to zero in probability by Theorem 2.4.4 and ϵ is arbitrary, therefore p DH (f0 , fˆn ) → 0. In the remaining part of the proof, for notational simplicity, we take µ̂n (x) and µ0 (x) to be µ̂ and µ̂0 respectively. The Hellinger distance between f0 and fˆn is DH (f0 , fˆn ) Z Z q 2 !1/2 fˆn (x, y) − f0 (x, y) dy dx p = Z Z    1 = τ (1 − τ ) exp − (y − µ̂n )(τ − I(y≤µ̂n ) ) 2  2 !1/2 1 −exp − (y − µ0 )(τ − I(y≤µ0 ) ) dy dx 2  ZZ   1/2 1 1 = 2−2 τ (1 − τ )exp − (y − µ̂n )(τ − I(y≤µ̂n ) ) − (y − µ0 )(τ − I(y≤µ0 ) ) dy dx 2 2 44 1 1 let, T = − (y − µ̂n )(τ − I(y≤µ̂n ) ) − (y − µ0 )(τ − I(y≤µ0 ) ) 2 2  ZZ 1/2 = 2−2 τ (1 − τ )exp (T ) dy dx (B.1) Now let’s break T into two cases: (a) µ̂n ≤ µ0 , and (b) µ̂n > µ0 . Case-(a) µ̂n ≤ µ0  µ̂n +µ0       − y− 2 τ, µ̂n ≤ µ0 < y     − y − µ̂n +µ0  (y−µ0 ) µ̂n +µ0  2 τ+ 2 , µ̂n ≤ 2 < y ≤ µ0 T =  µ̂n +µ0  (y−µ̂n ) µ̂n +µ0     − y− 2 (τ − 1) − 2 , µ̂n < y ≤ 2 ≤ µ0    µ̂n +µ0   − y −  (τ − 1), y ≤ µ̂n ≤ µ0 2 Case-(b) µ̂n > µ0  µ̂n +µ0   − y −  τ, µ0 ≤ µ̂n < y   2     µ̂n +µ0  (y−µ̂n ) µ̂n +µ0 − y −  2 τ+ 2 , µ0 ≤ 2 < y ≤ µ̂n T = µ̂n +µ0 (y−µ0 ) µ̂n +µ0       − y− 2 (τ − 1) − 2 , µ0 < y ≤ 2 ≤ µ̂n    µ̂n +µ0   − y −  (τ − 1), y ≤ µ0 ≤ µ̂n 2 Hence now, Z τ (1 − τ )exp (T ) dy Z   = I(µ̂n ≤µ0 ) + I(µ̂n >µ0 ) τ (1 − τ )exp (T ) dy Z ∞     µ̂n + µ0 = I(µ̂n ≤µ0 ) τ (1 − τ ) × exp − y − τ dy µ0 2 Z µ0     µ̂n + µ0 (y − µ0 ) + exp − y − τ+ dy µ̂n +µ0 2 2 2 Z µ̂n +µ0     2 µ̂n + µ0 (y − µ̂n ) + exp − y − (τ − 1) − dy µ̂n 2 2 Z µ̂n      µ̂n + µ0 + exp − y − (τ − 1) dy −∞ 2 45 Z ∞     µ̂n + µ0 + I(µ̂n >µ0 ) τ (1 − τ ) × exp − y − τ dy µ̂n 2 Z µ̂n     µ̂n + µ0 (y − µ̂n ) + exp − y − τ+ dy µ̂n +µ0 2 2 2 Z µ̂n +µ0     2 µ̂n + µ0 (y − µ0 ) + exp − y − (τ − 1) − dy µ0 2 2 Z µ0      µ̂n + µ0 + exp − y − (τ − 1) dy −∞ 2     1−τ |µ̂n − µ0 | τ |µ̂n − µ0 | = exp − τ − exp − (1 − τ ) 1 − 2τ 2 1 − 2τ 2 Substituting the above expression in Equation B.1 we get DH (f0 , fˆn ) equal to,  Z      1/2 1−τ |µ̂n − µ0 | τ |µ̂n − µ0 | 2−2 exp − τ − exp − (1 − τ ) dx 1 − 2τ 2 1 − 2τ 2 p Since DH (f0 , fˆn ) → 0, Z      1−τ |µ̂n − µ0 | τ |µ̂n − µ0 | p exp − τ − exp − (1 − τ ) dx → 1 1 − 2τ 2 1 − 2τ 2 Our next step is to show that above expression implies that |µ̂n − µ0 | → 0 a.s. on a set Ω, R p with probability tending to 1, and hence |µ̂n − µ0 | dx → 0. We are going to prove this using contradiction technique. Suppose that, |µ̂n − µ0 | ↛ 0 a.s. on Ω. Then, there exists an ϵ > 0 and a subsequence µ̂ni such that |µ̂ni − µ0 | > ϵ on a set A with P (A) > 0. Now decompose the integral as Z      1−τ |µ̂n − µ0 | τ |µ̂n − µ0 | exp − τ − exp − (1 − τ ) dx 1 − 2τ 2 1 − 2τ 2 Z      1−τ |µ̂n − µ0 | τ |µ̂n − µ0 | = exp − τ − exp − (1 − τ ) dx A 1 − 2τ 2 1 − 2τ 2 Z      1−τ |µ̂n − µ0 | τ |µ̂n − µ0 | + exp − τ − exp − (1 − τ ) dx Ac 1 − 2τ 2 1 − 2τ 2   (1 − τ )exp(−ϵτ /2) − τ exp(−ϵ(1 − τ )/2) ≤ P (A) + P (Ac ) < 1 | {z } 1 − 2τ | {z } >0 | {z } <1 <1 (max = 1 for ϵ = 0) and strictly ↓ for ϵ∈(0,∞) So we have a contradiction since the integral converges in probability to 1. Thus |µ̂n − µ0 | → R 0 a.s. on Ω. Once we apply Scheffe’s theorem we get |µ̂n − µ0 | dx → 0 a.s. on Ω and hence R p |µ̂n − µ0 | dx → 0. 46 Below we prove the Theorem 2.4.3 and for that we make use of Theorem 2.4.4 and Corollary 2.4.5. B.3 Theorem 2.4.3 Proof We proceed by showing that with Fn as in (2.13), the prior πn of Theorem 2.4.3 satisfies the condition (i) and (ii) of Theorem 2.4.4. The proof of Theorem 2.4.4 condition-(i) presented in Lee (2000, proof of Theorem 1 on p. 639) holds in BQRNN case without any change. Next we need to show that condition-(ii) holds in BQRNN model. Let Kδ be the KL-neighborhood of the true density f0 as in (2.12) and µ0 the corresponding conditional quantile function. We first fix a closely approximating neural network µ∗ of µ0 . We then find a neighborhood Mς of µ∗ as in (A.7) and show that this neighborhood has sufficiently large prior probability. Suppose that µ0 is continuous. For any δ > 0, choose ϵ = δ/2 in theorem from Funahashi (1989, Theorem 1 on p.184) and √ p let µ∗ be a neural network such that sup |µ∗ − µ0 | < ϵ. Let ς = ( ϵ/5na ) = (δ/50)n−a in x∈X Lemma A.1.5. Then following derivation shows us that for any µ̃ ∈ Mς , DK (f0 , f˜) ≤ δ i.e. Mς ⊂ Kδ . ZZ f0 (x, y) DK (f0 , f˜) = f0 (x, y) log dy dx f˜(x, y) ZZ   = (y − µ̃)(τ − I(y≤µ̃) ) − (y − µ0 )(τ − I(y≤µ0 ) ) f0 (y|x) f0 (x) dy dx let, T = (y − µ̃)(τ − I(y≤µ̃) ) − (y − µ0 )(τ − I(y≤µ0 ) ) Z Z  = T f0 (y|x) dy f0 (x) dx Now let’s break T into two cases: (a) µ̃ ≥ µ0 , and (b) µ̃ < µ0 . Case-(a) µ̃ ≥ µ0       (µ0 − µ̃)τ, µ0 ≤ µ̃ < y   T = (µ0 − µ̃)τ − (y − µ̃), µ0 < y ≤ µ̃      (µ0 − µ̃)(τ − 1),  y ≤ µ0 ≤ µ̃ 47 Case-(b) µ̃ ≤ µ0   (µ0 − µ̃)τ,   µ̃ ≤ µ0 < y    T = (µ0 − µ̃)(τ − 1) + (y − µ̃), µ̃ < y ≤ µ0     (µ0 − µ̃)(τ − 1),  y ≤ µ̃ ≤ µ0 So now, Z T f0 (y|x) dy Z   = I(µ̃−µ0 ≥0) × (µ̃ − µ0 )(1 − τ )I(y≤µ0 ) − (y − µ̃)I(µ0 µ0 )   +I(µ̃−µ0 <0) × (µ̃ − µ0 )(1 − τ )I(y≤µ0 ) + (y − µ̃)I(µ̃µ0 ) f0 (y|x) dy Z  = (µ̃ − µ0 )(1 − τ )I(y≤µ0 ) − (µ̃ − µ0 )τ I(y>µ0 )  −(y − µ0 + µ0 − µ̃)I(µ0 µ0 |x) = 1 − τ.   = E −(z − b)I(0 0, ∃Nν s.t. πn (Kδ ) ≥ exp(−nν) ∀n ≥ Nν , πn (Kδ ) ≥ πn (Mς ) d˜n Z θi +ς   Y 1 1 2 = p exp − 2 u du i=1 θi −ς 2πσ02 2σ0 d˜n   Y 1 1 2 ≥ 2ς inf p exp − 2 u i=1 u∈[θi −1,θi +1] 2πσ 2 0 2σ0 d˜n s   Y 2 1 = ς 2 exp − 2 ϑi i=1 πσ 0 2σ0 ϑi = max((θi − 1)2 , (θi + 1)2 ) s !d˜n   2 1 ˜ ≥ ς exp − 2 ϑdn where, ϑ = max(ϑ1 , . . . , ϑd˜n ) πσ02 2σ0 i " s # ! δ 1 = exp −d˜n a log n − log − 2 ϑd˜n 25πσ02 2σ0 r δ −a ς= n 50     ϑ ˜ ≥ exp − 2a log n + 2 dn for large n 2σ0     ϑ a ≥ exp − 2a log n + 2 (p + 3)n 2σ0 d˜n = (p + 2)k̃n + 1 ≤ (p + 3)na ≥ exp(−nν) for any ν and ∀n ≥ Nν for some Nν Hence, we have proved that both the conditions of Theorem 2.4.4 hold. The result of Theorem 2.4.3 thereby follows from the Corollary 2.4.5 which is derived from Theorem 2.4.4. Further, we can use similar argument to show that a neural network can approximate 49 any L2 function arbitrarily closely. Note for any L2 function h, ∥h∥2 ≥ ∥h∥1 , so Z  12 Z 2 (µ − µ0 ) dλ(x) < ϵ =⇒ |µ − µ0 | dλ(x) < ϵ Hence, Z DK (f0 , f˜) ≤ |µ̃ − µ0 | f0 (x) dx Z   ≤ sup |µ̃ − µ| + sup |µ − µ0 | f0 (x) dx x∈X x∈X Use Hornik et al. (1989, Theorem 2.4 on p.362) and Lemma A.1.5 Z ≤ [ϵ + ϵ] f0 (x) dx = 2ϵ = δ 50 CHAPTER 3 LAYER ADAPTIVE NODE SELECTION IN BAYESIAN NEURAL NETWORKS 3.1 Introduction Deep learning profoundly impacts science and society due to its impressive empirical success driven primarily by copious amounts of datasets, ever increasing computational resources, and deep neural network’s (DNN) ability to learn task-specific representations (LeCun et al., 2015). The key characteristic of deep learning is that accuracy empirically scales with the size of the model and the amount of training data. As such, large neural network models such as OpenAI GPT-3 (175 Billion) now typify the state-of-the-art across multiple domains such as natural language processing, computer vision, speech recognition etc. Nevertheless deep neural networks do have some drawbacks despite their wide ranging applications. First, this form of model scaling is exorbitantly prohibitive in terms of compu- tational requirements, financial commitment, energy requirements etc. Second, DNNs tend to overfit leading to poor generalization in practice (Zhang et al., 2017). Finally, there are numerous scenarios where training and deploying such huge models is practically infeasible. Examples of such scenarios include federated learning, autonomous vehicles, robotics, recom- mendation systems where models have to be refreshed daily/hourly or in an online manner for optimal performance. A promising direction for addressing these issues while improving the efficiency of DNNs is exploiting sparsity. From a practical perspective, it has been well-known that neural networks can be sparsified without significant loss in performance (Mozer and Smolensky, 1988; LeCun et al., 1990; Hassibi and Stork, 1993) and there is growing evidence that it is more so in the case of modern DNNs (Han et al., 2015). Recently Frankle and Carbin (2019) proposed the lottery ticket (LT) hypothesis, namely that there exist sparse, trainable sub-networks within 51 the larger network which can match the performance of their dense counterpart. To this end, sparsity in DNNs provides a promising way to reduce the network complexity by eliminating nonessential connections from a neural network thereby improving its calibration (Hoefler et al., 2021). A number of approaches to neural network compression via sparsity have been proposed in the literature (Cheng et al., 2018; Gale et al., 2019). Recent approaches (Guo et al., 2016; Molchanov et al., 2017; Zhu and Gupta, 2018) in magnitude-based pruning of neural network weights provide high model compression rates with minimal accuracy loss. Whereas, sparse evolutionary training learns sparse neural networks with a fixed parameter budget throughout the training based on adaptive sparse connectivity (Mocanu et al., 2018). A key feature of sparsity in neural networks is its structure on the topology of the neu- ral network weights. Weight pruning approaches perform high model compression leading to significant storage cost reduction at test-time (Han et al., 2015, 2016; Molchanov et al., 2017; Zhu and Gupta, 2018; Frankle and Carbin, 2019). However, they result in unstruc- tured sparsity in deep neural architectures which leads to inefficient computational gains in practical setups (Wen et al., 2016). Instead, inducing group sparsity on collection of incom- ing weights into a given node (or node selection) reduces the dimensions of weight matrices per layer allowing for significant computational savings. To that effect, edge selection and node selection approaches are complementary with the former leading to storage reduction and the later leading to computational speedup during inference stage. Although one may argue node selection arises as a byproduct of edge selection, we clearly demonstrate that an approach which targets node selection directly leads to lower latency models (smaller number of nodes per layer) compared to an approach which achieves node selection through edge selection. Node selection through group sparsity in deep neural networks has been explored under frequentist setting in Murray and Chiang (2015), Alvarez and Salzmann (2016), Ochiai et al. (2017), Liu et al. (2017), Luo et al. (2017) and Louizos et al. (2018), etc.. On the other hand, Louizos et al. (2017), Neklyudov et al. (2017), and Ghosh et al. (2019) incorporate group 52 sparsity via shrinkage priors in Bayesian paradigm. These group sparsity approaches specif- ically applied for node selection have shown significant computational speedup and lower memory footprint at inference stage. However, all of the proposed methods of neuron selec- tion perform ad-hoc pruning requiring fine-tuned thresholding rules. Moreover, the posterior inference of network weights in Bayesian neural networks (BNN) through standard MCMC method, ex. Hamiltonian Monte Carlo (Neal, 1992), does not scale well to modern neural network architectures and large datasets used in practice. Instead computationally efficient variational inference as an alternative to MCMC (Jordan et al., 1999; Blei et al., 2017), has been explored in the context of edge selection both theoretically and numerically by Blundell et al. (2015), Chérief-Abdellatif (2020), and Bai et al. (2020). On the other hand, Louizos et al. (2017) and Ghosh et al. (2019) have explored variational inference for node selection problem. In this work, we propose a Gaussian spike-and-slab prior for automatic node selec- tion in Bayesian neural networks thereby alleviating the need of an ad-hoc thresholding rule for pruning. Further for scalability, we develop a variational Bayes algorithm for posterior inference of BNN model parameters in our proposed model and demonstrate its numerical performance through simulation and real regression and classification datasets. Finally, we provide the theoretical guarantees to our node selection method under mild restrictions on the network topology. Related Work. A closely related work to our proposed model is Bai et al. (2020)’s auto- mated edge selection model using spike-and-slab prior. There the slab distribution controls the magnitude of weights and spike allows for the exact setting of weights to 0. We introduce spike-and-slab framework for node selection in BNNs and show the key resource efficiency trade-off between node and edge selection at test-time. There are two main advantages to node selection over edge selection (1) fewer parameters to train during optimization, (2) results in structurally compact network leading to computational speedup at test-time. On the theoretical front, sparse BNNs have been studied in the works of Polson and Ročková (2018) and Sun et al. (2021). In the context of variational inference, sparse BNNs 53 Output Hidden layer 3 Hidden Sparse deep BNN with layer 2 spike-and-slab priors for node selection Hidden layer 1 Input Figure 3.1 Sparse neural network with node selection. Sparse deep BNN using spike- and-slab priors achieves node selection in the given dense network on left leading to a sparse network on right. have been studied in the recent works of Chérief-Abdellatif (2020) and Bai et al. (2020). All these works concentrate on the problem of edge selection facilitated through the use of Gaussian spike-and-slab priors. In the context of node selection, Ghosh et al. (2019) makes use of regularized horseshoe prior. The main limitations of their approach include (1) need for fine tuning of the thresholding rule for node selection, and (2) lack of a theoretical justification. The only two works which have provided theoretical guarantees of their proposed sparse DNN methods under variational inference include those of Chérief-Abdellatif (2020) and Bai et al. (2020). Since they focus on the problem of edge selection, their theoretical developments are related to the results of Schmidt-Hieber (2020) (see the sieve construction in relation (4) in Schmidt-Hieber (2020)) and not directly extendable to our setup. Additionally, they assume certain restrictions on the network topology like (i) equal number of nodes in each layer, (ii) a known uniform bound B on all network weights, and (iii) a global sparsity parameter which may not lead to a structurally compact network. Although from a numerical standpoint, one may implicitly extend the problem of edge selection to node selection, the theoretical 54 guarantees of node selection consistency in sparse DNNs is not immediate. Detailed Contributions. 1. We propose a Gaussian spike-and-slab node selection model and develop a variational Bayes approach for posterior inference of the model parameters. We call our approach SS-IG (Spike-and-Slab Independent Gaussian) model. 2. We derive the variational posterior consistency using a functional space of neural net- works which takes two layer dependent bounds, one which upper bounds the number of neurons in each layer and the other which upper bounds the L1 norm of the weights incident onto each node of a layer. These layer dependent bounds allow the general- ization of the theoretical results presented to guarantee the consistency of any generic shaped network structure. Further, it also guides the calculation of layer-wise prior inclusion probabilities which allow for optimal node recovery per layer in the compu- tational experiments. 3. We measure the computational gains achieved by our approach using layer-wise node sparsities for shallow models and floating point operations in larger models. Our nu- merical results validate the proposed theoretical framework for the node selection in DNN models. These empirical experiments further justify the use of layer-wise node inclusion probabilities to facilitate the optimal node recovery. 3.2 Nonparametric Modeling: Deep Learning Approach Non-parametric modeling assumes an arbitrary relationship between the response and the variables. The term non-parametric does not mean that the value lack inherent parameters, but rather that the parameters are flexible and can vary. In particular, we would like to find a function η0 (·) : Rp → R such that η0 (x) is a good approximation or a representation of y. A standard neural network is, technically speaking, parametric since it has a fixed number of parameters. However, most deep neural networks (DNNs) have thousands or millions of 55 parameters that they could be interpreted as nonparametric. In fact, it has been proven that in the limit of infinite width, a deep neural network can be seen as a Gaussian process, which is a nonparametric model (Lee et al., 2018). Mathematically, let Y ∈ R, X ∈ X be two random variables with the following condi- tional distribution f0 (y|x) = exp [h1 (η0 (x))y + h2 (η0 (x)) + h3 (y)] (3.1) where η0 (·) : X → R is a continuous function satisfying certain regularity assumptions and X is usually a compact subspace of Rp . Note, the functions h1 , h2 , h3 are pre-determined and different choices give rise to different families of generalized linear models. For h1 (u) = u, h2 (u) = − log(1+eu ), h3 (y) = 1, we get the classification model. For h1 (u) = u, h2 (u) = −u2 , h3 (y) = −y 2 /2−log(2π)/2, we get the regression model with σ 2 = 1. Usually X = Rp . Note, x is a feature vector from a marginal distribution PX and y is the corresponding output from Y |X = x in (3.1). Let PX,Y be the joint distribution of (X, Y ). R Let g : X → R be a measurable function, the risk of g is R(g) = Y×X L(Y, g(X))dPX,Y for some loss function L. The Bayes estimator minimizes this risk (Friedman et al., 2009). For regression with squared error loss and classification with 0-1 loss, the optimal Bayes estimators are g ∗ (x) = η0 (x) and g ∗ (x) = 1{η0 (x) ≥ 0} respectively. In practice, Bayes estimator is not useful since the function η0 (x) is unknown. Thus, an estimator is obtained based on the training observations, D = {(x1 , y1 ), ..., (xn , yn )}. A good estimator enjoys universal consistency properties, i.e., its risk approaches Bayes risk as n → ∞ irrespective of PX . To find this optimal class, we use Bayesian neural networks, ηθ (x) with θ denoting the network weights, as an approximation to η0 (x). Mathematical Framework For x ∈ Rp , consider a BNN with L hidden layers with k1 , · · · , kL the number of nodes in the hidden layers with k0 = p, kL+1 = 1 (in regression). kL+1 > 1 allows the generalization 56 to Y ∈ Rd , d > 1, thereby providing a handle on multi class classification problems. The total number of parameters is K = Ll=0 kl+1 (kl + 1). With Wl = [wl0 , Wl1 ], let Q ηθ (x) = wL0 + WL1 ψ(wL−1 0 + WL−1 1 ψ(· · · ψ(w10 + W11 ψ(w00 + W01 x)))), (3.2) where ψ is a nonlinear activation function, wl0 are kl+1 × 1 vectors and Wl1 are kl+1 × kl matrices. Using the BNN in (3.2) to approximate the true function η0 (x), conditional probabilities of Y |X = x are fθ (y|x) = exp [h1 (ηθ (x))y + h2 (ηθ (x)) + h3 (y)] . (3.3) Thus, the likelihood function for the data D under the model and the truth is Yn Yn Pθn = fθ (yi |xi ), P0n = f0 (yi |xi ). (3.4) i=1 i=1 3.3 Spike-and-Slab Independent Gaussian Node Selection 3.3.1 Model To allow for automatic node selection, we consider a spike-and-slab prior consisting of a Dirac spike (δ0 ) at 0 and a slab distribution (Mitchell and Beauchamp, 1988). The spike part is represented by an indicator variable which is set to 0 if a node is not present in the network. The slab part comes from a Gaussian distributed random variable. To allow for the layer-wise node selection, we assume that the prior inclusion probability λl varies as a function of the layer index l. The symbol i.d. is used to denote independently distributed random variables. Prior: We assume a spike-and-slab prior of the following form with zlj as the indicator for the presence of j th node in the lth layer i.d.  i.d. wlj |zlj ∼ (1 − zlj )δ0 + zlj N (0, σ02 I) , zlj ∼ Ber(λl )  where l = 0, . . . , L, j = 1, . . . , kl+1 . Also, wlj = (wlj1 , . . . , wljkl +1 ) is a vector of edges incident on the j th node in the lth layer. In the above formula, note δ0 is a Dirac spike 57 vector of dimension kl + 1 with all entries zero and I is the identity matrix of dimension kl + 1 × kl + 1. Furthermore, zlj with j = (1, . . . , kl+1 ) all follow Bernoulli(λl ) to allow for common prior inclusion probability, λl , for each node from a given layer l. We set λL = 1 to ensure no node selection occurs in the output layer. Posterior: With zl = (zl1 , · · · , zlkl+1 ), let z = (z1 , · · · , zL ) denote the vector of all indicator variables. The posterior distribution of (θ, z) given D is given by P n π(θ|z)π(z) P n π(θ|z)π(z) π(θ, z|D) = P R θ n = θ (3.5) z Pθ π(θ|z)π(z)dθ m(D) Qn where Pθn = i=1 fθ (yi |xi ) is the likelihood function as in (3.4), π(z) is the probability mass function of z with respect to the counting measure and π(θ|z) is the conditional probability density function with respect to the Lebesgue measure of θ given z . Further, m(D) is the marginal density of the data and is free of (θ, z). P Let π e(θ) = z π(θ, z) be the marginal prior of θ. We shall use the notation Z Π(A) = e π e(θ)dθ (3.6) A to denote the probability distribution function corresponding to the density function π e. The marginal posterior of θ expressed as a function of the marginal prior for θ is X Pθn π e(θ) Pθn π e(θ) π e(θ|D) = π(θ, z|D) = R = z Pθn π e(θ)dθ m(D) Thus, the probability distribution function corresponding to the density function π e(|D) is then given by Z Π(A|D) e = π e(θ|D)dθ (3.7) A Variational family: We posit the following mean field variational family (QMF ) on network weights as n o i.d.  i.d. QMF = wlj |zlj ∼ (1 − zlj )δ0 + zlj N (µlj , diag(σlj2 )) , zlj ∼ Ber(γlj )  for l = 0, . . . , L, j = 1, . . . , kl+1 . This ensures that weight distributions follow spike-and-slab structure which allows for node sparsity through variational approximation. Further, the 58 weight distributions conditioned on the node indicator variables are all independent of each other (hence use of the term mean field family). The variational distribution of parameters obtained post optimization will then inherently prune away redundant nodes from each layer. Also, Gaussian distribution for slab component is widely popular for approximating neural network weight distributions (Blundell et al., 2015; Louizos et al., 2017; Bai et al., 2020). Additionally, µlj = (µlj1 , . . . , µljkl +1 ) and σlj2 = (σlj1 2 2 , . . . , σljkl +1 ) denote the vectors of variational mean and standard deviation parameters of the edges incident on the j th node in the lth layer. Similarly, γlj denotes the variational inclusion probability of the j th node in the lth layer. We set γLj = 1 to ensure no node selection occurs in the output layer. Variational posterior: Variational posterior aims to reduce the Kullback-Leibler (KL) distance between a variational family and the true posterior (Blei and Lafferty, 2007; Hinton and Van Camp, 1993) as π ∗ = argmin dKL (q, π(|D)) (3.8) q∈QMF where dKL (q, π(|D)) denotes the KL-distance between q and π(|D). Note, the variational member q can be written as q(θ, z) = q(θ|z)q(z) where q(z) is the probability mass function of z with respect to the counting measure and q(θ|z) is the con- ditional density function given with respect to the Lebesgue measure of θ given z. Further, XZ ∗ π = argmin [log q(θ, z) − log π(θ, z|D)]q(θ, z)dθ q∈QMF z ! XZ = argmin [log q(θ, z) − log π(θ, z, D)]q(θ, z)dθ + log m(D) q∈QMF z = argmin [−ELBO(q, π(|D))] + log m(D) = argmax ELBO(q, π(|D)) (3.9) q∈QMF q∈QMF Since log m(D) is free from q, it suffices to maximize the evidence lower bound (ELBO) above. e∗ (θ) = π ∗ (θ|z)π ∗ (z) then π e∗ denotes the marginal variational posterior for θ. P Let π z 59 We shall use the notation Z e∗ Π (A) = e∗ (θ)dθ π (3.10) A to denote the probability distribution function corresponding to the density function π e∗ . 3.3.2 Algorithm Evidence Lower Bound. The ELBO presented in (3.9) is given by L = −Eq [log Pθn ] + dKL (q, π) which is further simplified as − Eq [log Pθn ] + dKL (q, π)   = −Eq(θ|z)q(z) [log Pθn ] + dKL q(θ|z)q(z), π(θ|z)π(z) X = −Eq(θ|z)q(z) [log Pθn ] + dKL (q(zlj )||π(zlj )) l,j Xh + q(zlj = 1)dKL (q(wlj |zlj = 1)||π(wlj |zlj = 1)) l,j i + q(zlj = 0)dKL (q(wlj |zlj = 0)||π(wlj |zlj = 0)) X = −Eq(θ|z)q(z) [log Pθn ] + dKL (q(zlj )||π(zlj )) l,j X + q(zlj = 1)dKL (q(wlj |zlj = 1)||π(wlj |zlj = 1)) l,j X = −Eq(θ|z)q(z) [log Pθn ] + dKL (q(zlj )||π(zlj )) l,j X + q(zlj = 1)dKL (N (µlj , diag(σlj2 ))||N (0, σ02 I)) l,j The KL of discrete variables appearing in the above expression creates a challenge in practical implementation. Jang et al. (2017) and Maddison et al. (2017) proposed to re- place discrete random variable with its continuous relaxation. Specifically, the continuous relaxation approximation is achieved through Gumbel-softmax (GS) distribution, that is q(zlj ) ∼ Ber(γlj ) is approximated by q(z̃lj ) ∼ GS(γlj , τ ), where z̃lj = (1 + exp(−ηlj /τ ))−1 , ηlj = log(γlj /(1 − γlj )) + log(ulj /(1 − ulj )), ulj ∼ U (0, 1) 60 Algorithm 3.1 Variational inference in SS-IG Bayesian neural networks 1: Inputs: training dataset, network architecture, and optimizer tuning parameters. 2: Model inputs: prior parameters for θ, z. 3: Variational inputs: number of Monte Carlo samples S. 4: Output: Variational parameter estimates of network weights and sparsity. 5: Method: Set initial values of variational parameters. 6: repeat 7: Generate S samples from ζlj ∼ N (0, I) and ulj ∼ U (0, 1) 8: Generate S samples for (zlj , z̃lj ) using ulj 9: Use µlj , σlj , ζlj and zlj to compute loss (ELBO) in forward pass 10: Use µlj , σlj , ζlj and z̃lj to compute gradient of loss in backward pass 11: Update the variational parameters with gradient of loss using stochastic gradient de- scent algorithm (e.g. Adam (Kingma and Ba, 2015)) 12: until change in ELBO < ϵ where τ is the temperature. We set τ = 0.5 for this work (also see section 5 in Bai et al. (2020)). z̃lj is used in the backward pass for easier gradient calculation, while zlj is used for selecting nodes in the forward pass. We use non-centered parameterization for the Gaussian slab variational approximation where N (µlj , diag(σlj2 )) is reparameterized as µlj + σlj ⊙ ζlj for ζlj ∼ N (0, I), where ⊙ denotes the entry-wise (Hadamard) product. 3.4 Theoretical Results In this section, we develop the theoretical consistency of the variational posterior in (3.10) in context of node selection. Previous works which establish the statistical consistency of sparse deep neural networks do so only in the context of edge selection. Thereby, the works of Polson and Ročková (2018), Chérief-Abdellatif (2020) and Bai et al. (2020) use several results from the pioneer work of Schmidt-Hieber (2020). In addition to node selection consistency, we also relax certain network restrictions considered in the previous works. These restrictions include (1) equal number of nodes in each layer which restricts one from using any previous information on the number of nodes in the deep neural architecture (2) a known bound B on all the neural network weights as they essentially rely on the sieve construction in equation 3 of Schmidt-Hieber (2020) which assumes that L∞ norm of all θ entries is smaller than 1 (3) a global sparsity parameter s which does not always consider structurally sparse networks. 61 Towards the proof, firstly our sieve construction allows the number of nodes of the neural network to vary as a function of the layer. Secondly, instead of global sparsity parameter s (see the sieve construction in relation (4) of Schmidt-Hieber (2020)) we allow for layer wise sparsity vector s to account for the number of nodes in each layer. Finally, we relax the assumption of a known bound B by considering a sieve with a layer wise constraint (denoted by the vector B) on the L1 norm of the incoming edges of a node. Thus, our work extends on current literature along three directions (1) theoretically quantifies predictive performance of Bayesian neural networks with node based pruning (2) establishes that even without a fixed bound on network weights, one can recover true solution by appropriate choice of the prior (3) provides layer wise node inclusion probabilities to allow for structurally sparse solutions. The relaxation of these network structure assumptions requires us to provide the framework for node selection including appropriate sieve construction together with the derivation of the results in Schmidt-Hieber (2020) customized to our problem. To establish the posterior contraction rates, we show that the variational posterior in (3.8) concentrates in shrinking Hellinger neighborhoods of the true density function P0 with overwhelming probability. Since X ∼ U [0, 1]p , thus f0 (x) = fθ (x) = 1. This further implies P0 = f0 (y|x)f0 (x) = f0 (y|x) and similarly Pθ = fθ (y|x). We next define the Hellinger neighborhood of the true density P0 as Hε = {θ : dH (P0 , Pθ ) < ε} where the Hellinger distance between the true density function P0 and the model density Pθ is Z p 1 p 2 d2H (P0 , Pθ ) = fθ (y|x) − f0 (y|x) dydx 2 We also define the KL neighborhood of the true density P0 as Nε = {θ : dKL (P0 , Pθ ) < ε} where the KL distance dKL between the true density function P0 and the model density Pθ 62 is Z f0 (y|x) dKL (P0 , Pθ ) = log f0 (y|x)dydx fθ (y|x) Let k = (k0 , · · · , kL+1 ) be the node vector, W l = (wl1 ⊤ ⊤ , · · · , wlk l+1 )⊤ be the row represen- tation of W l and w el = (||wl1 ||1 , · · · , ||wlkl+1 ||1 ) be the vector of L1 norms of the rows of W l . Next we consider layer-wise sparsity, s = (s1 , · · · , sL ) for node selection. Similarly, we consider layer-wise norm constraints, B = (B1 , · · · , BL ) on L1 norms of weights including bias incident onto any given node in each layer. Based on s and B, we define the following sieve of neural networks (check definition A.1.1). F(L, k, s, B) = {ηθ ∈ (3.2) : ||w el ||0 ≤ sl , ||w el ||∞ ≤ Bl } . (3.11) The construction of a sieve is one of the most important tools towards the proof of consistency in infinite-dimensional spaces. In the works of Schmidt-Hieber (2020), Polson and Ročková (2018), Chérief-Abdellatif (2020) and Bai et al. (2020), the sieve in the context of edge selection is given by F(L, k, s) = {ηθ ∈ (3.2) : ||θ||0 ≤ s, ||θ||∞ ≤ 1} . which works with an overall sparsity parameter s. In addition, note the L∞ norm of all the entries in θ is assumed to be known constant equal to 1 (see relation (4) in Schmidt-Hieber (2020) and section 4 in Polson and Ročková (2018)). Section 3 in Bai et al. (2020) does not explicitly mention the dependence of their sieve on some fixed bound B on the edges in a network, however, their derivations on covering numbers (see proof of Lemma 1.2 in the supplement of Bai et al. (2020)) borrow results from Schmidt-Hieber (2020) which is based on sieve with B = 1. Consider any sequence ϵn . For Lemmas 3.4.1 and 3.4.2, we use the sieve F(L, k, s, B) in (3.11) with s = s◦ and B = B ◦ where s◦l + 1 = nϵ2n /( Lj=0 uj ) and log Bl◦ = (nϵ2n )/((L + P 1) Lj=0 (s◦j + 1)) with ul = (L + 1)2 (log n + log(L + 1) + log kl+1 + log(kl + 1)). Note, s◦l and P Bl◦ do not depend on l. 63 Lemma 3.4.1 below holds when the covering number (check definition A.1.2) of the func- tions which belong to the sieve F(L, k, s◦ , B ◦ ) is well under control. Lemma 3.4.2 below states that for the same choice of the sieve, the prior gives sufficiently small probabilities on the complement space F(L, k, s◦ , B ◦ )c (see the discussion under Theorem 3.4.4 for more details). For the subsequent results, the symbol Ac is used to denote complement of a set A. Lemma 3.4.1 (Existence of Test Functions). Let ϵn → 0 and nϵ2n → ∞. There exists a testing function ϕ ∈ [0, 1] and constants C1 , C2 > 0, EP0 (ϕ) ≤ exp{−C1 nϵ2n } sup EPθ (1 − ϕ) ≤ exp{−C2 nd2H (P0 , Pθ )} θ∈Hϵcn ,ηθ ∈F (L,k,s◦ ,B ◦ ) where Hϵn = {θ : dH (P0 , Pθ ) ≤ ϵn } is the Hellinger neighborhood of radius ϵn . PL Lemma 3.4.2 (Prior mass condition.). Let ϵn → 0, nϵ2n → ∞ and nϵ2n / l=0 ul → ∞, then for Πe as in (3.6) and some constant C3 > 0, XL Π(F(L, e k, s◦ , B ◦ )c ) ≤ exp(−C3 nϵ2n / ul ) l=0 Whereas Lemmas 3.4.1 and 3.4.2 work with a specific choice of the sieve, the following Lemma 3.4.3 is developed for any generic choice of sieve indexed by s and B. The final piece of the theory developed next tries to addresses two main questions (1) Can we get a sparse network solution whose layer-wise sparsity levels and L1 norms of incident edges (including the bias) of the nodes are controlled at levels s and B respectively? (2) Does this sparse network retain the same predictive performance as the original network? In this direction, let ξ = minηθ ∈F (L,k,s,B) ||ηθ − η0 ||2∞ Based on the values s and B, we also define X L L X 2 ϑl = Bl /(kl + 1) + log Bm + L + log kl+1 + log(kl + 1) + log n + log( um ) m=0,m̸=l m=0 64 rl = sl (kl + 1)ϑl /n (3.12) Lemma 3.4.3 has two sub conditions. Condition 1. requires that shrinking KL neigh- borhood of the true density function P0 gets sufficiently large probability. This along with Lemma 3.4.1 and 3.4.2 is an essential condition to guarantee the convergence of the true posterior in (3.5). Condition 2. is the assumption needed to control the KL distance be- tween true posterior and variational posterior and thereby guarantees the convergence of the variational posterior in (3.8) (see the discussion under Theorem 3.4.4 for more details). PL → 0 and n( Ll=0 rl +ξ) → P Lemma 3.4.3 (Kullback-Leibler conditions). Suppose l=0 rl +ξ ∞ and the following two conditions hold for the prior Π e in (3.6) and some q ∈ QMF   X L 1. Π N Ll=0 rl +ξ ≥ exp(−C4 n( e P rl + ξ)) l=0 XZ XL 2. dKL (q, π) + n dKL (P0 , Pθ )q(θ, z)dθ ≤ C5 n( rl + ξ) z l=0 where π is the joint prior of (θ, z), q is the joint variational distribution of (θ, z) and NPLl=0 rl +ξ is the KL neighborhood of radius Ll=0 rl + ξ. P The following result shows that the variational posterior is consistent as long as Lemma 3.4.1, Lemma 3.4.2 and Lemma 3.4.3 hold. The proof of Theorem 3.4.4 demonstrates how the validity of these three lemmas imply variational posterior consistency. Theorem 3.4.4. Suppose Lemma 3.4.3 holds and Lemmas 3.4.1 and 3.4.2 hold for ϵn = qP ( Ll=0 rl + ξ) Ll=0 ul . Then for some slowly increasing sequence Mn → ∞, Mn ϵn → 0 P and Πe ∗ as in (3.10), e ∗ (Hc Π Mn ϵn ) → 0, n→∞ in P0n probability where HM c n ϵn = {θ : dH (P0 , Pθ ) ≤ Mn ϵn } is the Hellinger neighborhood of radius Mn ϵn . Note, the above contraction rate depends mainly on two quantities rl and ξ. Note rl controls the number of nodes in the neural network. If the network is not sparse, then rl is 65 kl+1 (kl + 1)ϑl /n instead of sl (kl + 1)ϑl /n which can in turn make the convergence of ϵn → 0 difficult. On the other hand, if sl and Bl are too small, it will cause ξ to explode since a good approximation to the true function may not exist in a very sparse space. Remark (Rates as a function of n). Let L ∼ O(log n), Bl2 ∼ O(kl + 1) and sl (kl + 1) = O(n1−2ϱ ), for some ϱ > 0, then one can work with ϵn = n−ϱ log3 (n) as long as ξ = O(n−2ϱ log2 (n)). The exact expression of ϱ is determined by the degree of smoothness of the function η0 . Proof of Theorem 3.4.4 Discussion. To further enunciate Lemmas 3.4.1 and 3.4.2 con- R sider the quantity E1n = Hc (Pθn /P0n )e π (θ)dθ as used in the following proof. Here, E1n can Mn ϵn be split into two parts Z Z E1n = (Pθn /P0n )e π (θ)dθ + (Pθn /P0n )e π (θ)dθ HMc ∩F (L,k,s◦ ,B ◦ ) c HM ∩F (L,k,s◦ ,B ◦ )c n ϵn n ϵn Whereas Lemma 3.4.1 provides a handle on the first term by controlling the covering number of the sieve F(L, k, s◦ , B ◦ ), Lemma 3.4.2 gives a handle on the second term by controlling Π(F(L, e k, s◦ , B ◦ )c ) (for more details we refer to Lemma A.2.6 in the Appendix A). R Next, consider the quantity E2n = log (Pθn /P0n )e π (θ)dθ in the following proof. Lemma 3.4.3 part 1. provides a control on this term (see Lemma A.2.7 in the the Appendix A for P R more details). Finally, consider the quantity E3n = dKL (q, π) + z log(P0n /Pθn )q(θ, z)dθ in the following proof. Indeed Lemma 3.4.3 part 2. provides a control on this term (see Lemma A.2.8 in the Appendix A for further details). Proof. Let Π e and Π e ∗ be as in (3.7) and (3.10) respectively. Now, e∗ (θ) e∗ (θ) Z Z π π dKL (e ∗ π ,πe(|D)) = π ∗ e (θ) log dθ + e∗ (θ) log π dθ A π e(θ|D) Ac π e(θ|D) e∗ (θ) e∗ (θ) Z Z ∗ π π e(θ|D) ∗ c π π e(θ|D) = −Π (A) e log ∗ dθ − Π (A ) e log ∗ dθ A Π e ∗ (A) π e (θ) Ac Π e ∗ (Ac ) π e (θ) e∗ e∗ c ≥Π e ∗ (A) log Π (A) + Π e ∗ (Ac ) log Π (A ) , Jensen’s inequality Π(A|D) e e c |D) Π(A 66 where the above lines hold for any set A. Since Π(A|D) e ≤ 1, ≥Π e ∗ (A) log Π e ∗ (A) + Π e ∗ (Ac ) log Πe ∗ (Ac ) − Π e ∗ (Ac ) log Π(A e c |D) ≥ −Π e ∗ (Ac ) log Π(A e c |D) − log 2, (∵ x log x + (1 − x) log(1 − x) ≥ − log 2) Z Z ! = −Π e ∗ (Ac ) log (Pθn /P0n )eπ (θ)dθ − log (Pθn /P0n )e π (θ)dθ − log 2 A c | {z } | {z } E1n E2n The above representation is similar to the proof of Theorems 3.1 and 3.2 in Bhattacharya and Maiti (2021). For any q ∈ QMF , −Πe ∗ (Ac )E1n ≤ dKL (e π∗, πe(|D)) − Π e ∗ (Ac )E2n + log 2 ≤ dKL (π ∗ , π(|D)) − Π e ∗ (Ac )E2n + log 2 by Lemma A.2.3 ≤ dKL (q, π(|D)) − Π e ∗ (Ac )E2n + log 2 π ∗ is the KL minimizer XZ Pn e ∗ (Ac ))E2n + log 2 ≤ dKL (q, π) + log 0n q(θ, z)dθ +(1 − Π z P θ | {z } E3n = E3n + (1 − Π e ∗ (Ac ))E2n + log 2 (3.13) where the fourth inequality in the above equation follows since dKL (q, π(|D)) XZ = (log q(θ, z) − log Pθn − log π(θ, z) + log m(D))q(θ, z)dθ z XZ XZ = (log q(θ, z) − log π(θ, z))q(θ, z)dθ + (log P0n − log Pθn )q(θ, z)dθ |z {z } z dKL (q,π) + log m(D) − log P0n | {z } E2n where m(D) is the marginal distribution of data as in (3.5). c Take A = HM n ϵn = {θ : dH (P0 , Pθ ) > Mn ϵn } If Lemma 3.4.1 and Lemma 3.4.2 hold, then by Lemma A.2.6, E1n ≤ −nCMn2 ϵ2n / P ul for any Mn → ∞ with high probability. 67 If Lemma 3.4.3 condition 1 holds, then by Lemma A.2.7, E2n ≤ nMn ( Ll=0 rl + ξ) for any P Mn → ∞ with high probability. If Lemma 3.4.3 condition 2 holds, then by Lemma A.2.8, E3n ≤ nMn ( Ll=0 rl + ξ) for any P Mn → ∞ with high probability. Therefore, by (3.13), we get L L nCMn2 ϵ2n e ∗ c  X X P Π HMn ϵn ≤ nMn ( rl + ξ) + nMn ( rl + ξ) + log 2 ul l=0 l=0 XL X L X L ≤ nMn ( rl + ξ) + nMn ( rl + ξ) + Mn ( rl + ξ) l=0 l=0 l=0 PL P l=0 rl + ξ)  3Mn ( ul =⇒ Π e ∗ HM c ≤ n ϵn C1 Mn2 ϵ2n qP L P Taking ϵn = l=0 (rl + ξ) ul and noting Mn → ∞, the proof follows. We next give conditions on the prior probabilities λl and σ0 to guarantee that Lemmas 3.4.1, 3.4.2 and 3.4.3 hold. This in turn implies the conditions of Theorem 3.4.4 hold and variational posterior is consistent. Corollary 3.4.5. Let σ02 = 1, − log λl = log(kl+1 ) + Cl (kl + 1)ϑl , then conditions of Theorem 3.4.4 hold and Π e ∗ as in (3.10) satisfies e ∗ (Hc Π Mn ϵn ) → 0, n→∞ in P0n probability where and HMn ϵn = {θ : dH (P0 , Pθ ) ≤ Mn ϵn } is the Hellinger neighborhood of radius Mn ϵn . The proof of the corollary has been provided in Appendix A. In this corollary, note that our expression of prior inclusion probability varies as a function of l thereby providing a handle on layer-wise sparsity. Indeed, using these expressions in numerical studies further substantiates the theoretical framework developed in this section. Remark (Optimal Contraction). For a fixed choice of k, the optimal contraction rate is achieved at s⋆ , B ⋆ = argmin( rl + ξ). Thus, s⋆ and B ⋆ are the optimal values of s and B P s,B 68 which give the best sparse network with minimal loss in the true accuracy. The corresponding probability expressions in Corollary 3.4.5 can be accordingly modified by setting s = s⋆ and B = B ⋆ in the expressions of ϑl and rl in (3.12). 3.5 Numerical Experiments In this section, we present several numerical experiments to demonstrate the performance of our spike-and-slab independent Gaussian (SS-IG) Bayesian neural networks which we im- plement in PyTorch (Paszke et al., 2019). Further, to evaluate the efficacy of the variational inference we benchmark our model on synthetic as well as real datasets. Our numerical inves- tigation justifies the use of proposed choices of prior hyperparameters specifically layer-wise prior inclusion probabilities, which in turn substantiates the significance of our theoretical developments. With fully Bayesian treatment, we are also able to quantify the uncertain- ties for the parameter estimates and variational inference helps to scale our model to large network architectures as well as complex datasets. We compare our sparse model with a node selection technique: horseshoe BNN (HS- BNN) (Ghosh et al., 2019) and an edge selection technique: spike-and-slab BNN (SV-BNN) (Bai et al., 2020) in the second simulation study and UCI regression dataset examples. We use optimal choices of prior parameters and fine tuning parameters provided by the authors of HS-BNN and SV-BNN in their respective models. Further we compare our model against dense variational BNN model (VBNN) (Blundell et al., 2015) in all of the experiments. Since it has no sparse structure, it serves as a baseline allowing to check whether sparsity compromises accuracy. In all the experiments, we fix σ02 = 1 and σe2 = 1. For our model, the choices of layer-wise λl follow from Corollary 3.4.5: λl = (1/kl+1 )exp(−Cl (kl + 1)ϑl ). We take Cl values in the negative order of 10 such that prior inclusion probabilities do not fall below 10−50 otherwise λl values close to 0 might prune away all the nodes from a layer (check appendix B for more discussion). The remaining tuning parameter details such as learning rate, minibatch size, and initial parameter choice are provided in the appendix B. 69 The prediction accuracy is calculated using variational Bayes posterior mean estimator with 30 Monte Carlo samples in testing phase. Node sparsity estimates. In our experiments, we provide node sparsity estimates for each hidden layer separately. For all models, the node sparsity in a given hidden layer is the ratio of number of neurons with atleast one nonzero incoming edge over the original number of neurons present in that layer before training. The layer-wise node sparsity estimates give clear picture of the structural compactness of the trained model during test time. The structurally compact trained model has lower latency during inference stage. 3.5.1 Simulation Study - I We consider a two dimensional regression problem where the true response y0 is gener- ated by sampling X from U ([−1, 1]2 ) and feeding it to a deep neural network with known p parameters. We add a random Gaussian noise with σ = 5% V ar(y0 ) to y0 to get noisy outputs y. We create the dataset using a shallow neural network consisting of 2 inputs, one hidden layer with 2 nodes and 1 output (2-2-1 network). We train our SS-IG model and VBNN model using a single hidden layer network with 20 neurons in the hidden layer and administer sigmoid activation. Each model is trained till convergence. We found that both models give competitive predictive performance while fitting the given data. In Figure 3.2 we plot the magnitudes of the incoming weights into the hidden layer nodes using boxplots. Our model with the help of spike and slab prior is able to prune away redundant nodes not required for fitting the model. Since VBNN is densely connected, it shows all the nodes being active in its final model. From this experiment, it is clear that neural networks can be pruned leading to more compact models at inference stage without compromising the accuracy. We also performed the same experiment with a wider neural network consisting of 100 nodes in the single hidden layer and provide the results in the appendix B. There again we show that our model can easily recover the sparse solution with competitive performance. 70 (a) VBNN (b) SS-IG Figure 3.2 Simulation study I results. Node-wise weight magnitudes recovered by VBNN and proposed SS-IG model in the synthetic regression data generated using 2-2-1 network. The boxplots show the distribution of incoming weights into a given hidden layer node. 3.5.2 Simulation Study - II We consider a nonlinear regression example and generate the data from the following model: 7x2 y= + sin(x3 x4 ) + 2x5 + ε, 1 + x21 where ε ∼ N (0, 1). Further all the covariates are i.i.d. N (0, 1) and independent of ε. We generated 3000 data entries to create the training data for the experiment. Additional 1000 observations were generated for testing. We modeled this data using 2-hidden layer neural network which consists of 20 neurons per hidden layer. Sigmoid activation function is administered for each model used for comparative analysis. Table 3.1 provides the RMSEs on train and test dataset as well as layer-wise node sparsity estimates for SS-IG, SV-BNN, HS-BNN, and VBNN models. Our model is extremely well at pruning redundant nodes which leads to the most compact model compared to the other sparse models: SV-BNN and HS-BNN. Moreover it exhibits lower root mean squared error (RMSE) values on test data among the sparse models while showing similar predictive performance compared to the densely connected VBNN. This experiment further underscores the major benefit of 71 Table 3.1 Simulation study II results. Performance of the proposed SS-IG, SV-BNN, HS-BNN, and VBNN models where each model was trained for 10k epochs with learning rate 5×10−3 . Mean and S.D. of RMSE values and median sparsity estimates were calculated from last 1000 epochs (with jump of 10 giving us sample of 100). The sparsity estimates are given as a tuple of 2 values representing layer-1 and layer-2 node sparsities. Model Train RMSE Test RMSE Sparsity Estimate SS-IG 1.2087±0.0490 1.1947±0.0587 (0.35,0.05) SV-BNN 1.2897±0.0323 1.2760±0.0363 (0.45,0.35) HS-BNN 1.2580±0.0305 1.2436±0.0394 (1.00, 1.00) VBNN 1.1661±0.0335 1.1614±0.0349 NA our proposed approach to generate very compact models which could reduce computational times and memory usage at inference stage. 3.5.3 UCI Regression Datasets We apply our model to traditional UCI regression datasets (Dua and Graff, 2017) and contrast our performance against SV-BNN, HS-BNN, and VBNN models. We follow the protocol proposed by (Hernandez-Lobato and Adams, 2015) and train a single layer neural network with sigmoid activations. For smaller datasets - Concrete, Wine, Power Plant, Kin8nm, we take 50 nodes in the hidden layer, while for larger datasets - Protein, Year, we take 100 nodes in the hidden layer. We spilt data randomly while maintaining 9:1 train-test ratio in each case and for smaller datasets we repeat this technique 20 times. In Protein data we perform 5 repetitions while in Year data we use a single random split (more details in the appendix B). For the comparative analysis, we benchmark against SV-BNN, HS-BNN and VBNN. Moreover, VBNN test RMSEs serve as baseline in each dataset. Table 3.2 summarises our results including the sparsity estimate representing hidden layer-1 node sparsity (since there is only one hidden layer in the networks considered). We achieve lower RMSEs compared to SV-BNN and HS-BNN in Power Plant, Kin8nm, and Year datasets and in other cases we achieve comparable RMSE values. In all the 72 Table 3.2 UCI regression datasets results. Test RMSE Sparsity Estimate Dataset n(k0 ) SS-IG SV-BNN HS-BNN VBNN SS-IG SV-BNN Concrete 1030 (8) 7.92±0.68 8.22±0.70 5.34±0.53 7.34±0.62 0.42±0.06 0.98±0.02 Wine 1599 (11) 0.66±0.05 0.65±0.05 0.66±0.05 0.64±0.05 0.18±0.05 0.87±0.04 Power Plant 9568 (4) 4.28±0.20 4.32±0.19 4.34±0.18 4.27±0.17 0.18±0.03 0.24±0.03 Kin8nm 8192 (8) 0.09±0.00 0.11±0.01 0.10±0.00 0.09±0.00 0.43±0.04 0.47±0.04 Protein 45730 (9) 4.85±0.05 4.93±0.06 4.59±0.02 4.78±0.06 0.81±0.03 0.93±0.03 Year 515345 (90) 8.68±NA 8.78±NA 9.33±NA 8.67±NA 0.71±NA 0.78±NA datasets, our predictive performance is close to the dense baseline of VBNN. We provide node sparsity estimates in our SS-IG and SV-BNN models. HS-BNN was not able to achieve sparse structure which is consistent with the results provided in the appendix of (Ghosh et al., 2019). In contrast to HS-BNN, our model sparsifies the model during training without requiring ad-hoc pruning rule. Table 3.2 demonstrates that our approach uniformly achieves better sparsity than SV-BNN. In particular, Concrete and Wine datasets show the high compressive ability of our model over SV-BNN leading to very compact models for inference. 3.5.4 Image Classification Datasets Here, we benchmark the empirical performance of our proposed SS-IG method on network architectures and image classification datasets used in practice. Baselines. We compare our model against VBNN model which serves as a dense baseline to gauge the trade-off between predictive performance and sparsity. Moreover, to highlight the complementary behavior in memory and computational efficiency of node selection compared to edge selection achieved via Bayesian spike-and-slab prior framework, we compare our model against the edge selection model, SV-BNN. Network architectures. We consider 2 neural network model architectures: (i) multi- layer perceptron (MLP), and (ii) LeNet-5-Caffe. In MLP model, we take 2 hidden layers 73 with 400 neurons in each layer. Output layer has 10 neurons since there are 10 classes in both datasets. Next, LeNet-5-Caffe model has 2 convolutional layers with 20 and 50 feature maps respectively with filter size 5 × 5 for both layers. In SS-IG model, for convolution layers, we prune output channels (similar to neurons in linear layers) using our spike-and-slab prior where each output channel is assigned an Bernoulli variable to collectively prune parameters incident on that channel. We apply 2 × 2 max pooling layer after each convolution layer. The flattened feature layer after second convolution layer has size 4 ∗ 4 ∗ 50 = 800 serving as input to the fully connected block, where there are 2 hidden layers with 800 and 500 neurons respectively. The output layer has 10 neurons. Datasets. We apply each network architecture on 2 image classification datasets: (i) MNIST: dataset of 60,000 small square 28×28 pixel grayscale images of handwritten sin- gle digits between 0 and 9, and (ii) Fashion-MNIST (Xiao et al., 2017): dataset of 60,000 small square 28×28 pixel grayscale images of items of 10 types of clothing. We preprocess the images in the MNIST data by dividing their pixel values by 126. In Fashion-MNIST data, we horizontally flip images at random with probability of 0.5. Metrics. We quantify the predictive performance using the accuracy of the test data (MNIST and Fashion-MNIST). Besides the test accuracy, we evaluate our model against SV-BNN using the metrics that relate to the model compression and computational com- plexity. First the compression ratio is the ratio of number of nonzero weights in the com- pressed network versus the dense model and is an indicator of storage cost at test-time. Next, we present layer-wise node sparsities in MLP experiments to highlight the computa- tional speedups at test-time. In LeNet-5-Caffe experiments, we provide the floating point operations (FLOPs) ratio which is the ratio of number of FLOPs required to predict y from x during test time in the compressed network versus its dense counterpart. We have detailed the FLOPs calculation in neural networks in Appendix B. Nonlinear activation. We use swish activations (Elfwing et al., 2018; Ramachandran et al., 74 2017) instead of ReLUs in our proposed SS-IG model to avoid the dying neuron problem (Lu et al., 2020). Specifically in large scale datasets turning off a node with more than 100 incoming edges adversely impacts the training process of ReLU networks. Smoother activation functions such as sigmoid, tanh, swish etc help alleviate this problem. We choose swish since it has the best performance. For VBNN and SV-BNN, we use ReLU activations as recommended by their authors. MLP Experiments The results of MLP network experiments on MNIST and Fashion-MNIST are presented in Figure 3.3. We provide test data accuracy, model compression ratio, and layer-wise node sparsities in each experiment. In MLP/MNIST experiment (Figure 3.3a - 3.3d), we observe that VBNN and SS-IG models only require ∼ 400 epochs to achieve stable predictive performance (Figure 3.3a). In contrast, SV-BNN slightly degrades after 600 epochs and takes longer to achieve convergence in layer-wise node sparsities compared to our approach (Figure 3.3c and 3.3d). Moreover, for SS-IG model, we observe that as we start to learn sparse network our model shows peak test accuracy when most of the nodes are present in the model and it starts to drop as we learn sparser network and ultimately the test accuracy stabilizes when the node sparsities converge. Furthermore, SV-BNN has better model compression ratio (Figure 3.3b) in this experiment at the expense of lower predictive performance. Our method is prunes off ∼ 80% of first hidden layer nodes and ∼ 90% of second hidden layer nodes at the expense of ∼ 2% accuracy loss due to sparsification compared to the dense VBNN. In MLP/Fashion-MNIST experiment (Figure 3.3e - 3.3h), we observe that VBNN model takes ∼ 200 epochs and our model takes ∼ 600 epochs for convergence. SV-BNN model takes longer to achieve convergence in layer-wise node sparsities (Figure 3.3g and 3.3h). We also observe the complementary behavior of our model and SV-BNN in memory and computational efficiency where our model achieves better layer-wise node sparsities and SV- 75 (a) Test accuracy (b) Compression ratio (c) Layer-1 node sparsity (d) Layer-2 node sparsity (e) Test accuracy (f) Compression ratio (g) Layer-1 node sparsity (h) Layer-2 node sparsity Figure 3.3 MLP/MNIST and MLP/Fashion-MNIST experiments results. First two rows (a)-(d) represent the MLP on MNIST experiment results. Bottom two rows (e)-(h) represent the MLP on Fashion-MNIST experiment results. 76 BNN has better model compression ratio (Figure 3.3f) with both models having similar predictive performance (Figure 3.3e). Furthermore, our method prunes off ∼ 90% of first hidden layer nodes and ∼ 92% of second hidden layer nodes at the expense of ∼ 3% accuracy loss due to sparsification compared to the densely connected VBNN. LeNet-5-Caffe Experiments The results of more complex LeNet-5-Caffe network experiments on MNIST and Fashion- MNIST are presented in Figure 3.4. We provide test data accuracy, model compression ratio, and FLOPs ratio in each experiment over 1200 epochs. Here, FLOPs ratio serve as a collective indicator of layer-wise node sparsities since FLOPs are directly related to how many neurons or channels are remaining in linear or convolution layers respectively. In LeNet-5-Caffe/MNIST experiment (Figure 3.4a - 3.4c), we observe that our model has better predictive accuracy than SV-BNN (Figure 3.4a). Moreover, we achieve 10% more (a) Test accuracy (b) Compression ratio (c) FLOPs ratio (d) Test accuracy (e) Compression ratio (f) FLOPs ratio Figure 3.4 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experi- ments results. Top row (a)-(c) represent the LeNet-5-Caffe/MNIST experiment results. Bottom row (d)-(f) represent the LeNet-5-Caffe/Fashion-MNIST experiment results. 77 reduction in Flops (Figure 3.4c)) compared to SV-BNN whereas SV-BNN achieves better model compression than our approach (Figure 3.4b). Lastly, our method is able to reduce the FLOPs of the model during inference at test-time by 90% at the expense of ∼ 0.5% accuracy loss due to sparsification compared to the densely connected VBNN. In LeNet-5-Caffe/Fashion-MNIST experiment (Figure 3.4d - 3.4f), we observe that both SS-IG and SV-BNN have similar test accuracies at convergence (Figure 3.4d). However, our model has 40% less FLOPs (Figure 3.4f) during inference stage compared to SV-BNN which again achieves better model compression (Figure 3.4e). This highlights the complementary nature of our method of node selection that leads to a structurally sparse model with sig- nificantly lower (almost 5 times) FLOPs compared to weight pruning approach, SV-BNN, which induces unstructured sparsity in the pruned network leading to significant model com- pression with low storage cost. Lastly, our method leads to a sparse model with only 8% of the FLOPs as compared to VBNN at the expense of ∼ 3% accuracy loss underscoring the trade-off between predictive accuracy and sparsity. 3.6 Conclusion and Discussion In this chapter, we have proposed sparse deep Bayesian neural networks using spike-and- slab priors for optimal node recovery. Our method incorporates layer-wise prior inclusion probabilities and recovers underlying structurally sparse model effectively. Our theoretical developments highlight the conditions required for the posterior consistency of the variational posterior to hold. With layer-wise characterisation of prior inclusion probabilities we show that the proposed sparse BNN approximations can achieve predictive performance compa- rable to dense networks. Our results relax the constraints of equal number of nodes and uniform bounds on weights thereby achieving optimal node recovery on more generic neural network structure. The closeness of a true function to the topology induced by layer-wise node distribution depends on the degree of smoothness of the true underlying function. In this work, this has not been studied in depth and forms a future direction for work. 78 Note, in contrast to previous works, our work assumes a spike-and-slab prior on the entire vector of incoming weights and bias onto a node. We underscore the fact that node selection has complementary behavior with edge selection approaches as established by our empirical experiments. Node selection offers significant computational speedup whereas edge selection achieves significant model compression at test-time. The demonstration of the efficacy of our node selection approach opens the avenue for exploration of sophisticated group sparsity priors for node selection. Our detailed experiments show the subnetwork selection ability of our method which underscores the notion that deep neural networks can be heavily pruned without losing predictive performance. The experiment with convolution neural network (LeNet-5-Caffe) highlights the generalizability of our approach from mere multi layer perceptron to complex deep learning models. Although our method performs model reduction while maintaining predictive power, some further improvements may be obtained by choosing the number of layers in a data-driven fashion and can be a part of future work. 79 APPENDICES 80 APPENDIX A PROOFS OF SS-IG THEORETICAL RESULTS A.1 Definitions Definition A.1.1 (Sieve). Consider a sequence of function classes F1 ⊆ F2 ⊆ · · · ⊆ Fn ⊆ Fn+1 ⊆ · · · ⊆ F, where ∀f ∈ F, ∃ fn ∈ Fn s.t. d(f, fn ) → 0 as n → ∞ where d(., .) is some pseudo-metric on F. More precisely, ∪∞ n=1 Fn is dense in F. Then, Fn is called a sieve space of F with respect to the pseudo-metric d(., .), and the sequence {fn } is called a sieve (Grenander, 1981). Definition A.1.2 (Covering number). Let (V, ||.||) be a normed space, and F ⊂ V . Then, {V1 , · · · , VN } is an ε−covering of F if F ⊂ ∪N i=1 B(Vi , ε), or equivalently, ∀ ϱ ∈ F, ∃ i such that ||ϱ − Vi || < ε. The covering number of F denoted by N (ε, F, ||.||) = min{n : ∃ ε − covering over F of size n} (Pollard, 1991). A.2 General Lemmas Lemma A.2.1. Let g1 and g2 be any two density functions. Then Eg1 (|log(g1 /g2 )|) ≤ dKL (g1 , g2 ) + 2/e Proof. Refer to Lemma 4 in Lee (2000). Lemma A.2.2. For any K > 0, let a, a0 ∈ [0, 1]K such that K P PK 0 k=1 ak = k=1 ak = 1, then the KL divergence between mixture densities K P PK 0 0 k=1 ak gk and k=1 ak gk is bounded as K K ! K X X X dKL a0k gk0 , ak gk ≤ dKL (a0 , a) + a0k dKL (gk0 , gk ) k=1 k=1 k=1 Proof. Refer to Lemma 6.1 in Chérief-Abdellatif and Alquier (2018). 81 Lemma A.2.3. dKL (eπ∗, πe(|D)) ≤ dKL (π ∗ , π(|D)) Proof. Using Lemma A.2.2 with a0 = π ∗ (z), a = π(z|D), g 0 = π ∗ (θ|z) and g = π(θ|z, D), we get X X π∗, π dKL (e e(|D)) = dKL ( π ∗ (θ|z)π ∗ (z), π(θ|z, D)π(z|D)) z z X ≤ dKL (π ∗ (z), π(z|D)) + dKL (π ∗ (θ|z), π(θ|z, D))π ∗ (z) z = dKL (π ∗ (θ, z), π(θ, z|D)) = dKL (π ∗ , π(|D)) Lemma A.2.4. For any 1-Lipschitz continuous activation function ψ such that ψ(x) ≤ x ∀x ≥ 0, "L !s l # X X Y Bl N (δ, F(L, k, s, B), ||.||∞ ) ≤ ··· QL kl+1 s∗L ≤sL s∗0 ≤s0 l=0 δBl /(2(L + 1)( j=0 Bj )) where N denotes the covering number. Proof. Given a neural network η(x) = vL + WL ψ(vL−1 + WL−1 ψ(vL−2 + WL−2 ψ(· · · ψ(v1 + W1 ψ(v0 + W0 x))) for l ∈ {1, · · · , L}, we define A+ p l η : [0, 1] → R , kl A+l η(x) = ψ(vl−1 + Wl−1 ψ(vl−2 + Wl−2 ψ(· · · ψ(v1 + W1 ψ(v0 + W0 x))) and A− l η : R kl−1 → RkL+1 , A−l η(y) = vL + WL ψ(vL−1 + WL−1 ψ(· · · ψ(vl + Wl ψ(vl−1 + Wl−1 y))) The above framework is also used in the proof of lemma 5 in Schmidt-Hieber (2020). Next, − Ql−1 set A+0 η(x) = AL+2 η(x) = x and further note that for η ∈ F(L, k), |Al η(x)|∞ ≤ + j=0 Bj 82 where k = (p, k1 , · · · , kL , kL+1 ) and kL+1 = 1. Next, we derive upper bound on Lipschitz constant of A− l η. − − |WL A+ + + L η(x1 ) − WL AL η(x2 )|∞ = |Al η(Al−1 η(x1 )) − Al η(Al−1 η(x2 ))|∞ + (A.1) QL l.h.s. is bounded above by j=0 Bj and r.h.s consists of composition of Lipschitz functions A− + l η and Al−1 η with C1 and C2 being corresponding Lipschitz constants. So we can bound r.h.s. by, |A− + − + l η(Al−1 η(x1 )) − Al η(Al−1 η(x2 ))|∞ ≤ C1 C2 ||x1 − x2 ||∞ ∀x1 , x2 ∈ Rp If we choose x1 = x ∈ [0, 1]p and x2 = 0 then, |A− + − + l η(Al−1 η(x)) − Al η(Al−1 η(0))|∞ ≤ C1 C2 ∀x ∈ [0, 1]p Ql−2 Since C2 is Lipschitz constant for A+ l−1 η and we know that |Al−1 η|∞ ≤ + j=0 Bj . So we Ql−2 get C2 ≤ 2 j=0 Bj . We use this in above expression, l−2 Y |A− + − + l η(Al−1 η(x)) − Al η(Al−1 η(0))|∞ ≤ 2C1 Bj ∀x ∈ [0, 1]p (A.2) j=0 QL Next we know that l.h.s. of (A.2) can be bounded above by 2 j=0 Bj because of (A.1). So we get bound on Lipschitz constant of A− l η, Yl−2 YL Y L 2C1 Bj ≤ 2 Bj =⇒ C1 ≤ Bj j=0 j=0 j=l−1 ∗ Let η, η ∗ ∈ F(L, k, s, B) be two neural networks with W l = (vl , Wl ) and W l = (vl∗ , Wl∗ ) ∗ respectively. Here, we define δ l using the L1 norms of the rows of D l = W l − W l as follows ⊤ ⊤ D l = (dl1 , · · · , dlkl+1 )⊤ δ l = (||dl1 ||1 , · · · , ||dlkl+1 ||1 ) We choose η, η ∗ such that ||δ l ||∞ ≤ ζBl . This also means that all parameters in each layer of these two networks are at most ζBl distance away from each other. Then, we can bound the absolute difference between these two neural networks by, |η(x) − η ∗ (x)| 83 L+1 X ≤ |A− + ∗ − ∗ l+1 η(ψ(vl−1 + Wl−1 Al−1 η (x))) − Al+1 η(ψ(vl−1 + Wl−1 Al−1 η (x)))| ∗ + ∗ l=1 L+1 L ! X Y ∗ ∗ ∗ ∗ ≤ Bj ||ψ(vl−1 + Wl−1 A+ + l−1 η (x)) − ψ(vl−1 + Wl−1 Al−1 η (x))||∞ l=1 j=l L+1 L ! X Y ∗ ∗ ∗ ≤ Bj ||vl−1 − vl−1 + (Wl−1 − Wl−1 )A+l−1 η (x))||∞ l=1 j=l L+1 Y L ! X ∗ ≤ Bj ||δ l−1 ||∞ ||A+ l−1 η (x))||∞ l=1 j=l L+1 L ! l−2 L ! X Y Y Y ≤ Bj ζBl−1 Bj = ζ(L + 1) Bj (A.3) l=1 j=l j=0 j=0 kl+1 sl  Recall that we have at most kl number of nodes in each layer and there are sl ≤ kl+1 combinations of nodes to choose sl active nodes in the given layer. Since supremum norm of L1 norms of the rows of Wl is bounded above by Bl in our family of neural networks F(L, k, s, B) so we can discretize these L1 norms with grid size δBl /(2(L + 1)( Lj=0 Bj )) Q and obtain upper bound on covering number as follows "L !s l # X X Y Bl N (δ, F(L, k, s, B), ||.||∞ ) ≤ ··· QL kl+1 ∗ sL ≤sL ∗ s0 ≤s0 l=0 δB l /(2(L + 1)( j=0 B j )) L L ! !(sl +1) Y Y ≤ 2δ −1 (L + 1) Bj kl+1 (A.4) l=0 j=0 Lemma A.2.5. Let θ ∗ = arg min |ηθ − η0 |2∞ and W fl = supi ||wli − w∗li ||1 , then for any θ∈F (L,k,s,B) density q = Lj=0 q(θj ), Q Z X L Z L Y Z ||ηθ − ηθ∗ ||22 q(θ)dθ ≤ c2j−1 fj2 qj (θj )dθj W fm + Bm )2 q(θ)dθ (W j=0 m=j+1 L Xj−1 Z L Z X Y +2 cj−1 c j ′ −1 W fj (W fj + Bj )qj (θj )dθj fm + Bm )2 q(θ)dθ (W j=0 j ′ =0 m=j+1 Z j−1 Y Z × W fj ′ qj ′ (θj ′ )dθj ′ (W fm + Bm )q(θ)dθ (A.5) m=j ′ +1 Qj−1 where cj−1 ≤ m=0 Bm . 84 Proof. Let ηθl be the partial networks defined as  ηθ0 (x) := ψ(W0 x + v0 ),         ηθl (x) := ψ(Wl ηθl−1 (x) + vl ),    ηθL (x) := WL ηθL−1 (x) + vL .   Similar to the proof of theorem 2 in Chérief-Abdellatif (2020), define φl (θ) = sup sup |ηθl (x)i − ηθl ∗ (x)i |. x∈[0,1]p 1≤i≤kl+1 We next show by induction X l φl (θ) ≤ fj cj−1 Rl W j+1 j=0 Ql where we define cl = max(supx∈[0,1]p sup1≤i≤kl+1 |ηθl ∗ (x)i |, 1), c0 = 1, Rj+1 l = m=j+1 (Wm f + Bm ). Claim: cl ≤ Bl cl−1 . Note cl ≤ sup sup (|wli∗ ⊤ ηθl−1 ∗ (x)| + |vli |) x∈[0,1]p 1≤i≤kl+1 Xkl ∗ ≤ sup sup ( |wlij ||ηθl−1 ∗ (x)j | + |vli |) x∈[0,1]p 1≤i≤kl+1 j=1 Xkl ∗ ≤ sup (cl−1 |wlij | + cl−1 |vli |) 1≤i≤kl+1 j=1 ≤ cl−1 sup ||w∗li ||1 = Bl cl−1 1≤i≤kl+1 where the above result holds since supi ||w∗li ||1 ≤ Bl . Next, Xkl φl (θ) ≤ sup sup ( |wlij ηθl−1 (x)j − wlij ηθ∗ (x)j | + |vli − vli∗ |) ∗ l−1 x∈[0,1]p 1≤i≤kl+1 j=1 Xkl ∗ l−1 ≤ sup sup ( |wlij ηθl−1 (x)j − wlij ηθ (x)j | x∈[0,1]p 1≤i≤kl+1 j=1 + |wlij ∗ l−1 ηθ (x)j − wlij ηθ∗ (x)j | + |vli − vli∗ |) ∗ l−1 Xkl ∗ ≤ sup sup ( |wlij − wlij ||ηθl−1 (x)j | x∈[0,1]p 1≤i≤kl+1 j=1 85 X kl ∗ ∗ + |wlij ||ηθl−1 (x)j − ηθl−1 ∗ (x)j | + |vli − vli |) j=1 Xkl ∗ ≤ sup sup ( |wlij − wlij ||ηθl−1 (x)j − ηθl−1 ∗ (x)j | x∈[0,1]p 1≤i≤kl+1 j=1 X kl ∗ ∗ + |wlij − wlij ||ηθl−1 ∗ (x)j | + |vli − vli |) + φl−1 (θ)Bl j=1 ≤W fl (φl−1 (θ) + cl−1 ) + φl−1 (θ)Bl = φl−1 (θ)(W fl + Bl ) + cl−1 Wfl Now applying recursion we get φl (θ) ≤ (φl−2 (θ)(W fl−1 + Bl−1 ) + cl−2 W fl−1 )(W fl + Bl ) + cl−1 W fl = φl−2 (θ)(W fl + Bl )(W fl−1 + Bl−1 ) + cl−2 W fl−1 (W fl + Bl ) + cl−1 W fl Repeating this we get Y l X l Yl φl (θ) ≤ φ0 (θ) (Wj + Bj ) + f cj−1 Wj f (Wfj + Bj ) j=1 j=1 u=j+1 Y l X l Yl =W f0 (W fj + Bj ) + B1 · · · Bj−1 W fj (Wfj + Bj ) j=1 j=1 u=j+1 X l Y l Xl = B1 · · · Bj−1 W fj (W fj + Bj ) = fj cj−1 Rl W j+1 j=0 u=j+1 j=0 Z Z Z ||ηθ − ηθ∗ ||22 q(θ)dθ ≤ ||ηθ − ηθ∗ ||2∞ q(θ)dθ = φ2L (θ)q(θ)dθ Z X L = ( fj cj−1 RL )2 q(θ)dθ W j+1 j=0 L Z L X j−1 Z X X = c2j−1 fj2 (Rj+1 W L )2 q(θ)dθ +2 cj−1 cj ′ −1 W fj W L fj ′ Rj+1 RjL′ +1 q(θ)dθ j=0 j=0 j ′ =0 L Z L !2 X Y = c2j−1 fj2 W (Wfm + Bm ) q(θ)dθ j=0 m=j+1 L X j−1 Z L L X Y Y +2 cj−1 cj ′ −1 W fj W fj ′ (W fm + Bm ) (W fm + Bm )q(θ)dθ j=0 j ′ =0 m=j+1 m=j ′ +1 QL The proof follows by noting q(θ) = j=0 q(θj ). 86 Lemma A.2.6. Suppose Lemma 3.4.1 and Lemma 3.4.2 in the Section 3.4 hold, with dom- inating probability Pθn Cnϵ2n Z log π(θ)dθ ≤ − P0n P Hϵcn ul PL PL Proof. Let Fn = F(L, k, s◦ , B ◦ ), s◦l + 1 = nϵ2n / j=0 uj , log Bl◦ = nϵ2n /((L + 1) ◦ j=0 (sj + 1)) and Hϵn = {θ : dH (P0 , Pθ ) < ϵn } is the Hellinger neighborhood of size ϵn Pθn Pθn Pθn Z Z Z e(θ)dθ ≤ π π e(θ)dθ + π e(θ)dθ Hϵcn P0n Hϵcn ∩Fn P0n Fnc P0 n Pθn (C0 /2)nϵ2n Z   ≤ e(θ)dθ + exp − P π Hϵcn ∩Fn P0n ul where the last inequality follows from Lemma 3.4.2 because by Markov’s inequality Pθn (C0 /2)nϵ2n Z   PP0n n e(θ)dθ > exp − P π Fnc P0 ul 2 n   Z  (C0 /2)nϵn Pθ ≤ exp P EP0n n π e(θ)dθ ul Fnc P0 (C0 /2)nϵ2n e c (C0 /2)nϵ2n     ≤ exp P Π(Fn ) = exp − P →0 ul ul Further, Pθn Pθn Pn Z Z Z e(θ)dθ ≤ π ϕ nπ e(θ)dθ + (1 − ϕ) θn π e(θ)dθ Hϵcn ∩Fn P0n Hϵcn ∩Fn P0 Hϵcn ∩Fn P0 | {z } | {z } T1 T2 Next, borrowing steps from proof of theorem 3.1 in Pati et al. (2018), we have EP0n (ϕ) ≤ exp(−C1 nϵ2n ), thus for any C1′ < C1 , ϕ ≤ exp(−C1′ nϵ2n ) with probability at least 1 − exp(−(C1 − C1′ )nϵ2n ). Thus, T1 ≤ exp(−C1′ nϵ2n )T1 + T2 which implies with dominating probability T1 ≤ T2 . Thus, it only remains to show T2 ≤ exp(−C2′ (nϵ2n )/( ul )) for some C2′ > 0. This is true since P C2 nϵ2 nϵ2 C2 nϵ2 Z − P n C2 P un P n PP0n (T2 > e ul )≤e l EP0n (T2 ) ≤ e ul EPθ (1 − ϕ)e π (θ)dθ Hϵcn ∩Fn 87 C2 nϵ2 Z P n 2 ≤e ul e−C2 ndH (P0 ,Pθ ) π e(θ)dθ Hϵcn ∩Fn C2 nϵ2 Z P n −C2 nϵ2n X ≤e ul e e(θ)dθ ≤ exp(−C2′ nϵ2n / π ul ) Hϵcn ∩Fn Therefore, for sufficiently large n and C = min(C0 /2, C2′ )/2 Pθn Z X X X ′ 2 2 2 n π e (θ)dθ ≤ 2exp(−C 2 nϵn / u l ) + exp(−(C 0 /2)nϵ n / u l ) ≤ exp(−Cnϵ n / ul ) Hϵcn P0 Lemma A.2.7. Suppose Lemma 3.4.3 part 1. in the Section 3.4 holds, then for any Mn → ∞ , with dominating probability, P0n Z X log π (θ)dθ ≤ nM n ( rl + ξ) Pθn e Proof. By Markov’s inequality,  Z n  Z n P0 X 1 Pθ PP0n log n e(θ) ≥ nMn ( π rl + ξ) ≤ EP0n log π e(θ)dθ P0n P Pθ nMn ( rl + ξ) Z Z n 1 Pθ = P log n e(θ)dθ P0n dµ π nMn ( rl + ξ) P0   1 n ∗ 2 ≤ P dKL (P0 , L ) + nMn ( rl + ξ) e where L∗ = Pθn π R e(θ)dθ and the last inequality follows from Lemma A.2.1. ! P0n P0n   n ∗ dKL (P0 , L ) = EP0n log R n ≤ EP0n log R Pθ π e(θ)dθ NP rl +ξ θ P nπe(θ)dθ Z Z ≤ π e(θ)dθ + dKL (P0n , Pθn )eπ (θ)dθ Jensen’s inequality NP rl +ξ NP rl +ξ P X X ≤ − log e−nC( rl +ξ) + n( rl + ξ) = n(C + 1)( rl + ξ) where the last inequality follows from Lemma 3.4.3 part 1. in the Section 3.4. The proof follows by noting C/Mn → 0. Lemma A.2.8. Suppose Lemma 3.4.3 part 2. in the Section 3.4 holds, then for any Mn → ∞ , with dominating probability, XZ P0n X dKL (q, π) + log q(θ, z)dθ ≤ nM n ( rl + ξ) z Pθn 88 Proof. By Markov’s inequality we have ! XZ P0n X PP0n dKL (q, π) + q(θ, z) log n dθ > nMn ( rl + ξ) z Pθ ! 1 XZ P0n ≤ P dKL (q, π) + EP0n q(θ, z) log n dθ nMn ( rl + ξ) z Pθ !! 1 XZ Pθn ≤ P dKL (q, π) + EP0n q(θ, z) log n dθ nMn ( rl + ξ) z P0 ! XZ P0n n Z 1 = P dKL (q, π) + q(θ, z) log n P0 dµdθ nMn ( rl + ξ) z Pθ By Lemma A.2.1, we get XZ   ! 1 2 ≤ P dKL (q, π) + q(θ, z) dKL (P0n , Pθn ) + dθ nMn ( rl + ξ) z e ! 1 XZ 2 = P dKL (q, π) + n q(θ, z)dKL (P0 , Pθ )dθ + nMn ( rl + ξ) z e C  X  = P n( rl + ξ) + (2/e) → 0 nMn ( rl + ξ) where the last line in the above holds due to Lemma 4.3 part 2. in the Section 3.4. A.3 Proof of Lemmas and Corollary in the Section 3.4 Proof of Lemma 3.4.1 PL PL Take s◦l + 1 = (nϵ2n )/( j=0 uj ) and log Bl◦ = (nϵ2n )/((L + 1) ◦ j=0 (sj + 1)). We know from Lemma 2 of Ghosal and Van Der Vaart (2007) that, there exists a function φ ∈ [0, 1], such that EP0 (φ) ≤ exp{−nd2H (Pθ1 , P0 )/2} EPθ (1 − φ) ≤ exp{−nd2H (Pθ1 , P0 )/2} for all Pθ ∈ F(L, k, s◦ , B ◦ ) satisfying dH (Pθ , Pθ1 ) ≤ dH (P0 , Pθ1 )/18. Let H = N (ϵn /19, F(L, k, s◦ , B ◦ ), dH (., .)) denote the covering number of F(L, k, s◦ , B ◦ ), i.e., there exist H Hellinger balls of radius ϵn /19, that entirely cover F(L, k, s◦ , B ◦ ). For 89 any θ ∈ F(L, k, s◦ , B ◦ ) w.l.o.g we assume Pθ belongs to the Hellinger ball centered at Pθh and if dH (Pθ , P0 ) > ϵn , then we must have that dH (P0 , Pθh ) > (18/19)ϵn and there exists a testing function φh , such that EP0 (φh ) ≤ exp{−nd2H (Pθh , P0 )/2} ≤ exp{−((182 /192 )/2)nϵ2n } EPθ (1 − φh ) ≤ exp{−nd2H (Pθh , P0 )/2} ≤ exp{−n(dH (P0 , Pθ ) − ϵn /19)2 /2} ≤ exp{−((182 /192 )/2)nd2H (P0 , Pθ )}. Next we define ϕ = maxh=1,··· ,H φh . Then we must have X EP0 (ϕ) ≤ EP0 (φh ) ≤ Hexp{−((182 /192 )/2)nϵ2n } h ≤ exp{−((182 /192 )/2)nϵ2n − log H} Using Lemma A.2.4 with s = s◦ and B = B ◦ , we get log H = log N (ϵn /19, F(L, k, s◦ , B ◦ ), dH (., .)) √ ≤ log N ( 8σe2 ϵn /19, F(L, k, s◦ , B ◦ ), ||.||∞ )  L L ! !(s◦l +1)  Y 38 Y ≤ log  √ (L + 1) Bj◦ kl+1  8σ 2ϵ l=0 e n j=0 L L ! ! X 38 Y = (s◦l + 1) log √ (L + 1) Bj◦ kl+1 8σ 2ϵ l=0 e n j=0 " L L !# X 1 X ≤C (s◦l + 1) log + log(L + 1) + log Bj◦ + log kl+1 l=0 ϵn j=0 X L XL ◦ ≤C (sl + 1)(log n + log(L + 1) + log Bj◦ + log kl+1 ) l=0 j=0 X L XL ≤C (s◦l + 1)(log n + log(L + 1) + log Bj◦ + log kl+1 + log(kl + 1)) ≤ Cnϵ2n l=0 j=0 90 where, C in each step is different which tends to absorb the extra constants in it. First inequality holds due to the following   1 2 d2H (Pθ , P0 ) ≤ 1 − exp − 2 ||η0 − ηθ ||∞ 8σe and ϵn = o(1), the second inequality is because of (A.4), and fourth inequality is because of s◦l log(1/ϵn ) ≍ s◦l log n. Therefore, X EP0 (ϕ) ≤ EP0 (φh ) = exp{−C1 nϵ2n } h for some C1 = (182 /192 )/2 − 1/4. On the other hand, for any θ, such that dH (Pθ , P0 ) ≥ ϵn , say Pθ belongs to the hth Hellinger ball, then we have EPθ (1 − ϕ) ≤ EPθ (1 − φh ) ≤ exp{−C2 nd2H (P0 , Pθ )} where C2 = (182 /192 )/2. This concludes the proof. Proof of Lemma 3.4.2 X L X Assumption: s◦l +1= (nϵ2n )/( uj ), λl kl+1 /s◦l → 0, ul log L = o(nϵ2n ) (A.6) j=0 L ! L ! [ [ Π(F(L, e k, s◦ , B ◦ )c ) ≤ Π e {||w el ||0 > s◦l } +Π e {||wel ||∞ > Bl◦ } l=0 l=0 XL X L ≤ Π(|| e w el ||0 > s◦l ) + Π(|| e w el ||∞ > Bl◦ ) l=0 l=0 XL X X L X = Π(||w el ||0 > s◦l |z)π(z) + Π(||wel ||∞ > Bl◦ |z)π(z) l=0 z l=0 z kl+1 L ! L ! X X X ≤ P zli > s◦l + P sup ||wli ||1 > Bl◦ z i=1 i=1,··· ,kl+1 l=0 l=0 where w el = (||wl1 ||1 , · · · , ||wlkl+1 ||1 )T and the last inequality holds since Π(||w el ||0 > s◦l |z) ≤ el ||0 > s◦l |z) = 1 iff zli > s◦l and π(z) ≤ 1. We now break the proof in two parts P 1, Π(||w as follows. 91 Part 1. kl+1 kl+1 L ! L ! X X X X P zli > s◦l = P zli − kl+1 λl > s◦l − kl+1 λl l=0 i=1 l=0 i=1 By Bernstein inequality L L −1/2(s◦l − kl+1 λl )2 −1/2(s◦l − kl+1 λl )2 X   X   ≤ exp ≤ exp l=0 kl+1 λl (1 − λl ) + 1/3(s◦l − kl+1 λl ) l=0 kl+1 λl + 1/3(s◦l − kl+1 λl ) L L −sl /2(1 − kl+1 λl /s◦l )2  ◦ 3s◦l    X X kl+1 λl = exp ◦ → exp − since ◦ → 0 by (A.6) 1/3(1 + 2k l+1 λl /sl ) 2 s l l=0 l=0 L 2 nϵ2n nϵ2n       X 3nϵn 3 = exp − P + ≤ 5(L + 1)exp − P ≤ exp − P l=0 4 u l 2 2 u l 4 ul ul log L = o(nϵ2n ) by (A.6). P P since ul log(5(L + 1)) ∼ Part 2. L ! X P sup ||wli ||1 > Bl◦ z i=1,··· ,kl+1 l=0 L X kl+1   X ≤ P ||wli ||1 > Bl◦ z l=0 i=1 L kl+1 XX  Bl◦  ≤ P ||wli ||∞ > z l=0 i=1 kl + 1 L X kl+1 kl +1 Bl◦ X X   ≤ P |wlij | > z l=0 i=1 j=1 kl + 1 L X kl+1 kl +1 Bl◦ 2 X X   ≤2 exp − By concentration inequality l=0 i=1 j=1 (kl + 1)2 L X kl+1 kl +1 X X  2nϵ2n  =2 exp − exp( PL ◦ − 2 log(kl + 1)) l=0 i=1 j=1 ((L + 1) j ′ =0 (sj ′ + 1) L X kl+1 kl +1 X X 1 ≤ exp(−nϵ2n ) = exp(−nϵ2n ) l=0 i=1 j=1 (L + 1)kl+1 (kl + 1) where the third inequality holds since |wlij | given z is bound above by a |N (0, σ02 )| random variable. The above proof holds as long as ! 2nϵ2n exp PL ◦ − 2 log(kl + 1) ≥ nϵ2n +log(L+1)+log kl+1 +log(kl +1)+log 2 (L + 1) j ′ =0 (sj ′ + 1) 92 Taking log on both sides we get ! nϵ2n 1 PL ◦ − log(kl + 1) ≥ log(nϵ2n +log(L+1)+log kl+1 +log(kl +1)+log 2) (L + 1) j ′ =0 (sj ′ + 1) 2 PL ◦ + 1) = (L + 1)nϵ2n / P This is true since j ′ =0 (sj ′ ul is bounded above by nϵ2n (L + 1)(log(kl + 1) + 12 log(nϵ2n + log(L + 1) + log kl+1 + log(kl + 1) + log 2) Proof of Lemma 3.4.3 part 1. Assumption: − log λl = O{(kl + 1)ϑl }, − log(1 − λl ) = O{(sl /kl+1 )(kl + 1)ϑl } (A.7) Z Z   P0 (y, x) dKL (P0 , Pθ ) = log P0 (y, x)dydx x∈[0,1]p y∈R Pθ (y, x) (y − η0 (x))2 (y − ηθ (x))2     1 1 P0 (y, x) = p exp − Pθ (y, x) = p exp − 2πσe2 2σe2 2πσe2 2σe2 So we get, (y − η0 (x))2 (y − ηθ (x))2 Z Z    dKL (P0 , Pθ ) = log exp − + P0 (y, x)dydx x∈[0,1]p y∈R 2σe2 2σe2 2y(η0 (x) − ηθ (x)) − (η02 (x) − ηθ2 (x)) Z Z = P0 (y, x)dydx x∈[0,1]p y∈R 2σe2 2η02 (x) − 2η0 (x)ηθ (x) − η02 (x) + ηθ2 (x) Z = dx x∈[0,1]p 2σe2 (η0 (x) − ηθ (x))2 Z 1 = dx = ||η0 − ηθ ||22 (A.8) x∈[0,1]p 2 2 where, σe2 = 1 can be chosen w.l.o.g. Next, let ηθ∗ (x) be θ ∗ satisfying arg minηθ ∈F (L,k,s,B) |ηθ − η0 |2∞ . Then, p ||ηθ∗ − η0 ||1 ≤ ||ηθ∗ − η0 ||∞ = ξ (A.9) ∗ Here, we redefine δ l by considering the L1 norms of the rows of D l = W l − W l as follows ⊤ ⊤ D l = (dl1 , · · · , dlkl+1 )⊤ δ l = (||dl1 ||1 , · · · , ||dlkl+1 ||1 ) 93 Next we define a neighborhood M√P r as follows: l ( pP ) rl Bl M√P r = θ : ||dli ||1 ≤ QL , i ∈ Sl , ||dli ||1 = 0, i ∈ Slc , l = 0, · · · , L l (L + 1)( j=0 Bj ) where Slc is the set where ||w∗li ||1 = 0, l = 0, · · · , L. Then, for every θ ∈ M√P r using l (A.3), we have qX ||ηθ − ηθ∗ ||1 ≤ rl (A.10) √ Combining (A.9) and (A.10), we get for θ ∈ M√P r , ||ηθ − η0 ||1 ≤ pP rl + ξ. So we get, l pP √ ( rl + ξ)2 X dKL (P0 , Pθ ) ≤ ≤ rl + ξ 2 Since θ ∈ NP rl +ξ for every θ ∈ M√P r ; therefore, l Z Z e(θ)dθ ≥ π π e(θ)dθ θ∈NP rl +ξ θ∈M√P r l rl Bl )/((L + 1)( Lj=0 Bj )) and A = {wli : ||wli − w∗li ||1 ≤ δn } pP Q Let δn = (   X   e M√ P Π = Π M √P z π(z) r l rl z   Π M√P r z π(z) X ≥ l {z:zli =1,i∈Sl ,zli =0,i∈Slc ,l=0,··· ,L} L E(1{wli ∈A} |zli = 1) Y Y = (1 − λl )kl+1 −sl λsl l l=0 i∈Sl L  kl2+1 kY l +1 w2lij YZ    Y kl+1 −sl sl 1 ≥ (1 − λl ) λl exp − dwlij l=0 i∈Sl wli ∈A 2π j=1 2 L  kl2+1 kY l +1 Z w ∗ + δn Y w2lij   Y 1 lij kl +1 ≥ (1 − λl )kl+1 −sl λsl l exp − dwlij l=0 i∈Sl 2π j=1 w ∗ − δn lij k +1 2 l L  kl2+1 kY l +1 2  Y w  Y 1 2δn blij = (1 − λl )kl+1 −sl λsl l exp − l=0 i∈Sl 2π j=1 kl + 1 2 where the third equality follows since E(1{wli ∈A} |zli = 0) = 1 since ||w∗li ||1 = 0, for i ∈ Slc . The last equality is by mean value theorem, w blij ∈ [w∗lij − δn /(kl + 1), w∗lij + δn /(kl + 1)], thus kX ! L l +1 2 Y kl+1 −sl sl Y kl + 1 1 2δn wblij = (1 − λl ) λl exp log + (kl + 1) log − l=0 i∈S 2 2π kl + 1 j=1 2 l 94 " L n     X 1 1 = exp − sl log + (kl+1 − sl ) log l=0 λl 1 − λl kX !)# l +1 2 X kl + 1 1 2δn w blij + − log − (kl + 1) log + i∈Sl 2 2π kl + 1 j=1 2 " L (     X 1 1 = exp − sl log + (kl+1 − sl ) log l=0 λl 1 − λl )# sl (kl + 1) 1 2δn X kX l +1 w 2 blij − log − sl (kl + 1) log + (A.11) 2 2π kl + 1 i∈S j=1 2 l Now, L X kX l +1 2 L k +1 l X w blij 1 XXX ≤ max((w∗lij − δn /(kl + 1))2 , (w∗lij + δn /(kl + 1))2 ) l=0 i∈Sl j=1 2 2 l=0 i∈S j=1 l XL X kX l +1 XL X XL X ≤ (w∗2lij + δn2 /(kl 2 + 1) ) ≤ ||w∗li ||21 + δn2 /(kl + 1) l=0 i∈Sl j=1 l=0 i∈Sl l=0 i∈Sl XL X X  ≤ sl (Bl2 + 1) ≤ n rl ≤ n rl + ξ (A.12) l=0 where the above line uses δn → 0. Finally L       X 1 1 sl (kl + 1) 1 2δn sl log + (kl+1 − sl ) log − log − sl (kl + 1) log l=0 λ l 1 − λ l 2 2π kl + 1 L ( L )! X sl (kl + 1) X X ≤ Cnrl + 2 log(kl + 1) + 2 log(L + 1) + 2 log Bm − log rl l=0 2 m=0,m̸=l X X  ≤ Cn rl ≤ Cn rl + ξ (A.13) where the first inequality follows from (A.7) and expanding δn . The last inequality follows P P since n rl → ∞ which implies − log rl = O(log n). Combining (A.12) and (A.13) and replacing (A.11), the proof follows. Proof of Lemma 3.4.3 part 2. Assumption: − log λl = O{(kl + 1)ϑl }, − log(1 − λl ) = O{(sl /kl+1 )(kl + 1)ϑl } 95 Suppose there exists q ∈ QMF such that X dKL (q, π) ≤ C1 n rl , XZ X |ηθ − ηθ∗ |22 q(θ, z)dθ ≤ rl . (A.14) z Θ Recall θ ∗ = arg minθ∈θ(L,p,s,B) |ηθ − η0 |2∞ . By relation (A.8), XZ XnZ ndKL (P0 , Pθ )q(θ, z)dθ = ||η0 − ηθ ||22 q(θ, z)dθ z z 2 Z nX n ≤ ||ηθ∗ − ηθ ||22 q(θ, z)dθ + ||ηθ∗ − η0 ||2∞ 2 z 2 X ≤ Cn( rl + ξ) where the above relation is due to (A.14) which completes the proof. We next construct q ∈ QMF as wlij |zli ∼ zli N (w∗lij , σl2 ) + (1 − zli )δ0 , zli ∼ Bern(γli∗ ) γli∗ = 1(||wli∗ ||1 ̸= 0) sl QL 2 −1 where σl2 = 8n(L+1) (4L−l (kl + 1) log(kl+1 2kl +1 ) m=0,m̸=l Bm ) . We next consider the relation (A.5) in Lemma A.2.5. We upper bound the expectation of the supremum of L1 norm of multivariate Gaussian variables: Z Z Z Wl q(θ, z)dθ ≤ sup ||wli − wli ||1 q(θ|z)dθ ≤ sup ||wli − w∗li ||1 q(θ|z = 1)dθ f ∗ i i since q(z) ≤ 1. If zli = 1, then ||wli − w∗li ||1 = 0, thus the above integral is maximized at z = 1 where z = 1 indicates all neurons are present in the network. In this case, all wlij are nothing but independent Gaussian random variables. In this direction we make use of concentration inequalities similar to the proof of theorem 2 in Chérief-Abdellatif (2020). Let, Y = supi ||wli − w∗li ||1 . exp(tEY ) ≤ E(exp(tY )) = E[sup exp(t||wli − w∗li ||1 )] i kl+1 kXl +1 kl+1 kl +1 X X Y ≤ E[exp(t |wlij − w∗lij |)] = E[exp(t|wlij − w∗lij |)] i=1 j=1 i=1 j=1 96 kl+1 kl +1 σl2 t2 σl2 t2 X Y     kl +1 = 2exp Φ(σl t) ≤ kl+1 2 exp (kl + 1) i=1 j=1 2 2 p Thus, EY ≤ (log(kl+1 2kl +1 ) + (kl + 1)σl2 t2 /2)/t. Let t = (1/σl ) (2/(kl + 1)) log(kl+1 2kl +1 ), r kl + 1 hp p i EY ≤ σl log(kl+1 2kl +1 ) + log(kl+1 2kl +1 ) 2 q q = 2σl2 (kl + 1) log(kl+1 2kl +1 ) ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 ) Similarly, Z Z Z f 2 q(θ, z)dθ = W l sup(||wli − w∗li ||1 )2 q(θ, z)dθ ≤ sup(||wli − w∗li ||1 )2 q(θ|z = 1) i i Let, Y ′ = supi (||wli − w∗li ||1 )2 . exp(tEY ′ ) ≤ E(exp(tY ′ )) = E[sup exp(t(||wli − w∗li ||1 )2 )] i kl+1 kXl +1 kl+1 kXl +1 X X ≤ E[exp(t( |wlij − w∗lij |)2 )] ≤ E[exp(t(kl + 1) (wlij − w∗lij )2 )] i=1 j=1 i=1 j=1 kl+1 kl +1 kl+1 kl +1   21 X Y X Y 1 = E[exp(t(kl + 1)(wlij − w∗lij )2 )] = i=1 j=1 i=1 j=1 1 − 2t(kl + 1)σl2   kl2+1 1 ≤ kl+1 1 − 2t(kl + 1)σl2 Thus, EY ′ ≤ (log kl+1 − ((kl + 1)/2) log(1 − 2t(kl + 1)σl2 ))/t. Let t = 1/(4σl2 (kl + 1)),     ′ kl + 1 kl +1 EY ≤ 4σl2 (kl + 1) log kl+1 + log 2 = 4σl2 (kl + 1) log(kl+1 2 2 ) 2 ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 ) Next we also get, Z Z q (W fl + Bl )q(θ, z)dθ = fl q(θ, z)dθ + Bl ≤ W 4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl ≤ 2Bl Z Z Z 2 f 2 q(θ, z)dθ fl q(θ, z)dθ + B 2 (W fl + Bl ) q(θ, z)dθ = W l + 2Bl W l q ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 ) + 2Bl 4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl2 ≤ 4Bl2 97 Z Z Z 2 Wfl (Wfl + Bl )q(θ, z)dθ = Wl q(θ, z)dθ + Bl W f fl q(θ, z)dθ q ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl 4σl2 (kl + 1) log(kl+1 2kl +1 ) q q  ≤ 4σl2 (kl + 1) log(kl+1 2kl +1 ) 4σl2 (kl + 1) log(kl+1 2kl +1 ) + Bl q ≤ 2Bl 4σl2 (kl + 1) log(kl+1 2kl +1 ) p since 4σl2 (kl + 1) log(kl+1 2kl +1 ) is bounded above by v !−1 u L u 4s l Y t 4L−l (kl + 1) log(kl+1 2kl +1 ) Bm 2 (kl + 1) log(kl+1 2kl +1 ) 8n(L + 1) m=0,m̸=l v u L ! −1 u sl L−l Y 2 = Bl t 4 Bm ≤ Bl , The quantity in square root < 1 for large n. 2n(L + 1) m=0 Let bj = (kj + 1) log(kj+1 2kj +1 ). From relation (A.5), we get Z L L ! X Y ||ηθ − ηθ∗ ||22 q(θ, z)dθ ≤ c2j−1 (4σj2 bj ) 4Bm 2 j=0 m=j+1 j−1 j−1 L X L ! ! X q Y q Y 2 +2 cj−1 cj ′ −1 2Bj 4σj2 bj 4Bm 4σj2′ bj ′ 2Bm j=0 j ′ =0 m=j+1 m=j ′ +1 j−1 L ! L ! X Y Y =4 4L−j σj2 bj Bm2 Bm2 j=0 m=0 m=j+1 j−1 j−1 j−1 j−1 L X ! ! L ! ! X Y Y Y Y q q 2 +8 Bm Bm 2Bj 4Bm 2Bm σj2 bj σj2′ bj ′ j=0 j ′ =0 m=0 m=0 m=j+1 m=j ′ +1 XL Y L 2L−2j =4 2 σj2 bj Bm 2 j=0 m=0,m̸=j j−1 j−1 j−1 L X ! ! L ! L ! q q j−j ′ X Y Y Y Y +8 4L−j 2 Bm Bm Bm Bm σj2 bj σj2′ bj ′ j=0 j ′ =0 m=0 m=0 m=j+1 m=j ′ +1 L L ! X Y 2L−2j =4 2 σj2 bj Bm 2 j=0 m=0,m̸=j j−1 L X L ! L ! q q L−j L−j ′ X Y Y +8 2 2 Bm Bm σj2 bj σj2′ bj ′ j=0 j ′ =0 m=0,m̸=j m=0,m̸=j ′ L L !!2 L r !2 X L−j q Y X sj =4 2 σj2 bj Bm =4 j=0 m=0,m̸=j j=0 8n(L + 1) 98 L !2 PL L 1 X √ j=0 sj X = sj ≤ ≤ rl 2n(L + 1) j=0 2n j=0 This concludes the proof of (A.14). Next, 1 Y kY n L−1 l+1 kl +1 n o + 1(z = γ )dKL Y dKL (q, π) ≤ log ∗ γli∗ N (w∗lij , σl2 ) + (1 − γli∗ )δ0 π(z) l=0 i=1 j=1 Y kY l+1 kl +1 n ! kY L +1 o n L−1 Y o kYL +1 o N (w∗Lj , σL2 ) , zli N (0, σ02 ) + (1 − zli )δ0 N (0, σ02 ) j=1 l=0 i=1 j=1 j=1 L−1 X kl+1 kl +1 1 X X  = log QL−1 + dKL γli∗ N (w∗lij , σl2 ) + (1 − γli∗ )δ0 , l=0 λsl l (1 − λl )kl+1 −sl l=0 i=1 j=1  kX L +1   γli∗ N (0, σ02 ) + (1 − γli∗ )δ0 + dKL N (w∗Lj , σL2 ), N (0, σ02 ) j=1 L−1 ! L−1 kl+1 k +1 ( ) l 2 2 ∗ 2 X 1 1 XXX 1 σ σ l + w lij 1 = sl log + (kl+1 − sl ) log + γli∗ log 02 + 2 − l=0 λ l 1 − λ l l=0 i=1 j=1 2 σ l 2σ 0 2 kX ( ) L +1 2 1 σ02 σL2 + w∗Lj 1 + log 2 + − j=1 2 σL 2σ02 2 L−1 L−1 " # X X sl kl + sl σl2 Bl2 σ02 ≤ Cnrl + 2 + 2 − 1 + log 2 l=0 l=0 2 σ0 σ 0 (kl + 1) σl " # kL + 1 σL2 BL2 σ02 + − 1 + log 2 2 σ02 σ02 (kL + 1) σL where the first inequality follows from Lemma A.2.2. The inequality in the above line uses Pkl +1 ∗ 2 2 j=1 w lij ≤ Bl and similar to the proof of Lemma 4.1 in Bai et al. (2020) uses (A.7). Let σ02 = 1 and it could be easily derived that σl2 ≤ 1. L−1 L−1 " # " # X X sl Bl2 (k L + 1) BL 2 dKL (q, π) ≤ Cnrl + (kl + 1) − log σl2 + − log σL2 l=0 l=0 2 k l + 1 2 kL + 1 L−1 L−1 " " L #−1 !# X X sl Bl2 sl Y = Cnrl + (kl + 1) − log 4L−l bl Bm2 l=0 l=0 2 k l +1 8n(L + 1) m=0,m̸=l " " L # −1 !# (kL + 1) BL2 1 Y 2 + − log bL Bm 2 kL + 1 8n(L + 1) m=0,m̸=L L−1 L " " L #−1 !# X X sl Bl2 sl L−l Y 2 = Cnrl + (kl + 1) − log 4 bl Bm l=0 l=0 2 kl + 1 8n(L + 1) m=0,m̸=l 99 L−1 L L ! X X sl X sl 8n(L + 1) = Cnrl + Bl2 + (kl + 1) log l=0 l=0 2 l=0 2 sl L L X X sl + sl (kl + 1)(L − l) log 2 + (kl + 1) log(kl + 1) l=0 l=0 2 L L L ! X sl   X X + (kl + 1) log log(kl+1 2kl +1 ) + sl (kl + 1) log Bm l=0 2 l=0 m=0,m̸=l L−1 L L ! L X X sl 2 X sl 8n(L + 1) X ≤ Cnrl + B + (kl + 1) log +L sl (kl + 1) l=0 l=0 2 l l=0 2 sl l=0 L L L ! X sl X X + (kl + 1)(log(kl + 1) + log(kl+1 + kl + 1)) + sl (kl + 1) log Bm l=0 2 l=0 m=0,m̸=l L−1 L L ! L X X sl X sl 8n(L + 1) X ≤ Cnrl + Bl2 + (kl + 1) log +L sl (kl + 1) l=0 l=0 2 l=0 2 sl l=0 L L L ! X X X + sl (kl + 1) log(kl+1 + kl + 1) + sl (kl + 1) log Bm l=0 l=0 m=0,m̸=l L−1 L " L ! X X Bl2 X ≤ Cnrl + sl (kl + 1) + log Bm + L + log(kl+1 + kl + 1) l=0 l=0 2(kl + 1) m=0,m̸=l !# 1 8n(L + 1) + log 2 sl L−1 X ≤ (C + C ′ )nrl + C ′ nrL l=0 L " L ! !# X Bl2 X n + sl (kl + 1) + log Bm + L + log(kl+1 + kl + 1) + log l=0 kl + 1 m=0,m̸=l sl L−1 X XL X L ′ ′ ≤ (C + C )nrl + C nrL + sl (kl + 1)ϑl ≤ C1 n rl l=0 l=0 l=0 This concludes the proof of (A.14). Proof of Corollary 3.4.5 The proof is a direct consequence of Theorem 3.4.4 in the Section 3.4 as long as assumptions of Lemma 3.4.2 and Lemma 3.4.3 parts 1 and 2 hold when σ02 = 1, − log λl = log(kl+1 ) + qP Cl (kl + 1)ϑl and ϵn = ( Ll=0 rl + ξ) Ll=0 ul . This what we show next. P 100 ul = O(ϵ2n ), thus P Verifying assumption (A.6) under Proof of Lemma 3.4.2: Note, X X ul log L = o(nϵ2n ) ⇐⇒ log L = o(n( rl + ξ)) which is indeed true since log L = o(L2 ) and L2 ≤ n rl . We show that (kl+1 λl )/s◦l → 0. P With λl = (1/kl+1 )exp(−Cl (kl + 1)ϑl ), P P kl+1 λl ul exp(−C(kl + 1)ϑl ) exp(−C(kl + 1)ϑl + log ul ) ≤ = s◦l nϵ2n nϵ2n exp(−C(kl + 1)ϑl + ϑl ) ≤ →0 nϵ2n ul ≤ ϑl , ϑl → ∞, kl → ∞ and nϵ2n → ∞. P where the above relation holds since log Verifying assumption (A.7) under Proof of Lemma 3.4.3 part 1. and part 2. Note, − log λl = log(kl+1 ) + Cl (kl + 1)ϑl ≤ ϑl + Cl (kl + 1)ϑl = O{(kl + 1)ϑl } And then, 1 − λl = 1 − exp(−Cl ϑl (kl + 1))/kl+1 − log(1 − λl ) ∼ exp(−Cl ϑl (kl + 1))/kl+1 = O{(kl + 1)sl ϑl /kl+1 } since exp(−Cl ϑl (kl + 1)) → 0 and (kl + 1)sl ϑl → ∞. 101 APPENDIX B ADDITIONAL NUMERICAL EXPERIMENTS DETAILS B.1 FLOPs Calculation We only count multiply operation for floating point operations (FLOPs) similar to Zhao et al. (2019). In 2D convolution layer, we assume convolution is implemented as a sliding window and that the nonlinearity function is computed for free. Then, for a 2D convolutional layer (given bias is present) we get FLOPs as: FLOPs = (Cin,pruned Kw Kh + 1)Ow Oh Cout,pruned where, Cin,pruned , Cout,pruned are the number of input channels and output channels after prun- ing. Channels are pruned if all the parameters associated with that channel in convolution mapping are zero. Kw and Kh are the kernel width and height respectively. Finally, Ow , Oh are output width and height where Ow = (Iw + 2 × Pw − Dw × (Kw − 1) − 1)/Sw + 1 and Oh = (Ih + 2 × Ph − Dh × (Kh − 1) − 1)/Sh + 1. Here, Iw , Ih are input, Pw , Ph are padding, Dw , Dh are dilation, Sw , Sh are stride widths and heights respectively. For fully connected (linear) layers (with bias) we get FLOPs as: FLOPs = (Ipruned + 1)Opruned where, Ipruned is the number of pruned input neurons and Opruned is the number of pruned output neurons. B.2 Variational parameters initialization We initialize the γlj ’s at a value close to 1 for all of our experiments. This ensures that at epoch 0, we have a fully connected deep neural network. This also warrants that most of the weights do not get pruned off at a very early stage of training which might lead 102 to bad performance. The variational parameters µljj ′ are initialized using U (−0.6, 0.6) for simulation and UCI regression examples whereas for classification Kaiming uniform initial- ization (He et al., 2015) is used. Moreover, σljj ′ are reparameterized using softplus function: σljj ′ = log(1 + exp(ρljj ′ )) and ρljj ′ are initialized using a constant value of -6. This keeps initial values of σljj ′ close to 0 ensuring that the initial values of network weights stay close to Kaiming uniform initialization. B.3 Hyperparameters for training We keep MC sample size (S) to be 1 during training. We choose learning rate of 3 × 10−3 , batch size of 400, and 10000 epochs in the 20 neurons case of simulation study-I. We use learning rate of 10−3 , batch size of 400, and 20000 epochs in the 100 neurons case of simulation study-I. Next, we use learning rate of 5 × 10−3 , full batch, and 10000 epochs for simulation study-II. In UCI regression datasets, we choose batch size = 128 and run 500 epochs for Concrete, Wine, Power Plant, 800 epochs for Kin8nm. For Protein and Year datasets, we choose batch size of 256 and run 100 epochs. For all the UCI regression datasets we keep learning rate of 10−3 . The Adam algorithm is chosen for optimization of model parameters. In image classification datasets, for SS-IG model, we use 10−3 learning rate and minibatch size of 1024 in all experiments except in LeNet-5-Caffe on Fashion-MNIST experiment where we use 2 × 10−3 learning rate and 1024 minibatch size. For SV-BNN model, we take 10−3 learning rate and 1024 minibatch size in all experiments after extensive hyperparameter search. For VBNN model, we take learning rate of 10−4 and minibatch size of 128 according to Blundell et al. (2015). We train each model for 1200 epochs using Adam optimizer in all the image classification experiments provided in the Section 3.5. B.4 Fine-tuning of the constant in prior inclusion probability ex- pression Recall the layer-wise prior inclusion probabilities: λl = (1/kl+1 )exp(−Cl (kl + 1)ϑl ) from the Corollary 3.4.5. In our numerical experiments, we use this expression to choose an 103 optimal value of λl in each layer of a given network. The λl varies as we vary our constant Cl and we next describe how is Cl chosen. The influence of Cl is mainly due to the kl + 1 term and Bl 2 /(kl + 1) from ϑl term. We ensure that each incoming weight and bias onto the node from layer l + 1 is bounded by 1 which leads us to choose Bl to be kl + 1. So the leading term from (kl + 1)ϑl is (kl + 1) and Cl has to be chosen such that we avoid making exponential term from λl expression close to 0. In our experiments we choose Cl values in the negative order of 10 such that prior inclusion probabilities do not fall below 10−50 . If we instead choose a λl value very close to 0 then we might prune off all the nodes in each layer or might make the training unstable which is not ideal. Overall the aforementioned strategy of choosing Cl constant values ensure reasonable values for the λl in each layer. B.5 Simulation study I: extra details First we provide the network parameters used to generate the data for this simulation experiment. The edge weights in the underlying 2-2-1 network are as follows: W0 = {w011 = 10, w012 = 15, w021 = −15, w022 = 10}; W1 = {w111 = −3, w121 = 3} and v0 = {v01 = −5, v02 = 5}; v1 = {v11 = 4}. (a) VBNN (b) SS-IG Figure B.1 Simulation study I: additional experiment results. Node-wise weight magnitudes recovered by VBNN and proposed SS-IG model in the synthetic regression data generated using 2-2-1 network. The boxplots show the distribution of incoming weights into a given hidden layer node. Only the 20 nodes with the largest edge weights are displayed. 104 In Figure B.1, we provide additional results demonstrating the model selection ability of our SS-IG approach in a wider network consisting of 100 nodes in the single hidden layer structure considered in the simulation study-I from the Section 3.5. B.5.1 Effect of Hidden Layer Widths Here, we explore 2-hidden layer neural networks with varying widths. For our SS-IG model we use 10−3 learning rate and minibatch size of 1024 while for VBNN model, we take learning rate of 10−4 and minibatch size of 128 according to Blundell et al. (2015). We train both the models for 400 epochs using Adam optimizer. Figure B.2 summarizes the results. We have provided results for 3 different architectures which have 400, 800, and 1200 nodes each in their 2-hidden layers. In Figure B.2a, we find that across the architectures both SS-IG and VBNN models have similar predictive performance. Further, our method is able to prune off more than 88% of first hidden layer nodes and more than 92% of second hidden layer nodes (Figure B.2b) at the expense of 2% accuracy loss due to sparsification compared to the densely connected VBNN. We also observe that as model capacity increases the sparsity percentage per layer decreases. This suggests that, each architecture is trying to reach a sparse network of comparable size. (a) Prediction accuracy per architecture (b) Layer-wise sparsity per architecture Figure B.2 MNIST experiment results for varying hidden layer widths. 105 CHAPTER 4 COMPACT BAYESIAN NEURAL NETWORKS WITH STRUCTURED SPARSITY 4.1 Introduction In high-dimensional modeling, predictor selection and sparse signal recovery are routine statistical and machine learning practices. Sparse parameter estimation via high-dimensional regularization penalizing model dimensionality is well studied in the literature. Two of the most popular regularization techniques are lasso and horseshoe regularizers (Bhadra et al., 2019). The lasso estimator (Tibshirani, 1996) induces sparsity by constraining the L1 norm of the parameters in the model. The horseshoe estimator (Carvalho et al., 2010) places ab- solutely continuous shrinkage priors on the entire parameter vector that selectively shrinks the small signals since horseshoe prior has heavy tails supporting both zero values and large values. Both Lasso and horseshoe procedures come with strong theoretical guarantees for estimation and prediction. In this work, we propose a spike-and-slab prior framework sim- ilar to SS-IG model proposed in Chapter 3 for dynamic node pruning with slab component using either group lasso or group horseshoe priors. This combination of spike-and-slab prior and group shrinkage priors, first ensures that unnecessary collection of weights incident on a node are shrunk to zero and then the spike-and-slab setup allows for automated pruning of such shrunken weights. In Figure 4.1, we provide an image classification experiment where our proposed approaches of spike-and-slab Group Lasso (SS-GL) and spike-and-slab Group Horseshoe (SS-GHS) demonstrate the improvement over simple Gaussian prior in the slab part. In order to conduct posterior approximation, VI in the sparse BNNs with spike-and- slab Gaussian prior framework for edge selection was introduced by Chérief-Abdellatif (2020) and later Jantre et al. (2021a) extended it to node pruning. In this work, we adopt varia- tional Bayesian inference leading to tractable model training in conjunction with continuous 106 (a) Prediction accuracy (b) Layer-1 node sparsity (c) Layer-2 node sparsity Figure 4.1 MNIST experiment results: motivation for group shrinkage priors over Gaussian prior. Here, we demonstrate the performance of our SS-GL and SS-GHS models in 2-layer perceptron network to classify MNIST, hand-written digits dataset. (a) we plot the classification accuracy on the test data for our models and include (Jantre et al., 2021a)’s SS-Gauss model. (b) and (c) we plot the proportion of active nodes (node sparsity) in the layer-1 and layer-2 of the network respectively. We observe that our SS-GHS yields the most compact network with the best classification accuracy. relaxation of discrete Bernoulli variables associated with the spike part (Maddison et al., 2017; Jang et al., 2017) similar to SS-IG model. 4.1.1 Proposed Methods Firstly, there does not exist any cohesive literature which establishes the numerical effi- ciency of shrinkage priors over Gaussian slabs in the context of training structurally sparse networks. Secondly, the numerical properties of the corresponding variational implementa- tion remain unexplored. To address these issues, we consider a spike-and-slab framework with group shrinkage priors: (i) group lasso and (ii) group horseshoe, which first shrinks the redundant model weights through the slab component and the spike component prunes out the nodes with weights whose values are shrunk close to zero. Accordingly, Detailed Contribution. • We propose structurally sparse Bayesian neural networks using two distinct spike and slab prior setups, where the slab component uses hierarchical priors on the group of incoming weights (including bias) on the neurons: (i) Spike-and-Slab Group Lasso (SS-GL), and (ii) Spike-and-Slab Group HorseShoe (SS-GHS). 107 4.2 Structured Sparsity: Spike-and-Slab Hierarchical Priors In order to carry out automatic node selection to induce structured sparsity in BNNs, we consider spike-and-slab priors. A zero-mean Gaussian distribution is the commonly used slab distribution in spike-and-slab priors (Jantre et al., 2021a). However, their use can lead to inflated predictive uncertainties, especially when used in conjunction with fully factorized variational inference (Ghosh et al., 2019). Instead, if we consider a slab distribution having zero-mean Gaussian distribution with its scale being a random variable then the slab part of the marginal prior distribution will have heavier tails and higher mass at zero. Such hierarchical distributions in slab part further improve the sparsity as well as circumvent the inflated predictive uncertainties. Below, we describe the two hierarchical spike-and- slab priors and corresponding fully factorized variational family that we use in each of our proposed approaches. 4.2.1 Spike-and-Slab Group Lasso (SS-GL): To facilitate the optimal layer-wise node selection, we allow the prior inclusion probability λl to vary as a function of the layer index l. Prior: We assume a spike-and-slab prior of the following form with zlj being the indicator for the presence of j th node in the lth layer. π(wlj |zlj ) = (1 − zlj )δ0 + zlj N (0, σ02 τlj2 I) , π(zlj ) = Ber(λl ), π(τlj2 ) = G(kl /2 + 1, ς 2 /2)   where l = 0, . . . , L, j = 1, . . . , kl+1 . N (., .), Ber(.), and G(., .) represent Gaussian, Bernoulli, and Gamma distributions. wlj = (wlj1 , . . . , wljkl +1 ) is a vector of edges incident on the j th node in the lth layer. δ0 is a Dirac spike vector of dimension kl + 1 with all zero entries and I is identity matrix of dimension kl + 1 × kl + 1. zlj with j = (1, . . . , kl+1 ) all follow Ber(λl ) to allow for common prior inclusion probability, λl , for each node from a given layer l. We set λL = 1 to ensure no node selection occurs in the output layer. σ0 and τlj are the 108 constant global and the variable local (per node) scale mixture components of the Gaussian slab distribution. ς 2 /2 is the constant rate hyperparameter of the Gamma distribution. Variational family: We consider the following fully factorized variational family q(wlj |zlj ) = (1 − zlj )δ0 + zlj N (µlj , diag(σlj2 )) , q(zlj ) = Ber(γlj )   {τ } {τ } 2 q(τlj2 ) = LN (µlj , σlj ) for l = 0, . . . , L, j = 1, . . . , kl+1 . LN (., .) denotes Log-Normal distribution and q(log τlj2 ) ∼ {τ } {τ } 2 N (µlj , σlj ). The spike-and-slab structure of the variational family ensures that the vari- ational weight distributions follow spike-and-slab structure allowing for exact node sparsity through variational approximation. Further, the weight distributions conditioned on the node indicator variables are all independent of each other. The variational distribution of parameters obtained post optimization will then inherently prune away redundant nodes from each layer. Moreover, we use Log-Normal family instead of Gamma family to approxi- mate Gamma distributed τlj2 since we obtain closed form expressions for dKL (q(τlj2 ), π(τlj2 |ς 2 )). Additionally, µlj = (µlj1 , . . . , µljkl +1 ) and σlj2 = (σlj1 2 2 , . . . , σljk l +1 ) denote the vectors of variational mean and standard deviation parameters of the slab component of q(wlj |zlj ). ′ th diag(σlj2 ) is the diagonal matrix with σljj 2 ′ being the j diagonal entry. Similarly, γlj denotes the variational inclusion probability parameter of q(zlj ). We set γLj = 1 to ensure no node {τ } {τ } 2 selection occurs in the output layer. µlj and σlj denote the variational mean and standard deviation parameters of the Gaussian distribution associated with q(log τlj2 ). ELBO: Let θ be the network weights and ϖ = (z, τ 2 ) be the remaining parameters. We minimize the loss function: L = −ELBO(q(θ, ϖ), π(θ, ϖ|D)), X Z L = −Eq(θ,ϖ) [log L(θ)] + q(zlj = 1) dKL (q(wlj |zlj = 1), π(wlj |τlj2 , zlj = 1))q(τlj2 )dτlj2 l,j X X + dKL (q(zlj ), π(zlj )) + dKL (q(τlj2 ), π(τlj2 |ς 2 )) l,j l,j X Z = −Eq(θ,ϖ) [log L(θ)] + q(zlj = 1) dKL (N (µlj , diag(σlj2 )), N (0, σ02 τlj2 I))q(τlj2 )dτlj2 l,j 109   {τ } 2{τ } X X 2 + dKL (Ber(γlj ), Ber(λl )) + dKL LN (µlj , σlj ), G(kl /2 + 1, ς /2) l,j l,j 4.2.2 Spike-and-Slab Group Horseshoe (SS-GHS): In this model, we consider spike-and-slab prior with group horseshoe distribution in the slab part. Prior: We consider regularized version of group horseshoe (Piironen and Vehtari, 2017) in the slab part to circumvent the numerical stability issues associated with the unregularized group horseshoe. We define our prior similar to SS-GL earlier. π(wlj |zlj ) = (1 − zlj )δ0 + zlj N (0, σ02 τelj2 s2 I) , τelj2 = c2 τlj2 /(c2 + τlj2 s2 )   π(zlj ) = Ber(λl ), π(τlj ) = C + (0, 1), π(s) = C + (0, s0 ) where l = 0, . . . , L, j = 1, . . . , kl+1 . C + (., .) denotes half Cauchy distribution. τelj2 is varying local (per node) scale parameter, s2 is the varying global scale parameter, and σ02 is the constant global scale parameter. Note that, when weights are strongly shrinking towards 0 then τlj2 s2 ≪ c2 and τelj2 → τlj2 s2 which leads to the unregularized version of the group horseshoe. Whereas, when weights are away from 0 then corresponding τlj2 s2 will be large, i.e, τlj2 s2 ≫ c2 and τelj2 → c2 , where c2 is constant. For these weights corresponding version of regularized group horseshoe in the slab follows N (0, σ02 c2 I). This helps in thinning out the heavy tails associated with the horseshoe prior. Next, the prior inclusion probabilities, λl are common for for all nodes from a given layer similar to SS-GL. Additionally, s0 is the scale parameter of half Cauchy prior on s that can be tuned for specific situations. Instead of directly working with the half-Cauchy distributions, we employ a decomposi- tion of the half-Cauchy that relies upon gamma and inverse gamma distributions (Louizos et al., 2017) as this allows us to compute the negative KL-divergence from the scale distribu- tion π(τ ) to an approximate log-normal scale posterior q(τ ) in closed form. More specifically, we have a half-Cauchy distribution that can be expressed in a non-centered parametrization 110 as: β̃ ∼ IG(1/2, 1), α̃ ∼ G(1/2, k 2 ), τ 2 = β̃ α̃ where IG(., .), G(., .) correspond to the inverse Gamma and Gamma distributions in the scale parametrization, and τ follows a half-Cauchy distribution with scale k. Therefore we re-express the whole SS-GHS prior hierarchy as: π(wlj |zlj ) = (1 − zlj )δ0 + zlj N (0, σ02 τelj2 s2 I) , π(zlj ) = Ber(λl )   π(βlj ) = IG (1/2, 1) , π(αlj ) = G (1/2, 1) , π(sb ) = IG (1/2, 1) , π(sa ) = G 1/2, s20  Variational family: We consider the following fully factorized variational family q(wlj |zlj ) = (1 − zlj )δ0 + zlj N (µlj , diag(σlj2 )) , q(zlj ) = Ber(γlj )   {β} {β} 2 {α} {α} 2 q(βlj ) = LN (µlj , σlj ), q(αlj ) = LN (µlj , σlj ), 2 2 q(sb ) = LN (µ{sb } , σ {sb } ), q(sa ) = LN (µ{sa } , σ {sa } ) for l = 0, . . . , L, j = 1, . . . , kl+1 . Similar to SS-GL variational family, we use Log-Normal family instead of Gamma and Inverse-Gamma families to approximate Gamma and Inverse- Gamma distributed variables to obtain closed form expression for the KL divergence between {β} {β} 2 {α} {α} 2 2 prior and variational distributions. Moreover, (µlj , σlj ), (µlj , σlj ), (µ{sb } , σ {sb } ), and 2 (µ{sa } , σ {sa } ) denote the variational mean and standard deviation parameters of the Gaus- sian distribution associated with q(log βlj ), q(log αlj ), q(sb ) and q(sa ). ELBO: Let θ be the network weights and ϖ = (z, τ 2 , s2 ) be the remaining parameters. Similar to SS-GL, We minimize the loss function: L = −ELBO(q(θ, ϖ), π(θ, ϖ|D)), L = −ELBO(q(θ, ϖ), π(θ, ϖ|D)) X = −Eq(θ,ϖ) [log L(θ)] + dKL (q(zlj ), π(zlj )) l,i Z Z "X Z Z + q(zlj = 1) dKL (q(wlj |zlj = 1), π(wlj |βlj , αlj , sb , sa , zlj = 1)) l,j # ×q(βlj ) q(αlj ) dβlj dαlj q(sb ) q(sa ) dsb dsa 111 Xh i + dKL (q(βlj ), π(βlj )) + dKL (q(αlj ), π(αlj )) + dKL (q(sb ), π(sb )) + dKL (q(sa ), π(sa )) l,j X = −Eq(θ,ϖ) [log L(θ)] + dKL (Ber(γlj ), Ber(λl )) l,i Z Z "X Z Z + q(zlj = 1) dKL (N (µlj , diag(σlj2 )), N (0, σ02 βlj αlj sb sa I)) l,j # ×q(βlj ) q(αlj ) dβlj dαlj q(sb ) q(sa ) dsb dsa {β} 2 {α} 2 Xh {β} {α} i + dKL (LN (µlj , σlj ), IG (1/2, 1)) + dKL (LN (µlj , σlj ), G (1/2, 1)) l,j 2 2 + dKL (LN (µ{sb } , σ {sb } ), IG (1/2, 1)) + dKL (LN (µ{sa } , σ {sa } ), G 1/2, s20 )  4.2.3 Algorithm and Computational Details We minimise the loss L for both SS-GL and SS-GHS models by recursively sampling their corresponding variational posterior, allowing us to propagate the information through the network. The Gaussian variational approximations, N (µlj , diag(σlj2 )), are reparameterized as µlj + σlj ⊙ ζlj for ζlj ∼ N (0, I), where ⊙ denotes the entry-wise (Hadamard) product. Continuous Relaxation. The discrete spike variables (z) are replaced with their continu- ous relaxation to circumvent the nondifferentiablility in L making practical implementation easier (Jang et al., 2017; Maddison et al., 2017). Specifically, the Gumbel-softmax (GS) distribution is used for continuous relaxation, that is q(zlj ) ∼ Ber(γlj ) is approximated by q(z̃lj ) ∼ GS(γlj , τ ), where z̃lj = (1 + exp(−ηlj /τ ))−1 , ηlj = log(γlj /(1 − γlj )) + log(ulj /(1 − ulj )), ulj ∼ U (0, 1) where τ is the temperature. We keep τ = 0.5 for all the experiments similar to (Jantre et al., 2021a). The use of z̃lj in the backward pass eases gradient calculation, while zlj is used in the forward pass for exact node sparsity. 112 Algorithm 4.1 Variational inference in SS-GL and SS-GHS Bayesian neural networks Inputs: training dataset, network architecture, and optimizer tuning parameters. Model inputs: prior parameters for T = (θ, z, τ 2 ) in SS-GL and T = (θ, z, τ 2 , s2 ) in SS-GHS. Variational inputs: number of Monte Carlo samples S. Output: Variational parameter estimates of network weights, scales, and sparsity. Method: Set initial values of variational parameters. repeat Generate S samples of βlj , zlj , z̃lj , τlj2 , s2 (for SS-GHS) Use βlj , τlj2 , s2 and zlj to compute L in forward pass Use βlj , τlj2 , s2 and z̃lj to compute gradient of L in backward pass Update the variational parameters with gradient of loss using stochastic gradient descent algorithm (e.g. SGD with momentum (Sutskever et al., 2013)) until change in ELBO < ϵ 4.3 Numerical Experiments In this section, we demonstrate the performance of our proposed SS-GL and SS-GHS approaches on network architectures and techniques used in practice. We consider multilayer perceptron (MLP), LeNet-5-Caffe, and ResNet architectures which we implement in PyTorch (Paszke et al., 2019). We perform image classification using aforementioned neural networks in widely used MNIST, Fashion-MNIST, and CIFAR-10 datasets. In all the experiments, we fix σ02 = 1 and σe2 = 1. The remaining tuning parameter details such as learning rate, minibatch size, and initial parameter choice are provided in the Appendix. The prediction accuracy is calculated using variational Bayes posterior mean es- timator with 10 Monte Carlo samples at test-time. We use swish (SiLU) activations (Elfwing et al., 2018; Ramachandran et al., 2017) instead of ReLU in our proposed SS-GL and SS- GHS models similar to Jantre et al. (2021a)’s spike-and-slab Gaussian node selection model (SS-IG) to avoid the dying neuron problem (Lu et al., 2020). Smoother activation functions such as sigmoid, tanh, etc also help alleviate this problem. We choose swish since it has the best performance. We provide node sparsity estimates for each linear hidden layer separately. For all models, the node sparsity in a given linear layer is the ratio of number of neurons with atleast one 113 nonzero incoming edge over the original number of neurons present in that layer before training. In a convolution layer we provide channel sparsity estimate which is ratio of the number of output channels with atleast one nonzero incoming connection over the total number of output channels present in the dense counterpart. The layer-wise node or channel sparsity estimates provide granular illustration of the structural compactness of the trained model. The structural sparsity in the trained model leads to lower computational complexity at test-time which is vital for resource constrained devices. 4.3.1 MLP MNIST Classification In this experiment, we use MLP model with 2 hidden layers having 400 nodes per layer to fit the MNIST data which consists of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9. We preprocess the images in the MNIST data by dividing their pixel values by 126. Output layer has 10 neurons since there are 10 classes in the MNIST data. We provide the graph of the prediction accuracy of the i.i.d. test data over training period of 1200 epochs. We provide layer-wise node sparsity plots for both layers to highlight the dynamic structural compactness of the model under training. In what follows, we discuss the choice of ς 2 in SS-GL model as well as the choice of creg values in SS-GHS model. SS-GL penalty parameter choice. In SS-GL model, the value of ς 2 need to be carefully tuned for numerical experiments (Xu and Ghosh, 2015). A very large value of ς 2 will over- shrink the network weights and leading to biased estimates; ς 2 → 0 will lead to a very diffuse distribution for the slab part. Instead we place a conjugate gamma prior on the penalty pa- rameter, ς 2 ∼ Γ(c, d), and estimate it through our variational inference framework via an approximating family q(ς 2 ) := LN (µς , σς2 ). Figure 4.2 summarizes the results of MLP-MNIST experiment using SS-GL model with fixed ς 2 = 1 and variable ς 2 ∼ Γ(c = 4, d = 2). The values of the shape (c = 4) and rate (d = 2) parameters where chosen based on hyperparameters search and past literature. We 114 (a) Prediction accuracy (b) Layer-1 node sparsity (c) Layer-2 node sparsity Figure 4.2 SS-GL penalty parameter choice experiment results. Here, we demonstrate the performance of SS-GL with fixed ς 2 = 1 and variable ς 2 ∼ Γ(c = 4, d = 2). (a) The classification accuracy on the test data. (b) and (c) the node sparsity in the layer-1 and layer-2 of the network respectively. We observe that placing a prior on ς 2 yields better classification accuracy. observe that inferring the value of ς 2 from Bayesian estimation significantly improves the predictive accuracy compared to fixed ς 2 (Figure 4.2a). The fixed ς 2 model has better node sparsities in both the layers of the MLP model (Figure 4.2b and 4.2c). This suggests that ς 2 = 1 might be overshrinking the weights which assists in pruning them via spike-and-slab prior, however this also hampers the predictive performance of the model. In rest of the experiments involving SS-GL, we place the gamma prior on ς 2 ∼ Γ(c = 4, d = 2). SS-GHS regularization constant choice. In what follows, we provide MLP-MNIST experiment using SS-GHS model with with regularization constant values of creg = 1 and creg = kl + 1. In MLP, the kl + 1 = 400 + 1 = 401 is a large constant and essentially acts as an unregularized model. We ran unregularized version of the model and verified this claim but do not provide the results for brevity. Figure 4.3 summarizes the results of MLP-MNIST experiment using SS-GHS model creg = 1 and creg = kl + 1. We observe that both values of creg lead to same predictive accuracies on test data. However in creg = 1 scenario, SS-GHS model has better layer-1 node sparsity (Figure 4.3b). Layer-2 node sparsity is same in both the creg values (Figure 4.3c). In rest of the experiments involving SS-GHS, we choose creg = 1. 115 (a) Prediction accuracy (b) Layer-1 node sparsity (c) Layer-2 node sparsity Figure 4.3 SS-GHS regularization constant choice experiment results. We demon- strate the performance of SS-GHS with regularization constant of creg = 1 and creg = kl +1 = 401. (a) The classification accuracy on the test data. (b) and (c) The node sparsity in the layer-1 and layer-2 of the network respectively. We observe that both creg choices lead to similar classification accuracies with creg = 1 having better layer-1 node sparsity. MLP-MNIST comparison with SS-IG We provide MLP-MNIST experiment where we compare our proposed models with SS- IG model. The results are presented in Figure 4.4. We provide test data accuracy, model compression ratio, flops ratio, and layer-wise node sparsities in each experiment. Additional metrics. We provide two additional metrics that relate to the model com- pression and computational complexity. (i) compression ratio: it is the ratio of number of nonzero weights in the compressed network versus the dense model and is an indicator of storage cost at test-time. (ii) floating point operations (FLOPs) ratio: it is the ratio of number of FLOPs required to predict the output from the input during test-time in the compressed network versus its dense counterpart. We have detailed the FLOPs calculation in neural networks in Chapter 3 Appendix B. Layer-wise node and channel sparsities are directly related to FLOPs ratio hence we only provide FLOPs ratio in LeNet-5-Caffe and ResNet models. In Figure 4.4a, we observe that SS-GHS has better predictive accuracy compared to SS- GL and SS-IG models. Moreover, SS-GHS model not only has minimal storage cost among the node selection models compared (Figure 4.4b) but also the least number of FLOPs required for inference during test-time (Figure 4.4c). In Figure 4.4d and 4.4e, we observe 116 (a) Prediction accuracy (b) Compression ratio (c) FLOPs ratio (d) Layer-1 node sparsity (e) Layer-2 node sparsity Figure 4.4 MLP/MNIST experiment results. Here, we demonstrate the performance of our SS-GL (ς 2 ∼ Γ(c = 4, d = 2)) and SS-GHS (creg = 1) models compared against SS-IG model. (a) we plot the classification accuracy on the test data. (b) and (c) we plot the node sparsity in the layer-1 and layer-2 of the network respectively. We observe that our SS-GHS yields the most compact network with the best classification accuracy. that SS-GHS has pruned away maximum number of nodes in contrast to SS-GL and SS-IG models and this also leads to the maximum reduction in FLOPs evident from (Figure 4.4c). Lastly, SS-GL and SS-IG models have similar predictive accuracies; however, SS-GL has lower layer-wise node sparsities in both layers, hence lower FLOPs ratio and it also has lower storage cost at test-time compared to SS-IG. 4.3.2 LeNet-5-Caffe Experiments The results of more complex LeNet-5-Caffe network experiments on MNIST and Fashion- MNIST are presented in Figure 4.5. We provide test data accuracy, model compression ratio, and FLOPs ratio in each experiment over 1200 epochs. Here, FLOPs ratio serve as a collective indicator of layer-wise node sparsities since FLOPs are directly related to how many neurons or channels are remaining in linear or convolution layers respectively. 117 In LeNet-5-Caffe/MNIST experiment (Figure 4.5a - 4.5c), we observe that our SS-GHS and SS-GL models have better predictive accuracy than SS-IG (Figure 4.5a). We observe that both SS-GHS and SS-GL models have better model compression ratio (Figure 4.5b). Moreover, all three models compared achieve similar reduction in Flops (Figure 4.5c)). In contrast with MLP-MNIST experiment (Figure 4.4) our SS-GHS and SS-GL have same performance on all metrics in LeNet-5-Caffe-MNIST experiment. In LeNet-5-Caffe/Fashion-MNIST experiment (Figure 4.5d - 4.5f), we observe that SS- GHS has better predictive accuracy compared to SS-GL and SS-IG models. The storage cost reduction in SS-GHS model is similar to SS-IG but better than SS-GL (Figure 4.5e). Next, SS-IG achieves best reduction in FLOPs compared to both our approaches and SS- GHS has lower FLOPs than SS-GL. Lastly, SS-GL and SS-IG models have similar predictive accuracies; however, SS-IG has lower FLOPs and storage cost at test-time. (a) Test accuracy (b) Compression ratio (c) FLOPs ratio (d) Test accuracy (e) Compression ratio (f) FLOPs ratio Figure 4.5 LeNet-5-Caffe/MNIST and LeNet-5-Caffe/Fashion-MNIST experi- ments results. Top row (a)-(c) represent the LeNet-5-Caffe on MNIST experiment results. Bottom row (d)-(f) represent the LeNet-5-Caffe on Fashion-MNIST experiment results. 118 4.3.3 Residual Network Experiments This section presents an example demonstrating the trade-off between the computational complexity and memory cost at test-time among our structured pruning methods and recent unstructured pruning methods in Residual Networks (ResNet) applied on CIFAR-10 dataset (He et al., 2016). The CIFAR-10 dataset (Krizhevsky, 2009) consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. We use ResNet-20 and ResNet-32 architectures and follow the experimental setting provided by Sun et al. (2021). We compare our proposed SS-GL and SS-GHS methods with SS-IG (Jantre et al., 2021a), consistent sparse deep learning (BNNcs ) (Sun et al., 2021), and variational BNN with mixture Gaussian prior (VBNN) (Blundell et al., 2015). In all of our experiments, we follow the same training setup as used in Sun et al. (2021), that is, each model considered in Table 4.1 was trained using SGD with momentum for 300 epochs with mini-batch size of 128 during training, the momentum parameter was set to 0.9, and the initial learning rate was set to 0.1. The training data was preprocessed using the random erasing data augmentation strategy proposed by Zhong et al. (2020). We use step-wise constant learning rate schedule and decrease the learning rate by a factor of 10 at epochs 150 and 225. We refine the parameters associated with sparse sub-network obtained after first 300 epochs for additional 100 epochs where sparsity parameters in SS-IG and our models are not trained and only parameters associated with the slab part are learned. We use swish activations in our SS-GL and SS-GHS models as well as SS-IG. For BNNcs and VBNN models, we use ReLU activations as recommended by their authors. In all the models, we set σe2 = 1 and σ02 = 0.04 same as Sun et al. (2021). In SS-IG, SS-GL, and SS-GHS we use common λl = 10−4 after hyperparamter search, since our theory does not cover Bayesian CNNs. However, establishing the posterior consistency in the Bayesian CNN is an interesting future direction of work. We quantify the predictive performance using the accuracy of the test data. Besides the test accuracy, we use compression (%) and pruned FLOPs (%) which are compression ratio 119 and FLOPs ratio discussed earlier converted to percentages respectively. In this experiment, we only count parameters and FLOPs over convolutional and last fully connected layer, because our proposed methods focus on channel and node pruning of convolutional and linear layers respectively. For ResNet architecture, our proposed methods under centered parameterization similar to previous experiments in Section 4.3.1 and 4.3.2 have unstable performance. Instead we incorporate non-centered parameterization (Ghosh et al., 2019) to stabilize the training. Below we detail the non-centered parameterization strategy: Non-Centered Parameterization We adopt non-centered parameterization for the Gaussian slab component in both the prior setups to circumvent the pathological funnel shaped geometries associated with the coupled posterior (Ingraham and Marks, 2017; Ghosh et al., 2019). Accordingly, the coupling between weights wlj and scales τlj∗2 = τlj2 (for SS-GL) or τlj∗2 = τelj2 s2 (for SS-GHS) can be reformulated as βlj ∼ N (0, σ02 τlj∗2 I), wlj = τlj∗ βlj , This formulation leads to independent sampling of weights and scales from their respec- tive prior distributions which are now marginally uncorrelated leads to simpler posterior geometries (Betancourt and Girolami, 2015). This non-centered reparameterization leads to efficient posterior inference without change in functional form of the respective prior. We summarize the ResNet experiment results in Table 4.1. The comparison with BNNcs and VBNN models indicates that our SS-GL and SS-GHS methods have significantly better prediction accuracy in both ResNet-20 and ResNet-32 setups. Moreover, we demonstrate that even though BNNcs and VBNN models have predefined high levels of pruned parameters, our models have significantly less FLOPs during inference at test-time. This highlights the trade-off between unstructured sparsity and structured sparsity methods, where the former leads to significant reduction in storage cost and the later has significantly less computa- 120 Table 4.1 ResNet-20/CIFAR-10 and ResNet-32/CIFAR-10 experiments results. The results of each method is calculated by averaging over 3 independent runs with stan- dard deviation reported in parentheses. For BNNcs and VBNN models, we show predefined percentages of pruned parameters used for magnitude pruning given in (Sun et al., 2021). Model Method Test Accuracy Compression (%) Pruned FLOPs (%) ResNet-20 BNNcs (20%) 92.23 (0.16) 19.29 (0.12) 98.94 (0.38) BNNcs (10%) 91.43 (0.11) 9.18 (0.13) 99.13 (0.37) VBNN (20%) 89.61 (0.04) 19.55 (0.01) 100.00 (0.00) VBNN (10%) 88.43 (0.13) 9.50 (0.00) 99.93 (0.00) SS-IG 92.94 (0.15) 79.52 (0.98) 88.39 (1.00) SS-GL (ours) 92.99 (0.11) 76.10 (1.55) 85.15 (1.76) SS-GHS (ours) 92.87 (0.23) 78.70 (0.42) 86.18 (1.02) ResNet-32 BNNcs (10%) 92.65 (0.03) 9.15 (0.03) 94.53 (0.86) BNNcs (5%) 91.39 (0.08) 4.49 (0.02) 90.79 (1.35) VBNN (10%) 89.37 (0.04) 9.61 (0.01) 99.99 (0.02) VBNN (5%) 87.38 (0.22) 4.59 (0.01) 94.27 (0.54) SS-IG 93.08 (0.23) 55.28 (2.96) 67.59 (2.36) SS-GL (ours) 93.33 (0.11) 54.27 (1.73) 66.93 (2.98) SS-GHS (ours) 93.15 (0.23) 53.72 (2.11) 66.68 (2.75) tional complexity at test-time. In comparison with SS-IG node selection model, we observe that our SS-GL and SS-GHS models have lower storage cost and FLOPS at test-time with comparable predictive accuracy in ResNet-20 architecture. In ResNet-32 case, SS-GL has better predictive accuracy than SS-IG, while compression (%) and pruned FLOPs (%) in our models are comparable to SS-IG within standard deviation. This comparison further highlights the advantage of using group shrinkage priors instead of Gaussian in the slab part of spike-and-slab framework to achieve better test accuracies with lower computational and memory footprint. 4.4 Conclusion and Discussion In this chapter, we introduced the compact Bayesian neural network methods to handle the model compression in a principled manner. Our proposed spike-and-slab models combine the automated sparsity learning with the hierarchical group shrinkage priors: group lasso 121 and group horseshoe. We provide computationally efficient and scalable variational inference algorithms in both the models. In the large scale experiments involving ResNet architectures, we relied on the non-centered parameterization which ensured the numerical stability of our models. We demonstrate the superior performance of the group shrinkage priors over Gaussian prior in slab component in several experiments which further highlights our point that group shrinkage priors shrink the collection of weights incident on the node close to zero which helps in removal of that node through spike-and-slab framework. An immediate future work would be to establish the variational posterior consistency and corresponding contraction rate in both the models. Moreover, the superiority of the SS-GL and SS-GHS models over the SS-IG model could be established through the faster posterior convergence rate for the former models compared to the later. 122 APPENDIX 123 APPENDIX ADDITIONAL NUMERICAL EXPERIMENTS DETAILS A.1 Variational parameters initialization We initialize the γlj ’s at a value close to 1 for all of our experiments. This ensures that at epoch 0, we have a fully connected deep neural network. The variational parameters µljj ′ are initialized using Kaiming uniform initialization (He et al., 2015). Moreover, σljj ′ are reparameterized using softplus function: σljj ′ = log(1 + exp(ρljj ′ )) and ρljj ′ are initialized using a constant value of -6. This keeps initial values of σljj ′ close to 0 ensuring that the initial values of network weights stay close to Kaiming uniform initialization. {τ } {τ } {τ } {τ } In SS-GL, µlj is initialized using U (−0.6, 0.6) and σlj = log(1 + exp(ρlj )) where ρlj are initialized to -6. Moreover, µς is initialized to 1 and σς = log(1 + exp(ρ{τ } )) where ρ{τ } is initialized to -6. {α} {β} {α} {α} In SS-GHS, µlj and µlj are initialized using U (−0.6, 0.6). σlj = log(1 + exp(ρlj )) {β} {β} {α} {β} and σlj = log(1 + exp(ρlj )) where ρlj and ρlj are initialized to -6. Next, µ{sa } and µ{sb } are initialized to 1. σ {sa } = log(1 + exp(ρ{sa } )) and σ {sb } = log(1 + exp(ρ{sb } )) where ρ{sa } and ρ{sb } are initialized to -6. A.2 Hyperparameters for training In MLP-MNIST and LeNet-5-Caffe-MNIST experiments, we use 10−3 learning rate and 1024 minibatch size for all three models compared. In LeNet-5-Caffe-Fashion-MNIST exper- iment, we train SS-GL with 10−3 learning rate whereas, SS-GHS and SS-IG are trained with 2 × 10−3 where minibatch size is 1024 in all three models compared. We train each model for 1200 epochs using Adam optimizer in all the MLP and LeNet-5-Caffe experiments. 124 CHAPTER 5 SEQUENTIAL BAYESIAN NEURAL SUBNETWORK ENSEMBLES 5.1 Introduction Bayesian neural networks (BNNs) have pushed the envelope of probabilistic machine learning through the combination of deep neural network architecture and Bayesian inference. However, due to the enormous number of parameters, BNNs adopt approximate inference techniques such as variational inference with a fully factorized approximating family (Jordan et al., 1999). Although this approximation is crucial for computational tractability, they could lead to under-utilization of BNN’s true potential (Izmailov et al., 2021). Recently, ensemble of neural networks (Lakshminarayanan et al., 2017) has been proposed to account for the parameter/model uncertainty, which has been shown to be analogous to the Bayesian model averaging and sampling from the parameter posteriors in the Bayesian context to estimate the posterior predictive distribution (Wilson and Izmailov, 2020). In this spirit, the diversity of the ensemble has been shown to be a key to improving the predictions, uncertainty, and robustness of the model. To this end, diverse ensembles can mitigate some of the shortcomings introduced by approximate Bayesian inference techniques without compro- mising computational tractability. Several different diversity-inducing techniques have been explored in the literature. The approaches range from using a specific learning rate schedule (Huang et al., 2017), to introducing kernalized repulsion terms among the ensembles in the loss function at train time (D’Angelo and Fortuin, 2021), mixture of approximate posteriors to capture multiple posterior modes (Dusenberry et al., 2020), appealing to sparsity (albeit ad-hoc) as a mechanism for diversity (Havasi et al., 2021; Liu et al., 2022) and finally ap- pealing to diversity in model architectures through neural architecture and hyperparameter searches (Egele et al., 2021; Wenzel et al., 2020). However, most approaches prescribe parallel ensembles, with each individual model part 125 of an ensemble starting with a different initialization, which can be expensive in terms of computation as each of the ensembles has to train longer to reach the high-performing neighborhood of the parameter space. Although the aspect of ensemble diversity has taken center stage, the cost of training these ensembles has not received much attention. However, given that the size of models is only growing as we advance in deep learning, it is crucial to reduce the training cost of multiple individual models forming an ensemble in addition to increasing their diversity. To this end, sequential ensembling techniques offer an elegant solution to reduce the cost of obtaining multiple ensembles, whose origin can be traced all the way back to Swann and Allinson (1998); Xie et al. (2013), wherein ensembles are created by combining epochs in the learning trajectory. Jean et al. (2015); Sennrich et al. (2016) use intermediate stages of model training to obtain the ensembles. Moghimi et al. (2016) used boosting to generate ensembles. In contrast, recent works by Huang et al. (2017); Garipov et al. (2018); Liu et al. (2022) force the model to visit multiple local minima by cyclic learning rate annealing and collect ensembles only when the model reaches a local minimum. Notably, the aforemen- tioned sequential ensembling techniques in the literature have been proposed in the context of deterministic machine learning models. Extending the sequential ensembling technique to Bayesian neural networks is attractive because we can potentially get high-performing en- sembles without the need to train from scratch, analogous to sampling with a Markov chain Monte Carlo sampler that extracts samples from the posterior distribution. Furthermore, sequential ensembling is complementary to the parallel ensembling strategy, where, if the models and computational resources permit, each parallel ensemble can generate multiple sequential ensembles, leading to an overall increase in the total number of diverse models in an ensemble. On the other hand, Chapters 3 and 4 make a case for (i) the automatic data-driven sparsity learning in Bayesian neural networks through the use of spike-and-slab priors, (ii) the use of group sparsity priors Louizos et al. (2017); Ghosh et al. (2019); Jantre et al. 126 (2021a) to provide structural sparsity in Bayesian neural networks leading to significant computational gains. In this work, we leverage the automated structural sparsity learning using spike-and-slab priors similar to Jantre et al. (2021a) in our approach to sequentially generate multiple Bayesian neural subnetworks with varying sparse connectivities which when combined yields highly diverse ensemble. To this end, we propose Sequential Bayesian Neural Subnetwork Ensembles (SeBayS) with the following major contributions: • We propose a sequential ensembling strategy for Bayesian neural networks (BNNs) which learns multiple subnetworks in a single forward-pass. The approach involves a single exploration phase with a large (constant) learning rate to find high-performing sparse network connectivity yielding structurally compact network. This is followed by multiple exploitation phases with sequential perturbation of variational mean pa- rameters using corresponding variational standard deviations together with piecewise- constant cyclic learning rates. • We combine the strengths of the automated sparsity-inducing spike-and-slab prior that allows dynamic pruning during training, which produces structurally sparse BNNs, and the proposed sequential ensembling strategy to efficiently generate diverse and sparse Bayesian neural networks, which we refer to as Bayesian neural subnetworks. 5.1.1 Related Work Ensembles of neural networks: Ensembling techniques in the context of neural net- works are increasingly being adopted in the literature due to their potential to improve accuracy, robustness, and quantify uncertainty. The most simple and widely used approach is Monte Carlo dropout, which is based on Bernoulli noise (Gal and Ghahramani, 2016) and deactivates certain units during training and testing. This, along with techniques such as DropConnect (Wan et al., 2013), Swapout (Singh et al., 2016) are referred to as“implicit” 127 ensembles as model ensembling is happening internally in a single model. Although they are efficient, the gain in accuracy and robustness is limited and they are mainly used in the context of deterministic models. Although most recent approaches have targeted parallel ensembling techniques, few approaches such as BatchEnsemble (Wen et al., 2020) appealed to parameter efficiency by decomposing ensemble members into a product of a shared matrix and a rank-one matrix, and using the latter for ensembling and MIMO (Havasi et al., 2021) which discovers subnetworks from a larger network via multi-input multi-output configura- tion. In the context of Bayesian neural network ensembles, Dusenberry et al. (2020) proposed a rank-1 parameterization of BNNs, where each weight matrix involves only a distribution on a rank-1 subspace and uses mixture approximate posteriors to capture multiple modes. Sequential ensembling techniques offer an elegant solution to ensemble training but have not received much attention recently due to a wider focus of the community on diversity of ensembles and less on the computational cost. Notable sequential ensembling techniques are Huang et al. (2017); Garipov et al. (2018); Liu et al. (2022) that enable the model to visit multiple local minima through cyclic learning rate annealing and collect ensembles only when the model reaches a local minimum. The difference is that Huang et al. (2017) adopts cyclic cosine annealing, Garipov et al. (2018) uses a piecewise linear cyclic learning rate schedule that is inspired by geometric insights. Finally, Liu et al. (2022) adopts a piecewise-constant cyclic learning rate schedule. We also note that all of these approaches have been primarily in the context of deterministic neural networks. Our proposed approach (i) introduces sequential ensembling into Bayesian neural net- works, (ii) combines it with dynamic sparsity through sparsity-inducing Bayesian priors to generate Bayesian neural subnetworks, and subsequently (iii) produces diverse model en- sembles efficiently. It is also complementary to other parallel ensembling as well as efficient ensembling techniques. 128 5.2 Sequential Bayesian Neural Subnetwork Ensembles 5.2.1 Base Learner We call individual models which are part of an ensemble the “Base learners”. Here we provide the prior and corresponding fully factorized variational family that we use in our proposed sequential ensembles. Prior Choice. Zero-mean Gaussian distribution is a widely popular choice of prior for the model parameters (θ) (Izmailov et al., 2021; Louizos et al., 2017; Mackay, 1992; Neal, 1996; Blundell et al., 2015). In our sequential ensemble of dense BNNs, we adopt the zero- mean Gaussian prior similar to Blundell et al. (2015) in each individual BNN model part of an ensemble. The prior and corresponding fully factorized variational family is given as follows p(θljk ) = N (0, σ02 ), 2 q(θljk ) = N (µljk , σljk ) where θljk is the k th weight incident onto the j th node (in MLP) or output channel (in CNN) in the lth layer. N (., .) represents the Gaussian distribution. σ02 is the constant prior Gaussian 2 variance and is chosen through hyperparameter search. µljk and σljk are the variational mean and standard deviation parameters of q(θljk ). Dynamic sparsity learning for our sequential ensemble of sparse BNNs is achieved via spike-and-slab prior. We adopt the sparse BNN model, SS-IG (Jantre et al., 2021a) as a base learner in to achieve the structural sparsity in Bayesian neural networks. The prior and corresponding variational family in SS-IG model is given in Section 2.3. 5.2.2 Sequential Ensembling and Bayesian Neural Subnetworks We propose an ensembling procedure to obtain the base learners {θ 1 , θ 2 , · · · , θ M } se- quentially in a single training run and construct the ensemble. The ensemble predictions are calculated using the uniform average of the predictions obtained from each base learner. 129 j Specifically, if ynew represents the outcome of the mth base learner, then the ensemble pre- PM diction of M base learners (for continuous outcomes) is ynew = M1 m m=1 ynew . Sequential Perturbations. Our ensembling strategy produces diverse set of base learn- ers from a single end-to-end training process. It consists of an exploration phase followed by M exploitation phases. The exploration phase is carried out with a large constant learning rate for t0 time. This allows us to explore high-performing regions of the parameter space. At the conclusion of the exploration phase, the variational posterior approximation for the model parameters reaches a good region on the posterior density surface. Next, during each equally spaced exploitation phase (time = tex ) of the ensemble training, we first use mod- erately large learning rate for tex /2 time followed by small learning rate for remaining tex /2 time. After the first model convergence step (time = t0 + tex ), we perturb the mean parame- ters of the variational posterior distributions of the model weights using their corresponding standard deviations. The initial values of these mean variational parameters at each subse- quent exploitation phase become µ′ljk = µljk ± ρ ∗ σljk , where ρ is a perturbation factor. This perturbation and subsequent model learning strategy is repeated a total of M − 1 times, generating M base learners (either dense or sparse BNNs) creating our sequential ensemble. Sequential Bayesian Neural Subnetwork Ensemble (SeBayS). In this ensembling procedure we use a large (and constant) learning rate (e.g., 0.1) in the exploration phase to find high-performing sparse network connectivity in addition to exploring a wide range of model parameter variations. The use of large learning rate facilitates pruning of excessive nodes or output channels, leading to a compact Bayesian neural subnetwork. This struc- tural compactness of the Bayesian neural subnetwork further helps us after each sequential perturbation step by quickly converging to different local minima potentially corresponding to the different modes of the true Bayesian posterior distribution of the model parameters. Freeze vs No Freeze Sparsity. In our SeBayS ensemble, we propose to evaluate two different strategies during the exploitation phases: (1) SeBayS-Freeze: freezing the sparse connectivity after the exploration phase, and (2) SeBayS-No Freeze: letting the sparsity 130 Algorithm 5.1 Sequential Bayesian neural subnetwork ensemble (SeBayS) algorithm 1: Inputs: training data D = {(xi , yi )}N i=1 , network architecture ηθ , ensemble size M , per- turbation factor ρ, exploration phase training time t0 , training time of each exploitation phase tex . Model inputs: prior hyperparameters for θ, z (for sparse models). 2: Output: Variational parameter estimates of network weights and sparsity. 3: Method: Set initial values of variational parameters: µinit , σinit , γinit . # Exploration Phase 4: for t = 1, 2, . . . , t0 do 5: Update µ0lj , σlj0 , and γlj0 (for sparse models) ← SGD(L). 6: end for 7: Fix the sparsity variational parameters, γlj , for freeze sparse models # M Sequential Exploitation Phases 8: for m = 1, 2, . . . , M do 9: for t = 1, 2, . . . , tex do 10: Update µm m m lj , σlj , and γlj (for no freeze sparse models) ← SGD(L). 11: end for 12: Save variational parameters of converged base learner ηθm . 13: Perturb variational mean parameters using standard deviations: µm+1 m init = µ ± ρ ∗ σ m m+1 −6 14: Set variational standard deviations to a small value: σinit = 10 . 15: end for parameters learn after the exploration phase. The first approach fixes the sparse connectivity leading to lower computational complexity during the exploitation phase training. The diversity in the SeBayS-Freeze ensemble is achieved via sequential perturbations of the mean parameters of the variational distribution of the active model parameters in the subnetwork. The second approach lets the sparsity learn beyond the exploration phase, leading to highly diverse subnetworks at the expense of more computational complexity compared to the SeBayS-Freeze approach. We found that the use of sequential perturbations and dynamic sparsity leads to high- performant subnetworks with different sparse connectivities. Compared to parallel ensem- bles, we achieve higher ensemble diversity in single forward pass. The use of a spike-and-slab prior allows us to dynamically learn the sparsity during training, while the Bayesian frame- work provides uncertainty estimates of the model and sparsity parameters associated with the network. Our approach is the first one in the literature that performs sequential en- 131 sembling of dynamic sparse neural networks, and more so in the context of Bayesian neural networks. Initialization Strategy. We initialize the variational mean parameters, µ, using Kaim- ing He initialization (He et al., 2015) while variational standard deviations, σ, are initialized to a value close to 0. For dynamic sparsity learning, we initialize the variational inclusion probability parameters associated with the sparsity, γlj , to be close to 1, which ensures that the training starts from a densely connected network. Moreover, it allows our spike-and-slab framework to explore potentially different sparse connectivities before the sparsity param- eters are converged after the initial exploration phase. After initialization, the variational parameters are optimized using the stochastic gradient descent with momentum algorithm (Sutskever et al., 2013). Algorithm. We provide the pseudocode for our sequential ensembling approaches: (i) BNN sequential ensemble, (ii) SeBayS-Freeze ensemble, (iii) SeBayS-No Freeze ensemble in Algorithm 5.1. 5.3 Numerical Experiments In this section, we demonstrate the performance of our proposed SeBayS approach on network architectures and techniques used in practice. We consider ResNet-32 on CIFAR10 (He et al., 2016), and ResNet-56 on CIFAR100. These networks are trained with batch nor- malization, stepwise (piecewise constant) decreasing learning rate schedules, and augmented training data. We provide the source code, all the details related to fairness, uniformity, and consistency in training and evaluation of these approaches and reproducibility considerations for SeBayS and other baseline models in the Appendix A. Baselines. Our baselines include the frequentist model of a deterministic deep neural network (trained with SGD), BNN (Blundell et al., 2015), spike-and-slab BNN for node sparsity (Jantre et al., 2021a), single forward pass ensemble models including rank-1 BNN Gaussian ensemble (Dusenberry et al., 2020), MIMO (Havasi et al., 2021), and EDST ensem- 132 ble (Liu et al., 2022), multiple forward pass ensemble methods: DST ensemble (Liu et al., 2022) and Dense ensemble of deterministic neural networks. For fair comparison, we keep the training hardware, environment, data augmentation, and training schedules of all the models same. We adopted and modified the open source code provided by the Liu et al. (2022) and Nado et al. (2021) to implement the baselines and train them. Extra details about model implementation and learning parameters are provided in the Appendix A. Metrics. We quantify predictive accuracy and robustness focusing on the accuracy and negative log-likelihood (NLL) of the i.i.d. test data (CIFAR-10 and CIFAR-100) and corrupted test data (CIFAR-10-C and CIFAR-100-C) involving 19 types of corruption (e.g., added blur, compression artifacts, frost effects) (Hendrycks and Dietterich, 2019). More details on the evaluation metrics are given in the Appendix A. Results. The results for CIFAR10 and CIFAR100 experiments are presented in Ta- bles 5.1 and 5.2, respectively. For all ensemble baselines, we keep the number of base models M = 3 similar to Liu et al. (2022). We report the results for sparse models in the upper half and dense models in the lower half of Tables 5.1 and 5.2. In our models, we choose the perturbation factor (ρ) to be 3. See the Appendix E for additional results on the effect of perturbation factor. We observe that our BNN sequential ensemble consistently outperforms single sparse and dense models, as well as sequential ensemble models in both CIFAR10 and CIFAR100 experiments. Whereas compared to models with 3 parallel runs, our BNN sequential ensem- ble outperforms the DST ensemble while being comparable to the dense ensemble in simpler CIFAR10 experiments. Next, our SeBayS-Freeze and SeBayS-No Freeze ensembles outper- form single-BNN, SSBNN, and MIMO while being comparable to deterministic and rank-1 BNN in CIFAR10 case. Whereas, in CIFAR10-C they outperform SSBNN, MIMO, and rank-1 BNN. Additionally, SeBayS-No Freeze ensemble has comparable performance to de- terministic and EDST ensemble, while SeBayS-Freeze ensemble outperforms deterministic and EDST ensemble in CIFAR10-C. In ResNet-32/CIFAR10 case, we dynamically pruned 133 Table 5.1 ResNet-32/CIFAR10 experiment results. We mark the best results out of single-pass sparse models in bold and single-pass dense models in blue. # Forward Methods Acc (↑) NLL (↓) cAcc (↑) cNLL (↓) passes (↓) SSBNN 91.2 0.320 67.5 1.479 1 MIMO (M=3) 88.9 0.333 65.9 1.102 1 EDST Ensemble (M = 3) 93.1 0.214 69.8 1.236 1 SeBayS-Freeze Ensemble (M = 3) 92.5 0.273 70.4 1.344 1 SeBayS-No Freeze Ensemble (M = 3) 92.4 0.274 69.8 1.356 1 DST Ensemble (M = 3) 93.3 0.206 71.9 1.018 3 Deterministic 92.6 0.378 69.9 2.143 1 BNN 91.9 0.353 71.3 1.422 1 Rank-1 BNN (M=3) 92.4 0.238 68.7 1.271 1 BNN Sequential Ensemble (M = 3) 93.8 0.265 73.3 1.341 1 Dense Ensemble (M=3) 93.8 0.214 72.5 1.381 3 off close to 50% of the parameters in SeBayS approach. In more complex ResNet56/CIFAR100 experiment, our SeBayS-Freeze ensemble out- performs SSBNN and MIMO in both CIFAR100 and CIFAR100-C, while it outperforms deterministic model in CIFAR100-C. Next our SeBayS-No Freeze ensemble outperforms SSBNN in both CIFAR100 and CIFAR100-C while it outperforms MIMO in CIFAR100 and Table 5.2 ResNet-56/CIFAR100 experiment results. We mark the best results out of single-pass sparse models in bold and single-pass dense models in blue. # Forward Methods Acc (↑) NLL (↓) cAcc (↑) cNLL (↓) passes (↓) SSBNN 67.9 1.511 38.9 4.527 1 MIMO (M=3) 65.8 1.528 42.3 2.522 1 EDST Ensemble (M = 3) 71.9 0.997 44.3 2.787 1 SeBayS-Freeze Ensemble (M = 3) 69.4 1.393 42.4 3.855 1 SeBayS-No Freeze Ensemble (M = 3) 69.4 1.403 41.7 3.906 1 DST Ensemble (M = 3) 74.0 0.914 46.7 2.529 3 Deterministic 69.8 1.786 41.6 5.856 1 BNN 70.4 1.335 43.2 3.774 1 Rank-1 BNN (M=3) 70.7 1.075 43.9 2.752 1 BNN Sequential Ensemble (M = 3) 72.2 1.250 44.9 3.537 1 Dense Ensemble (M=3) 74.2 1.236 45.4 4.093 3 134 deterministic model in CIFAR100-C. Given the complexity of the CIFAR100, our SeBayS approach was able to dynamically prune off close to 18% of the ResNet56 model parameters. 5.4 Sequential BNN Ensemble Analysis 5.4.1 Function Space Analysis Quantitative Metrics. We measure the diversity of the base learners in our sequential ensembles by quantifying the pairwise similarity of the base learner’s predictions on the test data. The average pairwise similarity is given by Dd = E [d(P1 (y|x1 , · · · , xN ), P2 (y|x1 , · · · , xN ))] where d(., .) is a distance metric between the predictive distributions and {(xi , yi )}i=1,··· ,N are the test data. We consider two distance metrics: (1) Disagreement: the fraction of the predictions on the test data on which the base learners disagree: ddis (P1 , P2 ) = N1 N P i=1 I(arg maxyˆi P1 (ŷi ) ̸= arg maxyˆi P2 (ŷi )). (2) Kullback-Leibler (KL) divergence: dKL (P1 , P2 ) = E [log P1 (y) − log P2 (y)]. When given two models have the same predictions for all the test data, then both dis- agreement and KL divergence are zero. Table 5.3 Diversity metrics in ResNet-32/CIFAR-10 and ResNet-56/CIFAR100 experiments. We mark the best results out of single-pass models in bold. ResNet-32/CIFAR10 ResNet-56/CIFAR100 Methods ddis (↑) dKL (↑) Acc (↑) ddis (↑) dKL (↑) Acc (↑) EDST Ensemble 0.058 0.106 93.1 0.209 0.335 71.9 BNN Sequential Ensemble 0.061 0.201 93.8 0.208 0.493 72.2 SeBayS-Freeze Ensemble 0.060 0.138 92.5 0.212 0.452 69.4 SeBayS-No Freeze Ensemble 0.106 0.346 92.4 0.241 0.597 69.4 DST Ensemble 0.085 0.205 93.3 0.292 0.729 74.0 We report the results of the diversity analysis of the base learners that make up our sequential ensembles in Table 5.3 and compare them with the DST and EDST ensembles. We observe that for simpler CIFAR10 case, our sequential perturbation strategy helps in 135 generating diverse base learners compared to EDST ensemble. Specifically, the SeBayS-No Freeze ensembles have significantly high prediction disagreement and KL divergence among all the methods, especially surpassing DST ensembles which involve multiple parallel runs. In a more complex setup of CIFAR100, we observe that SeBayS-No Freeze ensemble has the highest diversity metrics among single-pass ensemble learners. This highlights the im- portance of dynamic sparsity learning during each exploitation phase. Training Trajectory. We use t-SNE (Van Der Maaten and Hinton, 2008) to visual- ize the training trajectories of the base learners obtained using our sequential ensembling strategy in functional space. In the ResNet32-CIFAR10 experiment, we periodically save the checkpoints during each exploitation training phase and collect the predictions on the test dataset at each checkpoint. After training, we use t-SNE plots to project the collected predictions into the 2D space. In Figure 5.1, the local optima reached by individual base learners using sequential ensembling in all three models is fairly different. The distance be- tween the optima can be explained by the fact that the perturbed variational parameters in each exploitation phase try to reach nearby local optima. Learner-1 Learner-1 Learner-2 Learner-2 Learner-3 Learner-3 Dimension 2 Dimension 2 Dimension 2 Learner-1 Learner-2 Learner-3 Dimension 1 Dimension 1 Dimension 1 (a) BNN Sequential Ensemble (b) SeBayS-Freeze (c) SeBayS-No Freeze Figure 5.1 Training trajectories of base learners in ResNet32/CIFAR10 experi- ment. Training trajectories obtained by BNN sequential ensemble, SeBayS-Freeze Ensem- ble, and SeBayS-No Freeze Ensemble. 136 (a) Pruned Parameter Ratio (b) Pruned Flops Ratio Figure 5.2 Dynamic sparsity and FLOPs curves. They show ratio of remaining param- eters and FLOPs for our SeBayS-Freeze and SeBayS-No Freeze ensembles in ResNet32- CIFAR10 experiment. 5.4.2 Dynamic Sparsity Learning In this section, we highlight the dynamic sparsity training in our SeBayS ensemble meth- ods. We focus on ResNet-32/CIFAR10 experiment and consider M = 3 exploitation phases. In particular, we plot the ratios of remaining parameters and floating point operations (FLOPs) in the SeBayS sparse base learners. In Figure 5.2, We observe that during ex- ploration phase, SeBayS prunes off 50% of the network parameters and more than 35% of the FLOPs compared to its dense counterpart. 5.4.3 Effect of Ensemble size In this section, we explore the effect of the ensemble size M in ResNet-32/CIFAR10 experiment. According to the ensembling literature (Hansen and Salamon, 1990; Ovadia et al., 2019), increasing number of diverse base learners in the ensemble improves predictive performance, although with a diminishing impact. In our ensembles, we generate models and aggregate performance sequentially with increasing M , the number of base learners in the ensemble. 137 (a) BNN Sequential Ensemble (b) SeBayS-Freeze (c) SeBayS-No Freeze Figure 5.3 Predictive performance results of the base learners and the sequential ensembles as the ensemble size M varies in ResNet32/CIFAR10 experiment. In Figure 5.3, we plot the performance of the individual base learners, as well as the sequential ensemble as M varies. For individual learners, we provide the mean test accu- racy with corresponding one standard deviation spread. When M = 1, the ensemble and individual model refer to a single base learner and hence their performance is matched. As M grows, we observe significant increase in the performance of our ensemble models with diminishing improvement for higher M s. The high performance of our sequential ensembles compared to their individual base models further underscores the benefits of ensembling in sequential manner. 5.5 Conclusion and Discussion In this work, we propose SeBayS ensemble, which is an approach to generate sequential Bayesian neural subnetwork ensembles through a combination of novel sequential ensem- bling approach for BNNs and dynamic sparsity with sparsity-inducing Bayesian prior that provides a simple and effective approach to improve the predictive performance and model robustness. The highly diverse Bayesian neural subnetworks converge to different optima in function space and, when combined, form an ensemble which demonstrates improving performance with increasing ensemble size. Our simple yet highly effective sequential per- 138 turbation strategy enables a dense BNN ensemble to outperform deterministic dense ensem- ble. Whereas, the Bayesian neural subnetworks obtained using spike-and-slab node pruning prior produce ensembles that are highly diverse, especially the SeBays-No Freeze ensembles compared to EDST in both CIFAR10/100 experiments and DST ensemble in our simpler CIFAR10 experiment. Future work will explore the combination of parallel ensembling of our sequential ensembles leading to a multilevel ensembling model. In particular, we will leverage the exploration phase to reach highly sparse network and next perturb more than once and learn each subnetwork in parallel while performing sequential exploitation phases on each subnetwork. We expect this strategy would lead to highly diverse base learners with potentially significant improvements in model performance and robustness. 139 APPENDICES 140 APPENDIX A REPRODUCIBILITY CONSIDERATIONS A.1 Hyperparameters Hyperparameters for single and parallel ensemble models. For the ResNet/CIFAR models, we use minibatch size of 128 uniformly across all the methods. We train each single model (Deterministic, BNN, SSBNN) as well as each member of Dense and DST ensemble for 250 epochs with a learning rate of 0.1 which is decayed by a factor of 0.1 at 150 and 200 epochs. For frequentist methods, we use weight decay of 5e − 4 whereas for Bayesian models the weight decay is 0 (since the KL term in the loss acts as a regularizer). For the DST ensemble, we take the sparsity S = 0.8, the update interval ∆T = 1000, and the exploration rate p = 0.5, same as Liu et al. (2022). Hyperparameters for sequential ensemble models. For the ResNet/CIFAR models, the minibatch size is 128 for all the methods compared. We train each sequential model with M = 3 for 450 epochs. In the BNN and SeBayS ensembles, the exploration phase is run for t0 = 150 epochs and each exploitation phase is run for tex = 100 epochs. We fix the perturbation factor to be 3. During the exploration phase, we take a high learning rate of 0.1. Whereas, for each exploitation phase, we use learning rate of 0.01 for first tex /2 = 50 epochs and 0.001 for remaining tex /2 = 50 epochs. For the EDST ensemble, we take an exploration time (tex ) of 150 epochs, each refinement phase time (tre ) of 100 epochs, sparsity S = 0.8, and exploration rate q = 0.8, same as Liu et al. (2022). A.2 Data Augmentation For CIFAR10 and CIFAR100 training dataset, we first pad the train images using 4 pixels of value 0 on all borders and then crop the padded image at a random location gen- erating train images of the same size as the original train images. Next, with a probability 141 of 0.5, we horizontally flip a given cropped image. Finally, we normalize the images us- ing mean = (0.4914, 0.4822, 0.4465) and standard deviation = (0.2470, 0.2435, 0.2616) for CIFAR10. Whereas, we use mean = (0.5071, 0.4865, 0.4409) and standard deviation = (0.2673, 0.2564, 0.2762) for CIFAR100. Next, we split the train data of size 50000 images into a TRAIN/VALIDATION split of 45000/5000 transformed images. For CIFAR10/100 test data, we normalize the 10000 test images in each data case using the corresponding mean and standard deviation of their respective training data. A.3 Evaluation Metrics We quantify the predictive performance of each method using the accuracy of the test data (Acc). For a measure of robustness or predictive uncertainty, we use negative log- likelihood (NLL) calculated on the test dataset. Moreover, we adopt {cAcc, cNLL} to denote the corresponding metrics on corrupted test datasets. We also use VALIDATION data to determine the best epoch in each model which is later used for TEST data evaluation. In the case of Deterministic and each member of Dense, MIMO, and DST ensemble, we use a single prediction for each test data element and calculate the corresponding evaluation metrics for each individual model. In case of all the Bayesian models, we use one Monte Carlo sample to generate the network parameters and correspondingly generate a single prediction for each single model, which is used to calculate the evaluation metrics in those individual models. For all the ensemble models, we generate a single prediction from each base learner present in the ensemble. Next, we evaluate the ensemble prediction using a simple average of M predictions generated from M base learners and use this averaged prediction to calculate the evaluation metrics mentioned above for the ensemble models. A.4 Hardware and Software The Deterministic, MIMO, Rank-1 BNN, and Dense Ensemble models are run using the Uncertainty Baselines (Nado et al., 2021) repository, but with the data, model and 142 hyperparameter settings described in Section 5.3. Moreover, we consistently run all the experiments on a single NVIDIA A100 GPU for all the approaches evaluated in this work. 143 APPENDIX B OUT-OF-DISTRIBUTION EXPERIMENT RESULTS In Table B.1, we present the AU-ROC results for out-of-distribution (OoD) detection for the ResNet-32/CIFAR10 models. In this case, the out-of-distribution data was taken to be CIFAR100. The results show that our SebayS-Freeze Ensemble perform better than the single SSBNN and MIMO model. On the other hand, SebayS-No Freeze Ensemble performss better than SSBNN. Next, our BNN sequential ensemble performs better than deterministic and BNN models. Table B.1 OoD detection results in ResNet-32/CIFAR10 experiment. We mark the best results out of single-pass sparse models in bold and single-pass dense models in blue. # Forward Methods AUROC (↑) passes (↓) SSBNN 0.806 1 MIMO (M=3) 0.840 1 EDST Ensemble (M = 3) 0.872 1 SeBayS-Freeze Ensemble (M = 3) 0.864 1 SeBayS-No Freeze Ensemble (M = 3) 0.842 1 DST Ensemble (M = 3) 0.879 3 Deterministic 0.854 1 BNN 0.841 1 Rank-1 BNN (M=3) 0.866 1 BNN Sequential Ensemble (M = 3) 0.863 1 Dense Ensemble (M=3) 0.879 3 144 APPENDIX C EFFECT OF THE ENSEMBLE SIZE In Section 5.4.3, we have explored the effect of the ensemble size M in ResNet-32/CIFAR10 experiment through comparison of mean individual learner test accuracies compared to en- semble accuracies in uncorrupted test dataset. In Table C.1, we provide the results on both CIFAR10 and CIFAR10-C datasets for our sequential ensembles with an increasing number of base learners M = 3, 5, 10. We also provide BNN and SSBNN baselines to compare against BNN sequential ensemble and SeBayS ensembles, respectively. We observe that our BNN sequential ensemble and SeBayS ensembles of any size significantly outperform single BNN and SSBNN models, respectively. With an increasing number of base learners (M = 3, 5, 10) within each of our sequential ensembles, we observe a monotonically increasing predictive performance. The NLLs for BNN sequential ensemble decrease as M increases. The NLLs for the SeBayS ensembles are either similar or increasing as M increases, which suggests the influence of the KL divergence term in ELBO optimization in variational inference. Table C.1 Ensemble size effect results in ResNet-32/CIFAR10 experiment. We mark the best results out of the sparse models in bold and the dense models in blue. # Forward Methods Acc (↑) NLL (↓) cAcc (↑) cNLL (↓) passes (↓) SSBNN 91.2 0.320 67.5 1.479 1 SeBayS-Freeze Ensemble (M = 3) 92.5 0.273 70.4 1.344 1 SeBayS-Freeze Ensemble (M = 5) 92.5 0.273 70.9 1.359 1 SeBayS-Freeze Ensemble (M = 10) 92.7 0.275 71.0 1.386 1 SeBayS-No Freeze Ensemble (M = 3) 92.4 0.274 69.8 1.356 1 SeBayS-No Freeze Ensemble (M = 5) 92.5 0.271 70.2 1.375 1 SeBayS-No Freeze Ensemble (M = 10) 92.7 0.272 70.8 1.375 1 BNN 91.9 0.353 71.3 1.422 1 BNN Sequential Ensemble (M = 3) 93.8 0.265 73.3 1.341 1 BNN Sequential Ensemble (M = 5) 94.1 0.253 73.7 1.318 1 BNN Sequential Ensemble (M = 10) 94.2 0.244 73.9 1.300 1 145 APPENDIX D EFFECT OF THE MONTE CARLO SAMPLE SIZE In variational inference during evaluation phase, model prediction is calculated using the average of the predictions from ensemble of networks where the weights of each network represent one sample from the posterior distributions of the weights. The number of such networks used to build ensemble prediction is called the Monte Carlo (M C) sample. In Table D.1, we present our sequential ensemble models as well as BNN and SSBNN baselines in the ResNet-32/CIFAR10 experiment. Here, we take M C = 1 which is used in the Section 5.3 experiments and compare it with M C = 5 for each method. In single BNN and SSBNN models, we observe significant improvement in model performance when using M C = 5 instead of 1. However, when we compare the SebayS ensembles using M C = 1 or 5 with SSBNN using M C = 5, we observe that their performance is similar, indicating that M C = 1 is sufficient for our SebayS ensembles. On the other hand, sequential BNN ensembles using M C = 1 has better performance compared to BNN with M C = 5. Whereas, sequential BNN ensemble using M C = 1 and 5 have similar performance. This highlights the importance of sequential perturbation strategy, which leads to more diverse ensembles compared to mere Monte Carlo sampling. 146 Table D.1 Monte Carlo sample size effect results in ResNet-32/CIFAR10 experi- ment. We mark the best results out of the sparse models in bold and dense models in blue. M C is the Monte Carlo sample size Methods MC Acc (↑) NLL (↓) SSBNN 1 91.2 0.320 SSBNN 5 92.3 0.270 SeBayS-Freeze Ensemble (M=3) 1 92.5 0.273 SeBayS-Freeze Ensemble (M=3) 5 92.5 0.270 SeBayS-No Freeze Ensemble (M=3) 1 92.4 0.274 SeBayS-No Freeze Ensemble (M=3) 5 92.6 0.268 BNN 1 91.9 0.353 BNN 5 93.2 0.271 BNN Sequential Ensemble (M=3) 1 93.8 0.265 BNN Sequential Ensemble (M=3) 5 93.9 0.254 147 APPENDIX E EFFECT OF THE PERTURBATION FACTOR In this Appendix, we explore the influence of the perturbation factor on our sequential ensemble models through the ResNet-32/CIFAR10 experiment. In Table E.1, we report the results for our three sequential approaches for three perturbation factors, ρ = 2, 3, 5. For our SeBayS-Freeze and No Freeze ensembles, the lower perturbations with ρ = 2 lead to higher test accuracies and NLLs over ρ = 3, 5 in both the CIFAR10 and CIFAR10-C test datasets. This means that higher perturbations ρ = 3, 5 might need a higher number of epochs to reach the convergence in each exploitation phase. However, in the BNN sequential ensemble ρ = 3 has an overall higher performance compared to ρ = 2, 5. This points to the fact that the lower perturbation, ρ = 2, may not lead to the best ensemble model. In Table E.2, we present the prediction disagreement and KL divergence metrics for the experiments described in this Appendix. In the BNN sequential ensemble, the ρ = 5 perturbation model has the best diversity metrics, whereas the ρ = 3 perturbation model has the best accuracy. In the SeBayS approach, the perturbation of ρ = 3 leads to the best diversity metrics nonetheless at the expense of slightly lower predictive performance. This highlights the fact that the ρ = 3 SeBayS approaches lead to the best ensembles given the training budget constraint. Hence, we use ρ = 3 for our three sequential models in all the experiments presented in the Section 5.3. 148 Table E.1 Perturbation factor effect results in ResNet-32/CIFAR10 experiment. We mark the best results out of different perturbation models under a given method in bold. Ensemble size is fixed at M = 3. ρ is the perturbation factor. Methods ρ Acc (↑) NLL (↓) cAcc (↑) cNLL (↓) SeBayS-Freeze Ensemble 2 92.7 0.264 70.6 1.303 SeBayS-Freeze Ensemble 3 92.5 0.273 70.4 1.344 SeBayS-Freeze Ensemble 5 92.5 0.267 70.6 1.314 SeBayS-No Freeze Ensemble 2 92.7 0.268 70.4 1.331 SeBayS-No Freeze Ensemble 3 92.4 0.274 69.8 1.356 SeBayS-No Freeze Ensemble 5 92.4 0.272 70.1 1.353 BNN Sequential Ensemble 2 93.6 0.269 73.3 1.361 BNN Sequential Ensemble 3 93.8 0.265 73.3 1.341 BNN Sequential Ensemble 5 93.6 0.262 73.0 1.366 Table E.2 Diversity metrics for models trained with different perturbation factors in ResNet-32/CIFAR-10 experiment. We mark the best results out of different per- turbation models under a given method in bold. Ensemble size is fixed at M = 3. ρ is the perturbation factor. ResNet-32/CIFAR10 Methods ρ ddis (↑) dKL (↑) Acc (↑) BNN Sequential Ensemble 2 0.062 0.205 93.6 BNN Sequential Ensemble 3 0.061 0.201 93.8 BNN Sequential Ensemble 5 0.063 0.211 93.6 SeBayS-Freeze Ensemble 2 0.058 0.135 92.7 SeBayS-Freeze Ensemble 3 0.060 0.138 92.5 SeBayS-Freeze Ensemble 5 0.059 0.137 92.5 SeBayS-No Freeze Ensemble 2 0.082 0.222 92.7 SeBayS-No Freeze Ensemble 3 0.106 0.346 92.4 SeBayS-No Freeze Ensemble 5 0.083 0.228 92.4 149 APPENDIX F EFFECT OF THE CYCLIC LEARNING RATE SCHEDULE In this Appendix, we provide the effect of different cyclic learning rate strategies during exploitation phases in our three sequential ensemble methods. We explore the stepwise (our approach), cosine (Huang et al., 2017), linear-fge (Garipov et al., 2018), and linear-1 cyclic learning rate schedules. Cosine. The cyclic cosine learning rate schedule reduces the higher learning rate of 0.01 to a lower learning rate of 0.001 using the shifted cosine function (Huang et al., 2017) in each exploitation phase. Linear-fge. In the cyclic linear-fge learning rate schedule, we first drop the high learning rate of 0.1 used in the exploration phase to 0.01 linearly in tex /2 epochs and then further drop the learning rate to 0.001 linearly for the remaining tex /2 epochs during the first exploitation phase. Afterwards, in each exploitation phase, we linearly increase the learning rate from 0.001 to 0.01 for tex /2 and then linearly decrease it back to 0.001 for the next tex /2 similar to Garipov et al. (2018). Linear-1. In the linear-1 cyclic learning rate schedule, we linearly decrease the learning rate from 0.01 to 0.001 for tex epochs in each exploitation phase and then suddenly increase the learning to 0.01 after each sequential perturbation step. In Figure F.1, we present the plots of the cyclic learning rate schedules considered in this Appendix. 150 Figure F.1 Cyclic learning rate schedules. The red dots represent the converged models after each exploitation phase used in our final sequential ensemble. In Table F.1, we present the results for our three sequential ensemble methods under the four cyclic learning rate schedules mentioned above. We observe that, in all three sequential ensembles, the cyclic stepwise learning rate schedule yields the best performance in almost all criteria compared to the rest of the learning rate schedules in each sequential ensemble method. In Table F.2, we present the prediction disagreement and KL divergence metrics for the experiments described in this Appendix. We observe that, in SeBayS-No Freeze ensemble, cyclic stepwise schedule generates highly diverse subnetworks, which also leads to high predictive performance. Whereas, in the BNN sequential and SeBayS-Freeze ensemble, we observe lower diversity metrics for the cyclic stepwise learning rate schedule compared to the rest of the learning rate schedules. 151 Table F.1 Cyclic learning rate schedules results in ResNet-32/CIFAR10 experi- ment. We mark the best results out of different learning rate (LR) schedules under a given method in bold. Ensemble size is fixed at M = 3. Methods LR Schedule Acc (↑) NLL (↓) cAcc (↑) cNLL (↓) SeBayS-Freeze Ensemble stepwise 92.5 0.273 70.4 1.344 SeBayS-Freeze Ensemble cosine 92.3 0.301 69.8 1.462 SeBayS-Freeze Ensemble linear-fge 92.5 0.270 70.1 1.363 SeBayS-Freeze Ensemble linear-1 92.1 0.310 69.8 1.454 SeBayS-No Freeze Ensemble stepwise 92.4 0.274 69.8 1.356 SeBayS-No Freeze Ensemble cosine 92.2 0.294 69.9 1.403 SeBayS-No Freeze Ensemble linear-fge 92.4 0.276 70.0 1.379 SeBayS-No Freeze Ensemble linear-1 92.2 0.296 69.7 1.412 BNN Sequential Ensemble stepwise 93.8 0.265 73.3 1.341 BNN Sequential Ensemble cosine 93.7 0.279 72.7 1.440 BNN Sequential Ensemble linear-fge 93.5 0.270 73.1 1.342 BNN Sequential Ensemble linear-1 93.4 0.287 72.2 1.430 Table F.2 Diversity metrics for models trained with different cyclic learning rate schedules in ResNet-32/CIFAR10 experiment. We mark the best results out of dif- ferent learning rate (LR) schedules under a given method in bold. Ensemble size is fixed at M = 3. ResNet-32/CIFAR10 Methods LR Schedule ddis (↑) dKL (↑) Acc (↑) BNN Sequential Ensemble stepwise 0.061 0.201 93.8 BNN Sequential Ensemble cosine 0.068 0.256 93.7 BNN Sequential Ensemble linear-fge 0.070 0.249 93.5 BNN Sequential Ensemble linear-1 0.071 0.275 93.4 SeBayS-Freeze Ensemble stepwise 0.060 0.138 92.5 SeBayS-Freeze Ensemble cosine 0.072 0.204 92.3 SeBayS-Freeze Ensemble linear-fge 0.076 0.215 92.5 SeBayS-Freeze Ensemble linear-1 0.074 0.209 92.1 SeBayS-No Freeze Ensemble stepwise 0.106 0.346 92.4 SeBayS-No Freeze Ensemble cosine 0.078 0.222 92.2 SeBayS-No Freeze Ensemble linear-fge 0.074 0.199 92.4 SeBayS-No Freeze Ensemble linear-1 0.077 0.217 92.2 152 CHAPTER 6 EPILOGUE 6.1 Summary This dissertation focuses on the development of novel theoretically consistent Bayesian neural networks (BNN) models for wide range of data scenarios. We have proposed: the Bayesian quantile regression neural networks (BQRNN) in Chapter 2, the spike-and-slab Gaussian node selection technique (SS-IG) in Chapter 3, the spike-and-slab group lasso (SS- GL) and the spike-and-slab group horseshoe (SS-GHS) in Chapter 4. In each of BQRNN and SS-IG methods, we provide rigorous theoretical justification via posterior consistency results and the optimal contraction rate. We also provide numerical evidence establishing the advantage of our proposed methods compared to the recent competing techniques in the literature. In Chapter 5, we propose sequential Bayesian neural subnetwork ensembles (SeBayS) which use SS-IG models as the base models in the ensemble. We concluded that chapter with several experiments showcasing the effectiveness of our proposed approach as well as few studies where we explore the effect of changing some parameters in the model. 6.2 Broader Impacts Our BQRNN approach is particularly useful when the relationships in the lower and upper tail areas of the response variable distribution are of greater interest such as extreme weather events, cascading failures in electric power grids, and other rare events modeling. On the other hand, as deep learning gets harnessed by big industrial corporations in recent years to improve their products, the demand for models with both high predictive and uncertainty estimation performance is rising. The vast variety of its applications range from computer vision, pattern recognition, to natural language processing. However, as deep learning models are pushed into smaller and smaller embedded devices, such as, smart cameras recognizing 153 visitors at your front door, the design of resource efficient neural networks is of extreme practical importance. These real-world applications demand real-time, on-device neural network inference. Our work on sparse BNNs addresses this computational bottleneck by compressing neural networks by inducing sparsity during training. The Bayesian framework estimates the posterior of model parameters allowing for uncertainty quantification around the parameter estimates which can be vital in medical diagnostics. For example, many of the brain imaging data could be processed through our model yielding a decision on certain medical condition with added benefit of quantified confidence associated with that decision. 6.3 Future Research In the future, the Bayesian neural networks still have a huge room to investigate. There are few promising research directions which stem from our current work. • The development of sparse deep Bayesian quantile networks which can allow for ex- treme quantile inference with fewer data points. Such a model can benefit in cases where the event of interest is rarely manifested in a given data. • Bayesian convolutional neural network approximation theory which consists of the derivations of posterior consistency, variable selection consistency, and asymptotically optimal generalization bounds. • The theoretical framework for the Bayesian ensembling including not only the posterior consistency but also the nonasymptotic generalization error upper bounds. Such a bound might depend on the data size as well as explicit number of base models in an ensemble. This theoretical development will also help in deciding the optimal number of base models in a given ensemble. • Development of sparse Bayesian tensor-to-tensor convolution neural networks involving structured sparsity learning which would potentially benefit ill-posed as well as well- posed problems which can occur for instance in tomographic reconstruction. 154 BIBLIOGRAPHY 155 BIBLIOGRAPHY Alhamzawi, R. (2018). Brq: R package for bayesian quantile regression. https://cran.r- project.org/web/packages/Brq/Brq.pdf. Online; accessed 15 May 2020. Alvarez, J. M. and Salzmann, M. (2016). Learning the number of neurons in deep networks. In Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS 2016). Andrews, D. F. and Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society – Series B, 36(1):99–102. Arangio, S. and Bontempi, F. (2015). Structural health monitoring of a cable-stayed bridge with bayesian neural networks. Structure and Infrastructure Engineering, 11(4):575–587. Bai, J., Song, Q., and Cheng, G. (2020). Efficient variational inference for sparse deep learn- ing with theoretical guarantee. In Proceedings of the 33th Advances in Neural Information Processing Systems (NeurIPS 2020). Barndorff-Nielsen, O. and Shephard, N. (2001). Non-Gaussian OU based models and some of their uses in financial economics. Journal of the Royal Statistical Society – Series B, 63:167–241. Barron, A., Schervish, M. J., and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Annals of Statistics, 10:536–561. Bateni, S. M., Jeng, D.-S., and Melville, B. W. (2007). Bayesian neural networks for pre- diction of equilibrium and time-dependent scour depth around bridge piers. Advances in Engineering Software, 38(2):102–111. Beker, W., Wolos, A., Szymkuć, S., and Grzybowski, B. (2020). Minimal-uncertainty pre- diction of general drug-likeness based on bayesian neural networks. Nature Machine Intelligence, 2:457–465. Benoit, D. F., Alhamzawi, R., Yu, K., and den Poel, D. V. (2017). R package ‘bayesqr’. https://cran.r-project.org/web/packages/bayesQR/bayesQR.pdf. Online; accessed 15 May 2020. Betancourt, M. and Girolami, M. (2015). Hamiltonian monte carlo for hierarchical models. Current trends in Bayesian methodology with applications, 79(30). Bhadra, A., Datta, J., Polson, N., and Willard, B. (2019). Lasso meets horseshoe: A survey. Statistical Science, 34(3):405–427. 156 Bhattacharya, S. and Maiti, T. (2021). Statistical foundation of variational bayes neural networks. Neural Networks, 137:151–173. Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877. Blei, D. M. and Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1):17–35. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural network. In Proceedings of Machine Learning Research, volume 37, pages 1613– 1622. PMLR. Buntine, W. L. and Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5:603–643. Cannon, A. J. (2011). R package ‘qrnn’ for quantile regression neural network. https://cran.r- project.org/web/packages/qrnn/qrnn.pdf. Online; accessed 15 May 2020. Cannon, A. J. (2018). Non-crossing nonlinear regression quantiles by monotone composite quantile regression neural network, with application to rainfall extremes. Stoch Environ Res Risk Assess, 32:3207–3225. Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480. Chen, C. (2007). A finite smoothing algorithm for quantile regression. Journal of Computational and Graphical Statistics, 16:136–164. Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2018). Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136. Chérief-Abdellatif, B.-E. (2020). Convergence rates of variational inference in sparse deep learning. In Proceedings of the 37th International Conference on Machine Learning (ICML-2020). Chérief-Abdellatif, B.-E. and Alquier, P. (2018). Consistency of variational bayes infer- ence for estimation and model selection in mixtures. Electronic Journal of Statistics, 12(2):2995–3035. Cobb, A. D. et al. (2019). An ensemble of bayesian neural networks for exoplanetary atmo- spheric retrieval. The Astronomical Journal, 158(1). Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics 157 of Controls, Signals, and Systems, 2:303–314. D’Angelo, F. and Fortuin, V. (2021). Repulsive deep ensembles are bayesian. In Proceedings of the 34th Advances in Neural Information Processing Systems (NeurIPS 2021). Dantzig, G. B. (1963). Linear Programming and Extensions. Princeton University Press, Princeton. De Freitas, N., Andrieu, C., Højen-Sørensen, P., Niranjan, M., and Gee, A. (2001). Sequential monte carlo methods for neural networks. In Sequential Monte Carlo Methods in Practice, pages 359—-379. Springer, New York. Dua, D. and Graff, C. (2017). UCI machine learning repository. Dusenberry, M., Jerfel, G., Wen, Y., Ma, Y., Snoek, J., Heller, K., Lakshminarayanan, B., and Tran, D. (2020). Efficient and scalable bayesian neural nets with rank-1 factors. In Proceedings of the 37th International Conference on Machine Learning (ICML-2020). Egele, R., Maulik, R., Raghavan, K., Balaprakash, P., and Lusch, B. (2021). Autodeuq: Au- tomated deep ensemble with uncertainty quantification. arXiv preprint arXiv:2110.13511. Elfwing, S., Uchibe, E., and Doya, K. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11. Special issue on deep reinforcement learning. Frankle, J. and Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, train- able neural networks. In 7th International Conference on Learning Representations (ICLR-2019). Friedman, J., Hastie, T., and Tibshirani, R. (2009). The elements of statistical learning. Springer series in statistics. Springer, New York. Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2:183–192. Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML-2016. Gale, T., Elsen, E., and Hooker, S. (2019). The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Garipov, T., Izmailov, P., Podoprikhin, D., Vetrov, D. P., and Wilson, A. G. (2018). Loss surfaces, mode connectivity, and fast ensembling of dnns. In Proceedings of the 31st Advances in Neural Information Processing Systems (NeurIPS-2018). 158 Gelfand, E. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85:398–409. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis. CRC Press, Third edition. Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472. Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721–741. Ghosal, S. and Van Der Vaart, A. W. (2007). Convergence rates of posterior distributions for noniid observations. The Annals of Statistics, 35(1):192–223. Ghosh, M., Ghosh, A., Chen, M. H., and Agresti, A. (2000). Noninformative priors for one-parameter item models. Journal of Statistical Planning and Inference, 88:99–115. Ghosh, S., Yao, J., and Doshi-Velez, F. (2019). Model selection in bayesian neural networks via horseshoe priors. Journal of Machine Learning Research, 20:1–46. Grenander, U. (1981). Abstract Inference. Wiley. Guo, Y., Yao, A., and Chen, Y. (2016). Dynamic network surgery for efficient dnns. In Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS 2016). Han, S., Mao, H., and Dally, W. J. (2016). Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In 4th International Conference on Learning Representations (ICLR-2016). Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for efficient neural network. In Proceedings of the 28th Advances in Neural Information Processing Systems (NIPS 2015). Hansen, L. and Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10):993–1001. Hassibi, B. and Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems. Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika, 57:97–109. Havasi, M., Jenatton, R., Fort, S., Liu, J. Z., Snoek, J., Lakshminarayanan, B., Dai, A. M., 159 and Tran, D. (2021). Training independent subnetworks for robust prediction. In 9th International Conference on Learning Representations (ICLR-2021). He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision (ICCV-2015), pages 1026–1034. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2016), pages 770–778. Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. In 7th International Conference on Learning Representations (ICLR-2019). Hernandez-Lobato, J. M. and Adams, R. (2015). Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML-2015). Hinton, G. E. and Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational Learning Theory (COLT-1993), pages 5–13. Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., and Peste, A. (2021). Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241):1–124. Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. In 5th International Conference on Learning Representations (ICLR-2017). Ingraham, J. B. and Marks, D. S. (2017). Variational inference for sparse and undi- rected models. In Proceedings of the 34th International Conference on Machine Learning (ICML-2017). Izmailov, P., Vikram, S., Hoffman, M. D., and Wilson, A. G. G. (2021). What are bayesian neural network posteriors really like? In Proceedings of the 38th International Conference on Machine Learning (ICML-2021). Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations (ICLR-2017). 160 Jantre, S., Bhattacharya, S., and Maiti, T. (2021a). Layer adaptive node selection in bayesian neural networks: Statistical guarantees and implementation details. arXiv preprint arXiv:2108.11000. Jantre, S., Bhattacharya, S., and Maiti, T. (2021b). Quantile regression neural networks: a bayesian approach. Journal of Statistical Theory and Practice, 15(3):1–34. Jantre, S., Madireddy, S., Bhattacharya, S., Maiti, T., and Balaprakash, P. (2022). Sequential bayesian neural subnetwork ensembles. arXiv preprint arXiv:2206.00794. Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015). On using very large target vocab- ulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Sau, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37:183–233. Karmarkar, N. (1984). A new polynomial time algorithm for linear programming. Combinatorica, 4(4):373–395. Kendall, A. and Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st Advances in Neural Information Processing Systems (NIPS-2017). Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations (ICLR-2015). Kingma, D. and Welling, M. (2014). Auto-encoding variational bayes. In 2nd International Conference on Learning Representations (ICLR-2014). Koenker, R. (2005). Quantile Regression. Cambridge University Press, Cambridge, First edition. Koenker, R. (2017). R package ‘quantreg’ for quantile regression. https://cran.r- project.org/web/packages/quantreg/quantreg.pdf. Online; accessed 15 May 2020. Koenker, R. and Basset, G. (1978). Regression quantiles. Econometrica, 46:33–50. Koenker, R. and Machado, J. (1999). Goodness of fit and related inference processes for quantile regression. Journal of the American Statistical Association, 94:1296–1309. Kottas, A. and Gelfand, A. E. (2001). Bayesian semiparametric median regression modeling. Journal of the American Statistical Association, 96:1458–1468. 161 Kozumi, H. and Kobayashi, G. (2011). Gibbs sampling methods for bayesian quantile re- gression. Journal of Statistical Computation and Simulation, 81:1565–1578. Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. M.Sc. Thesis. Kwon, Y., Won, J.-H., Kim, B. J., and Paik, M. C. (2020). Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics and Data Analysis, 142. Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the 30th Advances in Neural Information Processing Systems (NIPS 2017). LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521:436–444. LeCun, Y., Denker, J. S., and Solla, S. A. (1990). Optimal brain damage. In Proceedings of the 3rd Advances in Neural Information Processing Systems (NIPS-1990). Lee, H. K. H. (2000). Consistency of posterior distributions for neural networks. Neural Networks, 13:629–642. Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. (2018). Deep neural networks as gaussian processes. In 6th International Conference on Learning Representations (ICLR-2018). Liu, S., Chen, T., Atashgahi, Z., Chen, X., Sokar, G., Mocanu, E., Pechenizkiy, M., Wang, Z., and Mocanu, D. C. (2022). Deep ensembling with no overhead for either training or testing: The all-round blessings of dynamic sparsity. In 10th International Conference on Learning Representations (ICLR-2022). Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. (2017). Learning efficient con- volutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Louizos, C., Ullrich, K., and Welling, M. (2017). Bayesian compression for deep learning. In Proceedings of the 30th Advances in Neural Information Processing Systems (NIPS 2017). Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through l0 regularization. In 6th International Conference on Learning Representations (ICLR-2018). Lu, L., Shin, Y., Su, Y., and Em Karniadakis, G. (2020). Dying relu and initialization: Theory and numerical examples. Communications in Computational Physics, 28(5):1671– 1706. 162 Luo, J.-H., Wu, J., and Lin, W. (2017). Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). Mackay, D. J. C. (1992). A practical bayesian framework for backpropagation networks. Neural Computation, 4:448–472. Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In 5th International Conference on Learning Representations (ICLR-2017). Madsen, K. and Nielsen, H. B. (1993). A finite smoothing algorithm for linear l1 estimation. SIAM Journal of Optimization, 3:223–235. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092. Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032. Mocanu, D. C., Mocanu, E., Stone, P., Nguyen, P. H., Gibescu, M., and Liotta, A. (2018). Evolutionary training of sparse artificial neural networks: A network science perspective. Nature Communication, 9(2383). Moghimi, M., Belongie, S. J., Saberian, M. J., Yang, J., Vasconcelos, N., and Li, L.-J. (2016). Boosted convolutional neural networks. In BMVC, volume 5, page 6. Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML-2017). Mozer, M. C. and Smolensky, P. (1988). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Proceedings of the 1st Advances in Neural Information Processing Systems (NIPS-1988). Murray, K. and Chiang, D. (2015). Auto-sizing neural networks: With applications to n- gram language models. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pages 908–916. Nado, Z. et al. (2021). Uncertainty Baselines: Benchmarks for uncertainty & robustness in deep learning. In Bayesian Deep Learning workshop, NeurIPS-2021. Neal, R. (1992). Bayesian learning via stochastic dynamics. In Proceedings of the 5th Advances in Neural Information Processing Systems (NIPS-1992). 163 Neal, R. M. (1996). Bayesian Learning for Neural Networks. New York: Springer Verlag. Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D. P. (2017). Structured bayesian pruning via log-normal multiplicative noise. In Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS-2020). Ochiai, T., Matsuda, S., Watanabe, H., and Katagiri, S. (2017). Automatic node selection for deep neural networks using group lasso regularization. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Dillon, J. V., Lakshminarayanan, B., and Snoek, J. (2019). Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Proceedings of the 32th Advances in Neural Information Processing Systems (NeurIPS-2019). Papamarkou, T., Hinkle, J., Young, M. T., and Womble, D. (2022). Challenges in markov chain monte carlo for bayesian neural networks. Statistical Science, 37(3):425–442. Paszke, A. et al. (2019). PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 32nd Advances in Neural Information Processing Systems (NeurIPS-2019). Pati, D., Bhattacharya, A., and Yang, Y. (2018). On statistical optimality of variational bayes. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS-2018). Perreault Levasseur, L., Hezaveh, Y. D., and Wechsler, R. H. (2017). Uncertainties in parameters estimated with neural networks: Application to strong gravitational lensing. The Astrophysical Journal, 850(1). Piironen, J. and Vehtari, A. (2017). On the hyperprior choice for the global shrinkage parameter in the horseshoe prior. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS-2017). Pollard, D. (1991). Bracketing methods in statistics and econometrics. In Nonparametric and semiparametric methods in econometrics and statistics: Proceedings of the Fifth International Symposium in Econometric Theory and Econometrics, pages 337–355. Cam- bridge, UK: Cambridge University Press. Polson, N. and Ročková, V. (2018). Posterior concentration for sparse deep learn- ing. In Proceedings of the 31st Advances in Neural Information Processing Systems (NeurIPS-2018). Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941. 164 Rosenblatt, F. (1957). The Perceptron - a perceiving and recognizing automaton (project para). Cornell Aeronautical Laboratory. Rosenblatt, F. (1961). Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical Report, Cornell Aeronautical Laboratory. Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1875–1897. Schwartz, L. (1965). On bayes procedures. Z. Wahrsch. Verw. Gebiete, 4:10–26. Sennrich, R., Haddow, B., and Birch, A. (2016). Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. Singh, S., Hoiem, D., and Forsyth, D. (2016). Swapout: Learning an ensemble of deep architectures. Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS-2016). Sriram, K., Ramamoorthi, R. V., and Ghosh, P. (2013). Posterior consistency of bayesian quantile regression based on the misspecified asymmetric laplace density. Bayesian Analysis, 8(2):479–504. Sun, Y., Song, Q., and Liang, F. (2021). Consistent sparse deep learning: Theory and computation. Journal of the American Statistical Association, pages 1–15. Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initializa- tion and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-2013). Swann, A. and Allinson, N. (1998). Fast committee learning: Preliminary results. Electronics Letters, 34(14):1408–1410. Taylor, J. W. (2000). A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting, 19(4):299–311. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society – Series B, 58:267–288. Titterington, D. M. (2004). Bayesian methods for neural networks and related models. Statistical Science, 19:128–139. Van Der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605. 165 Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes. Springer, First edition. Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, Fourth edition. Walker, S. G. and Mallick, B. K. (1999). A bayesian semiparametric accelerated failure time model. Biometrics, 55:477–483. Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning (ICML-2013). Wasserman, L. (1998). Asymptotic properties of nonparametric bayesian procedures. In Practical nonparametric and semiparametric Bayesian statistics, pages 293–304. New York: Springer. Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. (2016). Learning structured sparsity in deep neural networks. In Proceedings of the 29th Advances in Neural Information Processing Systems (NIPS-2016). Wen, Y., Tran, D., and Ba, J. (2020). Batchensemble: an alternative approach to efficient ensemble and lifelong learning. 8th International Conference on Learning Representations (ICLR-2020). Wenzel, F., Snoek, J., Tran, D., and Jenatton, R. (2020). Hyperparameter ensembles for robustness and uncertainty quantification. Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS-2020). Wilson, A. G. and Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS-2020). Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and conver- gence rates of sieve mles. Annals of Statistics, 23:339–362. Xiao, H., Rasul, K., and Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Xie, J., Xu, B., and Chuang, Z. (2013). Horizontal and vertical ensemble with deep repre- sentation for classification. arXiv preprint arXiv:1306.2759. Xu, Q., Deng, K., Jiang, C., Sun, F., and Huang, X. (2017). Composite quantile regression neural network with applications. Expert Systems with Applications, 76:129–139. 166 Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso. Bayesian Analysis, 10(4):909–936. Yeh, I.-C. (1998). Modeling of strength of high performance concrete using artificial neural networks. Cement and Concrete Research, 28(12):1797–1808. Yu, K. and Moyeed, R. A. (2001). Bayesian quantile regression. Statistics and Probability Letters, 54(4):437–447. Yu, K. and Zhang, J. (2005). A three-parameter asymmetric laplace distribution and its extensions. Communications in Statistics- Theory and Methods, 34:1867–1879. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations (ICLR-2017). Zhang, F. and Gao, C. (2020). Convergence rates of variational posterior distributions. The Annals of Statistics, 48(4):2180–2207. Zhao, C., Ni, B., Zhang, J., Zhao, Q., Zhang, W., and Tian, Q. (2019). Variational convolu- tional neural network pruning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2020). Random erasing data augmen- tation. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7):13001–13008. Zhu, M. and Gupta, S. (2018). To prune, or not to prune: Exploring the efficacy of pruning for model compression. In 6th International Conference on Learning Representations (ICLR-2018), Workshop Track Proceedings. 167