VARIATIONAL BAYES DEEP NEURAL NETWORK: THEORY, METHODS AND
                            APPLICATIONS
                                    By
                                Zihuan Liu
                           A DISSERTATION
                               Submitted to
                       Michigan State University
               in partial fulfillment of the requirements
                            for the degree of
                   Statistics – Doctor of Philosophy
                                   2021


                                          ABSTRACT
      VARIATIONAL BAYES DEEP NEURAL NETWORK: THEORY, METHODS AND
                                         APPLICATIONS
                                                By
                                            Zihuan Liu
   Bayesian neural networks (BNNs) have achieved state-of-the-art results in a wide range of tasks,
especially in high dimensional data analysis, including image recognition, biomedical diagnosis
and others. My thesis mainly focuses on high-dimensional data, including simulated data and brain
images of Alzheimer’s Disease. We develop variational Bayesian deep neural network (VBDNN)
and Bayesian compressed neural network (BCNN) and discuss the related statistical theory and
algorithmic implementations for predicting MCI-to-dementia conversion in multi-modal data from
ADNI.
    The transition from mild cognitive impairment (MCI) to dementia is of great interest to clinical
research on Alzheimer’s disease (AD) and related dementias. This phenomenon also serves as a
valuable data source for quantitative methodological researchers developing new approaches for
classification. The development of VBDNN is motivated by an important biomedical engineering
application, namely, building predictive tools for the transition from MCI to dementia. The
predictors are multi-modal and may involve complex interactive relations. In Chapter 2, we
numerically compare performance accuracy of logistic regression (LR) with support vector machine
(SVM) in classifying MCI-to-dementia conversion. The results show that although SVM and other
ML techniques are capable of relatively accurate classification, similar or higher accuracy can
often be achieved by LR, mitigating SVM’s necessity or value for many clinical researchers.
Further, when faced with many potential features that could be used for classifying the transition,
clinical researchers are often unaware of the relative value of different approaches for variable
selection. Other than algorithmic feature selection techniques, manually trimming the list of
potential predictor variables can also protect against over-fitting and also offers possible insight
into why selected features are important to the model. We demonstrate how similar performance


can be achieved using user-guided, clinically informed pre-selection versus algorithmic feature
selection techniques.
    Besides LR and SVM, Bayesian deep neural network (BDNN) has quickly become the most
popular machine learning classifier for prediction and classification with ADNI data. However,
their Markov Chain Monte Carlo (MCMC) based implementation suffers from high computational
cost, limiting this powerful technique in large-scale studies. Variational Bayes (VB) has emerged as
a competitive alternative to overcome some of these computational issues. Although the VB is pop-
ular in machine learning, neither the computational nor the statistical properties are well understood
for complex modeling such as neural networks. First, we model the VBDNN estimation method-
ology and characterize the prior distributions and the variational family for consistent Bayesian
estimation (in Chapter 3). The thesis compares and contrasts the true posterior’s consistency and
contraction rates for a deep neural network-based classification and the corresponding variational
posterior. Based on the complexity of the deep neural network (DNN), this thesis assesses the
loss in classification accuracy due to VB’s use and guidelines on the characterization of the prior
distributions and the variational family. The difficulty of optimization associated with variational
Bayes solution has been quantified as a function of the complexity of the DNN.
    Chapter 4 proposes using a BCNN that takes care of the large ? small = problem by projecting
the feature space onto a smaller dimensional space using a random projection matrix. In particular,
for dimension reduction, we propose randomly compressed feature space instead of other popular
dimension reduction techniques. We adopt a model averaging approach to pool information across
multiple projections. As the main contribution, we propose the variation Bayes approach to
simultaneously estimate both model weights and model-specific parameters. By avoiding using
standard Monte Carlo Markov Chain and parallelizing across multiple compression, we reduce
both computation and computer storage capacity dramatically with minimum loss in prediction
accuracy. We provide theoretical and empirical justifications of our proposed methodology.


                                    ACKNOWLEDGEMENTS
    Throughout the writing of this dissertation I have received a great deal of support and assistance.
    Firstly, I would like to express my sincere gratitude to my advisors Prof. Maiti and Prof.
Bhattacharya for the continuous support of my Ph.D study and related research, for their patience,
motivation, and immense knowledge. Their guidance helped me in all the time of research and
writing of this thesis. I could not have imagined having a better advisor and mentor for my Ph.D
study.
    Besides my advisors, I would like to thank the rest of my thesis committee: Prof. Bender,
Prof. Hong, and Prof. Zhu for their insightful comments and encouragement, but also for the hard
question which incented me to widen my research from various perspectives.
    My sincere thanks also goes to Prof. Wang from Western Connecticut State University and Dr.
Zhao from Cleveland Clinic, who provided me an opportunity to join their team as researcher, and
who gave access to the laboratory and research facilities. Without their precious support it would
not be possible to conduct this research.
    Thanks to the Michigan State University Graduate School for awarding me a Dissertation
Completion Fellowship, providing me with the financial means to complete this thesis. In addition,
I would like to thank my parents for their wise counsel and sympathetic ear. You are always there
for me. Finally, I could not have completed this dissertation without the support of my friends,
LiangLiang, Zhang and CheukYin, Lee, who provided stimulating discussions as well as happy
distractions to rest my mind outside of my research.
                                                   iv


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    ix
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       xi
CHAPTER 1 INTRODUCTION . . . . . . . .             . . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 MCI-to-dementia conversion . . . . . .      . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 Bayesian Deep Neural Networks . . . .       . . . . . . . . . . . . . . . . . . . . . . .  2
   1.3 Variational inference . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . .  3
   1.4 Posterior Consistency . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . .  5
   1.5 Bayesian Compressed Neural Networks         . . . . . . . . . . . . . . . . . . . . . . .  7
CHAPTER 2     A ROLE FOR PRIOR KNOWLEDGE IN STATISTICAL CLASSIFICA-
              TION OF THE TRANSITION FROM MCI TO ALZHEIMER’S DISEASE                             10
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  10
   2.2 Transition from MCI to dementia . . . . . . . . . . . . . . . . . . . . . . . . . . .     10
   2.3 Materials and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    14
       2.3.1 Data Used in Classification . . . . . . . . . . . . . . . . . . . . . . . . . .     15
       2.3.2 Clinical Cognitive Assessment and Genetic data . . . . . . . . . . . . . . .        16
       2.3.3 MRI data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      19
   2.4 Method and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      19
       2.4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     21
       2.4.2 Support Vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . .        22
       2.4.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       23
   2.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    25
       2.5.1 Comparison with different modalities . . . . . . . . . . . . . . . . . . . .        27
       2.5.2 Comparison of Pre-selection and ! 1 norm . . . . . . . . . . . . . . . . . .        28
       2.5.3 Comparison of Groups One and Two . . . . . . . . . . . . . . . . . . . . .          28
       2.5.4 Comparison of LR and SVM . . . . . . . . . . . . . . . . . . . . . . . . .          29
   2.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     32
CHAPTER 3     CONSISTENT VARIATIONAL BAYES CLASSIFICATION WITH DEEP
              NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . .            . 38
   3.1 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
   3.2 The Neural Networks Classifier and Likelihoods . . . . . . . . . . . . . . . . . .      . 38
   3.3 Bayesian Inference with Variational Algorithm . . . . . . . . . . . . . . . . . . .     . 41
       3.3.1 Prior Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 41
       3.3.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 42
       3.3.3 Black Box Variational Algorithm using score function estimator . . . . .          . 44
       3.3.4 Control Variate: Stabilizing the stochastic gradient . . . . . . . . . . . .      . 45
       3.3.5 RMSprop Learning Rate: Stabilizing the learning rate. . . . . . . . . . .         . 47
                                                v


      3.3.6 Classification using variational posterior . . . . . . . . . .    . . . . . . . . . 48
  3.4 Posterior and Classification Consistency . . . . . . . . . . . . . .    . . . . . . . . . 48
      3.4.1 Posterior consistency and its implication in practice . . . .     . . . . . . . . . 49
      3.4.2 Discussion of the proof . . . . . . . . . . . . . . . . . . .     . . . . . . . . . 52
      3.4.3 Classification consistency . . . . . . . . . . . . . . . . . .    . . . . . . . . . 55
  3.5 Simulation Studies. . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . 56
      3.5.1 Simulation Scenarios . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . 56
      3.5.2 Parameters choice for statistical and computational models.       . . . . . . . . . 56
      3.5.3 Gradient stabilization paramaters. . . . . . . . . . . . . .      . . . . . . . . . 57
      3.5.4 Testing accuracy and convergence. . . . . . . . . . . . . .       . . . . . . . . . 60
      3.5.5 Large number of layers and challenges. . . . . . . . . . . .      . . . . . . . . . 61
  3.6 Numerical Properties and Alzheimer’s Disease Study . . . . . . .        . . . . . . . . . 62
      3.6.1 Parameters choice for statistical and computational models.       . . . . . . . . . 63
      3.6.2 Gradient stabilization paramaters. . . . . . . . . . . . . .      . . . . . . . . . 64
      3.6.3 Testing accuracy and convergence. . . . . . . . . . . . . .       . . . . . . . . . 65
      3.6.4 Numerical comparison with popular models . . . . . . . .          . . . . . . . . . 65
  3.7 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . 66
CHAPTER 4    LEARNING INTRINSIC DIMENSIONALITY OF FEATURE SPACE
             WITH VARIATIONAL BAYES NEURAL NETWORKS . . . . . . . .                         . . 68
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 68
  4.2 Bayesian neural network for random projection based compressed feature space          . . 72
      4.2.1 Bayesian neural network model . . . . . . . . . . . . . . . . . . . . . .       . . 72
      4.2.2 Compression in the feature space with random projections . . . . . . .          . . 72
      4.2.3 Prior choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 73
  4.3 Variational Bayes model averaging for pooling multiple instances of random
      projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
      4.3.1 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . . . . .        . . 74
      4.3.2 ELBO derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 74
  4.4 Intrinsic dimensionality and prediction . . . . . . . . . . . . . . . . . . . . . .   . . 76
      4.4.1 Optimal dimension neighborhood selection . . . . . . . . . . . . . . .          . . 76
      4.4.2 Classification based on optimal neigborhood choice . . . . . . . . . . .        . . 76
  4.5 Algorithm and its implementation. . . . . . . . . . . . . . . . . . . . . . . . .     . . 77
  4.6 Theoretical results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 78
  4.7 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 80
      4.7.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 80
      4.7.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 81
      4.7.3 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 82
      4.7.4 ADNI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 82
      4.7.5 MNIST Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 83
  4.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
      4.8.1 Optimal dimensional region . . . . . . . . . . . . . . . . . . . . . . .        . . 83
      4.8.2 Comparative Baselines . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 85
      4.8.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 86
  4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 89
                                                vi


CHAPTER 5    CONCLUSIONS, DISCUSSION,             AND DIRECTIONS          FOR   FUTURE
             RESEARCH . . . . . . . . . . .       . . . . . . . . . . . . . . . . . . . . . . . 90
   5.1 Conclusions and discussion . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . . 90
   5.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
   APPENDIX A         SUPPLEMENT FOR CONSISTENT VARIATIONAL BAYES
                      CLASSIFICATION WITH DEEP NEURAL NETWORKS . . . . . 94
   APPENDIX B         SUPPLEMENT FOR LEARNING INTRINSIC DIMENSION-
                      ALITY OF FEATURE SPACE WITH VARIATIONAL BAYES
                      NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . 118
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
   *
                                              vii


                                       LIST OF TABLES
Table 2.1: Sample Sizes by Timing and Diagnosis: Group One and Two . . . . . . . . . . . 15
Table 2.2: Clinical Features and Cognitive Assessment Score of Group One . . . . . . . . . 18
Table 2.3: Pre-selected MRI Features of Group One . . . . . . . . . . . . . . . . . . . . . 20
Table 2.4: Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Table 2.5: Top 10 features of Group One obtained by ! 1 regularization . . . . . . . . . . . 27
Table 2.6:  LR and SVM performance of Group One (Time = 3 years) for models on
           single- and multi-modal feature sets . . . . . . . . . . . . . . . . . . . . . . . . 30
Table 2.7: LR and SVM performance of Group Two (Time =2 years) for single-data and
           multi-modal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table 3.1: Performance of algorithms algorithms 1, 2, 4, 5 for scenario 1. . . . . . . . . . . 60
Table 3.2: Performance of algorithms 1, 2, 4, 5 for scenario 2 . . . . . . . . . . . . . . . . 61
Table 3.3: Performance of algorithms 1, 2, 4, 5 for scenario 1 and scenario 2 for 3 layers.     . 62
Table 3.4: Performance of algorithms 1, 2, 4, 5 for ADNI. . . . . . . . . . . . . . . . . . . 65
Table 3.5: Performance for different classifiers. LR: Logistic regression. SVM: Support
           vector machine. ANN: Frequentist artificial neural network. SG-MCMC:
           Stochastic gradient MCMC Bayesian neural network . . . . . . . . . . . . . . . 66
Table 4.1: Summary of data,where n, p and c denote the numbers of samples, features
           and classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Table 4.2: RPVBNN setting for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Table 4.3: Table: Small simulated data performance . . . . . . . . . . . . . . . . . . . . . 87
Table 4.4: Table: Large simulated data performance . . . . . . . . . . . . . . . . . . . . . 88
Table 4.5: Table: ADNI data performance . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Table 4.6: MNIST data performance in term of testing accuracy and time (based on 500
           epochs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
                                                viii


                                       LIST OF FIGURES
Figure 2.1: Comparison of distributions for baseline predictor variables between MCI-S
            and MCI-C groups. (a) The mean MMSE score in MCI-S is higher than in
            MCI-C. (b) Mean Learning scores of MCI-C and MCI-S groups are 2.5 and 5. . 16
Figure 2.2: Comparisons between MCI-S and MCI-C groups on baseline predictor vari-
            ables. The y-axis of panels (a) through (d) represents the number of par-
            ticipants developing AD. Blue and red bars represent non-converters and
            converters, respectively. Panel (a) shows a greater number of converters than
            non-converters for both men and women. Panel (b) shows more than half of
            MCI-C subjects are APOE4 carriers and approximately 70% MCI-S subjects
            are non-APOE4 carriers. Panel (c) shows MCI-S subjects have the relatively
            lower CDR score and MCI-C subjects have higher CDR score. The number of
            people in MCI-C group has a downward trend as CDR score increases. Panel
            (d) shows MCI-C subjects have the relatively higher ADASQ4 score. The
            average of ASADQ4 score of MCI-S and MCI-C subjects are approximately
            5 and 8, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 2.3:  Flowchart of the LR and SVM method A) ROI-P: ROI level data with Pre-
            selection; B) ROI-NP: ROI level data with No Pre-selection; C) CCAR:
            Clinical, Cognitive assessments score, APOE4 and ROI level data. . . . . . . . 26
Figure 2.4: Model performance on ROI feature set by number of features for LR and SVM.
            Panel (a) shows dramatic growth in AUC with LR as the number of features
            increases from 1 to 30, and then becoming more static at approximately 74% -
            i.e., as the number of features increases from 30 to 40, but drops significantly
            when the number of features reaches to 41. Panel (b) shows the AUC increased
            dramatically as the number of features grows from 1 to 28, but fluctuated after
            29. The optimal number of ROI features for both methods are 29 and 28, and
            their corresponding optimized AUC were approximately 74.0% and 78.0%. . . 36
Figure 2.5:  Model performance on CCA feature set by number of features for LR and
            SVM. Figure (a) shows there is a significant increase in the AUC with LR as
            the number of features increases from 1 to 5, then there is a slight decrease in
            the testing accuracy when the number of features is greater than 5. Figure (b)
            shows the AUC shot up dramatically as the number of features increases from
            1 to 4. The optimal number of CCA features obtained by LR and SVM are 5
            and 4, and their corresponding optimized AUC are approximately 84.0% and
            83.0%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 3.1: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 1 layer. . . . . . . 58
                                                 ix


Figure 3.2: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers. . . . . . . 58
Figure 3.3: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 2 for 1 layer. . . . . . . 59
Figure 3.4: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers. . . . . . . 59
Figure 3.5: ELBO convergence of algorithms 1, 2, 4, 5 for ADNI. . . . . . . . . . . . . . . 64
Figure 4.1: Small simulated data: # = 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 4.2: Large simulated data: # = 128 . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 4.3: ADNI data: # = 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 4.4: MNIST data: # = 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
                                              x


                                 LIST OF ALGORITHMS
Algorithm 1: BBVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Algorithm 2: BBVI-CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Algorithm 3: RPVBNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Algorithm 4: BBVI-RMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Algorithm 5: BBVI-CV-RMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
                                               xi


                                           CHAPTER 1
                                         INTRODUCTION
Introduction
In this thesis, we develop variational Bayesian deep neural network (VBDNN) and Bayesian
compressed neural network (BCNN) and discuss the related statistical theory and algorithmic
implementations in the context of classification, such as classifying MCI-to-dementia conversion.
Chapter 1 reviews the background, research questions and development of Bayesian neural network
(BNN). Chapter 2 introduces the prediction of the transition from mild cognitive impairment
(MCI) to dementia for brain images of Alzheimer’s Disease using traditional machine learning
models (logistic regression and support vector machine). Finally, chapter 3 introduces the VBDNN
estimation methodology and the choice of the prior distributions and the variational family. In
particular, we discuss the statistical framework for neural networks based classification problem
and provide posterior consistency and classification consistency. Chapter 4 introduces a variational
Bayes neural network predictive model for addressing the curse of dimensionality (small = large ?)
by compressing the feature space using random projection matrices. Finally, chapter 5 introduces
Conclusions, Discussion, and Suggestions for Future Research.
1.1    MCI-to-dementia conversion
The transition from mild cognitive impairment (MCI) to dementia is of great interest to clinical
research on Alzheimer’s disease (AD) and related dementias. Alzheimer’s disease (AD) is a
progressive, age-related, neurodegenerative disease and the most common cause of dementia [147,
148, 67]. Behaviorally, Alzheimer’s dementia is commonly preceded by mild cognitive impairment
(MCI), a syndrome characterized by declines in memory and other cognitive domains that exceed
cognitive decrements associated with normal aging [148, 103]. However, the prodromal symptoms
of MCI are not prognostically deterministic: individuals with MCI tend to progress to diagnoses
                                                  1


of probable AD at a rate of 8%-15% per year, and many conversions are detectable within 3 years
of initial presentation [24, 44, 2]. Research efforts to provide new insights into the incidence
of MCI-to-AD conversion have focused largely on clinically or biologically relevant features (i.e.,
neuroimaging markers, clinical exam data, neuropsychological test scores) and on different methods
for statistical classification [145].
1.2     Bayesian Deep Neural Networks
Due to the universal approximation theory of stochastic functions and larger access to computational
power, Bayesian Deep Neural Networks (BDNNs) are fashionable in machine learning and statistics
for classification and prediction from big data. The BDNNs based prediction has several advantages
over standard parametric statistical models. They implicitly consider the interactions or dependence
among predictor variables and model the unknown functional relationship between the predictors
and responses. For example, we consider classifying Alzheimer’s disease status from brain imaging
an important biomedical problem. The image features are segmented into voxels or regions of
interest (ROI’s). Due to their physical adjacency and biological proximities, a simple parametric
model or semi-parametric models, such as logistic regression or generalized additive models may
not be appropriate. Besides the dependence (spatial) among the predictors, some network structures
might be in the feature space while modeling the brain images. The BDNNs can take into account
these data features without any explicit assumptions about their dependence structure. Further,
these studies often have additional features but in different modes such as genetic and demographic
information, brings additional complexity while modeling dependence among the features. Thus,
machine learning-based approaches, such as deep neural networks, become useful in this type
of application. Bayesian neural networks (BDNNs) have been comprehensively studied by [7],
[95], [71], and many others. More recent developments which establish the efficacy of BDNNs
can be found in [120], [93], [61], [80], [64] and the references therein. The estimation of the
posterior distribution is a key part of Bayesian inference and represents the information about the
uncertainties for both data and parameters. However, the exact analytical solution for the posterior
                                                   2


distribution is intractable as the number of parameters is very large and the functional form of a
neural network does not lend itself to exact integration (see [11]). Several approaches have been
proposed for solving posterior distribution of weights of BDNNs, based on both optimization-based
techniques such as variational inference (VI), and sampling-based approach, such as Markov Chain
Monte Carlo (MCMC).
1.3     Variational inference
Markov Chain Monte Carlo (MCMC) techniques are typically used to obtain sampling-based
estimates of the posterior distribution. Indeed, BDNNs with MCMC have not seen widespread
adoption due to computational cost in terms of both time and storage on a large dataset, [66, 94,
132, 139]. In contrast to MCMC, VI tends to converge faster, and it has been applied to many
popular Bayesian models, such as factorial models and topic models [79, 9, 8]. We want to take a
variational approximation approach for posterior estimation in the context of deep neural network
classification models. The basic idea of VI is that it first defines a family of variational distributions
and then minimizes the Kullback-Leibler (KL) divergence with respect to the variational family.
Many recent works have discussed the application of variational inference to Bayesian deep neural
networks e.g., [50], [11], [121]. Although there is a plethora of literature on variational inference for
neural networks, the theoretical properties of the variational posterior in BDNNs remain relatively
unexplored and this limits the use of this powerful computational tool beyond the machine learning
community.
    Some of the previous works that focused on theoretical properties of variational posterior
include the frequentist consistency of variational inference in parametric models in presence of
latent variables (see [135]). Optimal risk bounds for mean-field variational Bayes for Gaussian
mixture (GM) and Latent Dirichlet allocation (LDA) models have been discussed in [99]. The work
of [142] proposed U-variational inference Bayes risk for GM and LDA models. The [149] discusses
the variational posterior contraction rates in Gaussian sequence models, infinite exponential families
and piece-wise constant models. The works of [105] and [122] study the posterior contraction rates
                                                   3


for Bayesian sparse deep neural network models under spike and slab priors. Three more closely
related works which study variational posterior are: (1) [3] discusses the contraction rates of VB in
sparse BDNN models with spike and Gaussian slab priors and mean-field spike and Gaussian slab
variational family (2) [22] discusses the contraction rates of a tempered VB solution with spike
and Gaussian slab priors and mean-field spike and Gaussian slab variational family and (3) [6]
discusses consistency of VB for single-layer neural network models with Gaussian priors and mean
field Gaussian variational family. All these three works focus on a regression setting unlike our
classification set up, which in turn allows for the generalization of VBDNNs to generalized linear
models. Further, none of these works discuss computational details and the theoretical guidelines
of BDNN to achieve the desired level of accuracy.
    The work of [6] does not establish contraction rates or deal with deep networks. The works
of both [22] and [3] establish contraction rates, however, their notion of convergence do not
agree with the classical definition of posterior contraction as established in Theorem 2.1 in [47]
wherein one needs to find the rates at which variational posterior gives probability to shrinking
Hellinger neighborhoods of the true density. The notion of contraction as used in [22] considers
the contraction rate of the quantity \/|| 50    5\ || 1 instead of Hellinger neighborhoods of the true
density. The work of [3] on the other hand considers the posterior expectation of the square of the
Hellinger distance instead of the posterior probability of shrinking Hellinger neighborhoods. Note,
in terms of the notion of consistency, our work is similar to those of [105] (Theorem 5.1) and [122]
(Theorem 2.1) but in the context of variational posterior instead of the true posterior.
    The derivation of the posterior contraction rates in the classical sense provided us the additional
advantage to quantify the loss incurred due to the use of VI approach over MCMC approach on
the classification accuracy of the BDNN’s, a result which to the best of our knowledge, does not
exist in the literature. Additionally, our current work does not assume a sparsity constant B⇤ which
can control the overall complexity of the model. We instead start with a dense network and break
down the complexity of a deep neural network into three components (1) the number of layers (2)
the number of nodes and (3) the strength of interactions between active nodes. Then, we study
                                                     4


the impact of each of these components on the consistency, contraction rates and classification
accuracy of the variational posterior based on BDNN. Finally, this thesis adopts the control variates
and adaptive learning rate approach as proposed in [107] to BDNNs. This allowed us to analyze the
stability of the numerical optimization used for obtaining a variational Bayes solution as a function
of the complexity of the model. We like to emphasize that, unlike the high-dimensional regression
model, the sparsity constant B⇤ is not well defined in DNN as the layers can be thought of a sequence
and there should not be any gap between layers.
1.4     Posterior Consistency
To evaluate the validity of a posterior in non-parametric models, one must establish its consistency
and contraction rates. Unlike any of the previous works, we establish the posterior consistency and
contraction rates of the variational posterior in the classical sense, see theorems 3.4.1 and 3.4.2. For
a simple consistency result, one needs to show that the posterior concentrates around the Hellinger
neighborhood of the true density function with overwhelming probability. A deep neural network
model for which the input feature space and number of layers is fixed enjoys consistency properties
irrespective of the true function under study as long as the total number parameters of grow at a
rate smaller than the sample size =. In this direction, we establish that posterior probability of an
Y- Hellinger neighborhood grows at the rate 1         24  =Y 2 /2 in contrast to the slower 1 a, a ! 0
as = ! 1 rate for the variational posterior. For establishing the rates of contraction, one needs to
show that the posterior concentrates around shrinking Hellinger neighborhoods of the true density
with overwhelming probability. To determine the rates of contraction, one needs assumptions on
the neural network solution that approximates the true function and the number of total parameters
being less than =. Treating the input feature space as the number of nodes in the 0th layer, we
found that the approximating neural network solution must satisfy three main properties (1) the
number of layers grows at a rate smaller than log = (2) the number of nodes in each layer are well
controlled (3) the number of connections between active nodes is well controlled. In this direction,
we establish that the true posterior probability of a shrinking Yn = - Hellinger neighborhood grows
                                                    5


at the rate 1    24 =Y 2 n =2 /2 in contrast to the slower 1 a rate for the variational posterior.
    For BDNN, we next establish the connection between posterior contraction rate and classifi-
cation accuracy. In this direction, we first show that the classification accuracy of a consistent
posterior asymptotically approaches the Bayes classifier’s classification accuracy. With no assump-
tions on the true function, we show that a deep neural network model for which the number of input
features and number of layers is fixed, we show that the convergence rates of the classification ac-
curacy are the same for both variational approximation and true posterior. However, under suitable
assumptions on the approximating neural network solution as described in the above paragraph,
we establish that the classification accuracy of variational posterior approaches to the classification
accuracy of the Bayes classification at the rate n =2/3 in contrast to the higher rate of n = for the true
posterior. This interesting theoretical discovery quantifies the loss to the use of variational posterior
instead of using the true posterior density.
    We provide prior elicitation for Bayesian estimation. Our detailed mathematical treatment
provides theoretical guidelines for selecting the prior distributions that might affect prediction
accuracy. For example, even one works with fairly vague priors, there is a limit for choosing the
hyper-parameter values to achieve a desired level of consistency. We also discuss how the choice
of variational distribution along with the prior distribution affects the posterior consistency.
    Besides the theoretical validation, the challenges of implementing a VI based approach is two
folds: (1) the choice of the variational family (2) the optimization of the KL-divergence. For
the first issue, we show that a simple mean-field Gaussian variational family suffices for posterior
consistency along with good numerical performance. For the second issue, the current paper
discusses the associated computational challenges of using a VI approach and provides statistically
principled guidelines to overcome the same. We first adapted the black-box variational inference
(BBVI) algorithm in [107] to the classification based on BDNN’s and used Monte Carlo estimates
of the gradient of the evidence lower bound (ELBO) for stochastic optimization of the variational
parameters. We then adapted the control variates approach as in [107] to allow for faster convergence
to the solution. We found that control variates offers a great deal of improvement in terms of time
                                                         6


management when using one or two layers. With increase in the number of layers, it was observed
that using adaptive learning rates like Adagrad as in [107] can offer huge advantage to allow for
stable optimization. We, however, propose the use of the RMSprop due to its superior performance
over Adagrad. Finally we discuss in detail the learning rate selection, number of Monte-Carlo
samples and other tuning controls in the context of variational Bayes implementation.
1.5       Bayesian Compressed Neural Networks
Bayesian neural networks (BNNs) have achieved state-of-the-art results in a wide range of tasks,
especially in high dimensional data analysis, including image recognition, biomedical diagnosis
and others. One of the major disadvantages in using neural networks and deep networks is that they
require a huge number of training data due to a large number of inherent parameters [140, 45]. For
example, high-dimensional neural networks have been widely applied with regularization, dropout
techniques or early stopping to prevent overfitting [118, 143]. Furthermore, most commonly used
dimensional reduction techniques include Lasso [17], Ridge [58], Elastic net [152], Sparse group
lasso [116], Bayesian Lasso [98], Horseshoe prior [16], principal component analysis [115]. Even
though the ;1 and ;2 norm can force the weights to become zero or small, they do not have the
regularizing effect of making the computed function simpler [70]. Additionally, all these methods
rely on the use of whole data, which severely increases the cost of both computation and memory
storage.
     In this thesis, we propose the use of a BNN on a compressed feature space to take care of
the large ? small = problem by projecting the feature space onto a smaller dimensional space
using a random projection matrix. Random-projection (RP) is a powerful dimension reduction
technique which uses RP matrices to map data into low-dimensional spaces. The use of RP in high
dimensional statistics is motivated from the Johnson–Lindenstrauss Lemma [27] which states for
x1 , · · · , x= 2 R ? , n 2 (0, 1) and 3 > 8 log =/n 2 , there exists a linear map 5 : R ? ! R3 such that
(1     n)||G8    G 9 || 22  || 5 (x8 ) 5 (x 9 )|| 22  (1 + n)||G8 G 9 || 22 for 8, 9 = 1, · · · , =. The properties of
the RPs and their applications to statistical problems were furthered explored in [33, 13], etc..
                                                             7


    To reduce the sensitivity to the choice of random matrices, one must pool information obtained
from multiple projections. In this thesis, we adopt a Bayesian model averaging approach for
combining information across multiple instances RP based neural networks. There are two main
challenges of implementing Bayesian modeling averaging (1) due to the convoluted structure of
the neural network likelihood, closed form expressions do not exist for the posterior distribution
under each model (2) posterior distribution of model weights is completely intractable and no
closed form solutions exist. Thereby, the implementation of standard Markov Chain Monte Carlo
(MCMC) is next to impossible. Further, the computation and storage cost associated with MCMC
implementation is humongous since each posterior model weight depends on the remaining models’
posterior model weight.
    To address the challenges of MCMC implementation, we use variational inference (VI) [63, 9]
approach to provide an approximate solution for Bayesian model averaging (BMA) to allow for
combining of BNNs with multiple instances of compression on the feature space. There has been
a plethora of literature implementing variational inference in the neural networks [10]. However,
their implementation makes use of the entire feature space, thereby putting a great burden on
computational stability and memory storage. We address two main challenges in this thesis
(1) developing a variational Bayes (VB) solution for BNNs with compressed feature space (2)
providing a VB solution for doing BMA across multiple instances of RP. Further, we establish the
posterior contraction rates for the variational posterior for classification (the theory is extendable
to regression set up with minor modifications). In this direction, we provide characterization of the
prior, variational posterior and the RP matrix which guarantees the convergence of the variational
Bayes neural network (VBNN) under the compressed feature space to the true density of the
observations.
    The main advantage of implementing a BMA approach is that it gives the posterior model
weights under each compression of feature space. The so obtained posterior model weights in turn
induce a probability distribution on the projected dimension of the feature space. The mode of
this probability distribution concentrates around the intrinsic dimensionality of the feature space.
                                                   8


The BMA approach is then applied to a pool of RP matrices whose projected dimension lies in a
neighborhood of the intrinsic dimensionality to improve the prediction performance. Finally, we
study the numerical behavior of the proposed procedure in the light of simulation and real data
sets. To the best of our knowledge, no literature provides theoretical guarantees and computation
algorithms of VBNNs with compressed feature space.
                                                 9


                                            CHAPTER 2
  A ROLE FOR PRIOR KNOWLEDGE IN STATISTICAL CLASSIFICATION OF THE
                   TRANSITION FROM MCI TO ALZHEIMER’S DISEASE
2.1     Introduction
The transition from mild cognitive impairment (MCI) to dementia is of great interest to clinical
research on Alzheimer’s disease (AD) and related dementias. This phenomenon also serves as
a valuable data source for quantitative methodological researchers developing new approaches
for classification. However, the growth of machine learning (ML) approaches for classification
may falsely lead many clinical researchers to underestimate the value of logistic regression (LR),
which often demonstrates classification accuracy equivalent or superior to other ML methods.
Further, when faced with many potential features that could be used for classifying the transition,
clinical researchers are often unaware of the relative value of different approaches for variable
selection. Using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the present
study investigated automated and theoretically-guided feature selection techniques in the context
of LR and support vector machine (SVM) classification methods for predicting conversion from
MCI to dementia. The present findings demonstrate how similar performance can be achieved
using user-guided, clinically informed pre-selection versus algorithmic feature selection techniques.
These results show that although SVM and other ML techniques are capable of relatively accurate
classification, similar or higher accuracy can often be achieved by LR, mitigating SVM’s necessity
or value for many clinical researchers.
2.2     Transition from MCI to dementia
Alzheimer’s disease (AD) is a progressive, age-related, neurodegenerative disease and the most
common cause of dementia [147, 148, 67]. Behaviorally, Alzheimer’s dementia is commonly
preceded by mild cognitive impairment (MCI), a syndrome characterized by declines in memory and
                                                  10


other cognitive domains that exceed cognitive decrements associated with normal aging [148, 103].
However, the prodromal symptoms of MCI are not prognostically deterministic: individuals with
MCI tend to progress to diagnoses of probable AD at a rate of 8%-15% per year, and many
conversions are detectable within 3 years of initial presentation [24, 44, 2]. Research efforts to
provide new insights into the incidence of MCI-to-AD conversion have focused largely on clinically
or biologically relevant features (i.e., neuroimaging markers, clinical exam data, neuropsychological
test scores) and on different methods for statistical classification [145].
    For clinical researchers, however, there may be a tendency to conflate more sophisticated,
novel analytic approaches and the value of multimodal information from neuroimaging and clinical
assessment. Moreover, whereas statisticians may inherently understand the comparability of differ-
ent quantitative approaches, the novelty of both big data and data-driven approaches for studying
MCI-to-AD conversion may lead clinical researchers to assume that such data-driven methods are
inherently superior to more theoretically-grounded approaches. Thus, the value of using extant
findings and domain expertise to help guide and constrain the application of newer data-driven ap-
proaches capable of capitalizing on emergent big data may be a particularly important consideration
for clinical researchers.
    Statistical classification in clinical research has traditionally utilized binary logistic regres-
sion (LR). However, key attributes of modern clinical and neuroimaging data, including high
dimensionality and the presence of ground truth estimates of pathology and diagnosis provide
new opportunities for quantitative research. This has led to a substantial expansion in the use of
data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI; http://adni.loni.usc.edu) for
quantitative research and methodological development, particularly by researchers utilizing and
developing prediction and classification methods in machine learning (ML). Besides LR, support
vector machine (SVM) has quickly become the most common type of ML classifier for diagnostic
prediction and classification with ADNI data. In general, LR works well when the data is linearly
separable and the number of data is greater than the number of features. Moreover, SVM and LR
have similar misclassification rates (MCRs) when used to diagnose malignant tumors from imaging
                                                   11


data [19, 30].
     Indeed, before the rapid expansion of ML research and applied work over the past decade, many
clinical researchers and those outside of engineering and mathematically intensive disciplines had
little exposure to classification approaches other than LR. Despite its growing popularity, the relative
benefits of SVM or other forms of ML [101, 87] over LR for such classification are not always
apparent. Although this may be of little surprise to statisticians and quantitative researchers, such
perspectives are often lost on clinical researchers, whose implicit beliefs in the superiority of ML
is driven by the volume of publications, rather than through training or empirical demonstration.
     Most efforts to develop new classification methods for prediction of MCI-to-AD conversion
are well suited to integrate measures from multiple sources such as demographics, clinical rating
scores, neuropsychological testing, neuroimaging, genetic markers, etc. However, identifying
which combination of features most accurately classifies conversion from MCI to AD is a key
challenge for ADNI, and may vary by method. The ! 1 norm regularization method (i.e., ! 1 ) is a
highly used feature selection technique for LR and SVM. ! 1 is popular for addressing circumstances
in which the number of features is quite large or even larger than the sample size. Despite some
risk of abusing the statistical terminology, the problem is often generically referred to as the
“small n, large p" or high dimensional problem. The ! 1 technique has dual impacts, namely
the algorithm can (i) optimize a higher number of parameters in comparison to sample size,
and (ii) reduce the effective number of parameters (i.e., performing variable selection). This
powerful technique has been implemented in ADNI data with LR [144]. Furthermore, ! 1 and
other algorithmic feature selection methods used in ML suffer from one key limitation: they are
agnostic to theoretical considerations, and as such, they cannot interpret why selected features
are meaningful and important to the model. When sampling from a large pool of features, the
algorithmic approaches fail to consider prior knowledge of features and their associations with the
relevant systems in variable selection. Therefore, domain expertise and prior knowledge may afford
additive or differential value for choosing features and interpreting model results over algorithmic
feature selection methods alone.
                                                   12


    However, most real-world problems occur in the context of additional information about each
potential feature and its conceptual relationship with the phenomenon being classified. Other than
using ! 1 feature selection, manually trimming the list of potential predictor variables can also
protect against over-fitting, and also offers potential insight into why selected features are important
to the model. When guided by prior knowledge, user-guided or ‘manual’ feature selection may be a
valuable additional step to help minimize potentially spurious effects. This perspective is frequently
lost on applied researchers, as most commonly used variable selection algorithms are context-free
– that is, they only look at relationships within the data set, and cannot factor in the wider meanings
of variables. Furthermore, this also means that automated algorithms may identify relationships
among a large number of predictor variables that are spurious and are unlikely to generalize outside
the data set. Although there are a vast number of potential neuroimaging features in ADNI data,
the present study focused only on regional brain volumes segmented from structural magnetic
resonance imaging (MRI) data, the most commonly used neuroimaging datatype for classifying
MCI-to-dementia conversion. In contrast to prior studies that used a limited set of volumetric brain
features, the present study utilized data generated by modern multi-atlas segmentation methods
and analyses included up to 259 features - anatomically specific gray and white matter volumes.
However, the large pool of extant findings from studies evaluating regional brain MRI volumetry
in prediction and classification of MCI-to-dementia conversion using both limited and expansive
feature sets also provides a valuable set of priors for relevant brain regions [18, 43, 123, 91, 42, 108].
Thus, applied researchers are often left with the conundrum of more confirmatory approaches that
use few regions in classification or more exploratory methods in which prior findings have little
value.
    The present study addressed two questions regarding commonly-used classification approaches
for predicting MCI-to-dementia conversion in multi-modal data from ADNI. First, we compared
performance accuracy of binary LR with SVM in classifying MCI-to-dementia conversion. Second,
we asked if applying prior knowledge in feature selection outperforms algorithmic variable selection
alone. We hypothesized that 1) LR would perform comparably to SVM, and 2) user-guided
                                                    13


variable selection would outperform algorithmic variable selection alone. This work is intended to
demonstrate to clinical researchers the benefit of using ML in an informed fashion, rather than as a
‘black box’ that obscures clear interpretation. Moreover, we wish to emphasize that this study is not
meant to highlight a novel innovation in quantitative methods, but rather to provide an important
example to applied researchers regarding the comparable value of ML methods and importance of
domain expertise in classification with ADNI data.
2.3    Materials and Data
The data used in the preparation of this study were obtained from the Alzheimer’s Disease Neu-
roimaging Initiative (ADNI). ADNI is an ongoing joint public-private effort to utilize neuroimaging,
other biological markers, and clinical and neuropsychological assessment to measure the incidence
and progression of MCI to early dementia. Determination of sensitive and specific markers of
preclinical AD and MCI is intended to aid researchers and clinicians to develop new treatments
and monitor their effectiveness, as well as reduce the time and cost of clinical trials. Data in the
present study came from all sites across the U.S and Canada. All ADNI study participants included
in the present analyses were between 55 and 90 years old, spoke English or Spanish as their native
language, and had a study partner who provided an independent assessment of functioning.
    This study used a subset of the 819 participants from ADNI-1 diagnosed with MCI at baseline
and for whom the data from demographic, clinical cognitive assessments, APOE4 genotyping,
and MRI measurements were also available. To evaluate differences in classification performance
due to participant inclusion and drop out, we subdivided the sample into two overlapping groups.
After applying other criteria for inclusion, Group One included all patients whose follow-up period
was at least 36 months (n = 265); Group Two consisted of all patients with follow-up assessments
at 24 months (n = 308). Although the ADNI study protocol includes additional follow-up visits
at 6-month intervals, the present study only evaluated baseline data for features (i.e., clinical,
neuropsychological, brain volumetric) in classification analyses. In addition, identification of stable
vs. converting clinical outcomes only considered longer-term outcomes based on assessments at 2
                                                 14


and 3 years after baseline. The final samples included 265 and 308 study participants in Groups
One and Two, respectively, who met criteria for inclusion. Both Groups included participants who
were stable in their diagnosis (MCI-S) and those who converted to a diagnosis of dementia over
the 2 or 3 years (MCI-C). Table 2.1 shows the participant characteristics. Diagnostic criteria for
MCI included an MMSE score at baseline between 24 and 30, a CDR score of 0.5, and a subjective
memory complaint, in addition to objective memory loss measured by education-adjusted scores
on the Logical Memory II subscale of the Wechsler Memory Scale, generally preserved activities
of daily living and no dementia. The diagnostic criteria for dementia were an MMSE score between
20 and 26, and a CDR score between 0.5 and 1.0. The clinical status of each participant diagnosed
with MCI was re-assessed at each follow-up visit and updated to reflect one of several outcomes
(e.g., MCI or dementia subtypes). The MCI-C and MCI-S group designations were based on this
follow-up clinical diagnosis and marked as either 1 for MCI-C or 0 for MCI-S in classification
study.
               Table 2.1: Sample Sizes by Timing and Diagnosis: Group One and Two
   Group             Time         # MCI-S (y=0)         # MCI-C (y=1) # Total patients
     One 36 months                         101                   164                   265
     Two 24 months                         122                   186                   308
 Table shows the number of MCI-C, MCI-S and total subjects in Group One and Two. The number of MCI-C
patients is higher than MCI-S patients in both groups.
2.3.1    Data Used in Classification
Evaluation of extant reports of common predictors of conversion from MCI to dementia focused on
dimensions of neuropsychological test performance, clinical assessment, genetic data, and regional
brain volumes. In the present study, we first divided these variables into two sets of features, with
all non-brain volumetric variables in one set and all variables representing regional brain volumes
in a second set. In addition, we created a third set of features from the volumetry feature set that
only included 26 of the 259 brain volumes. Henceforth, we refer to models that only include one
                                                    15


of these three feature sets as ’single-modality,’ whereas models that combine brain and non-brain
feature sets are referred to as ’multi-modal.’
   (a) MMSE scores in MCI-C and MCI-S groups           (b) Learning in MCI-C and MCI-S groups
Figure 2.1: Comparison of distributions for baseline predictor variables between MCI-S and MCI-
C groups. (a) The mean MMSE score in MCI-S is higher than in MCI-C. (b) Mean Learning scores
of MCI-C and MCI-S groups are 2.5 and 5.
2.3.2   Clinical Cognitive Assessment and Genetic data
We considered a total of 19 clinical features as potential predictors of MCI-to-AD progression
in our classification analyses. These included the following assessment scores: the Mini Mental
State Examination (MMSE), Clinical Dementia Rating Sum of Boxes (CDR-SB), Alzheimer’s
Disease Assessment Scale-cognitive sub-scale (ADAS-cog), Functional Activities Questionnaire
(FAQ) measures of activities of daily living, Trail Making Test-B (TRABSCOR), the immediate
and delayed recall components of the Rey Auditory Verbal Learning Test (RAVLT), the Digit-
Symbol Coding test (DIGT) and the Digit Symbol Substitution Test from the Preclinical Alzheimer
Cognitive Composite (mPACCdigit). We also considered genotype for carriers of the epsilon-4
allele of the apolipoprotein E (APOE) gene [145] as a genetic predictor in this study. Table
2.2 summarizes all 19 clinical, demographic and genetic features used in this study. Preliminary
comparison of six clinical and genetic predictors by MCI-C and MCI-S subgroups showed five
of them (APOE4, ADAS4, CDR, MMSE and RAVLT.learning) significantly differ between the
                                                  16


              (a) Sex distributions                         (b) APOE4 genotype distributions
              (c) CDR distributions                              (d) ADAS distributions
Figure 2.2: Comparisons between MCI-S and MCI-C groups on baseline predictor variables. The
y-axis of panels (a) through (d) represents the number of participants developing AD. Blue and red
bars represent non-converters and converters, respectively. Panel (a) shows a greater number of
converters than non-converters for both men and women. Panel (b) shows more than half of MCI-C
subjects are APOE4 carriers and approximately 70% MCI-S subjects are non-APOE4 carriers.
Panel (c) shows MCI-S subjects have the relatively lower CDR score and MCI-C subjects have
higher CDR score. The number of people in MCI-C group has a downward trend as CDR score
increases. Panel (d) shows MCI-C subjects have the relatively higher ADASQ4 score. The average
of ASADQ4 score of MCI-S and MCI-C subjects are approximately 5 and 8, respectively.
groups, whereas one (SEX) does not. Fig 2.1 and 2.2 illustrate the distribution of these predictors
for both groups. Overall, in comparison to MCI-S participants, those in the MCI-C group were more
cognitively and functionally impaired at baseline, exhibited greater verbal memory impairments,
and included a greater proportion of APOE4 carriers.
                                                  17


             Table 2.2: Clinical Features and Cognitive Assessment Score of Group One
       Characteristics                MCI-S                 MCI-C         Test statistic    P-value
         Age(years)                74.34 ± 7.78          74.84 ± 6.83         -0.528         > 0.50
      Education(years)             15.57 ± 2.94          15.73 ± 2.91         -0.527         > 0.51
        Sex, % female                 33.67%                34.14%               0             11
     APOE4 carriers %                 34.65%                62.19%            17.900       < 0.0010
           CDRSB                   1.23 ± 0.61            1.72 ± 0.92         -5.237       < 0.0010
        MMSE score                 27.61 ± 1.74          26.82 ± 1.71          3.645       < 0.0010
          ADAS11                   8.89 ± 3.79           12.29 ± 4.16         -6.823       < 0.0010
          ADAS13                   14.48 ± 5.50          20.01 ± 5.79         -7.795       < 0.0010
          ASASQ4                   4.76 ± 2.19            6.77 ± 2.21         -7.339       < 0.0010
     RAVLT.immediate              36.21 ± 10.10          29.10 ± 7.98          6.021       < 0.0010
      RAVLT.learning               4.19 ± 2.47            2.91 ± 2.26          4.231       < 0.0010
     RAVLT.forgetting              4.31 ± 2.59            4.47 ± 2.15         -1.501        0.1350
  RAVLT.perc.forgeting 51.55 ± 31.04                    72.85 ± 30.45         -5.464       < 0.0010
        LEDLTOTAL                  4.96 ± 2.36            3.41 ± 2.66          4.931       < 0.0010
         DIGTSCOR                 40.75 ± 11.09         36.72 ± 10.96          2.883       < 0.0050
        TRABSCOR                 109.43 ± 62.94        132.09 ± 71.36         -2.704        0.0070
              FAQ                  1.50 ± 2.99            4.96 ± 4.79         -7.243       < 0.0010
         mPACCdigit                 5.376 ± 2.96           8.06 ± 2.96         7.174       < 0.0010
       mPACCtrailsB                 5.47 ± 3.06            8.22 ± 2.98         7.174       < 0.0010
Table only for Group One where has 265 patients and 36 months follow-up time. Values are shown as
mean ± standard deviation or percentage. Test statistics and P-values for differences between MCI-S and
MCI-C are based on (a) t-test or (b) chi- square test. MCI-S = non-progressive MCI; MCI-P = progressive
MCI; APOE = apolipoprotein E; MMSE = Mini-Mental State Examination. RAVLT = The Rey Auditory
Verbal Learning Test (immediate: sum of 5 trails; learning: trial 5-trial 1; Forgetting: trial 5-delayed;
perc.forgetting: Precent forgetting); DIGT = The Digit- Symbol Coding test; TRAB = Trail Making tests;
CDRSB = Clinical Dementia Rating Scaled Response; FAQ = Activities of Daily living Score; ADAS =
Alzheimer’s Disease Assessment Scale–Cognitive sub- scale; mPACCdigit = the Digit Symbol Substitution
Test from the Preclinical Alzheimer Cognitive Composite;
                                                      18


2.3.3   MRI data
Structural MRI data were collected according to the ADNI acquisition protocol using T1-weighted
scans (GradWarp, B1 Correction, N3, Scaled) [36]. These data included baseline structural MRI
scans of 840 ADNI participants, including 230 diagnosed as cognitively normal, 200 with diagnoses
of dementia, and 410 diagnosed with MCI. Processing for ROI-based volumetric data used in the
present study included brain extraction [34] and a multi-atlas, consensus-based label fusion scheme
for anatomical parcellation [35] to generate template-based ROIs deformed to individual subject
space. MRI scans were automatically segmented into 145 anatomic regions of interest (ROIs)
spanning the entire brain. An additional 114 derived ROIs were calculated by combining single
ROIs within a tree hierarchy, to obtain volumetric measurements from larger structures [36]. In
total, 259 ROIs were measured and used as potential predictors of MCI-to-dementia progression in
this study.
    One of the goals of this study is to investigate if manually selecting predictors improves a model’s
performance. Based on the extant literature [68], we manually selected 26 out of 259 features as
theoretically significant predictors of MCI to dementia progression (Table 2.3) [18, 43, 123, 91,
42, 108]. While many brain regions have been reported as showing some relationship to MCI-
to-dementia progression, prior reports and reviews clearly implicate hippocampal and entorhinal
cortical volumes as markers of such conversion. In addition, we manually selected additional
regions based on their common occurrence across reports, including cingulate gyrus, precuneus,
amygdala, inferior frontal gyrus, superior parietal lobule, and lobar white matter volumes.
2.4     Method and Algorithm
In the following section, we utilize binary LR and SVM classification techniques to investigate
which approach yields superior discrimination accuracy in the context of ADNI data. Prior
comparisons of logistic regression and SVM have reported that SVM requires fewer variables than
logistic regression to achieve an equivalent level of misclassification rate (MCR) [131, 30]. These
also report SVM performs better than LR with microarray expression data [30]. Furthermore,
                                                     19


                          Table 2.3: Pre-selected MRI Features of Group One
   Characteristics             MCI-S                MCI-C            Test statistic    P-value
        HippoR               3684 ± 438          3366 ± 437               5.735        < 0.001
        HippoL               3414 ± 418          3105 ± 388               5.994        < 0.001
        flWMR              96720 ± 6218        96976 ± 5585              -0.338          0.73
        flWML              93671 ± 5836        94238 ± 5160              -0.802          0.42
       plWMR               47197 ± 3415        47141 ± 3098              0.135           0.89
        plWML              50149 ± 3714        50038 ± 3467               0.242          0.81
        tlWMR              56076 ± 3252        55934 ± 2931              0.359           0.72
        tlWML              55412 ± 3396        55468 ± 3023              -0.136          0.89
        ACgCR                3167 ± 756          3128 ± 641               0.438          0.66
        ACgCL                4104 ± 787          4075 ± 689              0.312           0.76
          EntR               2189 ± 365          1983 ± 373               4.412        < 0.001
          EntL               2050 ± 399          1844 ± 356               4.240        < 0.001
       MCgCR                 4176 ± 547          4200 ± 541              -0.341          0.73
       MCgCL                 3988 ± 493          4002 ± 559              -0.213          0.83
        MFCR                 1581 ± 342          1505 ± 524               1.805          0.07
         MFCL                1566 ± 285          1548 ± 291               0.487          0.62
       OpIFGR                2575 ± 608          2425 ± 546               2.021          0.04
       OpIFGL                2465 ± 550          2361 ± 579               1.466          0.14
       OrIFGR                1252 ± 315          1196 ± 362               1.322          0.18
       OrIFGL                1514 ± 335          1398 ± 356               2.658        < 0.001
        PCgCR                3679 ± 466          3528 ± 415               2.657        < 0.001
        PCgCL                3991 ± 442          3789 ± 424               3.676        < 0.001
         PCuR              10129 ± 1193         9862 ± 1313               1.701          0.09
         PCuL              10005 ± 1263         9759 ± 1299               1.522          0.13
         SPLR               8867 ± 1140         8693 ± 1219              1.180           0.02
         SPLL               8880 ± 1192         8662 ± 1313              1.390           0.17
 Values are shown as mean ± standard deviation or percentage. Test statistics and P-values for differences
between MCI-C and MCI-S are based on t-test. MCI-S = non-progressive MCI; MCI-C = progressive MCI.
HippoR = Right Hippocampus; HippoL = Left Hippocampus; flWMR = frontal lobe WM right; flWML
= frontal lobe WM left; plWMR = parietal lobe WM right; plWML = parietal lobe WM left; tlWMR =
temporal lobe WM right; tlWML = temporal lobe WM left; ACgCR=Right ACgG anterior cingulate gyrus;
ACgCL=Left ACgG anterior cingulate gyrus; EntR = Right Ent entorhinal area; EntL = Left Ent entorhinal
area; MCgCR = Right MCgG middle cingulate gyrus; MCgCL = Left MCgG middle cingulate gyrus; MFCR
= Right MFC medial frontal cortex; MFCL = Left MFC medial frontal cortex; OpIFGR = Right OpIFG
opercular part of the inferior frontal gyrus; OpIFGL = Left OpIFG opercular part of the inferior frontal
gyrus; OrIFGR = Right OrIFG orbital part of the inferior frontal gyrus; OrIFGL = Left OrIFG orbital part of
the inferior frontal gyrus; PCgCR = Right PCgG posterior cingulate gyrus; PCgCL = Left PCgG posterior
cingulate gyrus; PCuR = Right PCu precuneus; PCuL = Left PCu precuneus; SPLR = Right SPL superior
parietal lobule; SPLL = Left SPL superior parietal lobule.
SVMs have a nice dual form, giving sparse solutions when using the kernel trick. In addition, both
methods involve minimizing some cost associated with the misclassification based on likelihood
ratio for a probabilistic model. Therefore, LR and SVM share common roots in statistical pattern
                                                    20


recognition, which we utilize in the comparison of their performance on multi-modal ADNI data.
2.4.1     Logistic Regression
Logistic regression (LR) is the most commonly used machine learning approach for binary classifi-
cation. In the past decade this has been applied to task of MCI-to-dementia conversion [29, 144, 82].
In the present study, we consider a supervised learning task where we are given M training examples
{⇡ = (G8 , H8 ), 8 = 1, ..." }. Here each G8 2 <# is # dimensional feature vectors, and H8 2 {0, 1}
is a class label. The goal of LR is to model the probability ? of a random variable y being 1 or 0
given the experimental data x. The logistic regression model is defined as follows:
                                                               ?
                                             ;>68C ? = ;>6                                          (2.1)
                                                             1     ?
Logit, the natural logarithm of the odds, is the key concept that underlies logistic regression. The
equation for LR is:
                                           %(H8 = 1|G8 ; )        ’#
                                     ;>6                       =           j xi j                   (2.2)
                                         1 %(H8 = 1|G8 ; ) 9=1
where        = (V1 , ...V # )) are the parameters or weights of the logistic regression model, xi j =
(G81 , ...G8# ), 8 = 1, ...". Also, %(H8 = 1|G8 , ) is the probability that 8C⌘ MCI patient will develop
dementia and %(H8 = 0|G8 , ) is the probability that 8C⌘ MCI patient will not develop dementia.
Denote %(H8 = 1|G8 ; ) = ⌘(G8 ), then
                                                            1
                                       ⌘(G8 ) =            Õ#                                       (2.3)
                                                 1 + 4G ?(  9=1      j G8 9 )
LR is usually trained by minimizing an error function; an appropriate choice of such a function for
binary classification problems is the cross-entropy error:
                              48 ( ) = H8 ;>6(⌘(G8 ))    (1    H8 );>6(1         ⌘(G8 )))           (2.4)
The total cost over the data {⇡ = (G 8 , H8 ), 8 = 1, ..." } is:
                                     1 ’
                                         "
                            ( )=       [    H8 ;>6(⌘(G8 ))   (1     H8 );>6(1       ⌘(G8 ))]        (2.5)
                                     " 8=1
                                                       21


Consider the problem of finding the maximum likelihood estimate (MLE) of the parameters               for
the unregularized logistic regression model. To find the optimized weights , the total cost needs
to be minimized. The optimization function can be written:
                                       1 ’
                                             "
                    > ?C8<0;
                             = <8=        [      H8 ;>6(⌘(G8 ))    (1    H8 );>6(1     ⌘(G8 ))]     (2.6)
                                       " 8=1
Solving Eq. (2.6) yields the optimal weights of             . However, the model-building challenge is
to abstract the underlying distribution from the particular instance D of samples because of the
relatively small sample size, as compared to the number of features. The problem of replicating the
data set instead of identifying the underlying distribution is known as overfitting [37]. To avoid
the overfitting problem, it is often necessary to apply a dimension reduction technique. ! 1 and ! 2
norm are widely used to avoid overfitting, especially when there is a only small number of training
examples, or when there is a larger number of features to be learned. ! 1 norm or ;0BB> is also often
used for feature selection, and has been shown to generalize well in the presence of many irrelevant
features [76, 109]. ! 1 regularization is implemented by adding ! 1 norm to the cost function; the
cost function and the optimization function were based on the following:
                               1 ’
                                   "
                      ( )=       [    H8 ;>6(⌘(G8 ))     (1    H8 );>6(1      ⌘(G8 ))] + _| |       (2.7)
                               " 8=1
and
                                   1 ’
                                       "
               > ?C8<0;
                         = <8= {     [      H8 ;>6(⌘(G8 ))    (1    H8 );>6(1      ⌘(G8 ))] + _| |} (2.8)
                                   " 8=1
where _ is positive tuning parameter. This Eq. (2.8) is refereed to as ! 1 regularized logistic
regression.
2.4.2   Support Vector machine
Support Vector Machine (SVM) is another classification and regression method that can handle
high-dimensional feature vectors. Algorithmically, SVMs build optimal boundaries between data
sets by solving a constrained quadratic optimization problem [23, 112, 129, 128, 127]. The number
of studies applying SVM to evaluate classification of conversion from MCI to dementia has grown
over the past decade [147, 148, 24, 145, 56, 68, 31, 59, 28, 136].
                                                       22


      We briefly review basic support vector machines with linear kernal (SVM-linear) for classifi-
cation problems: Let         ) ⌘(G)  + V0 = 0 denote an equidistant hyperplane (decision surface) to the
closest point of each class on the new space. The goal of SVMs is to find                      and V0 such that
|   ) ⌘(G)  + V0 | = 1 for all points closer to the hyperplane. In the following classifier construction,
one assumes that:
                                                          8
                                                          >
                                                          <
                                                          >      1 8 5 H8 = 1
                                        )
                                          ⌘(G8 ) + V0 =                                                   (2.9)
                                                          >
                                                          >  1 8 5 H8 = 0
                                                          :
such that the distance from the closest point of each class to the hyperplane is 1/|| || and the
distance between the two groups is 2/|| ||. To maximize the margin, the SVM requires the solution
of the following optimization primal problem [151]:
                                              ’"                   ’#
                                  <8=   ,  0      {1    H8 [V0 +         V)9 ⌘ 9 (G8 9 )]}               (2.10)
                                              8=1                  9=1
where ⌘ 9 is the kernel function which is a linear function for SVM-linear. Specifically we choose,
⌘ 9 (G 9 ) = G 9 for 9-th covariate.
      To make the algorithm work for highly correlated features and improve the fitted model’s
prediction accuracy, we reformulate our optimization by adding ! 1 -norm of V, i.e. the ;0BB>
penalty as follows:
                                        ’"                  ’ #
                             <8=  ,  0       {1    H8 [V0 +       V)9 ⌘ 9 (G8 9 )]} + _||V|| 1           (2.11)
                                        8=1                  9=1
where _ is the tuning parameter that controls the trade-off between loss and penalty. The lasso
penalty shrinks the fitted coefficients V towards zero, and hence benefits from the reduction in fitted
coefficients’ variance.
2.4.3      Experimental Design
We built four different classifiers, each designed to classify individual ADNI participants as be-
longing to either the MCI-C group or the MCI-S group: Classifier 1 is logistic regression (C-LR);
                                                          23


Classifier 2 is logistic regression with ! 1 norm (C-LR-1); Classifier 3 is support vector machine (C-
SVM), and Classifier 4 is SVM with ! 1 norm (C-SVM-1). To test the classifiers’ performance, we
constructed five different data sources (Table 2.4). The first three single-modality data sets included
clinical cognitive assessment scores and APOE4 status (CCA), all MRI volumes (ROI-NP), and
MRI volumes with pre-selection (ROI-P), respectively. Two additional multi-modal data sets were
constructed by combining the CCA data separately with ROI-NP and ROI-P data sets (i.e., brain
volumes with and without pre-selection). Furthermore, it is interesting to note that the number of
MCI-S subjects is 101 (38%) in the Group One and 122 (39%) in Group Two, which makes the
data rather imbalanced. Consequently, to precisely report the results obtained from the models,
the present study also assessed additional model performance parameters, including AUC score,
sensitivity and specificity (accuracy coefficient is unreliable for imbalanced data). The prediction
procedure consisted of three processing stages for Group One (Time=36 months) and Group Two
(Time=24 months): 1) Split data as training, validation, testing set; 2) Train classifiers using train-
ing set, tune hyper-parameter using the validation set, and assess classifiers using testing set, then
train classifiers again using ! 1 norm on the same training set; 3) Report the testing accuracy, AUC
score, sensitivity and specificity of each classifier on single-modality data. Specifically, the first
stage used 80% of the sample as a training set while the remaining 20% of the data constituted the
testing set. In the second stage, the optimal subsets of features of each data source are determined
and chosen following application of ! 1 norm. We then list the top 10 features of each data set for
each of the models. In the last stage, we report AUC score, sensitivity (percent of MCI-C subjects
correctly classified), and specificity (percent of MCI-S subjects correctly classified) as measures of
classification accuracy. To protect against over-fitting and to avoid optimistically-biased estimates
of model performance, we report 20 measures of predictive performance for each classifier (1-4); for
these different partitions of the data, we report the mean and standard deviation of testing accuracy,
AUC score, sensitivity, and specificity (Tables 6 & 7). We also investigate the relationship between
the number of features and model performance. Finally, we compare the performance of LR with
SVM based on their ability to handle the problem with a large number of covariates. Figure 2.3
                                                    24


illustrates the diagram of the prediction framework.
                                       Table 2.4: Modalities
                                    Data sources                                     # features
      Single-modality
      Clinical Cognitive Assessments score and APOE4 data (CCA)                      19
      ROI with no pre-selection data (ROI-NP)                                        259
      ROI with pre-selection data (ROI-P)                                            26
      Multi-modal
      CCA and ROI with no pre-selection data (CCAR-NP)                               278
      CCA and ROI with pre-selection data (CCAR-P)                                   45
2.5     Results and Analysis
Cross-validation and choice of _
We adopted 10-fold cross-validation to tune the hyper-parameters for each model, which included
dividing the data into separate sets for training and validation. The ratio of case in training and
validation was 8:2. Here, the training set was used to train the model and the validation set was used
to select the hyper-parameters. The results of a 10-fold cross-validation run are summarized with
the mean and standard deviation of the model skill scores based on testing data. Cross-validation
was also applied to tune the hyper-parameters; _ is used to denote the hyper-parameters for both
LR-! 1 and SVM-! 1 . To select the optimized _, we tried different values of the _; results reported
here include values of _ = 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and applied them to the
Eq (2.8) and (2.11). Next, we selected the _ value based on the best cross-validation score and used
the selected _ with Classifiers 2 and 4 to select optimal features. For brevity, the model performance
estimates are reported in Tables 2.6 and 2.7 for each different modalities, and the top 10 selected
features are reported in Table 2.5. For example, the best _ for ROI-NP-! 1 was 0.01 and the top
3 optimal features selected by LR were left amygdala, right accumbens area, and right middle
                                                    25


Figure 2.3: Flowchart of the LR and SVM method A) ROI-P: ROI level data with Pre-selection; B)
ROI-NP: ROI level data with No Pre-selection; C) CCAR: Clinical, Cognitive assessments score,
APOE4 and ROI level data.
                                             26


  temporal gyrus. After hyper-parameters were selected, we adopted a 10-fold cross-validation again
  to avoid optimistically-biased estimates of model performance. In each iteration, 212 of the 265
  participants are selected by simple random sampling as training cases and the remaining 53 were
  used as test cases. The approximate 4:1 ratio of training to test cases is, of course, arbitrary.
                 Table 2.5: Top 10 features of Group One obtained by ! 1 regularization
    Source                 LR-L1 (Classifier 2)                        SVM-L1 (Classifier 4)
     Data           CCA          ROI-NP        CCAR-NP              CCA         ROI-NP CCAR-NP
       1             FAQ          AmyL            FAQ               FAQ          AmyL          FAQ
       2       mPACCtrailsB AccmR                 AmyL           Yrs. Educ.     AccmR         AmyL
       3           APOE4         MTGR          ADASQ4             APOE4         AOrGL        AccmR
       4          ADASQ4         HippoL          HippoL         mPACCdigit      PCgGL        AOrGL
       5          Learning       AOrGL           MTGR            ADASQ4         HippoL         PTR
       6         Yrs. Educ.       PrGR           APOE4            Learning        PrGR        AnGR
       7          Forgetting     PCgGL           AOrGL           ADAS11          POrGR       APOE4
       8        mPACCdigit         InfR         Learning       mPACCtrailsB       PTR        PCgGL
       9          ADAS13          POR       mPACCtrailsB        DELTOTAL         LOrGL Learning
      10          ADAS11         MOGL         mPACCdigit         Forgetting     MOrGL        POrGR
AccmR = Right Accumbens Area; AmyL = Left Amygdala; HippoL = Left Hippocampus; InfR = Right Inf
Lat Vent; AOrGL = Left anterior orbital gyrus; AnGR = Left angular gyrus; LOrGL = Left lateral orbital
gyrus; MOGL = Left middle occipital gyrus; MOrGL = Left medial orbital gyrus; MTGR = Right middle
temporal gyrus; PCgGL = Left posterior cingulate gyrus; POR = Right parietal operculum; POrGR = Right
posterior orbital gyrus; PrGR = Right precentral gyrus; PTR = Right planum temporale
  2.5.1   Comparison with different modalities
  We compared the performance of each classifier (1-4) on the five different feature sets (Table 2.4)
  based on estimates of AUC, sensitivity and specificity. As shown in Table 2.6, the results of using
  LR with ! 1 regularization (Classifier 2) can achieve the high AUC of 81.2% and sensitivity of 81.4%
  on single-modality data (CCA), which is considerably better than performance of LR on the other
  four modalities. Similarly, the best AUC and sensitivity achieved by SVM are 81.4% and 81.6%
  based on the combination of CCA and SVM-L1. Furthermore, we also found the highest accuracy
  achieved by both classifiers without applying regularization is based on the single-modality data
  (CCA); this indicated both classifiers perform best on single-modality data.
                                                     27


2.5.2   Comparison of Pre-selection and ! 1 norm
We found that using prior knowledge to inform feature selection improves model performance
and protects against over-fitting. As shown in Table 2.6, model performance (i.e., AUC) on ROI-P
(64.3%) and CCAR-P (76.3%) outperformed ROI-NP (60.6%) and CCAR-NP (60.1%). However,
the performance of Classifier 2 on the ROI-NP-! 1 and CCAR-NP-! 1 data sets had AUC score of
64.1% and 64.0%, while the ROI-P-! 1 and CCAR-P-! 1 had respective AUC scores of 64.3% and
77.9%; this suggests that user-guided pre-selection significantly improved model performance over
! 1 norm. In addition, the SVM (Classifiers 3 & 4) had similar and comparable results with LR
classifiers. First, as with the LR models, the observed AUC estimates for CCAR-P and ROI-P
(69.2% and 64.1%, respectively), were superior to AUCs from the CCAR-NP (59.1%) and ROI-NP
analyses (61.4%). Classifier 4 exhibited similar performance on the CCAR-P-! 1 as Classifier 2, with
an AUC value of 79.6% – higher than the model for CCAR-NP-! 1 (74.0%). Therefore, manually
selecting features improves model’s performance whether ! 1 norm is applied, or not. Second,
these results show it is necessary and important to use pre-selection because both LR and SVM
models on CCAR-P-! 1 , with respective AUC estimates of 77.9% and 78.5%, exhibited superior
performance over the models without such pre-selection (i.e., LR and SVM on CCAR-NP-! 1 had
AUC estimates of 64.0% and 74.0%, respectively).
2.5.3   Comparison of Groups One and Two
In addition to the results from models of Group One (i.e., MCI-to-AD conversion over 36 months),
we also evaluated the performance of Group Two (i.e., MCI-to-AD conversion over 24 months)
in an effort to gain further insight regarding possible benefits of shorter or longer assessment
periods on classification of the progression of MCI to dementia. Table 2.7 summarizes the
predictive performance of LR and SVM for Group Two. Similarly, we also evaluated classifier
performance for single- and multi-modality feature sets. The best result is obtained by using SVM-
! 1 model (Classifier 4) on CCAR-P, and its corresponding AUC, Sn and Sp are 76.2%, 60.1%
and 79.2%, which verifies the assumption that manually selecting techniques improves the model’s
                                                  28


performance again. However, it warrants mention that all classifiers’ performance on the Group
One data outperformed the same classifiers’ performance on the same data sets in Group Two. For
example, Classifier 2 of Group One on CCA achieved AUC and Sn values of 81.2% and 83.1%,
which is considerably better than the same classifier of Group Two on CCA (i.e., 76.3% and 79.8%).
Similarly, Classifier 3 for ROI-NP had an AUC of 61.4% for Group One and 56.6% for Group Two.
The experimental results indicated superior model performance on data obtained using longer than
using shorter follow-up periods. Given the uncertainty in conversion, a longer time window for
assessment of cognitive and functional change clearly yields more accurate classification.
2.5.4   Comparison of LR and SVM
In addition to comparing classification between different time windows of assessment, we also
compared performance differences between LR and SVM. The results, including models’ ability to
address the over-fitting problem of LR and SVM methods with different modalities are displayed
in Table 2.6, 2.7 and Fig.2.4, and 2.5. First, it is worth noting that both LR and SVM do not work
well if no ! 1 penalization used, since Classifiers 2 and 4 outperform Classifiers 1 and 3 on the
same data set. Second, it is worth noting that SVM has a better performance on MRI data when
the L1 feature selection method is employed. Third, it was possible to obtain good performance
accuracy using LR, which had equivalent model performance as SVM for “large p" data (ROI-P),
as evidenced by respective AUC estimates for Classifiers 1 and 3 of 64.3% and 64.1%. Finally, as
shown in Fig. 2.5 and 2.4, the SVM method is more stable and robust than LR to the large number
of features when n is small. To summarize, the best performance of Group One was achieved by
Classifier 4 (SVM with ! 1 norm) when using multi-modal – i.e., CCAR-L1, had an AUC of 81.4%.
                                                    29


Table 2.6: LR and SVM performance of Group One (Time = 3 years) for models on single- and
multi-modal feature sets
     Source              LR (Classifier 1 and 2)                      SVM (Classifier 3 and 4)                Features
   Modality   Test Acc %  AUC %          Sp %     Sn %     Test Acc %  AUC %         Sp %         Sn %      # Features
                                                                                                                      19 (2)
      CAA     74.3 ± 6.0 80.8 ± 7.0 62.3 ± 12.1 81.5 ± 6.2 72.4 ± 6.9 80.0 ± 7.3 53.6 ± 13.2    79.4 ± 7.7
                                                                                                           19 (1)
                                                                                                                     259 (2)
    ROI-NP    58.1 ± 7.0 60.6 ± 8.1 45.5 ± 13.4 65.3 ± 7.9 59.5 ± 7.3 61.4 ± 8.5 46.5 ± 11.9    67.3 ± 8.5
                                                                                                           259 (1)
                                                                                                                      26 (2)
     ROI-P    64.4 ± 6.5 64.3 ± 6.6 46.1 ± 10.4 75.0 ± 9.6 62.1 ± 5.9 64.1 ± 6.2  43.6 ± 9.5   78.4 ± 10.4
                                                                                                           26 (1)
                                                                                                                     278 (2)
   CCAR-NP    57.6 ± 7.2 60.1 ± 8.1 44.8 ± 12.9 65.1 ± 9.0 57.8 ± 6.8 59.1 ± 7.0 45.9 ± 10.4    65.1 ± 7.5
                                                                                                           278   (1)
                                                                                                                      45 (2)
    CCAR-P    72.7 ± 6.4 76.3 ± 6.5 60.5 ± 10.4 80.4 ± 8.2 66.9 ± 6.0 69.2 ± 6.4 53.6 ± 13.2   74.4 ± 10.5
                                                                                                           45 (1)
                                                                                                                       3 (2)
    CCA-! 1   74.9 ± 6.4 81.2 ± 6.7 61.3 ± 12.0 83.1 ± 6.6 74.2 ± 6.0 81.4 ± 6.9 61.6 ± 11.5    81.6 ± 5.9
                                                                                                           4 (1)
                                                                                                                      27 (2)
  ROI-NP-! 1  62.2 ± 6.6 64.1 ± 7.9 53.1 ± 13.1 68.1 ± 7.2 62.7 ± 5.8 67.0 ± 6.7 53.7 ± 11.6    67.7 ± 7.4
                                                                                                           29  (1)
                                                                                                                      17 (2)
   ROI-P-! 1  64.4 ± 6.5 64.3 ± 6.2 46.2 ± 11.0 74.9 ± 9.6 64.4 ± 5.7 64.7 ± 5.8 46.7 ± 11.1    75.4 ± 8.3
                                                                                                           5 (1)
                                                                                                                      27 (2)
 CCAR-NP-! 1  62.6 ± 7.2 64.0 ± 8.2 51.8 ± 12.7 69.5 ± 7.3 67.4 ± 6.4 74.0 ± 7.4 55.7 ± 12.1    74.1 ± 7.1
                                                                                                           18 (1)
                                                                                                                      25 (2)
  CCAR-P-! 1  73.1 ± 6.5 77.9 ± 5.9 61.6 ± 10.9 79.6 ± 7.7 73.5 ± 6.2 78.5 ± 6.4  61.6 ± 9.3    80.8 ± 7.5
                                                                                                           14 (1)
Predictive performance of LR and SVM (mean ± standard deviation) for all models. Performance
estimates include testing accuracy (Test Acc %), area under the cureve (AUC), sensitivity (Sn), and
specificity (Sp). The number (#) of features was determined via (1): Classifier 2; (2): Classifier 4.
                                                        30


Table 2.7: LR and SVM performance of Group Two (Time =2 years) for single-data and multi-
modal data
    Source                LR (Classifier 1 and 2)                       SVM (Classifier 3 and 4)                Features
   Modality   Test Acc %   AUC %          Sp %      Sn %     Test Acc %   AUC %        Sp %         Sn %      # Features
                                                                                                                      19 (2)
     CAA       69.9 ± 5.3 76.2 ± 5.5   56.7 ± 9.0 79.3 ± 7.3 69.4 ± 5.4  75.4 ± 5.5 56.7 ± 8.8    78.5 ± 7.1
                                                                                                             19 (1)
                                                                                                                     259 (2)
    ROI-NP     58.1 ± 4.2 58.8 ± 5.6   49.7 ± 7.1 64.4 ± 5.9 57.8 ± 5.0  56.6 ± 6.4 50.3 ± 7.1    62.9 ± 7.5
                                                                                                             259 (1)
                                                                                                                      25 (2)
     ROI-P     63.4 ± 4.7 65.8 ± 4.3 43.7 ± 10.2 77.8 ± 8.6  64.5 ± 4.7  66.2 ± 5.0 44.5 ± 8.5    79.1 ± 9.1
                                                                                                             25  (1)
                                                                                                                     278 (2)
  CCAR-NP      57.3 ± 4.0 58.8 ± 5.4   47.5 ± 8.3 64.3 ± 5.8 56.6 ± 5.5  56.4 ± 5.2 48.9 ± 7.9   62.3 ± 10.4
                                                                                                             278 (1)
                                                                                                                      45 (2)
   CCAR-P      70.2 ± 5.4 74.0 ± 5.0   56.7 ± 9.5 80.6 ± 7.0 69.5 ± 4.9  72.0 ± 5.3 58.1 ± 8.1    78.0 ± 8.2
                                                                                                             45 (1)
                                                                                                                       4 (2)
   CCA-! 1     70.1 ± 4.8 76.3 ± 5.3   56.8 ± 9.9 79.8 ± 7.6 70.4 ± 4.9  76.4 ± 7.7 56.8 ± 9.8    79.4 ± 7.7
                                                                                                             4 (1)
                                                                                                                      31 (2)
  ROI-NP-! 1   62.2 ± 6.0 64.7 ± 6.0   48.9 ± 9.2 72.0 ± 6.8 60.8 ± 4.5  65.9 ± 6.1 53.6 ± 7.5    64.3 ± 7.9
                                                                                                             29  (1)
                                                                                                                      14 (2)
   ROI-P-! 1   64.1 ± 4.6 66.8 ± 3.8 42.8 ± 11.3 79.8 ± 8.4  65.4 ± 4.0  67.8 ± 3.9 46.3 ± 9.4    81.1 ± 7.2
                                                                                                             6 (1)
                                                                                                                      32 (2)
 CCAR-NP-! 1   62.6 ± 6.3 64.8 ± 6.0   49.1 ± 9.1 72.1 ± 6.1 64.5 ± 5.1  71.7 ± 4.8 55.4 ± 7.8    71.4 ± 8.9
                                                                                                             26 (1)
                                                                                                                      27 (2)
  CCAR-P-! 1   70.0 ± 5.5 74.3 ± 5.5   57.8 ± 8.0 78.3 ± 8.8 71.3 ± 4.9  76.2 ± 4.7 60.1 ± 7.1    79.2 ± 8.5
                                                                                                             14  (1)
For each modality, the predictive performance of LR and SVM are shown (mean ± standard
deviation), including testing accuracy, AUC, sensitivity (Sn), specificity (Sp), # features is the
number of features; # features is the number of features; this parameter was determined via (1):
Classifier 2; (2): Classifier 4.
                                                          31


2.6     Discussion and Conclusion
In this thesis, we applied two machine learning methods under multiple conditions, to test accuracy
in classifying patients with MCI who progress to clinically-defined dementia (MCI-C) from those
who remain stable (MCI-S). Using multi-modal data from ADNI, we compared LR and SVM
classification accuracy and pre-selection dimensional reduction techniques - i.e., feature selection
as informed by prior findings in clinical neuroscience and by ! 1 norm. Notably, the present
results demonstrate important boundaries for applying feature selection techniques in statistical
classification of MCI-to-dementia conversion. Specifically, we found that while using ! 1 for pre-
selection can improve accuracy, it also benefits from a more limited, theoretically based set of
feature inputs. In addition, we found that model performance benefited from a longer window
of assessment. These results have implications for studies utilizing multi-modal data for such
classification, including features from clinical neuropsychological assessment, demographic and
genetic markers, MRI-based volumetric brain measures, and other modalities.
    Comparison of user-defined and ! 1 pre-selection for LR and SVM classifiers yielded multiple
noteworthy findings, consistent with previously published reports [147, 148, 24, 29, 145, 56,
144, 68]. First, the classification results showed that the model using multi-modal data with
cognitive, clinical, and volumetric data (CCAR) achieved better classification accuracy than the
methods based on single-modality (CCA, ROI). Moreover, the AUC of CCAR based on LR or
SVM was either statistically significantly or at least numerically greater than those based on the
single-modality model. Based in AUC, we reported the highest accuracy was observed for CCAR
data at 78.5% by ! 1 SVM and 77.9% by ! 1 LR. Second, SVM demonstrated several advantages
over LR in discriminating MCI-C from MCI-S (Fig. 2.4). For one, SVM performance tended to be
more stable than LR when the number of features was relatively large. In other words, the model
performance of SVM on ROI data remained more stable than LR when using larger numbers of
features without user-defined pre-selection. In particular, SVM performance on ROI data improved
as the number of features increased from 20 and 30. In contrast, the AUC values for ROI data sets
remained fairly static despite increasing the number of features. However, LR model performance
                                                 32


decreased gradually after the number of ROI features reached 40. Third, the classification results
clearly demonstrate that manually selecting features on MRI data not only improved the model
performance and protected the classifier from overfitting, but also affords easier interpretation of
each selected feature’s contribution to the model. In addition, we show that pre-selection improves
performance: Tables 2.6 and 2.7 suggest it is the best strategy to obtain the maximum model
performance, compared to features selection based on ! 1 norm.
    The present findings can also be interpreted in the context of other reports over the past decade
that also investigated the prognostic capacity of brain volumetry data to predict the conversion
of MCI to dementia, using either SVM or LR, and that also combined volumetry data with
other imaging and biomarker modalities such as MRI, functional MRI (fMRI), positron emission
tomography (PET) to cerebral spinal fluid (CSF) protein markers [147, 148, 24, 29, 145, 56, 144,
68, 78, 130, 69]. In addition, one can vary the degrees of non-linearity and flexibility in the model
by employing different kernel functions. For example, Young et al (2013) report [145], results
from both SVM and Gaussian process (GP) classification on MCI progression in ADNI data using
MRI, PET, APOE4, and CSF biomarkers. In contrast the present study and with other published
work that used MCI-C and MCI-S groups as training and test data sets, they trained a classifier to
distinguish cognitively normal older adults from those diagnosed as probable AD. They reported
that the accuracy using GP – an AUC value of 79.5% – was substantially higher than using any
individual modality or using multi-kernel SVM. Other studies of MCI-to-dementia classification
reporting high accuracy have also implemented other approaches such as multiple kernel learning
(pMKL) classification techniques using clinical, MRI and plasma biomarkers data. One method
using this approach to identify the important features first grouped the data set into five different data
sources and then applied a filter-wrapper approach of feature selection techniques in combination
with Joint Mutual Information (JMI) criterion to achieve an AUC of 82% [68].
    We also found consistently superior classification performance in patients classified under a
longer window of assessment. MCI-to-dementia conversion is a process that can take several years
to reliably track an individual from onset of amnestic MCI to early-stage dementia [145, 92, 75].
                                                  33


For the modeled features to be of use for classification necessitates well-defined, if not orthogonal
classes. However, MCI is not inherently prodromal to dementia: a large proportion of individuals
with MCI never progress, either reverting to cognitively normal status or remaining rather stable.
Furthermore, others may show early evidence of brain atrophy that precedes cognitive impairment
by years. In order to account for this variable timing, others have employed methods such as
supervised learning using time windows [102]; however, even those methods strongly benefit
from longer follow-up periods. Thus, MCI is an inherently heterogeneous and poorly-defined class,
particularly in terms of the relationships between brain characteristics and the likelihood and timing
of further cognitive decline. Most recent computational neuroimaging studies in the past few years
have utilized multi-modal features [24, 31, 82, 59, 28, 114, 89, 90, 133, 136]. For example, when
Ding et al applied SVM with PET and MRI data to classify the transition from MCI to AD, they
reported the sensitivity and specificity were 66.67% and 64.52% [31]. In addition to PET and
structural MRI data, CSF protein markers can be used to predict progression from MCI to AD,
in addition to proteomic, demographic and cognitive data [28, 113, 21]. By applying LR with
! 1 norm to CSF markers for classifying individual patients as belonging to either the MCI-C and
MCI-S group, one study reported a sensitivity and specificity of 80% and 75% [82]. Furthermore,
Varatharajah and colleagues (2020) showed SVM-linear outperforms other advanced classification
methods, including linear classifiers—multiple kernel learning (MKL) with linear kernels, SVM
with a linear kernel, and generalized linear model (GLM), in predicting transition from MCI to
AD [130]. In general, LR works well when the data is linearly separable and the number of data
is greater than the number of features, whereas SVM with Gaussian Kernel is mostly used when
the data is not linearly separable. In addition to LR and SVM, deep neural network approaches
also offer benefits [78, 119], but have not had the extent of application in ADNI data as SVM
and LR. Using a novel LR, artificial neural network (ANN) model and decision tree (DT) model
for classifying the progression of MCI to AD, Kuang (2021) reported that the ANN exhibited the
highest sensitivity at 82.1% [69].
    In conclusion, models applying prior knowledge for classification and prediction of MCI-
                                                  34


to-dementia conversion outperform those without pre-selection. This theoretically guided pre-
selection of features from MRI-based regional brain volumes appears to protect the model against
over-fitting. In addition, the present findings demonstrate that SVM classifier performance is
more stable than LR for dealing with the “large p" problem. Clinical researchers should note the
value of evaluating different classification and pre-selection approaches in application to clinical or
research questions, and be mindful that not all machine learning techniques are equally beneficial
for modeling specific clinical outcomes.
                                                  35


Figure 2.4: Model performance on ROI feature set by number of features for LR and SVM. Panel
(a) shows dramatic growth in AUC with LR as the number of features increases from 1 to 30, and
then becoming more static at approximately 74% - i.e., as the number of features increases from 30
to 40, but drops significantly when the number of features reaches to 41. Panel (b) shows the AUC
increased dramatically as the number of features grows from 1 to 28, but fluctuated after 29. The
optimal number of ROI features for both methods are 29 and 28, and their corresponding optimized
AUC were approximately 74.0% and 78.0%.
                                               36


Figure 2.5: Model performance on CCA feature set by number of features for LR and SVM. Figure
(a) shows there is a significant increase in the AUC with LR as the number of features increases
from 1 to 5, then there is a slight decrease in the testing accuracy when the number of features is
greater than 5. Figure (b) shows the AUC shot up dramatically as the number of features increases
from 1 to 4. The optimal number of CCA features obtained by LR and SVM are 5 and 4, and their
corresponding optimized AUC are approximately 84.0% and 83.0%.
                                                  37


                                            CHAPTER 3
    CONSISTENT VARIATIONAL BAYES CLASSIFICATION WITH DEEP NEURAL
                                           NETWORKS
3.1     Introdution
Bayesian deep neural network (BDNN) models are ubiquitous in classification problems; however,
their Markov Chain Monte Carlo (MCMC) based implementation suffers from high computational
cost, limiting the use of this powerful technique in large-scale studies. Variational Bayes (VB)
has emerged as a competitive alternative to overcome some of these computational issues. This
thesis focuses on the variational Bayesian deep neural network (VBDNN) estimation methodology
and discusses the related statistical theory and algorithmic implementations in the context of
classification. For a deep neural network-based classification, the thesis compares and contrasts
the true posterior’s consistency and contraction rates and the corresponding variational posterior.
Based on the complexity of the deep neural network (DNN), this thesis provides an assessment
of the loss in classification accuracy due to VB’s use and guidelines on the characterization of
the prior distributions and the variational family. The difficulty of the numerical optimization for
obtaining the variational Bayes solution has also been quantified as a function of the complexity
of the DNN. The development is motivated by an important biomedical engineering application,
namely building predictive tools for the transition from mild cognitive impairment to Alzheimer’s
disease. The predictors are multi-modal and may involve complex interactive relations.
3.2     The Neural Networks Classifier and Likelihoods
Let . be a binary random variable taking values 0 or 1, representing the class levels and - 2 R ? is
a feature vector drawn from a feature space with some marginal distribution % - . We consider the
following binary classification problem
                   %(. = 1|- = G) = f([0 (x)), %(. = 0|- = G) = 1         f([0 (x))            (3.1)
                                                 38


where [0 (·) : R ? ! R is some continuous function and f(.) = 4 (.) /(1 + 4 (.) ) is the sigmoid
function. Thus, % -,. , the joint distribution of (-, . ) is a product of the conditional distribution in
(4.1) and the marginal distribution % - . Borrowing some notations from [14] and [141], a classifier
⇠ is a Borel measurable function ⇠ : R ? ! {0, 1}, with the interpretation that we assign a point
x 2 R ? to class ⇠ (x). The test error of a classifier ⇠ is given by
                                             π
                                   '(⇠) =                 {⇠ (-)<. } 3% -,.                           (3.2)
                                              R ? ⇥{0,1}
Based on (4.1), we define the Bayes classifier as
                                                 8
                                                 >
                                                 >
                                                 >
                                                 < 1,
                                                 >       f([0 (x))      1/2
                                  ⇠ Bayes (x) =                                                       (3.3)
                                                 >
                                                 >
                                                 >
                                                 > 0,    otherwise
                                                 :
The Bayes classifier is optimal ([46]) since it minimizes the mis-classification error risk in (4.2).
However, the Bayes classifier is not useful in practice, since the function [0 (x) is unknown. Thus,
a classifier is obtained based on a set of training observations {(x1 , H 1 ), ..., (x= , H = )}, which are
drawn from % -,. . A good classifier based on the sample should have the risk tending to the Bayes
risk as the number of observations tends to infinity, without any requirement for its probability
distribution. This is so called universal consistency. Multiple methods have been adopted to
estimate [0 (x), including logistic regression (a linear approximation), generalized additive model
(GAM, a nonparametric nonlinear approximation), deep neural networks (a complicated structure
which is dense in continuous functions) etc. The first two methods usually work in practice with good
theoretical foundation, however, they may fail to catch the complicated dependency of the feature
vector x in a wide range of applications including the problem that we consider in this article. On
the other hand, the neural network structure which can exploit the dependency implicitly without
any specific parametric structure, has relatively few theoretical works establishing its statistical
efficacy in Bayesian models. In this thesis, we thereby focus our attention on classification using
deep neural networks.
    Consider a single layer neural network model with ? predictor variables. The layer has : =
nodes, where : = may be a diverging sequence depending on =. The validity of neural network
                                                      39


approximations is based on the universal approximation results [25], which states that the single layer
neural network is able to approximate any continuous function with a quite small approximation
error when : = is large. Assume a Fourier representation of [0 (x) of the form
                                                     π
                                                         48! x ˜ (3!)
                                                            )
                                             5 (x) =
                                                      R?
                                  Ø
and denote        ⌫,⇠ = { 5 (·) :   ⌫
                                      k!k 2 | 5˜|(3!) < ⇠} for some bounded subset ⌫ of R ? containing
zero for some constant ⇠ > 0. Then, for all functions [0 2 ⌫,⇠ , there exist a single layer neural
                                                         p
network output [(x) such that k[ [0 k 2 = $ (1/ : = ) [5]. This result ensures good approximation
property of single layer neural network, and the convergence rate depends only on the number of
nodes under mild conditions on [0 (x). [77] proved that as long as the activation function is not
algebraic polynomials, the single layer neural network is dense in the continuous function space,
thus can be used to approximate any given continuous function.
     We use ✓= to index the set of all the parameters. For ? = ⇥ 1 input vector x, consider a deep
neural network with ! = hidden layers and : 1= , · · · , : ! = = being the number of nodes in the hidden
layers. Let : 0= = ? = + 1 and : (! = +1)= = 1. It can be checked that the total number of parameters is
       Œ! =
   = =    E=0 : (E+1)= (: E= + 1) due to the formulation below.
        [✓= (x) = b ! + A ! k(b !        1 + A ! 1 k(b !  2 + A ! 2 k(· · · k(b1 + A1 k(b0 + A0 x)))
bE , E = 0, · · · , ! are vectors of dimension : E+1 ⇥ 1 and AE , E = 1, · · · , !   1 are matrices each of
dimension : E+1 ⇥ : E . We have suppressed the dependence on = for notation simplicity.
For the purposes of this thesis, we use the activation function to be the sigmoid function, k(G) =
4 G /(1 + 4 G ), although the theoretical results are valid to a wider class of activation functions such
as tan-hyperbolic, Gaussian etc.. Thus, using the neural network in (3.4) as an approximation to
the true function [0 (x) in (4.1), the conditional probabilities of . given - = x is given by
                     %(. = 1|- = x) = f([✓= (x)), %(. = 0|- = x) = 1              f([✓= (x))          (3.4)
Assuming Bernoulli distribution, the conditional density of . |- = x under the model is:
                                                   ⇣              ⇣             ⌘⌘
                                ✓✓= (H, x) = exp H[✓= (x)     log 1 + 4 [✓= (x)                       (3.5)
                                                        40


Thus, the likelihood function for the data (y= , X= ) = (H8 , x8 )8=1            = under the model is
                                                            = h
                                                                                                            !
                             ÷ =                          ’                                 ⇣            ⌘i
                  !(✓= ) =        ✓✓= (H8 , x8 ) = exp           H8 [✓= (x8 )      log 1 + 4 [✓= (x8 )             (3.6)
                              8=1                          8=1
In view of (4.1), the conditional density of . |- = x under the truth
                                                       ⇣                   ⇣                 ⌘⌘
                                  ✓0 (H, x) = exp H[0 (x)            log 1 + 4 [0 (x)                              (3.7)
Therefore, the likelihood function for the data under the truth is given by
                                                          = h
                                                                                                       !
                             ÷ =                         ’                            ⇣             ⌘i
                      !0 =        ✓0 (H8 , x8 ) = exp          H8 [0 (x8 )       log 1 + 4 [0 (x8 )                (3.8)
                              8=1                        8=1
3.3     Bayesian Inference with Variational Algorithm
3.3.1    Prior Choice
For Bayesian analysis, prior distributions have to be assigned for all parameters defining the model.
Although one may have a prior knowledge concerning the function represented by a neural network,
it is generally difficult to translate this into a meaningful prior on neural network weights. We assume
an independent normal prior as follows:
                                                   ÷ =
                                                            1
                                                                        1
                                                                             (\ 9= ` 9= ) 2
                                                        q
                                                                      2f 2
                                      ?(✓= ) =                    4      9=                                        (3.9)
                                                         2cf 9=2
                                                   9=1
(A0) For     = = [f1= , · · · , f  == ],   ⇤
                                           =  = [1/f1= , · · · , 1/f    ==  ], assume
                                   log ||   = || 1 = $ (log =)       ||   ⇤
                                                                          = || 1  = $ (1),
where ||.|| 1 is the supremum norm of a vector as in definition A.0.1 in appendix B. Note, the above
assumption ensures that the variance associated with each \ 9= do not grow at an arbitrarily large
rate in which case the consistency of both the Bayesian and variational Bayes approach would break
down. Restrictions on the mean parameter µ= = [`1= , · · · , `                     == ] directly impact the consistency
rate and are more case specific (see section 3.4 for a thorough discussion).
                                                            41


    The reason for choosing the above form of prior is two folds: (1) first it guarantees that the
true posterior distribution is consistent (2) second it guarantees, under a suitable choice of the
variational family, the approximated variational posterior is also consistent. The choice of prior in
(3.9) is not unique. Indeed, one can work with a much more generic class of priors such that (1) and
(2) hold. Note, each prior comes with its own associated computation complexity, implementation
and theoretical justification. We choose one which does a fairly good job under all these three
criterion. In view of (3.6) and (3.9), posterior distribution of ✓= given y= = [H 1 , · · · , H = ] > and
X= = [x1 , · · · , x= ] > is
                                              c(✓= , y= , X= )        !(✓= ) ?(✓= )
                           c(✓= |y= , X= ) =                   =Ø                                  (3.10)
                                                c(y= , X= )         !(✓= ) ?(✓= )3✓=
where c(y= , X= ) is free from the parameter and depends only on y= and X= .
3.3.2    Variational Inference
As a first step to variational inference (VI) procedure, one has to start with a variational family.
Given several options, we work with one which is simple, computationally and structurally tractable,
and more importantly they provide statistically consistent posterior estimation. We posit a mean
field Gaussian variational family of the form
                                 8
                                 >                                                          9
                                                                                            >
                                 >
                                 <
                                 >                    ÷                  1
                                                                             (\ 9= < 9= ) 2 >
                                                                                            =
                                                                                            >
                                                         =
                                                               1
                                                           q
                                                                       2B 2
                           Q= = @(✓= ) : @(✓= ) =                                                  (3.11)
                                 >                                                          >
                                                                    4     9=
                                 >
                                 >                           2cB29=                         >
                                                                                            >
                                 :                                                          ;
                                                      :=1
Note that the variational family assumes that each \ 9= is independent with mean and standard
deviation equal to < 9= and B 9= respectively.
    The variational posterior aims to reduce the KL-distance between the variational family and the
true posterior [9, 41, 11]. For the true posterior, c(.|y= , X= ) in (4.9), the variational posterior is
                                      c ⇤ = argmin 3KL (@, c(.|y= , X= )).                         (3.12)
                                             @2Q=
where 3KL , the Kullback-Leibler (KL) divergence between a variational family member @(✓n ) and
the true posterior c(✓= |y= , X= ) is given by
                                                       42


                                             π
                   3KL (@, c(.|y= , X= )) =     log(@(✓= )/c(✓= |y= , X= ))@(✓= )3✓=                  (3.13)
Bases on (4.9), simplifying further, we get
                               π
      3KL (@, c(.|y= , X= )) =      [log @(✓= )   log c(✓= , y= , X= )]@(✓= )3✓= + log c(y= , X= )
                             = ELBO(@, c(., y= , X= )) + log c(y= , X= )                              (3.14)
Since the last term in (3.14) does not depend @, optimizing (3.14) w.r.t. to @ boils down to
optimizing the first term. Indeed the first term is nothing but the negative of the evidence lower
bound (ELBO). Thus in order to minimize the KL-distance, we shall instead maximize the ELBO
between @ and c(., y= , X= ). Alternatively, we define c ⇤ as
                                c ⇤ = argmax ELBO(@, c(., y= , X= ))                                  (3.15)
                                         @2Q=
To maximize the ELBO in (3.14), let V@ = (< 1= , · · · , <            2 , · · · , B 2 ) where <
                                                                == , B1=             ==
                                                                                                9= and B 9= is
the mean and standard deviation of \ 9= under the density @. Thus, each @ 2 Q= is indexed by its
parameters. Consequently,
                                     π
 ELBO(@(.|V@ ), c(., y= , X= )) =       [log c(✓= , y= , X= ) log @(✓= |V@ )]@(✓= |V@ )3✓=
                    π                                π
                  =     log !(✓= )@(✓= |V@ )3✓= + [log ?(✓= ) log @(✓= |V@ )]@(✓= |V@ )3✓=
                    π
                  =     log !(✓= )@(✓= |V@ )3✓= 3KL (@(.|V@ ), ?(.)) = LV@ 3KL (@(.|V@ ), ?(.))
                                                                                                      (3.16)
The derivative of 3KL (@(.|V@ ), ?(.)) w.r.t. V@ has a closed form expression (see appendix A). The
key challenge is the derivative LV@ w.r.t. to V@ which we discuss next
                    π                                π
   rV@ LV@ = rV@       log !(✓= )@(✓= |V@ )3✓= =         log ! (✓= )rV@ @(✓= |V@ )3✓=
              π
            =     rV@ log @(✓= |V@ ) log !(✓= )@(✓= |V@ )3✓= = ⇢ @(.|V@ ) (log @(✓= |V@ ) log !(✓= ))
                                                                                                      (3.17)
                                                    43


where the last equality holds since rV@ log @(✓= |V@ )@(✓= |V@ ) = rV@ @(✓= |V@ ).
    The black-box variational inference (BBVI) algorithm, [107], optimizes the ELBO using gra-
dient descent method by making use of a similar approach. The key challenge in evaluating the
gradient in (3.17) is the computation of the expectation. Exact computation of the expectation
leads to high computational complexity whereas using noisy estimates leads to high variability. In
section 3.3.3, we elucidate how to ensure fast and efficient estimation of the gradient.
3.3.3   Black Box Variational Algorithm using score function estimator
The gradient in (3.17) is difficult to evaluate for problems with complex likelihood structures arising
out of deep network models. Alternatively, the above expectation is evaluated by sampling from
the variational distribution and forming the corresponding Monte Carlo estimates of the gradient.
We next explain the computation of Monte Carlo estimate of the gradient in (3.17) by using ideas
similar to [107, 124]. Let V@ denote the current value of the variational parameters. We generate
, samples from the variational distribution @(.|V@ ) and define the noisy but unbiased estimate of
rV@ LV@ as
                                      1 ’
                                          ,
                          õ
                          rV@ L V =         rV log @(✓= [F] |V@ ) log ! (✓= [F])                 (3.18)
                                 @    , F=1 @
where ✓= [1], · · · , ✓= [,] are samples generated from @(.|V@ ). Similarly, a noisy but unbiased
estimate of the LV@ is given by
                                               1 ’
                                                   ,
                                        b
                                       LV@ =          log ! (✓= [F])                             (3.19)
                                               , F=1
    Algorithm 1 provides the pseudocode summarizing the overall algorithm for BBVI.
                                                     44


Algorithm 1 BBVI
    1. Fix an initial value for variational family parameters V@1 .
    2. Fix a step size sequence dC , C = 1, · · · .
    3. Set C = 1.
    4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ).
    5. Compute r  õ V@C L     as in (3.18)
                          V@C
    6. Update                                        ✓                                           ◆
                                  V@C+1 =  V@C + dC    õ
                                                       r V@C L        r V@C 3KL (@(.|V@ ), ?(.))                (3.20)
                                                                V@C
    7. Set C = C + 1.
    8. Repeat steps 4-7 until the convergence of ELBO using L            bV C as in (3.19) and
                                                                             @
                                            ELBO = L     bV C     3KL (@(.|V@ ), ?(.))
                                                             @
    In the implementation of the above algorithm, one needs to compute 3                           ! (@(.|V@ ), ?(.)),
rV@ log @(✓= |V@ ) and rV@ 3         ! (@(.|V@ ),   ?(.)) for the variational parameters V@ . For the choice of
? and @ as in (3.9) and (3.11), the explicit expressions have been presented in appendix A.
    For the variational parameters (B1= , · · · , B        ==  ), the updating rule in (3.20) may lead to negative
estimates. However, one must guard against this since variance terms cannot be negative. Thus, to
perform the optimization, we reparametrize the variance terms as B 9= = log(1 + 4A 9= ), 9 = 1, · · · ,              =
and update the quantities AA= in each step instead of B 9= . By chain rule, for any function 6(V@ ),
                                                     4A 9=
                               rA 9= 6(V@ ) =                  rV 6(V@ )| B 9= =log(1+4A 9= )
                                                (1 + 4A 9= ) @
where second term is the derivative of 6(V@ ) w.r.t. B 9= evaluated at B 9= = log(1 + 4A 9= ). The explicit
expressions of derivatives w.r.t. A 9= have also been provided in appendix A.
3.3.4   Control Variate: Stabilizing the stochastic gradient
We can use algorithm 1 to maximize the ELBO, however a major drawback is that the noisy
estimator of the gradient has high variance. There are two major techniques to reduce the variance
                                                             45


of gradients. One of them is “Rao-Blackwellization", where the idea is to replace the noisy estimate
of gradient with its conditional expectation w.r.t. a subset of the variables, [107]. This method is
useful when the posterior distribution is separable across subsets of variables or while dealing with
latent variables. A convoluted likelihood as in (3.6) is not separable across the components of ✓= and
there are no latent variables in our model. We thereby refrain from using the Rao-Blackwellization
approach.
Algorithm 2 BBVI-CV
    1. Fix an initial value for variational parameter V@1 .
    2. Fix a step size sequence dC , C = 1, · · · .
    3. Set C = 1.
    4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ).
    5. Compute 2¢C = cov(u1C , u2C )/var(u2C ) where u1C and u2C are same as in (3.22).
    6. Compute r  õ V@C L     as in (3.21).
                          V@C
    7. Update                                       ✓                                         ◆
                                  V@C+1 = V@C + dC    õ
                                                      r V@C L      r V@C 3KL (@(.|V@ ), ?(.))
                                                              V@C
    8. Set C = C + 1.
    9. Repeat steps 4-7 until the convergence of ELBO using L         bV C as in (3.19) and
                                                                          @
                                            ELBO = L   bV C     3KL (@(.|V@ ), ?(.))
                                                            @
     Another method which also gives an efficient technique for stabilizing the gradient is called
control variate (CV) (see [110, 97, 107]). We use CV to reduce the variance of the MC approx-
imations of the gradients. The key idea behind the variance reduction as proposed in [110] is to
replace the target function, whose expectation is being approximated by Monte Carlo, with an aux-
iliary function that has the same expectation but a smaller variance. To reduce the variance of the
function b (q), one instead considers the function bˆ (q) = b (q)              1 i(q)       ⇢ @ (i(q)) where i(q)
is function with finite expectation and 2 is a scalar. Such a choice ensures ⇢ @ ( bˆ (q)) = ⇢ @ (b (q))
and Var@ ( bˆ (q)) = Var@ (b (q)) + 22 Var@ (i(q))                22Cov@ (b (q), i(q)) which is minimized at
                                                            46


2¢ = Cov@ (b (q), i(q))/Var@ (i(q)). Thus, greater the correlation between b and i, greater
the variance reduction. Similar to [107], we use rV@ log @(✓|V@ ) as a choice for i(q). The
stochastic approximation of the gradient in (3.17) is then modified as
                                 1 ’
                                    ,
                    õ
                    rV@ L V =          rV log @(✓= [F] |V@ ) [log ! (✓= [F])       2¢ ]         (3.21)
                            @   , F=1 @
It is impossible to obtain an exact expression for 2¢, one thus uses 2b¢ = cov(u1 , u2 )/var(u2 ),
        D 1 [F] = rV@ log @(✓= [F]|V@ ) log ! (✓= )          D 2 [F] = rV@ log @(✓= [F] |V@ )   (3.22)
The extension of algorithm 1 with variance reduction of MC approximations due to CV is annotated
as BBVI-CV and summarized in algorithm 2.
     Similar to the implementation of algorithm 1, for the implementation of algorithm 2, we use
the reparametrization of B 9= = log(1 + 4A 9= ) as explained in section 3.3.3.
3.3.5   RMSprop Learning Rate: Stabilizing the learning rate.
Note that both BBVI and BBVI-CV algorithms as described in section 3.3.3 and 3.3.4 work fairly
well with a fixed learning rate for a single layer network. However, their performance deteriorates
significantly when the neural networks have two or more layers. This can be attributed to the fact
that the gradients for the different parameters changes at significantly different rates. In order to
overcome these issues, a wide class of adaptive learning rates have been explored in [117], [150],
etc. for the frequentist optimization of parameters in deep neural networks. One such popular
technique which performs well in practice, called the RMSprop, was introduced in [57] where the
gradient is divided by a running average of its recent magnitude.
     As described in both [57] and [51], let ⌧ C denote the value of the current gradient, then define
                                  'C = 0.9'C    1 + 0.1⌧ C2 , C = 1, 2, · · ·
                                                                              p
and the replace the learning dC by the effective learning rate dC /( 'C + n) for some small n > 0.
Numerical studies show for one layer network, RMSprop leads to faster convergence and for multiple
                                                    47


layer networks, convergence is not possible without an adaptive learning rate similar to RMSprop.
One could also experiment with other adaptive learning rates like AdaGrad, AdaDelta, ADAM, etc.
to serve the same purpose as that of RMSprop (see [88] and [38] for more details on other adaptive
learning rates.)
    The updated version of BBVI and BBVI-CV using RMSprop, renamed as BBVI-RMS and
BBVI-CV-RMS, are summarized as algorithms 4 and 5 and provided in appendix A.
3.3.6   Classification using variational posterior
Define, [(x),
         ˆ     the variational estimator of [0 (x) as
                                              ✓π                         ◆
                                            1
                                 ˆ
                                 [(x)  =f                       ⇤
                                                    f([✓= (x))c (✓= )3✓=                         (3.23)
where c ⇤ is the variational posterior. Analgous to (4.3), the classifier based on [(x)
                                                                                    ˆ     is
                                               8
                                               >
                                               >
                                               >
                                               < 1,
                                               >        ˆ
                                                     f( [(x))      1/2
                                     ⇠ˆ (x) =                                                    (3.24)
                                               >
                                               >
                                               >
                                               > 0, otherwise
                                               :
Note, the formulation in (3.23) guarantees that we directly approximate the main quantity of interest,
                                               Ø
f([0 (x)) as in (4.1) by its posterior mean, f([✓= (x))c ⇤ (✓= )3✓= , which is empirically estimated
as
                                                  1 ’
                                                     ,
                                      [ˆ (x) =
                                        ,
                                                        f([✓= [8] (x))                           (3.25)
                                                 , F=1
where ✓= [1], · · · , ✓= [,] are multiple samples from the variational posterior c ⇤ . Since generation
of multiple samples from the variational posterior is much cheaper, the order of error between
(3.23) and (3.25) is negligible.
3.4    Posterior and Classification Consistency
In this section, we establish that the Bayesian inference procedure proposed in section 3.3 enjoys
theoretical guarantees in terms of consistency of the posterior estimation and classification. For a
simple Gaussian mean field family as in (3.11), we establish that the variational posterior (3.12)
                                                     48


is consistent under suitable assumptions on the prior parameters. We also discuss how the true
function [0 impacts the rate of consistency of the variational posterior. Finally, we present how the
consistency rates of the variational posterior differ from those of the true posterior.
     Let 50 and 5✓= be the joint density of the observations (H8 , x8 )8=1     = under the truth and the model
respectively. Without loss of generality, we assume -8 ⇠ * [0, 1] ? = , which implies 50 (x) = 1 and
 5✓= (x) = 1. This implies that the joint distribution of (H8 , x8 )8=1        = depends only the conditional
distribution of . |- = x. From (4.1) and (3.4) with ✓✓= and ✓0 as in (B.5) and (3.7),
                                                         ⇣                  ⇣             ⌘⌘
        5✓= (H, x) =       5✓= (H|x) 5✓= (x) = exp H[✓= (x)             log 1 + 4 [✓= (x)    = ✓✓= (H, x)
                                                           ⇣                ⇣            ⌘⌘
          50 (H, x) =      50 (H|x) 50 (x)       = exp H[0 (x)          log 1 + 4 [0 (x)      = ✓0 (H, x) (3.26)
We next define the Hellinger neighborhood of the true density function 50 = ✓0 as
                                          UY = {✓= : 3H (✓0 , ✓✓= ) < Y}                                  (3.27)
where the Hellinger distance, 3H (✓0 , ✓✓= ) is given by
                                      π             ’ ✓p                   q            ◆2     1/2
                                  ©1                                                         ™
                 3H (✓0 , ✓✓= ) = ≠                           ✓0 (H, x)       ✓✓= (H, x) 3xÆ̈      .
                                    2
                                  ´
                                        x2[0,1] ?= H2{0,1}
Also, the Kullback-Leibler (KL) neighborhood of the true density function 50 = ✓0 is
                                         NY = {✓= : 3KL (✓0 , ✓✓= ) < Y}                                  (3.28)
where the KL distance, 3KL (✓0 , ✓✓= ) is given by
                                         π              ’ ✓          ✓0 (H, x)
                                                                                         ◆
                       3KL (✓0 , ✓✓= ) =                       log              ✓0 (H, x) 3x
                                           x2[0,1] ?= H2{0,1}       ✓✓= (H, x)
Let %0= denote the true distribution of (y= , X= ) = (H8 , x8 )8=1   = under the true density ✓ .
                                                                                                     0
3.4.1    Posterior consistency and its implication in practice
In the following two theorems, we establish the posterior consistency of c ⇤ defined in (3.12). In this
direction, we show that the variational posterior concentrates in Y small Hellinger neighborhoods
                                                           49


of the true density ✓0 . In theorem 3.4.1, we establish this result for a fixed choice of the neighborhood
distance Y. In theorem 3.4.2, we establish the same result for shrinking neighborhood sizes of the
true function ✓0 . For both these theorems, the total number of parameters           =  allows to grow at a
rate of =0 for some 0 < 0 < 1. Note, theorem 1 is a simple consistency result and holds due to the
universal approximation properties of neural networks (see [60]) when the number of layers and
input variables are fixed. This is an important result since it shows, irrespective of the function
under study, BDNN’s enjoy consistency properties if the number of input variables and the number
of layers are fixed. Additionally, we also provide a characterization on the prior distribution such
as that the rate of growth of ! 2 norm of the prior mean parameter necessary to guarantee the
consistency result in theorem 3.4.1 (see A2) and contraction results in theorem 3.4.2 (see A4).
    The theorem 3.4.2 studies the contraction rate of the variational posterior, it is more restrictive
in nature and requires additional assumptions on the approximating neural network solution to
the true function [0 (see assumption (A3) below). We next describe how our current theoretical
development contrast to the recent works of [105] and [3]. Firstly, theorem 2 establishes the
variational posterior contraction rates following the classical definition of contraction as in the
theorems of 2.1 of [47] and theorem 2.1 of [122]. It differs from the consistency results in [3]
which deal with posterior expectation of square of Hellinger distance and [105] which consider
the lower bound on \/||[✓=        [0 || 1 . Secondly, unlike the two aforementioned works, we assume a
restriction only on the total number of parameters        = in the system instead of developing the results
for the same number of nodes in each layer, an assumption which can severely restrict the space of
neural networks solution one works with. Third, both [105] and [3] assume that there exists a true
sparse solution all of whose coefficients are bounded above by a constant ⌫ (see condition 4.3 in
[3]). We impose no such restriction to begin with on our true neural network solution but derive
the most relaxed condition on the joint growth of the number of nodes and strength of connections
between active nodes to allow for rates of contraction to hold (see condition 3. in (A3)). Indeed, if
we make the assumption that all coefficients of the neural net are bounded above by ⌫, condition
3. in (A3) simplifies to a restriction only on the number of nodes as in [105] and [3]. Lastly, both
                                                      50


these works establish their contraction results in context of regression problems which allows them
to use results from [111]. However, our systematic development here requires the derivation of the
tools for a classification set up and ideas may be extended to other generalized linear models.
Theorem 3.4.1 Let              =  ⇠ =0 , 0 < 0 < 1 and ? = = ?, ! = = ! be constants independent of =.
(A2) The prior parameters in (3.9) satisfy assumption (A1) and ||µ= || 22 = >(=). Then,
                                                                        %0=
                                                            c ⇤ (UY2 ) ! 0
Here, ||.|| 2 is the ! 2 norm of a vector as in definition A.0.1 in the appendix B. By the above theorem,
for any a > 0, c ⇤ (UY2 ) < a with probability tending to 1 as = ! 1. Under the conditions of
theorem 3.4.1, it can be established that the true posterior satisfies c(UY2 |y= , X= ) < 24                  =Y 2 /2 with
probability tending to 1 as = ! 1 (see theorem A.0.18 part 1. in the appendix D). This implies
that the probability of the Y Hellinger neighborhoods of the true function ✓0 for the true posterior
increases at the rate of 1 24                =Y 2 /2 in contrast to the slow rate of 1  a for the variational posterior.
Theorem 3.4.2 Suppose                    =  ⇠ =0 , 0 < 0 < 1, ! = ⇠ log =, n =2 ⇠ = X , 0 < X < 1    0. Suppose,
(A3) There exists a sequence of neural network functions [✓=⇤ satisfying
    1. ||[0       [✓=⇤ || 1 = >(n =2 )
    2. ||✓=⇤ || 22 = >(=n =2 )
            ⇣Õ               Œ! =                ⌘
    3. log         !=
                   E=0  : E=                 ⇤
                               E 0 =E+1 0 E 0 =     = $ (log =)
where 0 ⇤E 0= = supB=0,··· ,: (E 0+1)= ||A⇤E 0 [B]] || 1 .
(A4) The prior parameters satisfy assumption (A1) and ||µ= || 22 = >(=n =2 ).
By the above theorem, for any a > 0, c ⇤ (UYn                      2 ) < a with probability tending to 1 as = !
                                                                      =
1. Under the conditions of theorem 3.4.2, it can be established that the true posterior satisfies
c(UYn 2 |y , X ) < 24           =Y 2 n =2 /2 with probability tending to 1 as = ! 1 (see theorem A.0.19 part 1.
       =    =       =
in the appendix D). This implies that the probability of the shrinking Yn = Hellinger neighborhoods
                                                                   51


of the true function ✓0 for the true posterior increases at the rate of 1                         24   =Y 2 n =2 /2 in contrast to the
slow rate of 1        a for the variational posterior.
Remark: For a single layer, assumption (A3) condition 3. holds if the number of input features
increases at a rate polynomial in =. As the number of layers increases, one needs the row sums
in the true solution        ⇤
                            E, E  = 0, · · · , ! = to be bounded. This shows that even with a control on the
number of nodes, the strength of the signal into every active node node must be well controlled
(this corresponds to edge selection following node selection).
3.4.2    Discussion of the proof
We next briefly outline the main steps in the the proof of theorems 3.4.1 and 3.4.2. The details
are deferred to appendix C. The first step of the proof is to establish that 3KL (c ⇤ , c(.|y= , X= )) is
bounded below by a quantity which is determined by the rate of consistency of the true posterior.
The second step is to show 3KL (c ⇤ , c(.|y= , X= )) is bounded above at a rate which is greater than
its lower bound if and only if the variation posterior is consistent. Note,
        3KL (c ⇤ , c(.|y= , X= ))
       π                                                π
                                  c ⇤ (✓= )                                           c ⇤ (✓= )
    =          c (✓= ) log
                ⇤
                                                3✓= +          c ⇤ (✓= ) log                        3✓=
                             c(✓= |y= , X= )                                   c(✓= |y= , X= )
                      π                                                               π
          UY                                              U Y2
                            c ⇤ (✓= )       c(✓= |y= , X= )                                  c ⇤ (✓= )            c(✓= |y= , X= )
    =        ⇤
          c (UY )             ⇤
                                       log                     3✓  =     c ⇤
                                                                             (U  Y
                                                                                  2
                                                                                    )               2    log                       3✓=
                       U Y c (UY )              c ⇤ (✓= )                                      ⇤
                                                                                        U Y2 c (UY )                  c ⇤ (✓= )
                             c ⇤ (UY )                               c ⇤ (UY2 )
        c ⇤ (UY ) log                        + c ⇤ (UY2 ) log                          , by Jensen’s inequality
                         c(UY |y= , X= )                       c(UY2 |y= , X= )
where UY as in (4.16) note that for any Y > 0, Since c(UY |y= , X= )  1, thus
                c ⇤ (UY ) log c ⇤ (UY ) + c ⇤ (UY2 ) log c ⇤ (UY2 )          c ⇤ (UY2 ) log c(UY2 |y= , X= )
                  c ⇤ (UY2 ) log c(UY2 |y= , X= ) log 2, since G log G + (1 G) log(1 G)                                        log 2
                             ✓ π                                          π                            ◆
                                           !(✓= )                              ! (✓= )
           =      c (UY ) log
                     ⇤    2
                                                   ?(✓= )3✓= log                          ?(✓= )3✓=             log 2
                                      U Y2   !0                                   !0
Thus, with
                                                               52


                              π                                                π
                                     ! (✓= )                                       ! (✓= )
                     = = log                  ?(✓= )3✓=         ⌫= =       log             ?(✓= )3✓=        (3.29)
                                U Y2     !0                                          !0
we get the following main step towards the proof of theorems 3.4.1 and 3.4.2.
                               c ⇤ (UY2 )   =  3KL (c ⇤ , c(.|y= , X= )) + |⌫= | + log 2                   (3.30)
     In the above proof we have assumed c ⇤ (UY ) > 0, c ⇤ (UY2 ) > 0. If c ⇤ (UY2 ) = 0, there is nothing
to prove. If c ⇤ (UY ) = 0, then following the steps of the proof in appendix C, we get Y 2 = > %0= (1)
which is a contradiction. The first term            = is decomposed as
                                π                                 π
                                            !(✓= )                            !(✓= )
                         4 =
                              =                     ?(✓= )3✓= +                      ?(✓= )3✓=
                                   U Y2 \F=   !0                    U Y2 \F=2   !0
where {F= }1 ==1 is a suitably chosen sequence of sieves. Indeed our choice of F= is given by
                                            n                                        o
                                     F= = ✓= : |\ 9= |  ⇠= , 9 = 1, · · · , (=)                            (3.31)
                  1/                                          1n 2/
where ⇠= = 4 =         = in theorem 3.4.1 and ⇠= = 4 =           =   (=)  in theorem 3.4.2 respectively where 1
is chosen to ensure Hellinger bracketing entropy (see definition A.0.2 in the appendix B) of F= is
well controlled (proposition A.0.16 in the appendix C). Secondly, the prior needs to give negligible
probability outside F=2 so that term 4            =  is well controlled. The prior in (3.9) satisfies this for
theorem 3.4.1 and theorem 3.4.2 with assumptions (A1), (A2) and (A1), (A4) respectively.
     The second quantity ⌫= is controlled by the rate at which the prior gives mass to shrinking KL
neighborhoods of the true density ✓0 . In theorem 3.4.1, this rate is controlled as long as the prior
parameters in (3.9) satisfy (A1) and (A2). In theorem 3.4.2, the same rate is controlled as long as
the prior parameters satisfies (A1) and (A4) and the true function [0 has a neural network solution
which satisfies assumption (A3).
     Finally, we bound 3KL (c ⇤ , c(.|y= , X= )) by 3KL (@, c(.|y= , X= )) for a suitable @ 2 Q= (see
propositions B.0.5 and A.0.17 in the appendix). From Relation (A.30) in the appendix,
                                                π                                      π
                                                         ! (✓= )                           !(✓= )
   3  ! (@, c(.|y= , X= ))    3KL (@, ?) +          log         @(✓= )3✓= + log                  ?(✓= )3✓= (3.32)
                                                           !0                               !0
                                                           53


The last term above is nothing but |⌫= |. The second term is the most crucial quantity.
                          π                                  π
                                  !(✓= )
                              log         @(✓= )3✓= ⇡ =         3KL (✓0 , ✓✓= )@(✓= )3✓= .
                                    !0
For both the theorems 3.4.1 and 3.4.2, the right hand side can always be controlled by choosing
@ = "+ # (< ⇤= , B=⇤ ) for a suitable choice of the sequence < ⇤= and B=⇤ . We discuss the choice of B=⇤ in
the appendix C. For theorem 3.4.1, < ⇤= = \ =⇤ where [✓=⇤ is the finite neural network approximation
of [0 and for theorem 3.4.2, the < ⇤= = \ =⇤ corresponds to [✓=⇤ , the rate controlled neural network
approximation of assumption (A3). Finally, the first term in (3.32) is determined by both prior and
@. In theorem 3.4.1, it is controlled as long as the prior parameters in (3.9) satisfy (A1), (A2). In
theorem 3.4.2, the same rate is controlled as long as the prior parameters satisfies (A1), (A4) and
the sequence ✓=⇤ satisfies assumption (A3).
    In light of the above discussion, there are three main properties which a prior must satisfy to
allow for the convergence of variational posterior. For any a > 0
   1. For a sequence of sieves {F= }1     ==1 with well controlled Hellinger bracketing entropy,
                                           π
                                                                    =n =2 a
                                                  ?(✓= )3✓=  4             ,= ! 1
                                            F=2
   2. With NY as in (3.28),
                                          π
                                                                     =n =2 a
                                                   ?(✓= )3✓=     4           ,= ! 1
                                           NY n 2
                                               =
                             Ø
   3. For a @ satisfying       3KL (✓0 , ✓✓= )@(✓= )3✓= < Y, = ! 1,
                                              3    ! (@, ?)  =n =2 a, = ! 1
Whereas condition 1 and 2 are standard assumptions for consistency of true posterior (see assump-
tions 1 and 2 in [4] and theorem 2 in [74]), condition 3 is an additional requirement which makes
the variational posterior consistent. The proof presented in this section can be generalized to a
much wider class of priors satisfying (1)-(3).
                                                          54


3.4.3    Classification consistency
In this section, we discuss the classification accuracy of the predictions made by the variational
posterior by comparing to the optimal mis-classification error. In view of (4.2), let '(⇠)                            ˆ and
'(⇠ Bayes ) denote the classification accuracy under the variational classifier in (3.24) and the Bayes
classifier in (4.3) respectively, then
       ˆ
   |'(⇠)     '(⇠ Bayes )| = |⇢ - ⇢. |- [                  ⇠ Bayes (-)<. ] |
                                               ⇠ˆ (-)<.
    = |⇢ - ⇢. |- [(  ⇠ˆ (-)=0      ⇠ Bayes (-)=0 )f([0 (-))  +(     ⇠ˆ (-)=1       ⇠ Bayes (-)=1 ) (1   f([0 (-)))]|
     2⇢ - [  ⇠ˆ (-)<⇠ Bayes (-) |f([0 (-))         1/2|]
    = 2⇢ - [      ˆ
             f( [(-))    1/2,f([0 (-))<1/2 |f([0 (-))        1/2| +         ˆ
                                                                         f( [(-))<1/2,f([     0 (-)) 1/2
                                                                                                         |f([0 (-))  1/2|]
     2⇢ - |f([0 (-))              ˆ
                              f( [(-))|                                                                                (3.33)
The above result establishes how the difference in classification accuracy depends on the logit links
 ˆ
[(-)   and [0 (-) as defined in (3.23) and (4.1) respectively. Using the above result, in corollary
3.4.3, we establish the classification accuracy of the variational estimate [(x)                  ˆ    under no assumptions
on the true function [0 (x). In corollary 3.4.4, we establish the same result under assumption (A3)
on the true function [0 (x). Note, although theorem 3.4.1 requires minimal assumptions, it gives a
much weaker convergence result on the classification accuracy.
Corollary 3.4.3 Under the conditions of theorem 3.4.1,
                                                                            %0=
                                                 |'(⇠)ˆ   '(⇠ Bayes )| ! 0
     By the above corollary, for any a > 0, |'(⇠)          ˆ       '(⇠ Bayes )| < a with probability tending to 1 as
= ! 1. Under the conditions of theorem 3.4.1, it can be established that the true posterior also
gives classification consistency at the same rate and there is no loss in using a variational posterior
approximation (see theorem A.0.18 part 2. in the appendix D).
Corollary 3.4.4 Under conditions of theorem 3.4.2, for every 0  ^  2/3,
                                                                               %0=
                                                        ˆ
                                              n = ^ |'(⇠)   '(⇠ Bayes )| ! 0
                                                             55


    By the above corollary, for any a > 0, 0  ^  2/3, |'(⇠)          ˆ    '(⇠ Bayes )| < an =: with probability
tending to 1 as = ! 1. Under the conditions of theorem 3.4.2, it can be established that the true
posterior satisfies |'(⇠) ˆ     '(⇠ Bayes )| < an =: for every a > 0, 0  ^  1 with probability tending
to 1 as = ! 1 (see theorem A.0.19 part 2. in the appendix D). Thus, the classification consistency
occurs at the rate n =2/3 for the variational posterior in contrast to n = for the true posterior.
3.5     Simulation Studies.
In this section, we study the performance of the four algorithms viz BBVI, BBVI-CV, BBVI-RMS,
BBVI-CV-RMS in the context of two simulation scenarios. We used approximate 2:1 ratio for
training and test cases. All the covariates are normalized. We adopted a 10-fold cross-validation
to avoid optimistically-biased estimates of model performance.
3.5.1    Simulation Scenarios
Scenario 1: We simulate = = 3000 observations from a 2-2-2-1 network, i.e. a neural network
with 2 input features, 2 hidden layer with 2 nodes each and 1 output layer as
                                 8
                                 >
                                 >
                                 >
                                 < 0,
                                 >      b2 + A2 k(b1 + A1 (k(b0 + A0 x8 ))) > 0
                            H8 =
                                 >
                                 >
                                 >
                                 > 1,   otherwise
                                 :
where x8 2 R2 , are i.i.d. from # (0, 1) and entries in b 9 , A 9 , 9 = 0, 1, 2 are i.i.d. from * (0, 1).
Scenario 2: We simulate = = 3000 observations from the following non linear function as
                               8
                               >
                               >
                               >
                               < 0,
                               >      24 G8 [1] + 3 sin(G8 [2]G8 [3]) + 4G8 [4] 3   3>0
                         H8 =
                               >
                               >
                               >
                               > 1,   otherwise
                               :
where x8 2 R4 are i.i.d. from # (0, 1).
3.5.2    Parameters choice for statistical and computational models.
In order to implement the BBVI, BBVI-CV, BBVI-RMS, and BBVI-CV-RMS, we need to make a
valid choice of the prior parameters ` 9= , f 9= for 9 = 1, · · · ,       =  as in (3.9). We use the choice of
                                                         56


` 9= = 0 and f 9= = 1 for our prior parameters. Indeed, this choice satisfies conditions (A1), (A2)
and (A4) as assumed in the consistency proofs of theorems 3.4.1 and 3.4.2. Next, we need to make
a choice on the number nodes in each hidden layer. We experiment with 1 and 2 hidden layers with
2 nodes in each layer. The choice of number of nodes satisfy the assumption of theorem 3.4.1 and
3.4.2.
3.5.3    Gradient stabilization paramaters.
The choice of the initial learning rate is dC = 14 4 , C         1 for BBVI and BBVI-CV and dC =
14 1 , C   1 for BBVI-RMS and BBVI-CV-RMS. These values were chosen to ensure the optimal
performance of the algorithms, however little sensitivity to the initial choice was observed. As
explained in section 3.3, to allow for stable optimization, we study the sensitivity to the different
samples sizes (, use of control variates and the RMSprop based gradient descent method. The
choice of sample size ( is sensitive to the performance to model in terms of algorithmic stability
and convergence time. Whereas each update with small sample size takes less time, the variability
of the estimate is high. On the other hand a large sample size leads to less variable estimates but
each update takes a much longer time. We experimented with ( = 200, ( = 500 and ( = 1000.
For scenario 1, Figures 3.1 and 3.2 illustrates how the ELBO changes with ( for one and two layers
respectively. For scenario 2, Figures 3.3 and 3.4 provide the same illustration for one and two layers
respectively. It is evident that increase in ( from 200 to 1000 stabilizes the ELBO and helps with a
faster convergence.
    As explained in section 3.3.4, the maximization of the ELBO requires stabilization of the
variance of the stochastic gradient in (3.18) which is done by the use of control variate. For
scenario 1, Figures 3.1 and 3.2 illustrates how the ELBO changes with use of control variates for
one and two layers respectively. For scenario 2, Figures 3.3 and 3.4 provide the same illustration
for one and two layers respectively. It is evident that the use of control variates stabilizes the ELBO
by a huge margin and allows for its faster convergence. Finally, as explained in section A, the use
of RMSprop stabilizes the optimization of ELBO by normalizing the gradients by their running
                                                   57


Figure 3.1: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 1 layer.
Figure 3.2: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers.
                                     58


magnitude. For scenario 1, Figures 3.1 and 3.2 illustrates how the ELBO changes with use of
RMSprop versus a fixed learning rate for one and two layers respectively. For scenario 2, Figures
3.3 and 3.4 provide the same illustration for one and two layers respectively. It is evident that the
use of RMSprop leads to stable ELBO and faster convergence rates.
         Figure 3.3: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 2 for 1 layer.
        Figure 3.4: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers.
                                                59


                                                  Testing accuracy(%)     Convergence time(s)
       Layers     Method      Sample size (S)    Fixed         RMSprop    Fixed    RMSprop
       1           BBVI            200        97.41 ± 0.50 96.89 ± 0.93     23         114
                                   500        97.72 ± 0.38 97.52 ± 0.74     55         106
                                   1000       98.01 ± 0.33 97.38 ± 0.39    108          80
                 BBVI-CV           200        97.82 ± 0.40 97.61 ± 0.60     21          6
                                   500        97.84 ± 0.40 97.67 ± 0.34     52          7
                                   1000       97.84 ± 0.42 97.94 ± 0.40    104          10
       2           BBVI            200        97.79 ± 0.71 97.02 ± 1.10    200          98
                                   500        94.34 ± 3.82 97.75 ± 0.95    452          39
                                   1000       91.50 ± 5.17 98.11 ± 0.42    904          65
                 BBVI-CV           200        96.34 ± 0.75 97.61 ± 0.44    118          17
                                   500        96.33 ± 0.73 97.30 ± 0.60    272          23
                                   1000       96.36 ± 0.74 97.74 ± 0.54    552          40
             Table 3.1: Performance of algorithms algorithms 1, 2, 4, 5 for scenario 1.
3.5.4   Testing accuracy and convergence.
We evaluate the model’s performance for all four algorithms BBVI, BBVI-CV, BBVI-RMS and
BBVI-CV-RMS under two criteria (1) testing accuracy (2) convergence time. The test accuracy,
) (⇠) of a classifier is given by 1 '(⇠) where '(⇠) is the mis-classification error rate as described
in (4.2). The convergence criterion is defined as the point where Monte Carlo estimate of the ELBO
as in (3.19) converges.
For scenario 1, table 3.1 gives the performance of four algorithms for 1 and 2 layers. The best
average accuracy of 98.11% is obtained for BBVI-RMS with ( = 1000 with 2 layers. The optimal
time is achieved with BBVI-CV-RMS for ( = 200 for one layer with average accuracy of 97.61%.
We can make two conclusions here, although the true data is generated from 2 layer network
structure, a one layer approximation is fairly competitive. BBVI-CV-RMS with ( = 200 provides
the best convergence time of nearly 6 sec for one layer and 17 sec for two layers with competitive
accuracy. For scenario 2, table 3.2 gives the performance of four algorithms for 1 and 2 layers. The
best average accuracy of 91.12% is obtained for BBVI-CV-RMS with ( = 1000 with 2 layers. The
optimal time is achieved with BBVI-CV-RMS for ( = 200 for one layer with average accuracy of
91.11%. The improvement obtained by moving from 1 to 2 layers is only marginal. BBVI-CV-RMS
with ( = 200 provides the best convergence time of nearly 19 sec for one layer and 11 sec for two
                                                  60


                                                  Testing accuracy(%)        Convergence time(s)
       Layers    Method     Sample size (S)      Fixed         RMSprop       Fixed    RMSprop
       1          BBVI           200         83.66 ± 14.51 88.71 ± 7.12       190        15
                                 500         90.22 ± 0.54 90.32 ± 0.98        364       390
                                 1000        90.28 ± 0.75 90.41 ± 0.71        732       710
                BBVI-CV          200         90.51 ± 0.87 90.42 ± 0.64        17         19
                                 500         90.51 ± 0.87 90.65 ± 0.61        36         33
                                 1000        90.53 ± 0.91 90.78 ± 0.49        69         37
       2          BBVI           200         88.40 ± 0.50 89.89 ± 0.88        256       421
                                 500         90.52 ± 0.38 90.48 ± 0.74        518       544
                                 1000        90.61 ± 0.33 90.32 ± 0.65        906       608
                BBVI-CV          200         90.62 ± 0.40 91.11 ± 0.58        444        11
                                 500         90.74 ± 0.40 90.98 ± 0.54        862        12
                                 1000        90.72 ± 0.42 91.12 ± 0.53       1646        13
                    Table 3.2: Performance of algorithms 1, 2, 4, 5 for scenario 2
layers with competitive accuracy.
3.5.5    Large number of layers and challenges.
We finally discuss the performance for all four algorithms BBVI-RMS and BBVI-CV-RMS when
the number of layers are 3. For 3 layers, using a fixed learning rate does not allow for the
maximization of the ELBO. This may be attributed to the different scales of the gradients for the
different parameters. Similar behavior is also observed in parametric optimization of artificial deep
neural networks (see [57] for more details). From table 3.3, it is evident that the improvement from
using 3 layers over 2 layers provides only a marginal improvement for scenario 1. For scenario 2,
the performance at 3 layers is worse than that in the case of 2 layers.
    As explained in the previous sections, the performance of both BBVI-RMS and BBVI-CV-
RMS improves with increase in sample size (. However, a great deal of sensitivity to choice of the
initial learning rate was observed. The observed sensitivity was even more profound in the case
of control variates especially under scenario 2. For scenario 1, the optimal learning rate dC was
found to be 0.1 and 0.3 (S=200) and 0.35(S=500 and 1000) for C          1 for BBVI-RMS and BBVI-
CV-RMS respectively. For scenario 2 under BBVI-RMS, the optimal learning rates were found to
be dC = 0.055, dC = 0.04 and dC = 0.04, C     1 for ( = 200, ( = 500 and ( = 1000 respectively. For
                                                  61


scenario 2 under for BBVI-CV-RMS, the optimal learning rates dC = 0.4, dC = 0.55 and dC = 0.63,
C    1 for ( = 200, ( = 500 and ( = 1000 respectively. With the optimal choice of dC at hand,
the BBVI-CV-RMS provided faster convergence results with a comparable test accuracy to that
of BBVI-RMS. This sensitivity to the choice of the initial learning rate especially in the case of
control variates for large number of layers needs to be explored as a part of future work.
                                            Scenario 1                       Scenario 2
        Method               S    Testing accuracy(%) Time(s)      Testing accuracy(%) Time(s)
        BBVI-RMS            200       97.76 ± 0.87        218          84.68 ± 4.85      423
                            500       97.65 ± 0.83        169          88.00 ± 5.56      631
                           1000       98.21 ± 0.73        132          90.69 ± 0.67      714
        BBVI-CV-RMS         200       96.23 ± 1.05        212          84.53 ± 8.90      33
                            500       97.83 ± 0.81        166          88.28 ± 2.03      37
                           1000       98.42 ± 0.72        124          89.33 ± 1.67      45
     Table 3.3: Performance of algorithms 1, 2, 4, 5 for scenario 1 and scenario 2 for 3 layers.
3.6    Numerical Properties and Alzheimer’s Disease Study
The transition from mild cognitive impairment (MCI) to Alzheimer’s disease (AD) is of great
interest for clinical researchers. Several studies over the past decade have shown and compared the
performance of different machine learning methods on this classification task. For this classification
problem, we illustrate the performance of variational Bayesian neural networks as developed under
section 3.3 in terms of classification accuracy, numerical complexity and time of convergence. We
implemented both algorithms, algorithm 1 and 2 and shall hence forth refer to them as BBVI and
BBVI-CV respectively. For a comparative baseline, we also report the performance for several
machine learning techniques as applicable to this task. We like to emphasize that, our primary goal
here is to illustrate the computational methodology rather incremental improvement for a specific
application.
    Alzheimer’s disease (AD) is a progressive, age-related, neurodegenerative disease and the most
common cause of dementia [147, 148, 68]. Behaviorally, AD is commonly preceded by mild
cognitive impairment (MCI), a syndrome characterized by decline in memory and other cognitive
domains that exceed cognitive decrements associated with normal aging [148, 103]. However,
                                                   62


the prodromal symptoms of MCI are not prognostically deterministic: individuals with MCI tend
to progress to probable AD at a rate of 8%-15% per year, and most conversions occur within
3 years of presentation, [24, 44, 2]. We used T1-weighted MRI images from the collection of
standardized datasets. The description of the standardized MRI imaging from ADNI can be found
in http://adni.loni.usc.edu/methods/mri-analysis/adni-standardized-data.
    This study used a subset of the MCI subjects from ADNI-1, who had data from demographic,
clinical cognitive assessments, APOE4 genotyping, and MRI measurements. In total, there are
819 individuals with a baseline diagnosis of MCI, but we only consider patients whose follow-up
period was at least 36 months and no missing values. The final samples included 265 subjects
which included participants who were stable in their diagnosis (MCI-S) and those who converted
to a diagnosis of AD over 3 years (MCI-C). We considered a total of 18 clinical predictors as
potential of MCI-to-AD progression in our classification analyses. Structural MRI data were
collected according to the ADNI acquisition protocol using T1-weighted scans (GradWarp, B1
Correction, N3, Scaled). Based on the extant literature, [68, 81], we used 24 ROI features as
clinically significant of MCI to dementia progression.
    The dependence and interactions among different modes of features (clinical, MRI) and within
the modes may be different and hard to model explicitly. Thus, a neural network-based modeling
is intuitive from predictive modeling and machine learning perspective. Of the 265 patients, 186
are selected by simple random sample as training cases and the remaining 79 as test cases. The
approximate 2:1 ratio for training and test cases is, of course, arbitrary. All the covariates (except
categorical variables) were z-normalized. The outcome H8 for the 8 th patient is either 1 for MCI-C or
0 for MCI-S in classification study. 10-fold cross-validation is used to avoid optimistically-biased
estimates of model performance.
3.6.1    Parameters choice for statistical and computational models.
In order to implement the BBVI, BBVI-CV, BBVI-RMS, and BBVI-CV-RMS, we use the choice
of ` 9= = 0 and f 9= = 1 similar to section 3.6. For the number of layers, we found that one
                                                 63


layer provides a good enough performance and inclusion of addition layers do not offer additional
improvement in the accuracy. We tried with : 1= = 2, 10, 20 and obtained the best results at : 1= = 10,
the results of which are reported in this thesis.
3.6.2    Gradient stabilization paramaters.
The choice of the initial learning rate is dC = 14 4 , C         1 for BBVI and BBVI-CV and dC =
14 1 , C    1 for BBVI-RMS and BBVI-CV-RMS. As explained in section 3.3, to allow for stable
optimization, we study the sensitivity to the different samples sizes (, use of control variates and
the RMSprop based gradient descent method. For ADNI, figure 3.5 illustrates how the ELBO
changes with (. It is evident that increase in ( from 200 to 1000 stabilizes the ELBO and helps
with a faster convergence. For ADNI, figure 3.5 illustrates how the ELBO changes with use of
control variates. It is evident that the use of control variates stabilizes the ELBO by a huge margin
and allows for its faster convergence. Similary, figure 3.5 also illustrates how the ELBO changes
with use of RMSprop versus a fixed learning rate. It is evident that the use of RMSprop leads to
stable ELBO and faster convergence rates.
                  Figure 3.5: ELBO convergence of algorithms 1, 2, 4, 5 for ADNI.
                                                   64


3.6.3   Testing accuracy and convergence.
For ADNI, table 3.4 gives the performance of BBVI, BBVI-CV, BBVI-RMS, BBVI-CV-RMS for
one layer. The best average accuracy of 76.88% was obtained for BBVI with ( = 200 for BBVI.
The optimal convergence time is achieved with BBVI-CV-RMS for ( = 200 for one layer with
average accuracy of 76.25% and convergence time is 36 seconds. Thus, the conclusions for real
data corroborate the use of BBVI-CV-RMS for a single layer NN.
                                              Testing accuracy(%)      Convergence time(s)
            Method      Sample size (S)       Fixed        RMSprop     Fixed    RMSprop
            BBVI              200         76.88 ± 3.32 75.75 ± 3.27      68        49
                              500         76.75 ± 3.63 76.50 ± 3.90     105        62
                             1000         76.75 ± 3.12 76.63 ± 3.21     231        65
            BBVI-CV           200         76.75 ± 3.41 76.25 ± 3.83     146        36
                              500         76.75 ± 3.58 76.63 ± 3.95     210        38
                             1000         76.75 ± 3.71 76.75 ± 4.07     264        39
                      Table 3.4: Performance of algorithms 1, 2, 4, 5 for ADNI.
3.6.4   Numerical comparison with popular models
In this section, we numerically compare the testing accuracy of BBVI, BBVI-CV, BBVI-RMS
and BBVI-CV-RMS and BBVI-CV to a few benchmark models which include logistic regression
(LR) and support vector machine (SVM) as developed by [101, 87] and frequentist artificial neural
network (ANN) [20, 54]. We also compared with a Bayesian neural network models which uses
Stochastic Gradient MCMC [137] . For all neural network models, viz, artificial neural network
(ANN) and Stochastic Gradient MCMC Bayesian neural network (SG-MCMC), the number of
nodes are fixed at : = = 10 with a single hidden layer.
    Table 3.5 provides the training and testing accuracy and empirical standard errors for all methods
under consideration. For the 4 models viz BBVI, BBVI-CV, BBVI-RMS and BBVI-CV-RMS, the
results reported correspond to the optimal parameter combination which provides the best average
test accuracy. Little to no difference was observed across different choices of the algorithm
parameters (see table 3.4). LR, SVM, ANN and SG-MCMC have considerably larger standard
                                                   65


errors for testing accuracy. One might observe an improvement in performance of SG-MCMC
Bayesian neural network by optimally choosing their tuning parameters. However studying that
is beyond the scope of this thesis as they are different methodology and the underlying statistical
theories are not well established.
                     Classifier        Training accuracy (%)   Testing accuracy (%)
                     LR                      82.1 ± 2.5              70.9 ± 5.5
                     SVM                     80.3 ± 2.2              70.6 ± 5.5
                     ANN                     82.0 ± 5.6              74.1 ± 6.8
                     SG-MCMC                 80.8 ± 4.6              73.5 ± 5.9
                     BBVI                    80.7 ± 2.1              76.9 ± 3.3
                     BBVI-CV                 80.3 ± 2.3              76.8 ± 3.4
                     BBVI-RMS                81.2 ± 2.4              76.8 ± 3.3
                     BBVI-CV-RMS             82.8 ± 1.6              76.8 ± 4.1
Table 3.5: Performance for different classifiers. LR: Logistic regression. SVM: Support vector
machine. ANN: Frequentist artificial neural network. SG-MCMC: Stochastic gradient MCMC
Bayesian neural network
3.7     Conclusion and Discussion
The theoretical rigour and computational detail for variational Bayes neural network classifier
presented in this article is novel and unique contribution to statistical literature. Although the
variational Bayes is popular in machine learning, neither the computational method nor the statistical
properties are well understood for complex modeling such as neural networks. We characterize the
prior distributions and the variational family for consistent Bayesian estimation. The theory also
quantifies the loss due to VB numerical approximation compared to the true posterior distribution.
For practical implementation, we reveal that the algorithm may not be as simple and straightforward
as it sounds in computer science literature, rather it requires careful crafting on several parameters
associated in various steps. Nevertheless, the computation could be quite faster compared to popular
Monte Carlo Markov Chain procedure of approximating the posterior distributions.
    Although we build the framework on a multi-layer neural networks model with simplistic
prior structure, the detail statistical theory and computational methodology are quite involved.
This investigation opens up possibility of exploring much wider class of models and priors. For
                                                   66


example, shrinkage priors, such as double exponential and horseshoe priors can be explored for
building sparse neural networks or one can experiment with various other variational families.
However, their computational details and associated statistical properties are not immediate. We
hope this research will accelerate further development of statistical and computational foundation
for variational inference in general machine learning research.
                                                 67


                                                       CHAPTER 4
           LEARNING INTRINSIC DIMENSIONALITY OF FEATURE SPACE WITH
                                VARIATIONAL BAYES NEURAL NETWORKS
4.1       Introduction
Bayesian neural networks (BNNs) have achieved state-of-the-art results in a wide range of tasks,
especially in high dimensional data analysis including image recognition, biomedical diagnosis and
others. One of the major disadvantage in using neural networks and deep networks is that they
require a huge number of training data due to the large number of inherent parameters [140, 45]. For
example, high-dimensional neural networks have been widely applied with regularization, dropout
techniques or early stopping to prevent overfitting [118, 143]. Furthermore, most commonly used
dimensional reduction techniques include Lasso [17], Ridge [58], Elastic net [152], Sparse group
lasso [116], Bayesian Lasso [98], Horseshoe prior [16], principal component analysis [115]. Even
though the ;1 and ;2 norm can force the weights to become zero or small, they do not have the
regularizing effect of making the computed function simpler [70]. Additionally, all these methods
rely on the use of whole data which severely increases the cost of both computation and memory
storage.
     In this chapter, we propose the use of a BNN on a compressed feature space to take care of
the large ? small = problem by projecting the feature space onto a smaller dimensional space
using a random projection matrix. Random-projection (RP) is a powerful dimension reduction
technique which uses RP matrices to map data into low-dimensional spaces. The use of RP in high
dimensional statistics is motivated from the Johnson–Lindenstrauss Lemma [27] which states for
x1 , · · · , x= 2 R ? , n 2 (0, 1) and 3 > 8 log =/n 2 , there exists a linear map 5 : R ? ! R3 such that
(1     n)||G8    G 9 || 22  || 5 (x8 ) 5 (x 9 )|| 22  (1 + n)||G8 G 9 || 22 for 8, 9 = 1, · · · , =. The properties of
the RPs and their applications to statistical problems were furthered explored in [33, 13], etc..
     In order to reduce the sensitivity to the choice of random matrices, one must pool information
                                                             68


obtained from multiple projections. In this chapter, we adopt a Bayesian model averaging approach
for combining information across multiple instances RP based neural networks. There are two
main challenges of implementing Bayesian modeling averaging (1) due to the convoluted structure
of the neural network likelihood, closed form expressions do not exist for the posterior distribution
under each model (2) posterior distribution of model weights is completely intractable and no
closed form solutions exist. Thereby, the implementation of standard Markov Chain Monte Carlo
(MCMC) is next to impossible. Further, the computation and storage cost associated with MCMC
implementation is humongous since each posterior model weight is dependent on the posterior
model weight of the remaining models.
    To address the challenges of MCMC implementation, we use variational inference (VI) [63, 9]
approach to provide an approximate solution for Bayesian model averaging (BMA) to allow for
combining of BNNs with multiple instances of compression on the feature space. There has been
a plethora of literature implementing variational inference in the neural networks [10]. However,
their implementation makes use of the entire feature space, thereby putting a great burden on
computational stability and memory storage. We address two main challenges in this thesis (1)
developing a variational Bayes (VB) solution for BNNs with compressed feature space (2) providing
a VB solution for doing BMA across multiple instances of RP. Further, for a given instance of
random compression, we establish the posterior contraction rates for the variational posterior for
classification (the theory is extendable to regression set up with minor modifications). In this
direction, we provide characterization of the prior, variational posterior and the RP matrix which
guarantees the convergence of the variational Bayes neural network (VBNN) under the compressed
feature space to the true density of the observations.
    The main advantage of implementing a BMA approach is that it gives the posterior model
weights under each compression of feature space. The so obtained posterior model weights in turn
induce a probability distribution on the projected dimension of the feature space. The mode of this
probability distribution concentrates around the intrinsic dimensionality of the feature space. To
improve the prediction performance, the BMA approach is then applied to a pool of RP matrices
                                                  69


whose projected dimension lie in a neighborhood of the intrinsic dimensionality. Finally, we study
the numerical behavior of the proposed procedure in the light of simulation and real data sets.
To the best of our knowledge there exist no literature which provides theoretical guarantees and
computation algorithm of VBNNs with compressed feature space.
    For a long time, people have studied feature reduction using projection matrix in both supervised
and unsupervised learning. [146] proposed semisupvervised classification with graph construction
and the idea of projection matrix which is used to preserve the local and global structure of
data. In addition to semisupervised learning, projection method have been used in convolutional
neural network, [125] introduced an efficient convolutional neural network which can control how
much context information can be incorporated into each specific position using word-embedding
projection matrix. In terms of unsupervised learning, [134] proposed an unsupervised adaptive
embedding method which combined the calculation of projection matrix and construction of affinity
graph together.
    Early works on Bayesian neural networks (BNNs) have been comprehensively discussed by
[85, 96, 71]. With the computational and information science advancement, recent developments
with higher efficient BNNs can be found in [120, 93, 61, 64] and the references therein. However,
with increase in the dimension of the feature space, the prediction accuracy of BNN’s is severely
compromised. To circumvent this issue, penalization and sparse network based approach has been
studied by [80, 45, 140, 48, 3], etc.. The major drawback of these sparsity based methods is one
needs to work on the entire data which increases the time of implementation by a manifold. With the
work of [27], the idea of using RPs to overcome the curse of dimensionality became very popular.
Further, RPs have been used in a wide range of statistical problems [1, 86, 84, 39, 55, 40, 49],
etc.. To ensemble information across projections, [14] uses a bagging approach for classification
problems as in [12]. On the other hand, the works of [52] and [53] propose the use of BMA in the
context of linear regression and Gaussian processes.
    There exists a plethora of literature implementing variational inference [9] to overcome the
drawback of a full MCMC implementation. The majority of Black-box variational methods for
                                                  70


Bayesian learning of neural networks are based on Pathwise gradient estimator [41, 83, 15, 11, 121],
which is computed using reparameterization trick [106]. Another line of Black-box variational
extensions is based on the score-function estimator using Monte Carlo estimator to find the full
gradient, including control variate [107] and stochastic search [97]. Theoretical properties of the
variational posterior in context of individual models have been studied in the works of [6, 135,
100, 149, 3]. The works of [65] and [72] explore variational inference for BMA in the context of
generalized linear models and graph on functions respectively. To the best of our knowledge, BMA
in context of Bayesian neural networks with compressed feature space remains unexplored.
    Firstly, we introduce the RP idea in a neural network predictive model where the feature space
grow exponentially with training sample size which in turn significantly reduces the computational
complexity and storage capacity associated with BNNs. Second, we apply the BMA idea in
conjunction with VB to allow for parallelization across RPs without compromising the uncertainty
quantification of a Bayesian approach. Third, we develop the associated statistical foundation,
namely the posterior contraction of the variational posterior for BNNs under a compressed feature
space. The theory not only provide trustworthiness to our model, the results also provide theoretical
guidelines for prior selection and the choice of variational family of distributions. Fourth, we
innovatively apply the learned posterior model weights to obtain the intrinsic dimensionality of
the feature space. Fifth, to improve predictive accuracy, we employ VB with BMA on a subspace
of RPs with projected dimension centred around the intrinsic dimensionality. Fifthly, we provide
numerical results to enunciate that our proposed approach learns well the intrinsic dimension of
feature space and beats the predictive performance of all competing methods for the large ? small =
problems. Lastly, the performance of the proposed methodology has been enunciated in the context
real data sets like ADNI and MNIST.
                                                 71


4.2     Bayesian neural network for random projection based compressed fea-
        ture space
4.2.1    Bayesian neural network model
For a binary random variable . , representing the class levels 0 or 1, and a feature vector - 2 R ?
with some marginal distribution % - , consider the classification problem
                             %(. = 1|- = x) = f([0 (x)) = 1               %(. = 0|- = x)                           (4.1)
where [0 (·) : R ? ! R is some continuous function and f(.) = 4 (.) /(1 + 4 (.) ) is the sigmoid
function. Following [14] and [141], the test error of a classifier ⇠ is given by
                                                   π
                                       '(⇠) =                    {⇠ (-)<. } 3% -,.                                 (4.2)
                                                     R ? ⇥{0,1}
where the joint density % -,. is a product of (4.1) and % - . The Bayes classifier is then
                                                        8
                                                        >
                                                        >
                                                        >
                                                        < 1,
                                                        >       f([0 (x))        1/2
                                     ⇠ Bayes (x) =                                                                 (4.3)
                                                        >
                                                        >
                                                        >
                                                        > 0,    otherwise
                                                        :
Since [0 (x) is unknown, we thereby use single-layer neural network model approximation with :
nodes:
                                      ’:=
                      [✓ (x) = V0 +        V 9 k(W 90 +       )
                                                              9 x) = V0 +      >
                                                                                  k(  0 + x)                       (4.4)
                                       9=1
where      = [V0 , · · · , V : ], 0 = [W10 , · · · , W :0 ] and      = [  1, · · · , :] and ✓ = (V0 , ,     0 , vec(  ))
is the set of all the parameters. Note, ✓ is a             ⇥ 1 vector where          = 1 + : + : ( ? + 1). Both : and
( ? >> =) grow as a function of =. We then use the following model for the problem in (4.1).
                             %(. = 1|- = x) = f([✓ (x)) = 1               %(. = 0|- = x)                           (4.5)
4.2.2    Compression in the feature space with random projections
There exists several choices for compressing the feature space X using RP matrices such as those
proposed in [33, 13, 14, 27, 26], etc.. For a given choice of the compression matrix , we consider
                                                             72


single-layer neural network with : nodes for the input vector x as
                                   [✓ ( x) = V0 +    >
                                                       k(  0 + ( x))                                  (4.6)
where      is a 3 ⇥ ? projection matrix,     and   0 are : ⇥ 1 vector and   >
                                                                              =[    1, · · · , :] is now a
3 ⇥ : matrix. Thus, in the projected space the number of parameters reduce from              = : ? + 2: + 1
to      = : 3 + 2: + 1. This leads to the following model under the projected space
                         %(. = 1|- = x) = f([✓ ( x)) = 1        %(. = 0|- = x)                        (4.7)
Experiments with different projection matrices suggested the use of the one in [52]. In this method,
                                                                   p
we draw the elements 8 9 independently, setting 8 9 = 1/ 0 ⇤ , with probability 0 2⇤ , 0 with
                                  p
probability 20 ⇤ (1 0 ⇤ ) and 1/ 0 ⇤ with probability (1 0 ⇤ ) 2 , with the rows of then normalized
using Gram-Schmidt orthogonalization. The parameter 0 ⇤ 2 (0.1, 1) provides a handle on the
sparsity of the projection matrix. We do not rely on the data to generate . Also, the algorithmic
implementation discussed can be generalized to any arbitrary class of projection matrices.
4.2.3    Prior choice
For the neural network [✓ ( x) based on the projected input             x, we assume an independent
Gaussian prior on each of the entries of ✓ , i.e. ?(✓ |" ) = MVN(µ , diag(                      )), where
diag(     ) is a diagonal matrix. With this choice of the prior and likelihood as in (4.7), the posterior
distribution based on the compressed data set (H8 , x8 )8=1= is given by
                                                ! (✓ |" ) ?(✓ |" )
                               c(✓ |" ) = Ø                                                           (4.8)
                                               !(✓ |" ) ?(✓ |" )3✓
where " is the model induced by random matrix with corresponding likelihood ! (✓ |" ) =
Œ=
   8=1 exp(H 8 [✓ ( x8 )  log(1 + exp([✓ ( x8 )))). The denominator in (4.8) is free of ✓ .
                                                   73


4.3    Variational Bayes model averaging for pooling multiple instances of ran-
       dom projection.
4.3.1   Bayesian model averaging
Ensemble learning methods are most widely used in machine learning literature to pool across
varying classifiers to solve given problem a [32]. In this section, we address the same problem from
a Bayesian perspective. Let A denote a subspace of the space of all random matrices. We assume
that each RP matrix induces a separate model " ,           2 A on the data D = (H8 , x8 )8=1  = . Thus, the
predictive distribution of a new observation H =+1 given x=+1 is
                                  π
             ?(H =+1 |x=+1 , D) =    ?(H =+1 |x=+1 , " , ✓ , D)c(" , ✓ |D)3`(" , ✓ )                     (4.9)
where ` is the product measure of counting and Lebesgue measure. Note, that in the implementation
of (4.9), the most difficult quantity to compute is c(" , ✓ |D). In [52], explicit forms could be
obtained for linear regression model, something which is next to impossible for convoluted neural
network structure. In the next section, we circumvent this issue using variational inference.
4.3.2   ELBO derivation
Let c(" , ✓ |D) denote the joint density of the parameter and the model conditional on the data.
We posit a variational distribution @(" , ✓ ) of the form @(✓ |" ) ⇠ MVN(m , diag(s )) where
s is a diagonal matrix and @(" ) are weights for the individual model. Thus, our variational
family may be expressed as
           (                                                                                            )
                                                 @(" ) (2c) /2         1
                                                                            m ) > (diag( s )) 1 ( ✓ m )
     Q= = @(" , ✓ ) = @(" )@(✓ |" ) =                                4 2 (✓
                                                     |diag(s )| 1/2
The optimal variational distribution minimizes the Kullback-Leibler distance between c(.|D) and
the variational family Q= . Thus, @ ⇤ = argmin 3KL (@, c(.|D)), where
                                          @2Q=
                      3KL (@, c(.|D)) = ⇢& (log c(✓ , " |D)         log @(✓ , " ))
                                      =    log c(D) + ELBO
                                                    74


where ELBO = ⇢& (log c(✓ , " , D)             log @(\ , " )). Since   log c(D) is independent of ✓
and " , therefore @ ⇤ (" , ✓ ) = argmin ELBO(@, c(.|D)). We next simplify the ELBO as
                                       @2Q=
                     ⇢& (log c(✓ , " , D) log @(✓ , " ))
                         ’
                      =       @(" )⇢&(✓ |" ) (log c(D|" , ✓ ) + log c(✓ |" )
                          2A
                       + log c(" ) log @(✓ |" ) log @(" ))
                         ’
                      =      @(" )⇢&(✓ |" ) (log ! (✓ |" ) + log ?(✓ |" )
                       + log c(" ) log @(✓ |" ) log @(" ))
                         ’
                      =      @(" )(L (.|" ) + log c(" ) log @(" ))
where L (.|" ) = ⇢&(✓       |" ) (log ! (✓  |" ) + log ?(✓ |" )     log @(✓ |" )) is nothing but the
ELBO under the model " . Note, that the derivative of the ELBO with respect to variational
parameters m , s is given by
                                rm   ,s  ELBO = @(" )rm      ,s L (.|" )
Since @(" ) is just constant, thus the gradient update for model specific variational parameters
is nothing but the gradient update from each individual model. Also, equating the derivative of
ELBO with respect to @(" ) to zero, we get
                           r@(" ) ELBO = 0
                            =) log c(" )        log @(" ) + L (.|" )    1=0
                            =) @(" ) / exp(log c(" ) + L (.|" ))
                                                                                Õ
Thus, the optimal model weights are @ ⇤ (" ) = exp(log c(" ) + L ⇤ (.|" ))/         exp(log c(" ) +
L ⇤ (.|" )) where L ⇤ (.|" ) are is the optimal ELBO under models .
     Note, that the main advantage of the above derivation is that the models can be individually
trained in a parallel fashion and the final model weights depend only on the final ELBO values from
each model.
                                                    75


4.4     Intrinsic dimensionality and prediction
4.4.1    Optimal dimension neighborhood selection
Let 3 ⇥ ? denote the dimension of a RP matrix                  2 A. Using section 4.3, one can obtain the
posterior model weights @ ⇤ (" ). The values of 3 with largest values of the posterior model
weights @(" ) tend to concentrate around the optimal dimension of the feature space.
     Let 31  32  · · · be an enumeration of the unique values of 3 ,         2 M . Define the average
posterior probability of each dimension value 38 as
                                                  1 ’
                                         @8⇤ =                 @(" )
                                                |A 8 |       8
                                                         2A
where   A8  ={     2 A : 3 = 38 }. The plot of (8, @8⇤ ) attains its peak around optimal dimension of
feature space for prediction of the response. Let 3 ⇤ = argmax @8⇤ , then for some a1 , a2 > 0,
                                                                   8
                                  I3 ⇤ = [b3 ⇤ (1     a1 )c, d3 ⇤ (1 + a2 )e]                      (4.10)
is the optimal dimension neighborhood which is used for the final classification task. Finally let
A 3 ⇤ be a subspace of RP matrices with dimension 3 ⇥ ? where 3 2 I3 ⇤ .
4.4.2    Classification based on optimal neigborhood choice
Using section 4.3, obtain the variational distribution @ ⇤ (" , ✓ ) for every 2 A 3 ⇤ . Let [b =
Ø
   [✓ ( x=+1 )@ ⇤ (✓ |" )3\ be the variational Bayes estimator of under model " . Define
                                                 ’
                                  [b(x=+1 ) =           @ ⇤ (" )b  [ (x=+1 )                       (4.11)
                                                 2A 3 ⇤
Based on [b(x=+1 ) define the classification rule as
                                        bH =+1 = I [b[ (x=+1 )       0]                            (4.12)
Remark: Note, the proposed estimator [(x      ˆ =+1 ) is not the exact variational Bayes estimator of
[(x=+1 ) = log(%(H =+1 = 1|x=+1 )/%(H =+1 = 0|x=+1 )). However, it is a good enough approximator
for sufficiently large training size and computationally way faster, especially when the number of
models and test samples are large.
                                                       76


4.5    Algorithm and its implementation.
Algorithm 3 RPVBNN
   1. Initialization: (m0 , s0 , d {C} )  2A  where d {C} , C  0 is step size sequence for model .
   2. Parallelization :
        a) Set C = 1,
        b) For     2 A, calculate the gradient of L   b(.|" ) in (4.14) with respect to m and s .
        c) Update the parameters mC and sC as
                                      mC+1 = mC + d C rm L      b(.|" )| m     =mC
                                         sC+1 = sC + d C rs L b(.|" )| s   =sC
        d) Set C = C + 1.
        e) Repeat steps (b)-(d) till convergence.
   3. Model averaging:
        a) For the optimized values (m⇤ , s⇤ )       2A ,  compute ( Lb⇤ (.|" ))    2A using (4.14).
        b) Compute the model weights
                                                   exp(log c(" ) + L   b⇤ (.|" ))
                                 @ (" ) = Õ
                                  ⇤
                                                                           b⇤
                                                   2A exp(log c(" ) + L (.|" ))
   4. Optimal neighborhood selection: Using the values (@ ⇤ (" ))              2A  compute
        a) The optimal neighborhood I3 ⇤ as in (4.10).
        b) The subspace A 3 ⇤ using based on I3 ⇤ (see section 4.4.1).
   5. Classification:
        a) Repeat steps (1)-(3) for        2 A3⇤ .
        b) Compute [b(x=+1 ) and b   H =+1 using relations (4.11) and (4.12) respectively.
                                                     77


                                                            Œ                                              2 /(2B 2
Gradient update equations. For @(✓ |" ) = 9=1 (1/(2cB2 9 ) 1/2 )4                              (\  9 <  9)          9) and ?(✓ |" ) =
Œ:                                      2      2
                     2 1/2 )4 (\ 9 ` 9 ) /(2f 9 ) , the ELBO is
    9=1 (1/(2cf 9 )
                                      L (.|" ) = ⇢&(✓         |" ) (log !(✓       |" ))
                                                       3KL (@(.|" ), ?(.|" )))
where 3KL (@(.|" ), ?(.|" ))) is given by
                                                                                               !
                                  ’            f   9    B2 9      (<    9     ` 9 )2        1
                                        log          +         +                                                             (4.13)
                                  9=1
                                               B   9   2f 2 9           2f 2 9              2
Since ⇢& (log ! (✓ |" )) cannot be computed explicitly, generate , samples ✓ [1], · · · , ✓ [,]
                                                                Õ
from &(✓ |" ) and compute b            !(.|" ) = (1/,) ,          F=1 log !(✓ [F] |" ). Thus, the final gradient
function which is optimized w.r.t. to the parameters m and s is given by
                                b(.|" ) = b
                                L               ! (.|" )       3KL (@(.|" ), ?(.|" )))                                       (4.14)
      Remark: Since, we need the variance parameter B                 9   to be always positive, thus, we consider the
reparametrization, B        9 = log(1+4 B̃ 9 ) and update the parameters B̃               9  instead where rB̃ 9 L      b(.|" ) =
4 B̃ 9 /(1 + 4 B̃ 9 )rB 9 L (.|" )| log(1+4 B̃     where rB 9 L  b(.|" )|              B̃      is the derivative of L     b(.|" )
                                                9)                             log(1+4      9)
with respect to B      9  evaluated at B   9   = log(1 + 4 B̃ 9 ).
4.6      Theoretical results.
In this section, we study the convergence properties of the variational posterior for a given projection
matrix (without model averaging). The results presented here are similar in spirit to the notion of
posterior consistency in [52].
      Let 50 (H, x) and 5✓ (H, x) be the joint density of the data D = (H8 , x8 )8=1                 = under the truth and
the model respectively. Without loss of generality, we assume x8 ⇠ * [0, 1] ? , which implies
 50 (x) = 5✓ (x) = 1. This implies that the joint distribution of (H8 , x8 )8=1                        = depends only the
conditional distribution of . |- = x. Thus, under the model indexed by the projection matrix ,
                                  5✓ (H, x) = 5✓ (H|x) 5 (x) 5✓ (x) = ✓✓ (H, x)
                                   50 (H, x) = 50 (H|x) 50 ( x) = ✓0 (H, x)                                                  (4.15)
                                                             78


where ✓\ (H1, x) = exp(H[✓ ( x)                log(1 + exp([✓ ( x)))) and ✓0 (H, x) = exp(H[0 (x)                     log(1 +
exp([0 (x)))) are defined respectively. We next define the Hellinger neighborhood of the true
function density function 50 = ✓0 as
                                        UY = {✓ : 3H (✓0 , ✓✓ ) > Y}
                                              π ’ ⇣p                       p              ⌘2
                                 2
                             23H (✓0 , ✓✓ ) =                 ✓0 (H, x)       ✓✓ (H, x) 3x.                             (4.16)
                                                x H
    We next give the set of conditions which ensure that the variational posterior for a given
projection matrix , is consistent to the true density function 50 . Recall ? is the total number of
covariates, : is the number of nodes. If the dimension of                          is 3 ⇥ ?, then the total number of
parameters in the model indexed by is                       = 1 + : + : (3 + 1), i.e. ✓ is         ⇥ 1 vector.
                              Õ
    Let [✓⇤ (x) = V0⇤ + :9=1 V⇤9 k( ⇤9 > x) be the neural network which can approximate the true
function [0 (x) in ! 1 norm. The existence of such a neural network is guaraneteed by [60].
Suppose      is an orthonormal projection matrix, prior ?(✓ ) = MVN(µ , diag(                                )), =n =2 ! 1
and the following conditions hold:
                                                     2
   1. (C1): : log = = >(=n =2 ), ? = >(4 =n = ).
                                                                                              Õ:
   2. (C2): ||µ V || 21 = >(=n =2 ), log ||    V || 1  = $ (log =), ||      V
                                                                              1 ||
                                                                                   1 = $ (1),  9=1 ||  >µ
                                                                                                             9 W || 1 = $ (1),
       sup 9=1,··· ,: log ||           = $ (log =), sup 9=1,··· ,: ||     1        = $ (1).
                              9 W || 1                                  9 W || 1
   3. (C3): ||[0 [✓⇤ || 1 = >(n =2 ), ||              ⇤ || 2
                                                           1 = >(=n =2 ), sup 9=1,··· ,: ||(     >    )W ⇤9 || 1 = >(= 1 ),
       Õ:         ⇤ 2
         9=1 || 9 || 1 = $ (1).
   4. (C4): log || x|| = $ (log =), 1/|| x|| = >(=n =2 )
Condition (C1) gives restrictions on the number of effective parameters (⇠ : 3 ) and the true
number of covariates (⇠ ?). Condition (C2) puts restrictions on the growth of the prior parameters.
                                       Õ
Note, although the condition :9=1 || > µ 9 W || 1 = $ (1) seems to depend on the matrix , it can be
easily ensured by setting µ 9 W = 0. Condition (C3) quantifies how fast the neural network solution
converges to the true function while keeping their coefficients magnitude under control. Although,
                                                               79


the condition sup 9=1,··· ,: ||(     > )W ⇤9 || 1 = >(= 1 ) is restrictive, it holds for any W ⇤9 in the column
space of the projection matrix . Condition (C4) for projection matrices relates to condition (iii)
in Theorem 3.1 of [52].
For the posterior in (4.8), let the variational posterior be
                                  @ ⇤ = argmin ELBO (@, c(.|D, " ))
                                         @2Q=
where Q= = {@(✓ ) = MVN(m , diag(s ))}. For a fixed , one can obtain @ ⇤ by following the
step 2. in algorithm 3.
Theorem: Suppose =n =2 ! 1 and conditions (C1)-(C4) hold, then for any Y > 0,
                                                             %0=
                                                 @ ⇤ (UYn
                                                        2
                                                          =
                                                            )!0
where %0(=) is the joint distribution of (H8 , x8 )8=1
                                                     = under (4.1).
    The proof has been presented in the supplement section.
    The above proof shows that the variational posterior @ ⇤ concentrates around shrinking Hellinger
neighborhoods of the true function 50 with overwhelming probability.
4.7    Numerical Study
4.7.1   Problem setup
We mimic the RP generation mechanism as in section 4.2.2. We fix the value of 0 ⇤ at 0.3 for
this whole section. We also experimented with the RP mechanism in [14] where the                        is taken
to the matrix of left singular vectors in the eigenvalue decomposition of ˜ where ˜ has all entries
drawn from # (0, 1) distribution. However we omit the results since no significant improvement
was observed. Further, the Algorithm 3 is also sensitive to the choice of the learning rate d, the
number of projections and the batch size. In this thesis, we do not explore the sensitivity with
respect to these parameters due to their small impact. For the number of projection matrices, we
followed the 2-power rule as: let the range of the project dimension 3 is [ ? 1 , ? 2 ], the number of
projections is chosen as # = D(min{28 : 28             ( ?2    ? 1 ) > 0}). This ensures that approximately
                                                        80


D number of 3 values are chosen from any unit sub-interval of [ ? 1 , ? 2 ]. We employ parallel
programming technique across different projection to reduce computational time. We first learn
the optimal dimension of the data and then use it to improve the predictive accuracy. We analyze
the performance of the algorithm in light of four data sets, two simulated datasets generated using
a non linear function in the input feature space and two real data set obtained from neuroimaging
and computer vision studies.
4.7.2    Datasets
We consider four cases of the data sets. The details of each data are summarized in Table 4.1. In
the first two cases, we use the non-linear system [80] to generate observations with varying number
of features. For these two data sets, the intrinsic dimensionality of the feature space is defined by
number of active variables used in the data generation. We employ our algorithm to validate if
it can recover the intrinsic dimensionality of the feature space and provide desirable classification
accuracy. For the remaining two data sets, the intrinsic dimensionality is unknown. In the third
case, we use the ADNI dataset from neuroimaging studies. In the last case, we use the MNIST
dataset from computer vision studies. The proposed algorithm allows us to learn the intrinsic
dimensionality of the feature space in both neuroimaging and computer vision applications.
    For implementation, all input variables are I normalized. We first employ RPVBNN to learn
the intrinsic dimensionality of the dataset. With a knowledge on the optimal dimensionality, we
compare RPVBNN with several other traditional algorithms (logistic regression, random forest
and gradient boosting) and VBNN which is the standard variational Bayes neural network based
on the whole feature space. For the simulated datasets and ADNI, to prevent over-fitting and
optimistically-biased estimates of model performance, we consider 10 different splits of the data
into train and test datasets. We report the mean and standard deviation of train and test accuracy
and AUC score over the 10 splits. For the MNIST data, since author-defined splits exist [73], we
only report the train and test accuracy and algorithm run time.
                                                  81


4.7.3   Simulated data
We generate two simulated data from
                                  8
                                  >
                                  < 1 , if 4 G1 + G 22 + 5 sin(G 3 G4 )
                                  >
                                              2
                                                                        3>0
                              H=                                                               (4.17)
                                  >
                                  > 0 , otherwise
                                  :
Since the number of active variables is 4, the intrinsic dimensionality of feature space is 4. To test
if RPVBNN can capture the intrinsic dimensionality, we consider two simulated data examples 1)
small simulated data 2) large simulated data. For small simulated data, we work with a smaller
number of covariates (? = 20) and for the large simulated data, larger number of covariates are
used (? = 200). Note, for both datasets = = 3000 observations are generated from (4.17). However,
x ⇠ MVN(0, ⌃) with f88 = 0.5 and f8 9 = 0.25 and has dimension ? = 20 and ? = 200 under
the small and large datasets respectively. The large simulated data exemplifies the small = large ?
problem. For the 10 splits of cross validation, the ratio of observations in the training and test is
7:3, thus the train and test data sets have 2100 and 900 subjects respectively.
4.7.4   ADNI Data
We utilized the data provided by Alzheimer’s disease Neuroimaging Initiative (ADNI) database
http://www.loni.ucla.edu/ADNI. ADNI is an ongoing joint public-private effort to utilize
neuroimaging, other biological markers, and clinical and neuropsychological assessment to measure
the incidence and progression of MCI to early AD. The data used consisted of 819 subjects with
baseline characteristics, genetics and diagnosis of MCI. For consistency, we only consider patients
whose follow-up period was at least 36 months and no missing values. The final samples included
= = 265 subjects which included participants who were stable in their diagnosis (MCI-S) and
those who converted to a diagnosis of AD over 3 years (MCI-C). We used ? = 277 variables which
included diagnosis, neuropsychological tests score, epsilon-4 allele of the apolipoprotein E (APOE)
gene and ROIs levels features derived from T1 magnetic resonance imaging (MRI). Analogous to
the simulated examples, the ratio of subjects in the training and testing was 7:3 for the 10 splits.
                                                       82


Table 4.1: Summary of data,where n, p and c denote the numbers of samples, features and classes.
                             Data                 Source    n      p    c
                             Small Simulated data   [80]   3000   20    2
                             Large Simulated data   [80]   3000   200   2
                             ADNI                   [62]   264    278   2
                             MNIST                  [73]  70000   784  10
4.7.5   MNIST Data
In addition to the ADNI, we evaluate the model performance on computer vision data - MNIST.
The MNIST dataset is a large collection of handwritten digits and from the National Institute of
Standards and Technology (NIST). MNIST dataset contains = = 70000 images with 60000 and
10000 in train and test sets respectively and a feature space of dimension ? = 784 [73].
4.8    Results
4.8.1   Optimal dimensional region
                              Figure 4.1: Small simulated data: # = 32
    Using section 4.4.1, we obtain the average posterior probabilities of the projected dimensions.
For all the four data sets, for learning the intrinsic dimension we employ RPVBNN with : = 32
nodes. Since obtaining the optimal dimension is a preprocessing step, we avoided experimentation
with number of nodes in this step. Figures 4.1, 4.2, 4.3 and 4.4 give the average probability density
                                                   83


                            Figure 4.2: Large simulated data: # = 128
                                 Figure 4.3: ADNI data: # = 128
curve as a function of the projected dimensions for small and large simulated data sets and ADNI
and MNIST respectively. The intrinsic dimensionality estimate corresponds to the mode of this
density curve while the optimal dimension neighborhood is a small interval around this posterior
mode. For small simulated data, Figure 4.1 shows a dramatic growth in the average posterior
probability as the number of projected dimensions increase from 3 to 5 with a significant drop
when the number of projected features reach 7, followed by stabilization after 8. Thus, with a
peak around 5, the optimal dimension neighborhood for small data is taken (3,7). For large data,
                                                84


                                 Figure 4.4: MNIST data: # = 128
Figure 4.2 shows that the average posterior probability peaks between 3 and 8 and stabilizes after
10. Thus, the optimal neighborhood for large data was taken to be (3,10). Note, for both small and
large simulated datasets, the true intrinsic dimensionality was 4. The fact that the average posterior
probability concentrates around 4 further corroborates that our algorithm learns well the intrinsic
dimensionality of the feature space in regards to the prediction of response. Next, for ADNI data,
Figure 4.3 shows that average posterior probability peaks between projected dimensions of 10
to 20 followed by stabilization after 30. Thus, the optimal dimensional neighborhood for ADNI
data was chosen as (1,30). Finally, for MNIST data, Figure 4.4 shows that the optimal dimension
neighborhood can be chosen as (580,600).
4.8.2   Comparative Baselines
For the two simulated examples and ADNI data, we consider 10 splits of the data as in section
4.7. With the optimal dimension neighborhood we use steps (4)-(5) of algorithm 3 to obtain the
mean and standard deviation of train and test accuracy and AUC score of RPVBNN (see tables
4.3, 4.5 and 4.4 respectively). In addition to the performance of RPVBNN, we also provide
results from logistic regression with ! 1 penalty (LR-! 1 ), random forest (RF), gradient boosting
(GB) as comparative baseline. In particular, we report the LR performance for varying values of
                                                  85


                               Table 4.2: RPVBNN setting for evaluation
                       Data          N    Learning rate Batch size  Optimal Region
                       Small data   16        0.01         256           (3,7)
                       Large data   64        0.01         256          (3,10)
                       ADNI         64        0.01         185          (1,30)
                       MNIST        128       0.01         512        (580,600)
_ = 0.01, 0.1, 0.5, 1, 5, 10, 100 together with the performance at _ 0 , the optimum _ obtained from
10-fold cross validation. To build both RF and GB models, we start with 50 trees and increase
the number of trees by 100 trees each time until we see either no improvement of test accuracy or
increase in the standard deviation of test accuracy [101]. Finally, for the same 10 splits, we report
the results obtained using VBNN algorithm which works on the whole feature space without any
compression (it is indeed a version of RPVBNN with # = 1, 3 = ? and                = ). The number of
nodes for both RPVBNN and VBNN are varied as : = 32, 64, 128.
    For the MNIST dataset, since user defined train and test splits already exist, we only report the
train and test accuracy and algorithm run time for RPVBNN (see table 4.6). As a comparative
baseline, we also provide the results of VBNN. For all the datasets, the details of RPVBNN settings
including optimal dimension neighborhoods, the number of projections, learning rate and batch
size are summarized in Table 4.2.
4.8.3   Experimental Results
For small simulated data, as shown in Table 4.3, the results of using RPVBNN with 128 hidden nodes
can achieve a test accuracy and AUC of 94.88% and 95.88% respectively which is considerably
better than performance of other learning algorithms. Also, the impact of the number of nodes is
minimal which further justifies our attainment of optimal dimensionality neighborhood using only
: = 32 nodes. Whereas for the small simulated dataset the second best performer was VBNN, its
performance significantly deteriorates for the large simulated dataset. This is because the with a
large feature space of ? = 200, the training size of 2100 is way smaller. Since RPVBNN works with
compressed feature space of 3 2 [3, 10], it still has the best testing accuracy and AUC of 94.96%
                                                    86


and 96.63% respectively (see table 4.4). This clearly indicates that RPVBNN is an effective solution
to the small = large ? problem. Also, since at each instance one works with the compressed feature
space, one gains a huge advantage in both memory storage and computational efficiency as long
as the intrinsic dimensionality of feature space lies in a smaller dimensional subspace (although
multiple compressions are needed, one can leverage parallelization across compressions).
    The ADNI with ? = 277 and training sample 180 is another example of a small = and large ?
problem. RPVBNN still continues (see table 4.5) to outperform all its competitors where VBNN
suffers from the curse of dimensionality. Interestingly, overall gradient boosting seems to the second
best performer after RPVBNN. For the MNIST data (see Table 4.2), note that VBNN with the best
testing accuracy of 97.8% slightly outperforms the RPVBNN with the best test accuracy of 97.32.
For MNIST, the training size = = 60000 is way larger than ? = 784, is best performer. However, the
average run time for one run based on 500 epochs using 128 nodes of VBNN is 2640 seconds while
the same value with 3 in the optimal dimension neighborhood is 2350 seconds. To conclude,
when ? >> =, RPVBNN offers the biggest advantage in terms of memory storage, computational
                        Table 4.3: Table: Small simulated data performance
                    Model        Setting   Train Acc(%)   Test Acc(%)     AUC(%)
                    LR-;1        _ = 10    67.63 ± 0.64   67.46 ± 1.04  68.04 ± 1.54
                                  _=1      67.59 ± 0.67   67.47 ± 0.97  68.07 ± 1.58
                                _ = 0.1    67.32 ± 0.77   67.44 ± 0.97  68.47 ± 1.51
                                _ = 0.01   65.49 ± 0.91   66.04 ± 1.36  68.77 ± 1.56
                                _0 = 0.1   67.32 ± 0.77   67.44 ± 0.97  68.47 ± 1.51
                    RF          10 trees   66.83 ± 1.76   66.02 ± 1.84  73.76 ± 3.63
                                25 trees   68.75 ± 1.91   67.92 ± 1.84  78.30 ± 3.67
                                50 trees   68.90 ± 1.09   67.84 ± 1.65  80.07 ± 2.05
                    GB          10 trees   72.78 ± 0.78   72.05 ± 1.45  74.18 ± 1.58
                                50 trees   79.78 ± 1.78   77.34 ± 1.73  88.74 ± 1.72
                               100 trees   87.30 ± 1.52   83.22 ± 2.08  92.51 ± 0.88
                               150 trees   90.81 ± 0.90   85.56 ± 2.01  93.78 ± 0.84
                               250 trees   94.17 ± 0.81   86.91 ± 1.63  94.41 ± 0.66
                               350 trees   94.07 ± 1.07   87.44 ± 1.98  94.62 ± 0.75
                               450 trees   97.62 ± 0.62   87.80 ± 1.95  94.84 ± 0.86
                    VBNN       32 nodes    94.88 ± 0.52   89.84 ± 0.64  90.36 ± 0.71
                               64 nodes    95.36 ± 0.38   90.27 ± 0.59  90.89 ± 0.53
                               128 nodes   95.28 ± 0.45   90.28 ± 0.65  90.88 ± 0.56
                    RPVBNN     32 nodes    95.70 ± 0.40   94.77 ± 0.68  95.68 ± 0.56
                               64 nodes    95.80 ± 0.42   94.80 ± 0.60  95.45 ± 0.64
                               128 nodes   95.83 ± 0.31   94.88 ± 0.76  95.88 ± 0.43
                                                  87


     Table 4.4: Table: Large simulated data performance
Model         Setting  Train Acc(%)  Test Acc(%)    AUC(%)
LR-;1         _ = 10    71.34 ± 0.61 64.12 ± 1.58 65.88 ± 0.90
               _=1      71.39 ± 0.63 64.21 ± 1.34 66.13 ± 0.91
              _ = 0.1   70.90 ± 0.59 66.31 ± 1.35 68.13 ± 0.95
             _ = 0.01   65.75 ± 0.86 66.07 ± 1.09 70.62 ± 1.41
            _0 = 0.01   65.75 ± 0.86 66.07 ± 1.09 70.62 ± 1.41
RF            10 trees  62.78 ± 3.31 61.12 ± 3.70 66.21 ± 6.43
              25 trees  62.93 ± 3.93 61.69 ± 3.07 70.24 ± 3.43
              50 trees  60.04 ± 1.48 59.59 ± 1.58 71.82 ± 2.67
             100 trees  60.14 ± 1.61 59.54 ± 1.39 74.81 ± 2.36
GB            10 trees  73.51 ± 0.82 72.14 ± 1.78 74.43 ± 1.62
              50 trees  80.53 ± 1.30 77.45 ± 2.69 87.68 ± 3.15
             100 trees  87.79 ± 1.37 80.75 ± 2.36 90.69 ± 1.59
             150 trees  91.98 ± 1.06 82.25 ± 2.68 91.56 ± 1.57
             250 trees  96.31 ± 0.08 84.04 ± 2.99 92.41 ± 1.82
             350 trees  98.46 ± 0.05 84.64 ± 2.71 92.70 ± 1.71
             450 trees  99.40 ± 0.04 85.06 ± 2.55 92.98 ± 1.72
             550 trees  99.78 ± 0.01 84.97 ± 2.20 92.95 ± 1.70
VBNN         32 nodes   62.76 ± 1.21 60.88 ± 1.59 65.23 ± 1.78
             64 nodes   62.52 ± 1.71 60.90 ± 1.50 65.62 ± 1.39
            128 nodes   63.61 ± 1.13 61.42 ± 1.11 66.21 ± 1.06
RPVBNN       32 nodes   96.54 ± 0.22 94.70 ± 0.81 96.21 ± 0.45
             64 nodes   96.57 ± 0.41 94.89 ± 0.68 96.45 ± 0.63
            128 nodes   96.66 ± 0.28 94.96 ± 0.90 96.63 ± 0.41
         Table 4.5: Table: ADNI data performance
Model         Setting  Train Acc(%)  Test Acc(%)    AUC(%)
LR-; 1        _ = 10   100.00 ± 0.00 65.75 ± 4.07 68.71 ± 3.75
               _=1     100.00 ± 0.00 63.25 ± 3.12 65.21 ± 4.44
              _ = 0.1  100.00 ± 0.00 61.00 ± 4.70 61.62 ± 4.00
             _ = 0.01  100.00 ± 0.00 60.12 ± 4.55 61.08 ± 4.15
             _0 = 10   100.00 ± 0.00 65.75 ± 4.07 68.71 ± 3.75
RF           10 trees   81.78 ± 2.23 68.87 ± 5.37 75.08 ± 5.18
             25 trees   82.38 ± 2.35 70.88 ± 4.43 79.29 ± 5.00
             50 trees   83.67 ± 1.79 72.00 ± 3.88 78.89 ± 4.90
            100 trees   83.08 ± 2.62 71.25 ± 4.50 79.95 ± 4.85
GB           10 trees   87.24 ± 1.51 73.25 ± 4.40 80.44 ± 4.23
             25 trees   95.41 ± 1.16 74.37 ± 5.34 81.32 ± 4.01
             50 trees   99.78 ± 0.35 74.87 ± 4.95 81.16 ± 4.17
            100 trees  100.00 ± 0.00 73.75 ± 3.95 80.79 ± 3.89
VBNN         32 nodes   62.51 ± 2.02 62.75 ± 4.67 62.54 ± 3.43
             64 nodes   62.51 ± 2.02 62.75 ± 4.67 62.54 ± 3.43
            128 nodes   62.51 ± 2.02 62.75 ± 4.67 62.54 ± 3.43
RPVBNN       32 nodes   78.57 ± 1.76 75.66 ± 3.80 81.88 ± 1.76
             64 nodes   78.62 ± 1.92 75.70 ± 4.85 82.12 ± 1.83
            128 nodes   78.84 ± 1.62 75.94 ± 3.84 82.33 ± 1.91
                               88


 Table 4.6: MNIST data performance in term of testing accuracy and time (based on 500 epochs)
                      Model        Setting   Train Acc(%)   Test Acc(%)   Time(s)
                      VBNN        32 nodes       97.63         96.88        354
                                  128 nodes      98.08         97.33        758
                                  256 nodes      98.13         97.40       1385
                                  512 nodes      99.11         97.80       2640
                      RPVBNN      32 nodes       97.82         97.18        280
                                  128 nodes      97.84         97.29        720
                                  256 nodes      98.00         97.30       1143
                                  512 nodes      98.06         97.32       2350
efficiency and prediction accuracy in addition to the inference on the intrinstic dimensionality of the
feature space. For = >> ?, RPVBNN is equally competitive while still providing computational
and memory gain as long as the input resides in a smaller dimensional subspace with respect to
prediction.
4.9     Conclusion
In this chapter, we consider a variational Bayes neural network predictive model for addressing
the curse of dimensionality (small = large ?) by compressing the feature space using RP matrices.
To remove the sensitivity to the choice of the RP matrix, we propose a model averaging approach
to base our projection on the most relevant models. To improve computational complexity, we
provide a variational inference technique which can estimate model specific parameters and model
weights both at the same time. As a by-product, we use the posterior model weights of the projected
dimensions to learn the intrinsic dimensionality of the feature space in context of prediction. The
advantage of variational inference approach proposed in the context of Bayesian model averaging
has two advantages (1) it has the computation gain of frequentist ensemble approaches since one
can parallelize across different models (2) it provides the uncertainty quantification of associated
with each random projection via posterior probabilities. The approach presented in this thesis can
generalized to a wide class of problems arising out of Bayesian neural networks which require
learning of the model importance or averaging across models.
                                                  89


                                               CHAPTER 5
      CONCLUSIONS, DISCUSSION, AND DIRECTIONS FOR FUTURE RESEARCH
5.1      Conclusions and discussion
In this thesis, we first applied two machine learning methods (LR and SVM) under multiple
conditions, to test accuracy in classifying patients with MCI who progress to clinically-defined
dementia (MCI-C) from those who remain stable (MCI-S). Using multi-modal data from ADNI, we
compared LR and SVM classification accuracy and pre-selection dimensional reduction techniques -
i.e., feature selection as informed by prior findings in clinical neuroscience and by ! 1 norm. Notably,
the present results demonstrate important boundaries for applying feature selection techniques in
statistical classification of MCI-to-dementia conversion. Specifically, we found that while using ! 1
for pre-selection can improve accuracy, it also benefits from a more limited, theoretically based set
of feature inputs. In addition, we found that model performance benefited from a longer window
of assessment. These results have implications for studies utilizing multi-modal data for such
classification, including features from clinical neuropsychological assessment, demographic and
genetic markers, MRI-based volumetric brain measures, and other modalities. This thesis also
demonstrates that SVM classifier performance is more stable than LR for dealing with the “large
p" problem. Clinical researchers should note the value of evaluating different classification and
pre-selection approaches in application to clinical or research questions, and be mindful that not all
machine learning techniques are equally beneficial for modeling specific clinical outcomes.
     To further tackle the high dimensional data and variability and complexity of big data, we
introduce the variational Bayes neural network and provide the theoretical rigour and computational
detail for BDNNs. Although the variational Bayes is popular in machine learning, neither the
computational method nor the statistical properties are well understood for complex modeling such
as neural networks. We characterize the prior distributions and the variational family for consistent
Bayesian estimation. The theory also quantifies the loss due to VB numerical approximation
                                                    90


compared to the true posterior distribution. For practical implementation, we reveal that the
algorithm may not be as simple and straightforward as it sounds in computer science literature,
rather it requires careful crafting on several parameters associated in various steps. Nevertheless,
the computation could be quite faster compared to popular Monte Carlo Markov Chain procedure
of approximating the posterior distributions.
    Even though BDNN has achieved higher model performnce in classifying the transtion from
MCI to dementia, it fails to address the curse of dimensionality and learn the true dimensionlity
of data. We then consider a variational Bayes neural network predictive model for addressing the
curse of dimensionality (small = large ?) by compressing the feature space using RP matrices. To
remove the sensitivity to the choice of the RP matrix, we propose a model averaging approach
to base our projection on the most relevant models. To improve computational complexity, we
provide a variational inference technique which can estimate model specific parameters and model
weights both at the same time. The derivation shows that use of variational inference provides a
huge advantage by allowing parallelization across different models at hand. Unlike Markov Chain
Monte Carlo, the variational technique proposed in this thesis allows to obtain optimal model
weights after individual models have been trained, by just making model Evidence Lower Bound
(ELBO) and prior model weights. The approach is generalizable to a wide class of problems where
Bayesian model averaging is next to impossible due to the large dimension of the data or intractable
likelihood.
5.2    Directions for future research
The future research is mainly focused on two aspects: choice of prior structure, Bayesian compressed
deep neural network. Although this thesis builds the framework on a multi-layer neural networks
model with simplistic prior structure, the detail statistical theory and computational methodology
are quite involved. This investigation opens up possibility of exploring much wider class of
models and priors. For example, shrinkage priors, such as double exponential and horseshoe priors
can be explored for building sparse neural networks or one can experiment with various other
                                                  91


variational families. However, their computational details and associated statistical properties
are not immediate. We hope this research will accelerate further development of statistical and
computational foundation for variational inference in general machine learning research.
    Moreover, we explored the sensitiveness to the number of projections and dimension of the
projection empirically. However, further investigation is needed in order to obtain a statistically
optimal solution. Another interesting direction to pursue will be studying the impact of different
projections and qualifying prediction accuracy as a function of the projection. This current work
presents a proof of concept for shallow networks. However the methodology developed in this
thesis can be extended to deep neural networks. Another interesting line of work will be extension
to more complex feature spaces to learn the intrinsic dimensionality of these spaces.
                                                92


APPENDICES
    93


                                                   APPENDIX A
SUPPLEMENT FOR CONSISTENT VARIATIONAL BAYES CLASSIFICATION WITH
                                       DEEP NEURAL NETWORKS
Algorithms of variational implementation
Algorithm 4 BBVI-RMS
   1. Fix an initial value for variational family parameters V@1 .
   2. Fix a step size sequence dC , C = 1, · · · .
   3. Set C = 1 and n > 0.
   4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ).
   5. Compute r  õ V@C L     as in (3.18)
                         V@C
   6. Compute
                                             õ
                                        ⌧C = r  V@C L       r V@C 3KL (@(.|V@ ), ?(.))
                                                      V@C
                                        'C = 0.9'C    1 + 0.1⌧ 2C
   7. Update
                                                                       ⌧C
                                                  V@C+1 = V@C + dC p                       (A.1)
                                                                       'C + n
   8. Set C = C + 1.
   9. Repeat steps 4-7 until the convergence of ELBO using L         bV C as in (3.19) and
                                                                         @
                                          ELBO = L    bV C    3KL (@(.|V@ ), ?(.))
                                                          @
                                                          94


Algorithm 5 BBVI-CV-RMS
   1. Fix an initial value for variational parameter V@1 .
   2. Fix a step size sequence dC , C = 1, · · · .
   3. Set C = 1.
   4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ).
   5. Compute 2¢C = cov(u1C , u2C )/var(u2C ) where u1C and u2C are same as in (3.22).
   6. Compute r  õ V@C L     as in (3.21).
                         V@C
   7. Compute
                                             õ
                                        ⌧C = r  V@C L        r V@C 3KL (@(.|V@ ), ?(.))
                                                       V@C
                                        'C = 0.9'C     1 + 0.1⌧ 2C
   8. Update
                                                                          ⌧C
                                                  V@C+1 = V@C + dC p
                                                                         'C + n
   9. Set C = C + 1.
 10. Repeat steps 4-7 until the convergence of ELBO using L             bV C as in (3.19) and
                                                                           @
                                           ELBO = L    bV C     3KL (@(.|V@ ), ?(.))
                                                           @
With @ and ? as in (3.11) and (3.9) respectively,
                                                                                                 !
                                         ’ =
                                                     f 9=      B29=      (< 9= ` 9= ) 2       1
                         3KL (@, ?) =          log         +          +
                                         9=1
                                                     B 9= 2f 9=   2            2f 9=2         2
                                           (< 9=      ` 9= )                               1       B 9=
                     r< 9= 3KL (@, ?) =             2
                                                                rB 9= 3KL (@, ?) =             + 2
                                               2f 9=                                      B 9= f 9=
                                                                           !               !
                                                             \ 9=     < 9=
                                r< 9= LV@ = ⇢ @(.|V@ )                        log !(✓= )
                                                                  B29=
                                                                                  !              !
                                                      (\ 9=     < 9= ) 2      1
                          rB 9= LV@ = ⇢ @(.|V@ )                                     log ! (✓= )
                                                             B39=            B 9=
                                                            95


Preliminaries
A.0.1       Definitions
Definition A.0.1 For a vector ↵ and a function 6,
                      Õ                    qÕ
   1. ||↵|| 1 =        8 |U8 |, ||↵|| 2 =         8 U82 , ||↵|| 1 = max8 |U8 |.
                     Ø                                qØ
   2. ||6|| 1 =       x2 j
                           |6(x)|3x, ||6|| 2 =             x2 j
                                                                6(x) 2 3x, ||6|| 1 = supx2 j |6(x)|
Definition A.0.2 (Bracketing number and entropy) For any two functions ; and D, define the
bracket [;, D] as the set of all functions 5 such that ;  5  D. Let ||.|| be a metric. Define
an Y bracket as a bracket with ||D                    ; ||  Y. Define the bracketing number of a set of functions
F ⇤ as the minimum number of Y brackets needed to cover F ⇤ , and denote it by # [] (Y, F ⇤ , ||.||).
Finally, the Hellinger bracketing entropy, denoted by                        [] (Y, F
                                                                                      ⇤ , ||.||), is the natural logarithm of
the bracketing number ([104]).
Definition A.0.3 (Covering number and entropy) Let (+, ||.||) be a normed space, and F ⇢ +.
{+1 , · · · , D = } is an Y covering of F if F ⇢ [8=1              # ⌫(+ , Y), or equivalently, 8 \ 2 F , 9 8 such
                                                                           8
that ||\        +8 || < Y. The covering number of F denoted by # (Y, F , ||.||) = min{= : 9 Y
 covering over F of size =}. Finally, the Hellinger covering entropy, denoted by                                  (Y, F , ||.||), is
the natural logarithm of the covering number ([104]).
A.0.2       Lemmas
Lemma A.0.4 With                       e
                                [] (D, F= , ||.|| 2 ) as in Definition A.0.2, for                 e
                                                                                          [] (D, F= , ||.|| 2 )   = log("= /D),
                                   π                                    p
                                                    e
                                      Y
                                           [] (D, F= , ||.|| 2 )3D .Y       = (log "=        log Y)
                                    0
Proof. See proof of lemma 7.14 in [6].
                                                  Ø
Lemma A.0.5 Suppose @ satisfies 3KL (✓0 , ✓✓= )@(✓= )3✓=  Y, then for any a > 0,
                                            ✓π                                        ◆
                                                                  !(✓= )                      Y
                                         =
                                        %0           @(✓= ) log          3✓=       =a 
                                                                   !0                         a
                                                                  96


Proof. See proof of lemma 7.13 in [6].
                                                                                         Ø
Lemma A.0.6 Suppose NY = {✓= : 3KL (✓0 , ✓✓= ) < Y} and                                   NY
                                                                                               ?(✓= )3✓=            4 =Y , = ! 1 then for
any a > 0,
                                           ✓          π                                       ◆
                                                           ! (✓= )                                     2Y
                                      %0=     log                   ?(✓= )3✓=             =a 
                                                               !0                                        a
Proof. See proof of lemma 7.12 in [6].
                                 Ø
Lemma A.0.7 Suppose,               F=2
                                       ?(✓= )3✓=  4 =Y , = ! 1 for any Y > 0. Then, for every Ỹ < Y.
                                      ✓π                                                    ◆
                                    =                 !(✓= )                            =Ỹ
                                 %0                             ?(✓= )3✓= 4                    4 =(Y Ỹ)
                                          ✓= 2F=   2    !0
Proof. See proof of lemma 7.16 in [6].
Lemma A.0.8 Let [✓=⇤ (x) = b⇤! + A⇤! k(b⇤!                        1 + A⇤! 1 k(· · · k(b⇤1 + A⇤1 k(b⇤0 + A⇤0 x))) be a fixed
neural network. Let [✓= (x) = b ! + A ! k(b !                     1 + A ! 1 k(· · · k(b1 + A1 k(b0 + A0 x))) be a neural
network such that
                                                                                 Y
                                           |\ 9=       \ ⇤9= |  Õ !           Œ! =
                                                                    E=0  :̃ E=     E 0 =E+1   0 ⇤E 0=
where :̃ E= = : E= + 1. Then,
                                           π
                                                              |[✓= (x)      [✓=⇤ (x)|3G  Y
                                              x2[0,1] ?=
Proof. In the proof, we suppress the dependence on =. Define the projection %E as %+ [✓ (x) =
b+ 1 + A+ 1 k(· · · k(b1 + A1 k(b0 + A0 x))). We claim that
                                                                                  Õ               Œ
                                                                                Y +E=0 :̃ E E!0=E+1 0 ⇤E 0
                              |%+ [✓ (x) [B] %+ [✓⇤ (x) [B] |  Õ !                             Œ!              ⇤
                                                                                                                                     (A.2)
                                                                                    E=0 :̃ E         E 0 =E+1 0 E 0
                                                                                           Õ!             Œ
We prove this by induction. Let E = 1 as follows. Let Ỹ = Y/ E=0                                   :̃ E E!0=E+1 0 ⇤E 0 , then
   |%1 [✓ (x) [B]       %1 [✓⇤ (x) [B]|
     |b1     b⇤1 [B]| + |A1 [B] > k(b0 + A0 x)                    A⇤1 [B] > k(b⇤0 + A⇤0 x)|
                                             ’ :1
     Ỹ + ||A1 [B]         A⇤1 [B]|| 1 +            |A⇤1 [B] [B0] (k(b0 [B] + A0 [B] > x)                   k(b⇤0 [B] + A⇤0 [B] > x))|
                                             B 0 =0
                       ’ :1
    = Ỹ + : 1 Ỹ + Ỹ        |A⇤1 [B] [B0]|(: 0 + 1) = Ỹ(1 + : 1 + 0 ⇤1 ( ? = + 1))  Ỹ( :̃ 1 + 0 ⇤1 :̃ 0 )
                       B 0 =0
                                                                     97


where the second line holds since k(D)  1 and the third step is shown next. Let D = b0 [B]
A0 [B] > x and D X = b0 [B] + A0 [B] > x             b⇤0 [B] + A⇤0 [B] > x, then for |D X | < 1
                                                          4 D+D X 4 D                       4 D (4 D X 1)
                    |k(D)     k(D + D X )| =                                                                                         (A.3)
                                                   (1 + 4 D+D X ) (1 + 4 D )          (1 + 4 D ) (1 + 4 D+D X )
                                                        4 |4 X 1|
                                                          D   D
                                                                              |D X |                                                (A.4)
                                                   (1 + 4 D ) (1 + 4 D 1 )
since 4 D /((1 + 4 D )(1 + 4 D 1 ))  1/2 and |4 D X 1|  2|D X | for |D X | < 1. Now, |D X | = |b0 [B]
           Õ?
b⇤0 [B] | + B 0==0 |A0 [B] [B0] A⇤0 [B] [B0]|  ( ? = + 1) Ỹ < 1.
Suppose the result hold for +            1, we show the result for + as follows:
       |%+ [✓ (x) [B]      %+ [✓⇤ (x) [B]|
         |b+ [B]     b+⇤ [B]| + |A+ [B] > k(%+ 1 [✓ (x))              A+⇤ [B] > k(%+ 1 [✓⇤ (x))|
                                              ’:+
         Ỹ + ||A+ [B]      A+⇤ [B] > || 1 +        |A+⇤ [B] [B0] (k(%+ 1 [✓ (x) [B])                k(%+ 1 [✓⇤ (x) [B]))|
                                              B 0 =0
                                              ’:+
         Ỹ + ||A+ [B]      A+⇤ [B] > || 1 +        |A+⇤ [B] [B0] (%+ 1 [✓ (x) [B])             k(%+ 1 [✓⇤ (x) [B])|
                                              B 0 =0
where the second step follows since k(D)  1 and the third step follows by relation (A.3) provided
|%+ 1 [✓ (x) [B]      %+ 1 [✓⇤ (x) [B]|  1. But this holds using relation (A.2) with E = +                                        1.
Thus proceeding further we get
                                                                              ’ :+                    ’1
                                                                                                      +              ÷1
                                                                                                                     +
            |%+ [✓ (x) [B]      %+ [✓⇤ (x) [B]|  Ỹ(1 + : + ) + 2Ỹ                 |,+⇤ [B] [B0] |          :̃ E            0 ⇤E 0
                                                                              B 0 =0                   E=0         E 0 =E+1
                                                                    ’1
                                                                    +          ÷ +               ’+          ÷ +
                                                      Ỹ :̃ E + Ỹ     :̃ E          e
                                                                                      \ 0E = Ỹ       :̃ E             0 ⇤E 0
                                                                    E=0      E 0 =E+1            E=0       E 0 =E+1
This completes the proof.
Lemma A.0.9 If |[0 (x)            [✓= (x)|  Y, then |⌘✓= (x)|  2Y where
           ⌘✓= (x) = f([0 (x))([0 (x)             [✓= (x)) + log(1           f([0 (x)))           log(1         f([✓= (x)))
                                                                 98


Proof. Note that,
         |⌘✓= (x)|  |f([0 (x))||[0 (x)             [✓= (x)| + | log(1 f([0 (x)) log(1                     f([✓= (x))|
                                                         ⇣                                            ⌘
                    |[0 (x)        [✓= (x)| + log 1 + f([0 (x))(4 [✓= (x) [0 (x) 1)
                    2|[0 (x)         [✓= (x)|
where the second step follows by using f(G) = 4 G /(1 + 4 G )  1 and the proof of the third step is
shown below.
Let ? = f([0 (x)), then 0  ?  1 and A = [✓= (x)                     [0 (x), then
                          ⇣                                               ⌘
                     log 1 + f([0 (x))(4 [✓= (x)             [ 0 ( x)
                                                                      1)       = |log (1 + ?(4A        1))|
        A > 0 : | log(1 + ?(4A          1))| = log(1 + ?(4A            1))  log(1 + (4A            1)) = A = |A |
        A < 0 : | log(1 + ?(4A          1))| =        log(1 + ?(4A        1))           log(1 + (4A     1)) = A = |A |
Lemma A.0.10 For [✓= (x) = b ! + A ! k(b !                    1  + A ! 1 k(· · · k(b1 + A1 k(b0 + A0 x))),
                                                                           ÷!=
                                               sup       r\ 9 [✓= (x)            0 E 0=
                                            9=1,··· ,  =                   E 0 =1
where 0 E 0= = supE=0,··· ,: (E 0+1)= ||AE 0 [E]]|| 1 .
Proof. We suppress the dependence on =. Let %+ = b+ + A+ k(b+                                1 + A+ 1 k(· · · b1 + A1 k(b0 +
A0 x))). Define ⌧ +,+ = 1 : + +1 and for + = 0, · · · , !, + 0 = 0, · · · , +                  1, let
                   ⌧ + 0,+ = A+ (k 0 (%+ 1 )             A+ 1 (k 0 (%+ 2 )         · · · A++1 (k 0 (%+ 0 ))))
where      denotes component wise multiplication. With k(% 1 ) = x, we define
                                        8
                                        >
                                        >
                                        >
                                        < rbE [✓ (x)
                                        >                    = ⌧ E,! 1 : E+1
                                        >
                                        >
                                        >
                                        > rAE [✓ (x)         = ⌧ E,! 1 : E+1 k(%E 1 ) >
                                        :
By the above form and the fact that k(D), k 0 (D), |G8 |  1, it can be easily checked by induction
           Œ
|⌧ E,! |  E!0=E+1 0 E 0 which completes the proof.
                                                                 99


                              e= = { ✓ : ✓✓ (H, x), ✓= 2 F= } where ✓✓ (H, x) is given by
                                       p
Lemma A.0.11 Let, F                              =                                        =
                                                         ⇣                         ⇣                   ⌘⌘
                                     ✓✓= (H, x) = exp H[✓= (x)                 log 1 + 4 [✓= (x)                            (A.5)
and F= is given by
                                                  n                                                o
                                           F= = ✓= : |\ 9= |  ⇠= , 9 = 1, · · · ,               =                          (A.6)
Then with              e
                [] (D, F= , ||.|| 2 ) is as in definition A.0.2,
          π    p
                  2Y q                                  p
                                   e
                           [] (D, F= , ||.|| 2 )3D   .Y        = ((! =     + 1) log       =  + (! = + 2) log ⇠=      log Y)
            Y 2 /8
Proof. In this proof, we suppress the dependence on =. Note, by lemma 4.1 in [104],
                                                                              ✓ ◆
                                                                                3⇠
                                                  # (Y, F= , ||.|| 1 )                   .
                                                                                 Y
                                      p
For ✓1 , ✓2 2 F , let ✓(D)e = ✓D✓ +(1 D) ✓ (x, H).
                                             1        2
Following equation (52) in [6], we get
              q                     q                               m ✓e
                  ✓✓1 (x, H)          ✓✓2 (x, H)        sup               ||✓1    ✓2 || 1          (x, H)||✓1   ✓2 || 1   (A.7)
                                                            9     m\ 9
                                                                                                                          p
where the upper bound               (x, H) = (⇠ ) ! . This is because |m ✓/m\             e 9 |, the derivative of ✓ w.r.t. is
bounded above by |m[✓ (x)/m\ 9 | as shown below.
                                                 ✓                        ◆p
                       m ✓e        1 m[✓ (x)              4 [ ✓ ( x)
                                                                              4 (H[✓ (x) log(1+4 ✓ ))
                                                                                                      [ ( x)
                              =                    H
                      m\ 9         2 m\ 9               1+4     [ ✓ ( x )
                                                   ✓ [ (x) ◆ 1/2 ✓                            ◆ 1/2
                                   1 m[✓ (x)          4 ✓                          1                    1 m[✓ (x)
                                                                                                    
                                   2 m\ 9           1+4   [ ✓ ( x )          1+4    [ ✓ ( x )           4 m\ 9
Thus, using 4 [✓ (x) /(1 + 4 [✓ (x) ) and Lemma A.0.10, we get
                                                m[✓ (x)         ÷  !            ÷!
                                      sup                              0 ⇤E =      : E ⇠  ( ⇠) !
                                   9=0,··· ,  =
                                                   m\ 9          E=1            E=1
In view of (B.6) and theorem 2.7.11 in [126], we have
                                          ✓ !+1 !+2 ◆
                       e= , ||.|| 2 )  3                                            e= , ||.|| 2 ) .
                                                    ⇠                                                            !+1 ⇠ !+2
            # [] (Y, F                                            =) [] (Y, F                                log
                                                  2Y                                                               Y
where # [] and        []  denote the bracketing number and bracketing entropy as in definition A.0.2.
                                                                   100


Using, lemma A.0.4 with " = !+1⇠ !+2 , we get
                   π Yq                                          p
                                         e
                                  [] (D, F= , ||.|| 2 )3D . Y        ((! + 1) log           + 2(! + 2) log ⇠          log Y)
                      0
Therefore,
               π    p
                       2Y                                   π   p
                                                                  2Y
                                      e
                             [] (D, F= , ||.|| 2 )3D 
                                                                               e
                                                                       [] (D, F= , ||.|| 2 )3D
                                                                  q
                 Y 2 /8                                      0
                                                            p                                                            p
                                                         .    2Y      ((! + 1) log           + (! + 2) log ⇠         log 2Y)
                                                  p
The proof follows by noting log 2Y                           log Y.
A.0.2.1        Propositions
Proposition A.0.12 Let @(✓= ) = "+ # (✓=⇤ ,                            = /=2+23 ) and ?(✓= ) = "+ # (µ= , diag(                   2
                                                                                                                                  = ))  where
log ||    = || 1  = $ (log =) and ||             ⇤
                                                 = || 1  = $ (1). Let =n =2 ! 1,              = log =    = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ),
||µ= || 22 = >(=n =2 ), then for any a > 0,
                                                              3KL (@, ?)  =n =2 a
Proof.
                                                                                                     !
                             ’   =        p                       1         (\ ⇤9= ` 9= ) 2        1
        3KL (@, ?) =                           1+3
                                      log = f 9= + 1+3 2 +
                              9=1
                                                             = f 9=                   2
                                                                                    f 9=           2
                =
                                                    ’  =
                                                                         1 ’     =
                                                                                      1        ’  = \⇤ 2
                                                                                                       9=
                                                                                                                 ’  = `2
                                                                                                                        9=
                  ((3 + 1) log =           1) +          log f 9= + 1+3                   +2                +2
               2                                                      =              f 2                2
                                                                                                     f 9=              f2
                                                                              ✓                                ◆
                                                     9=1                       9=1 9=           9=1               9=1 9=
                                                                                     + ||✓= || 2 + ||µ= || 2 || =⇤ || 1 = >(=n =2 )
                                                                                            ⇤ 2              2
                =                                                                  =
                  ((3 + 1) log =           2) + = log || = || 1 + 2
               2                                                                 =
where the second last inequality uses                          ⇤
                                                               =  = 1/     =.   The last equality follows since log ||                 = || 1 =
$ (log =), ||         ⇤
                      = || 1 = $ (1),       = log =     = >(=n =2 ), ||µ= || 22 = >(=n =2 ) and ||✓=⇤ || 22 = >(=n =2 ).
Proposition A.0.13 Let ?(✓= ) = "+ # (µ= , diag(                                 2    with log ||               = $ (log =), ||        ⇤      =
                                                                                 =)                    = || 1                         = || 1
$ (1). Let ||[0             [✓=⇤ || 1  Yn =2 /4, =n =2 ! 1. Define,
                                        π               ✓                                                                    ◆
                                                                                                           1 f([0 (x))
           3KL (✓0 , ✓✓= ) =                              f([0 (x))([0 (x)           [✓= (x)) + log                              3x
                                          x2[0,1] ?=                                                      1 f([✓= (x))
                            NY =          ✓= : 3KL (✓0 , ✓✓= ) < Y                                                                       (A.8)
                                                                       101


                                                                  Õ! =            Œ! =
If  = log = = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ), log(               E=0 : E=         E 0 =E+1     0 ⇤E 0= ) = $ (log =), ||µ= || 22 = >(=n =2 ),
                                         π
                                                                                      =n =2 a
                                                         ?(✓= )3✓=                4               8 a>0
                                           ✓= 2# Y n 2
                                                     =
Proof. Let [✓=⇤ (x) = b⇤! + A⇤! k(b⇤!                        ⇤                        ⇤
                                                  1 + A ! 1 k(· · · k(b1 + A1 k(b0 + A0 x)))
                                                                                                  ⇤          ⇤       ⇤       be the neural network
such that
                                                                                       Yn =2
                                                         ||[✓=⇤         [0 || 1                                                              (A.9)
                                                                                          4
Such a neural network exists since ||[0                      [✓=⇤ || 1  Yn =2 /4.
Next define neighborhood M Yn =2 as follows
                              (                                                                                                      )
                                                                                   Yn =2
                 M Yn =2 = ✓= : |\ 9=             \ ⇤9= | <          Õ! =           Œ! =                        , 9 = 1, · · · ,   =
                                                                  2      E=0 :̃ E=        E 0 =E+1     0 ⇤E 0=
where :̃ E= = : E= + 1. For every ✓= 2 M Yn =2 , by lemma B.0.1, we have
                                                                                        Yn =2
                                                        ||[✓=          [✓=⇤ || 1                                                           (A.10)
                                                                                           2
Combining (A.9) and (B.1), we get for ✓= 2 M Yn =2 , ||[✓=                                     [0 || 1  Yn =2 /2.
This, in view of lemma B.0.2, 3KL (✓0 , ✓✓= )  Yn =2 .
Let ✓= 2 NYn =2 for every ✓= 2 M Yn =2 . Therefore,
                                     π                                       π
                                                      ?(✓= )3✓=                                    ?(✓= )3✓=
                                       ✓= 2N Yn =2                               ✓= 2M Y n 2
                                                                                               =
                    Õ! =          Œ! =
Let X= = Yn =2 /(2   E=0    :̃ E=    E 0 =E+1  0 ⇤E 0= ), then
             π                                 ÷    = π       \ ⇤9= +X =
                                                                                                ( \ 9= ` 9= ) 2
                                                                                1
                                                                          q
                                                                                                     2f 2
                            ?(✓= )3✓= =                                                   4              9=      3\ 9=
                ✓= 2M Y n 2                               \ ⇤9= X =         2cf 9=   2
                         =                      9=1
                                               ÷             2X=
                                                                             ( \b9= ` 9= ) 2
                                                                                              , b
                                                    =
                                                        q
                                                                                  2f 2
                                            =                             4          9=            \ 9= 2 [\ ⇤9=        X= , \ ⇤9= + X= ]
                                                            2cf 9=    2
                                                9=1
                                                                                                                    !
                                               ÷                  1       2                           ( \b     ` )2
                                                    =
                                                                  2 log   c log X = +log f 9= + 9= 2 9=
                                                                                                           2f
                                            =           4                                                      9=
                                                                                                                                            (A.11)
                                                9=1
where the second last equality holds by mean value theorem.
                                                                         102


Note that b \ 9= 2 [\ ⇤9=        1, \ ⇤9= + 1] since X= ! 0, therefore
            (b\ 9=     ` 9= ) 2     max((\ ⇤9=         ` 9=     1) 2 , (\ ⇤9=     ` 9= + 1) 2 )          (\ ⇤9=     ` 9= ) 2      1
                       2
                                                                  2
                                                                                                                 2
                                                                                                                              +     2
                   2f 9=                                       2f 9=                                            f 9=            f 9=
where the last inequality follows since (0 + 1) 2  2(0 2 + 1 2 ). Again using (0 + 1) 2  2(0 2 + 1 2 ),
                            ’   =
                                   (b
                                    \ 9=    ` 9= ) 2         ’ = \⇤ 2
                                                                     9=
                                                                                 ’ = `2
                                                                                            9=
                                                                                                 ’  =
                                                                                                          1
                                            2
                                                      2              2
                                                                           +2               2
                                                                                               +
                             9=1
                                         2f 9=               9=1
                                                                  f 9=           9=1
                                                                                         f 9=     9=1 9=
                                                                                                        f2
                                                       2(||✓=⇤ || 22 + ||µ= || 22 + 1)||          ⇤
                                                                                                   = || 1   =an =2                       (A.12)
since ||✓=⇤ || 22 = >(=n =2 ), ||µ= || 22 = >(=n =2 ) and ||          ⇤
                                                                     = || 1    = $ (1) and =n =2 ! 1. Also,
                                                             ’!=           ÷ !=
                 log X= + log f 9= = log 2 + log(                 :̃ E=           0 ⇤E 0= )    log Yn =2
                                                             E=0         E 0 =E+1
                                                              ’!=           ÷!=
                                          log 2 + log(           :̃ E=            0 ⇤E 0= ) + log f 9=        log Y        2 log n =
                                                              E=0        E 0 =E+1
                                          log 2 + $ (log =) + $ (log =)                     log Y + $ (log =)
                                                                                         Õ! =          Œ! =
where the last follows since log ||                 = || 1  = $ (log =), log(                E=0 : E=     E 0 =E+1 0 ⇤E 0= ) = $ (log =) and
1/=n =2 = >(1) which implies 2 log n = = >(log =).
                             ’   =
                                      1       2
                                         log         log X= + log f 9= = $ (               = log =)    = >(=n =2 )                        (A.13)
                              9=1
                                      2       c
where the last inequality follows since                    = log =    = >(=n =2 ),
Combining (B.3) and (B.4) and replacing (B.2), the proof follows.
                                                                                                                Õ! =           Œ! =
Proposition A.0.14 Let @(✓= ) ⇠ "+ # (✓=⇤ ,                           = /=2+23 ), 3 > 3 ⇤ where                    E=0   : E=    E 0 =E+1 0 ⇤E 0= =
      ⇤
$ (= 3 ), 3 ⇤ > 0. Define
                             π              ✓                                                                               ◆
                                                                                                    1 f([0 (x))
                 ⌘(✓= ) =                     f([0 (x))([0 (x)                [✓= (x)) + log                                   3x
                                x2[0,1] ?=                                                          1 f([✓= (x))
Let ||[0     [✓=⇤ || 1  Yn =2 /4 where =n =2 ! 1. If                = log =     = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ), then
                                                   π
                                                         ⌘(✓= )@(✓= )3✓=  Yn =2 .
                                                                    103


Proof. Since ⌘(✓= ) is a KL-distance, ⌘(✓= ) > 0. We shall thus establish an upper bound.
         π                                   π
               ⌘(✓= )@(✓= )3✓=                             |[✓= (x)         [0 (x)|3x
                                             π π
                                               x2[0,1] ?=
                                                                |[✓= (x)           [\ ⇤= (x)|3x@(✓= )3✓= + ||[✓=⇤                [0 || 1
                                             π
                                                    x2[0,1] ?=
                                                 ||[✓=      [✓=⇤ || 1 @(✓= )3✓= + Yn =2                                                     (A.14)
where the first inequality is a consequence of lemma B.0.2 and the last inequality follows since
||[✓=⇤  [0 || 1 = >(n =2 ).
                                                       Õ!              Œ! =
Let ( = {✓= : \ 9=1      =
                            |\ 9=   \ ⇤9= |  Yn =2 /(     E=0  :̃ E=     E 0 =E+1    0 ⇤E 0= )}, then
        π
              ||[✓= [✓=⇤ || 1 @(✓= )3✓=
            π                                            π
         =        ||[✓= [✓=⇤ || 1 @(✓= )3✓= +                  ||[✓=       [✓=⇤ || 1 @(✓= )3✓=
               (                                           (2
                        π                                               π ’    : !=
            Yn =2  +         |b ! [B]      b⇤! [B]|@(✓= )3✓=        +                |A ! [B] [B0]           A⇤! [B] [B0] |@(✓= )3✓=
                           (2                                             (2   B 0 =1
            ’
            : !=                    π
         +         |A⇤! [1] [B]|           @(✓= )3✓=                                                                                         (A.15)
            B=1                        (2
                                                                                                         Õ!            Œ! =
Let ( 2 = [ 9=1 =
                    ( 29 where ( 9 = {|\ 9=          \ ⇤9= |  D = } where D = = Yn =2 /(                   E=0  :̃ E=  E 0 =E+1 0 ⇤E 0= ). We first
compute &(( 2 ) as follows:
                                                           ’  =                   ’  = π
                                2
                         &(( ) =      &([ 9=1 ( 29 )
                                                =
                                                                &(( 29 )     =                               @(\ 9= )3\ 9=
                                                            9=1                    9=1      |\ 9= \ ⇤9= |>D =
                                              ⇣         ⇣            ⌘⌘
                                   =2      =    1         =1+3 D =                                                                           (A.16)
Using (A.16) in the last term of (A.15), we get
          ’
          : !=                     π                                   ’
                                                                       : !=
                |A⇤! [1] [B]|            @(✓= )3✓= = &(( )         2
                                                                              |A⇤! [1] [B] | = 0 ⇤! = =          = (1       (=1+3 D = ))
                                                                           ✓                                   ◆
          B=1                        (2                                B=1
                                                                                       1          =2(1+3) D 2=
                                                        =   >(=n =2 )$       = 3
                                                                                                4                 = >(n =2 )                 (A.17)
                                                                                   =1+3 D =
                                                                      104


                                                                                               Õ! =       Œ! =
where the second step follows by Mill’s ratio,                        =  = >(=n =2 ) and         E=0 : E=   E 0 =E+1 0 ⇤E 0= = $ (= 3 )
which implies =1+3 D = ! 1 . The third step holds because
                                                                                                       !
                                                                          =2(1+3) Y 2 n=4
            =1+3        =2(1+3) D 2=         =2(1+3) D 2=      log =(
                                                                      Õ!       Œ !=
                                                                                          0⇤ ) 2
                                                                                                 (3+1)
                                                                                                                               (A.18)
                                                                           :̃
                    4                4                   =4            E=0 E = E 0 =E+1 E 0 =           = >(1)
           =1+3 D =
        Õ!         Œ! =
                               0 ⇤E 0= ) 2 log = = $ (=23 log =) = >(=23 ) for 3 > 3 ⇤ and =2 n =4 ! 1.
                                                             ⇤
since (  E=0 :̃ E=    E 0 =E+1
                                                                 105


For the second term in (A.15), let (0 = {|b ! [B]                        b⇤! [B] | > D = }
            π
                  |b ! [B]        b⇤! [B]| @(✓= )3✓=
             (2
               π                                                     π
            =             |b ! [B]       b⇤! [B]|@(✓= )3✓=      +                 |b ! [B]    b⇤! [B] |@(✓= )3✓=
               π( 2 \( 0                                               ( 2 \( 0 2
                   |b ! [B]        b⇤! [B]|@(b ! [B])3b ! [B] + ⇢ @(b! [B]) |b ! [B]                  b⇤! [B] |&( (˜2 ),       (A.19)
                 (0
(˜2 is the union of all ( 29 , 9 = 1, · · · ,         =  except the one corresponding to b ! [B].
               π
                     |b ! [B]       b⇤! [B]|@(b ! [B])3b ! [B]
                 (0
                   π                              r
                                                      =2+23                                 =2+23
                                                                                                  ( b ! [B] b⇤! [B]) 2
               =                                             (b ! [B] b⇤! [B])4               2                        3b ! [B]
                                                        2c
                                 π 1
                     | b ! [B] b⇤! [B] |>=1+3 D =
                         2                    D        1 2                 1+3
               =p                            p 4 2 D 3D  4 = D =                                                               (A.20)
                       =2+23 =1+3 D = 2c
                                               p
Also, ⇢ @(b! [B]) |b ! [B] b⇤! [B]| = 2/c(1/=1+3 ). Thus
                    ⇢ @(b! [B]) |b ! [B]         b⇤! [B] |&( (˜2 )
                              ✓          ⇣         ⇣          ⌘⌘ ◆
                                    =                 1+3                         =        =2(1+3) D =          =2(1+E) D 2=
                      = $ 1+3 1                      = D=            ⇠      2(1+3)
                                                                                         4               4                     (A.21)
                                =                                         =          D =
where the first equality in the above step follows by observing that &( (˜2 ) behaves analogous to
&(( 2 ) which was computed in (A.16) and the second equality in the above step follows due to
                     Õ!             Œ
Mill’s ratio and E=0          :̃ E= E!0==E+1 0 ⇤E 0= = $ (=E ) which implies =1+3 D = ! 1. The third inequality
in the above step is a consequence of the fact that                          =   =1+3 .
Combining (A.17), (A.20) and (A.21), we get
                                         π
                                                                                                =1+3 D =
                                               |b ! [B]     b⇤! [B] | @(✓= )3✓=  4                                             (A.22)
                                           (2
Note the third term in (A.15) can be handled similar to third term and it can be shown
           π                                               π ’  : !=
               |b ! [B]        b⇤! [B]|@(✓= )3✓=        +              |A ! [B] [B0]        A⇤! [B] [B0] |@(✓= )3✓=
            (2                                              ( 2 B 0 =1
                                =1+3 D =                        =1+3 D =                   (=1+3 D = 2 log =)
            : ! = +1     =4              = >((=n =2 ) 2 )4                 >(n =2 )4                          = >(n =2 )     (A.23)
                                                                     106


where the last equality in the second step follows by                             =  = >(=n =2 ) and the argument in (A.18) by
which 4    (=1+3 D = 2 log =)  = >(1).
Combining (A.17) and (A.23) with (A.15) the proof follows.
Proposition A.0.15 Let =n =2 ! 1.                    Let ?(✓= ) = "+ # (µ= , diag(                          2
                                                                                                            = )) where log ||             = || 1  =
$ (log =) and ||µ= || 22 = >(=n =2 ). Suppose for some 0 < 1 < 1,                                       = log =   = >(= 1 n =2 ), then for
         1n 2/
⇠= = 4 =    =   =  and F= as in (3.31), we have for any Y > 0,
                                         π
                                                                                     2
                                                    ?(✓= )3✓=  4 =Yn = , = ! 1
                                           ✓= 2F=2
Proof. Let F9= = {\ 9= : |\ 9= |  ⇠= }
                                          F= = \ 9=1  =
                                                         F9= =) F=2 = \ 9=1               =
                                                                                             F9=2
Note that
    π                           ’  = π
                                                            ( \ 9= ` 9= ) 2
                                                1
                                           q
                                                                 2f 2
               ?(✓= )3✓=                               4            9=     3\ 9=
     ✓= 2F=2                            2
                                      F 9=    2cf 9= 2
                                 9=1
                                ’ = π                                                     ’   = π
                                                                ( \ 9= ` 9= ) 2                                          ( \ 9= ` 9= ) 2
                                          ⇠=
                                                   1                                                  1
                                                                                                              1
                                             q                                                           q
                                                                     2f 2                                                     2f 2
                             =                             4            9=      3\ 9= +                             4            9=      3\ 9=
                                        1        2cf 9= 2                                           ⇠=     2cf 9=2
                                9=1                                                        9=1
                                ’ = ✓         ✓                 ◆◆       ’  = ✓             ✓              ◆◆
                                                 ⇠=      ` 9=                                 ⇠= + ` 9=
                             =        1                               +           1
                                9=1
                                                    f 9=                  9=1
                                                                                                   f 9=
                                                          p
Since ||µ= || 22  =   >(=n =2 ) =) ||µ= || 1 = >( =n = ). Since log ||                         = || 1  = $ (log =) which implies for
some " > 0, 3 1,
          ✓                               ◆               p
            |⇠= ` 9= | |⇠= + ` 9= |              (⇠=         =)                                             1
    min                    ,                                             4 log ⇠=     (3+1) log =
                                                                                                                    ⇠ 4 '= log = ! 1
                 f 9=            f 9=               = "3                                               = 3 1/2 "
                                                                                                                                           (A.24)
where the last convergence holds since                  = log =      = >(= 1 n =2 ) implies '= = (= 1 n =2 )/(                = log =)         (3 +
1) ! 1.
Thus, using Mill’s ratio, we get:
               π
                                             ©’ f 9=                                       ’
                                                                           (⇠= ` 9= ) 2                           (⇠= +` 9= ) 2
                                                                                                                                ™
                                                  =                                            =
                                                                                                      f 9=
                          ?(✓= )3✓= = $ ≠                                                                                       Æ̈
                                                                               2f 2                                 2f 2
                                                                      4           9=    +                      4        9=
                                                     ⇠= ` 9=                                      ⇠= + ` 9=
                                             ´
                  ✓= 2F=2                       9=1                                         9=1
                                                          p
                                                    (⇠= =) 2
                                                                           Y=n =2
                                        2     =4     2=2 " 2     4
                                                                   107


where the last asymptotic inequality holds because
                          p 2                                                                           ✓                   ◆
              (⇠=              =)                          1                                              4 2'=     2 log =
                                         log 2       =  ⇠ 4 2'= log =           2 log        =      =                              Y=n =2
                 2= 3 " 2                                  2                                                2           =
In the above step, the first asymptotic equivalence holds due to (A.24), the second inequality holds
since   =   =. The last inequality holds since '= ! 1 and log/= ! 0.
Proposition A.0.16 Let =n =2 ! 1. Suppose                                    = log =      = >(= 1 n =2 ), for some 0 < 1 < 1, ! = ⇠ log =
and ?(✓= ) = "+ # (µ= , diag(                        2     where log ||                   = $ (log =) and ||µ= || 22 = >(=n =2 ). Then for
                                                     = ))                        = || 1
every Y > 0,
                                          π
                                                       ! (✓= )
                                    log                        ?(✓= )3✓=  log 2                    Y 2 =n =2 + > %0= (1)
                                             U Y2 n=     !0
Proof. It suffices to show
                                          π                                                         !
                                                       !(✓= )                                Y=n =2
                                    %0=                        ?(✓= )3✓= > 24                          ! 0, = ! 1                                (A.25)
                                             U Y2 n=      !0
              π                                                           !
                           !(✓= )                               Y 2 =n =2
        %0=                            ?(✓= )3✓= > 24
                 U Y2 n=        !0
                   π                                                                 !          ✓π                                             ◆
                                       ! (✓= )                             Y 2 =n =2                      !(✓= )                     Y 2 =n =2
            %0=                                   ?(✓= )3✓= > 4                        + %0=                     ?(✓= )3✓= > 4
                       U Y2 n= \F=        !0                                                        F=2      !0
                                                                          1n 2/
Using lemma B.0.4 with Y = Yn = and ⇠= = 4 =                                 =      = ,
                     π     p
                               2Yn =
                                                   e
                                         [] (D, F= , ||.|| 2 )3D
                        Y 2 n =2 /8
                                  p
                      . n= Y            = ((! =    + 1) log = + (! = + 2) log ⇠= log Yn = )
                                                p                                    p                               p
                       Yn = $ (max( = (! = + 1) log = ,                                  = (! = + 2) log ⇠= ,            log n = ))
                                                    q                           q                         p                    p
                       Yn = max(>(n = = 1 log =), $ (n = = 1 log =), $ ( log =))  Y 2 n =2 =
where             e
          [] (D, F= , ||.|| 2 )      is as in definition A.0.2. The first inequality in the third step follows because
! = ⇠ log =,      = log =         = >(= 1 n =2 ) and         = log ⇠=       = = 1 n =2 ,       log n =2  log =. The second inequality in
the third step follows since (= 1 log =)/= = >(1)
                                                                            108


By theorem 1 in [138], for some constant ⇠ > 0, we have
              π                                                      !                                                              !
                                ! (✓  = )                      2   2                                     ! (✓= )          Y 2 =n =2
         %0=                              ?(✓= )3✓= > 4 Y =n =  %0=                        sup                   >4
                        2
                ✓= 2U Y n= \F=     ! 0                                                ✓= 2U Y2 n= \F=      !0
                               4 exp( ⇠Y 2 =n =2 ) = >(=n =2 )                                                                       (A.26)
Using proposition B.0.7 with Y = 2Y, we have
                                             π
                                                                                2=Y 2 n = 2
                                                        ?(✓= )3✓=  4
                                                ✓= 2F=2
Therefore, using lemma A.0.7 with Y = 2Y 2 n =2 and Ỹ = Y 2 n =2 , we have
                                ✓π                                                ◆
                                          !(✓= )                        Y 2 =n =2            Y 2 =n =2
                            %0=                    ?(✓= )3✓= > 4                     4                ! 0.                           (A.27)
                                    F=2    !0
Combining (A.26) and (A.27), (A.25) follows.
Proposition A.0.17 Let ?(✓= ) = "+ # (µ= , diag(                     2                     = $ (=) and ||         ⇤      = $ (1).
                                                                     = ),   ||    = || 1                          = || 1
  1. Let ! = = !, ? = = ? independent of =. If                 = log =     = >(=) and ||µ= || 22 = >(=), then
                                                 3KL (c ⇤ , c(.|y= , X= )) = > %0= (=)                                                (A.28)
  2. Let     = log =    = >(=n =2 ), ! = ⇠ log = and ||µ= || 22 = >(=n =2 ). There exists a neural network such
                                                                          Õ! =              Œ
      ||[0    [✓=⇤ || 1 = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ) and log( E=0            :̃ E= E!0==E+1 0 ⇤E 0= ) = $ (log =), then
                                                3KL (c ⇤ , c(.|y= , X= )) = > %0= (=n =2 )                                            (A.29)
Proof. For any @ 2 Q= .
                               π                                   π
 3KL (@, c(.|y= , X= )) =           @(✓= ) log @(✓= )3✓=                 @(✓= ) log c(✓= |y= , X= )3✓=
                               π                                   π
                                                                                                  !(✓= ) ?(✓= )
                            =       @(✓= ) log @(✓= )3✓=                 @(✓= ) log Ø                                    3✓=
                                                                                                 !(✓= ) ?(✓= )3✓=
                                                  π                                               π
                                                            !(✓= )                                      !(✓= )
                            = 3KL (@, ?)               log          @(✓= )3✓= + log                            ?(✓= )3✓=
                                                   π                                                  π
                                                              !0                                         !0
                                                            !(✓= )                                        ! (✓= )
                             3KL (@, ?) +              log          @(✓= )3✓= + log                               ?(✓= )3✓=          (A.30)
                                                              !0                                            !0
                                                              109


Since c ⇤ satisfies minimizes the KL-distance to c(.|y= , X= ) in the family Q= , therefore for any
^>0
                      %0= (3KL (c ⇤ , c(.|y= , X= )) > ^)  %0= (3KL (@, c(.|y= , X= )) > ^)                            (A.31)
Proof of part 1. Note,           = log =  = >(=), ||`= || 22 = >(=), ||    = || 1    = $ (=) and ||       ⇤
                                                                                                          = || 1  = $ (1). We
                                      p
take @(✓= ) = "+ # (✓=⇤ , I = / =) where ✓=⇤ is defined next.
For #      1, let [✓⇤# be a finite neural network approximation satisfying ||[✓=⇤                         [0 || 1  Y/4. The
existence of such a neural network is always guaranteed by [60]. Define ✓=⇤ same as ✓#⇤ for all the
non zero coefficients and zeros for all non existent coefficients.
Step 1 (a): Using proposition A.0.12, with n = = 1, we get for any a > 0,
                                               %0= (3KL (@, ?) > =a) = 0                                                (A.32)
where the above step follows ||✓=⇤ || 22 = ||✓#⇤ || 22 = ||✓=⇤ || 22 = >(=).
Step 1 (b): Next, note that
                     π             ✓                                                                                   ◆
                                                      f([0 (x))                                       1 f([0 (x))
  3KL (✓0 , ✓✓= ) =                  f([0 (x)) log                 + (1     f([0 (x))) log                               3x
                                                     f([✓= (x))                                      1 f([✓= (x))
                     π             ✓                                                                             ◆
                        x2[0,1] ?=
                                                                                             1 f([0 (x))
                  =                  f([0 (x))(f([✓= (x))        f([0 (x))) + log                                  3x (A.33)
                        x2[0,1] ?=                                                          1 f([✓= (x))
Since ||[0     [✓=⇤ || 1  Y/4, using proposition B.0.5 with n = = 1 and Y = Y
                                            π
                                                3KL (✓0 , ✓✓= )@(✓= )3✓=  Y
                                                               Õ!          Œ!
which follows by ||✓=⇤ || 22 = ||✓#⇤ || 22 = >(=) and log(       E=0 :̃ E#      E 0 =E+1 0 ⇤E 0 # ) = $ (log =).
Therefore, by lemma A.0.5,
                                         ✓π                                      ◆
                                                    !(✓= )                              Y
                                     %0=      log          @(✓= )3✓= > =a  .                                           (A.34)
                                                      !0                                a
Step 1 (c): Since ||[0         [✓=⇤ || 1  Y/4, using proposition B.0.3 with n = = 1 and a = Y,
                                           π
                                                     ?(✓= )3✓=      exp( =Y)
                                             ✓= 2NY
                                                           110


                                                                   Õ!            Œ!
which follows by ||✓=⇤ || 22 = ||✓#⇤ || 22 = >(=) and log(           E=0  :̃ E=    E 0 =E+1 0 ⇤E 0= ) = $ (log =). Therefore,
using lemma A.0.6, we get
                                        ✓     π                                    ◆
                                                    !(✓= )                                2Y
                                   %0=    log               ?(✓= )3✓= > =a                                                     (A.35)
                                                       !0                                  a
Step 1 (d): From (A.31) and (A.30) we get
       %0= (3KL (c ⇤ , c(.|y= , X= )) > 3=a)  %0= (3KL (@, ?) > =a)
            ✓π                                      ◆       ✓       π                                      ◆
                       ! (✓= )                                           !(✓= )                                   3Y
    + %0 =
                log            @(✓= )3✓= > =a + %0 log    =
                                                                                   ?(✓= )3✓= > =a                              (A.36)
                         !0                                                 !0                                     a
where the last inequality is a consequence of (A.32), (A.34) and (A.35).
Since Y is arbitrary, taking Y ! 0 completes the proof.
Proof of part 2. Note,          = log =   = >(=n =2 ), ||`= || 22 = >(=n =2 ), log || = || 1 = $ (log =) and || =⇤ || 1 =
                                                                           Õ! =          Œ
$ (1). Let @(✓= ) = "+ # (✓=⇤ , I = /=2+23 ), 3 > 3 ⇤ where E=0
                                                                                                                         ⇤
                                                                                   :̃ E= E!0==E+1 0 ⇤E 0= = $ (= 3 ), 3 ⇤ > 0.
We next define ✓=⇤ as follows:
Let [✓=⇤ be the neural satisfying
                                    ||[✓=⇤    [0 || 1  Yn =2 /4     ||✓=⇤ || 22 = >(=n =2 )
The existence of such a neural network is guaranteed since ||[✓=⇤                        [0 || 1 = >(n =2 ).
Step 2 (a): Since ||✓=⇤ || 22 = >(=n =2 ), by proposition A.0.12,
                                             %0= (3KL (@, ?) > a=n =2 ) = 0                                                     (A.37)
                                                                                          Õ! =            Œ! =
Step 2 (b): Since ||[✓=⇤           [0 || 1  Yn =2 /4, ||✓=⇤ || 22 = >(=n =2 ) and (         E=0    :̃ E=   E 0 =E+1 0 ⇤E 0= ) log = =
>(=n =2 ), by proposition B.0.5,
                                           π
                                              3KL (✓0 , ✓✓= )@(✓= )3✓=  Yn =2
Therefore, by lemma A.0.5,
                                       ✓π                                            ◆
                                                  !(✓= )                           2        Y
                                  %0=        log          @(✓= )3✓= > a=n =  .                                                 (A.38)
                                                      !0                                    a
                                                            111


                                                                                    Õ! =         Œ! =
Step 2 (c): Since ||[✓=⇤ [0 || 1  Yn =2 /4, ||✓=⇤ || 22 = >(=n =2 ) and log(          E=0 :̃ E=   E 0 =E+1 0 ⇤E 0= ) = $ (log =),
by proposition B.0.3,
                                       π
                                                      ?(✓= )3✓=       exp( Y=n =2 )
                                        ✓= 2N Yn =2
Therefore, using lemma A.0.6, we get
                                     ✓      π                                        ◆
                                                   !(✓= )                          2      2Y
                                 %0 log
                                   =
                                                             @(✓= )3✓= > a=n =                                            (A.39)
                                                     !0                                    a
Step 2 (d): From (A.31) and (A.30) we get
                                                          ⇣                        ⌘
  %0= (3KL (c ⇤ , c(.|y= , X= )) > 3a=n =2 )  %0= 3KL (@, ?) > a=n =2
          ✓π                                          ◆         ✓    π                                        ◆
                    ! (✓= )                         2                      !(✓= )                           2         3Y
  + %0 =
               log           @(✓= )3✓= > a=n = + %0 log       =
                                                                                    ?(✓= )3✓= > a=n =                     (A.40)
                      !0                                                     !0                                        a
where the last inequality is a consequence of (A.37), (A.38) and (A.39).
Since Y is arbitrary, taking Y ! 0 completes the proof.
Consistency of the variational posterior.
Proof of Theorem 1.
We assume Relation (B.13) holds with                 =  and ⌫= are same as in (3.29).
By assumptions (A1) and (A2), the prior parameters satisfy
                 ||µ= || 22 = >(=), log ||     = || 1  = $ (log =), ||      ⇤
                                                                            = || 1 = $ (1),       ⇤
                                                                                                  =   = 1/     =.
Note     = ⇠ =0 , 0 < 0 < 1 which implies              = log =   = >(=). By proposition A.0.17 part 1.,
                                        3KL (c ⇤ , c(.|y= , X= )) = > %0= (=).                                             (A.41)
By step 1 (c) in the proof of proposition A.0.17
                                                        ⌫= = > %0= (=)                                                     (A.42)
Since,     = ⇠ =0 ,  = log =   = >(= 1 ), 0 < 1 < 1. Using proposition A.0.16 with n = = 1,
                    c ⇤ (UY2 )   =    =Y 2 c ⇤ (UY2 )       log 2 + > %0= (1) = =Y 2 c ⇤ (UY2 ) + $ %0= (1)                (A.43)
                                                               112


Thus, using (A.41), (A.42) and (A.43) in (B.13), we get
                       =Y 2 c ⇤ (UY2 ) + $ %0= (1)  > %0= (=) + > %0= (=) =) c ⇤ (UY2 ) = > %0= (1)
Proof of Theorem 2.
We assume Relation (B.13) holds with                           = and ⌫= are same as in (3.29).
Let    = ⇠ =0 and n =2 ⇠ = X , 0 < X < 1                     0. This implies         = log =    = >(=n =2 ).
By assumptions (A1) and (A4), the prior parameters satisfy
                 ||µ= || 22 = >(=n =2 ), log ||          = || 1   = $ (log =), ||       ⇤
                                                                                        = || 1  = $ (1),           =
                                                                                                                    ⇤
                                                                                                                      = 1/    =.
Also by assumption (A3),
                                                                                    ’!=           ÷ !=
               ||[0      [✓=⇤ || 1 =   >(n =2 ),   ||✓=⇤ || 22  = >(=n =2 ), log(         :̃ E=          0 ⇤E 0= ) = $ (log =)
                                                                                    E=0         E 0 =E+1
By proposition A.0.17 part 2.,
                                                3KL (c ⇤ , c(.|y= , X= )) = > %0= (=n =2 ).                                       (A.44)
By step 2 (c) in the proof of proposition A.0.17
                                                                ⌫= = > %0= (=n =2 )                                               (A.45)
Since     = ⇠ =0 ,      = log =     = >(= 1 n =2 ), 0 + X < 1 < 1. Using proposition A.0.16, it follows that
            c ⇤ (UYn2
                       =
                         )   =      Y 2 =n =2 c ⇤ (UYn2
                                                        =
                                                          )      log 2 + > %0= (1) = Y 2 =n =2 c ⇤ (UYn   2
                                                                                                               =
                                                                                                                 ) + $ %0= (1)    (A.46)
Thus, using (A.44), (A.45) and (A.46) in (B.13), we get
               =Y 2 n =2 c ⇤ (UYn 2
                                    =
                                      ) + $ %0= (1)  > %0= (=n =2 ) + > %0= (=n =2 ) =) c ⇤ (UYn              2
                                                                                                                  =
                                                                                                                    ) = > %0= (1)
Proof of Corollary 1.
                    Ø
Let ✓ˆ= (H, x) = ✓✓= (H, x)c ⇤ (✓= )3✓= .
                                               ✓π                            ◆
                    3H ( ✓ˆ= , ✓0 ) = 3H                      ⇤
                                                     ✓✓= c (✓= )3✓= , ✓0
                                         π
                                              3H (✓✓= , ✓0 )c ⇤ (✓= )3✓= Jensen’s inequality
                                         π                                          π
                                                                   ⇤
                                     =           3H (✓✓= , ✓0 )c (✓= )3✓= +                3H (✓✓= , ✓0 )c ⇤ (✓= )3✓=
                                          UY                                          U Y2
                                      Y + > %0= (1)
                                                                      113


Taking Y ! 0, we get 3H ( ✓ˆ= , ✓0 ) = > %0= (1). Let
                                                            ✓π                                     ◆
                                                         1
                                          ˆ
                                         [(x)    =f               f([✓= (x))c (✓= )3✓=⇤
                                                                                                                                   (A.47)
                                     ˆ
then, note that [(x)ˆ       = log ✓✓ˆ= (1,  x)
                                        (0,x)
                                               .
                                      =
                            π                ’ q
    2 ˆ
 23H  ( ✓= , ✓0 ) =2      2                           ✓ˆ= (H, x)✓0 (H, x)3x
                              x2[0,1] ? H2{0,1}
                            π                ’         1
                                                             ˆ x)  log(1+4 [ˆ ( x) )+H[0 ( x) log(1+4 [0 ( x) )}
                  =2      2                        4 { 2 ( H[(                                                   3x
                              x2[0,1] ? H2{0,1}
                            π              ⇣p                                 p                                              ⌘
                  =2      2                     f([0 (x))f( [(x)) ˆ       +       (1      f([0 (x)))(1              ˆ
                                                                                                                 f( [(x)))     3x
                            π              q
                              x2[0,1] ?
                                                    p                    p
                    2 2                       1    ( f([0 (x))                      ˆ
                                                                             f( [(x)))       2 3x
                    π                                                                    π
                              x2[0,1] ?
                                   p                    p               2              1                                            2
                                 ( f([0 (x))                    ˆ
                                                            f( [(x)))     3x                          (f([0 (x))            ˆ
                                                                                                                        f( [(x)))     3x
                      x2[0,1] ?                                                        4 x2[0,1] ?
                                                                                                                                   (A.48)
                                                                                               p
In the above equation, the sixth and the seventh step hold because 1 G  1 G/2 and | ? 1                                            ?2 | 
 p       p        p         p              p        p
| ? 1 + ? 2 || ? 1            ? 2 |  2| ? 1            ? 2 | respectively. The fifth step holds because
       ⇣p            p                           ⌘2                                        p
            ?1 ?2 +      (1    ? 1 )(1       ?2)     = ?1 ?2 + 1         ?1         ?2 +      ? 1 ? 2 (1    ? 1 ) (1 ? 2 )
                                                         p                                    p                    p        p 2
                                                            ?1 ?2 + 1        ?1       ? 2 + ? 1 ? 2 = 1 ( ?1                  ?2)
By (A.48) and Cauchy Schwartz inequality,
           π                                                         ✓π                                                    ◆ 1/2
                                                                                                                      2
                          |f([0 (x)) f( [(x))|3x  ˆ                                  (f([0 (x))              ˆ
                                                                                                         f( [(x)))      3x
              x2* [0,1] ?                                               x2[0,1] ?
                                                                      p
                                                                   2 23H ( ✓ˆ= , ✓0 ) = > %0= (1)                                 (A.49)
The proof follows in lieu (3.33).
Proof of Corollary 2.
We assume Relation (B.13) holds with                       = and ⌫= are same as in (3.29).
Let    = ⇠ =0 and n =2 ⇠ = X , 0 < X < 1                0. This implies              = log =   = >(=n =2 ).
                                                                    114


Also,     = log =     = >(= 1 n =2 ), 0 + X < 1 < 1. This implies                     = log =    = >(= 1 (n =2 ) ^ ), 0  ^  1. Thus,
using proposition A.0.16 with n = = n =: , we get
          c ⇤ (UYn  2
                       =
                        ^)  =     Y 2 =n =2^ c ⇤ (UYn2
                                                        ^)
                                                        =
                                                              log 2 + > %0= (1) = Y 2 =n =2^ c ⇤ (UYn     2
                                                                                                            : ) + $ %0 (1)
                                                                                                                       =        (A.50)
                                                                                                            =
This together with (A.44), (A.45) and (B.13) implies
                                                                                 2   2^
                                                       c ⇤ (UYn2
                                                                  ^ ) = > % = (n =
                                                                  =        0
                                                                                        )
                      Ø
Let ✓ˆ= (H, x) =          ✓✓= (H, x)c ⇤ (✓= )3✓= .
                                       π                                              π
                    3H ( ✓ˆ= , ✓0 )                                ⇤
                                                   3H (✓✓= , ✓0 )c (✓= )3✓= +                    3H (✓✓= , ✓0 )c ⇤ (✓= )3✓=
                                          U Y n=^                                       U Y2 n ^
                                                                                              =
                                     Yn =^ + > %0= (n =2   2^
                                                               )
        Dividing by n =^ on both sides we get
               1
                    3H ( ✓ˆ= , ✓0 ) = > %0= (n =2   3^
                                                       ) + > %0= (1) = > %0= (1),          0  ^  2/3.
               n =^
By (A.49), for every 0  ^  2/3,
                          π
                      1                                                            1 p
                                       |f([0 (x))               ˆ
                                                            f( [(x))|3x                2 23H ( ✓ˆ= , ✓0 ) = > %0= (1).
                     n =^ x2[0,1] ?=                                               n =^
The proof follows in lieu of (3.33).
Consistency of the true posterior.
From (4.9), note that
                                                  Ø                                Ø
                                                       ! (✓= ) ?(✓= )3✓=                  (!(✓= )/! 0 ) ?(✓= )3✓=
                       c(UY2 |y= , X= ) = Ø Y                                  = ØY
                                                   U2                                U2
                                                                                                                                (A.51)
                                                      ! (✓= ) ?(✓= )3✓=                 (!(✓= )/! 0 ) ?(✓= )3✓=
Theorem A.0.18 Suppose conditions of theorem 3.4.1 hold. Then,
   1.
                                                 ⇣                                       ⌘
                                                                                =Y 2 /2
                                           %0= c(UY2 |y= , X= )  24                       ! 1, = ! 1
   2.
                                                                                   p
                                          %0= (|'(⇠)  ˆ     '(⇠ Bayes )|  8 2Y) ! 1, = ! 1
                                                                     115


Proof. By assumptions (A1) and (A2), the prior parameters satisfy
                  ||µ= || 22 = >(=), log ||        = || 1  = $ (log =), ||       ⇤
                                                                                = || 1   = $ (1),      ⇤
                                                                                                       = = 1/   =.
Note    = ⇠ =0 , 0 < 0 < 1 which implies                   = log = = >(=). Thus, the conditions of proposition B.0.3
hold with n = = 1.
       ✓π                                      ◆            ✓     π                                   ◆
             ! (✓= )                                                 ! (✓= )
   %0=
                        ?(✓= )3✓=  4       =a
                                                      %0=    log              ?(✓= )3✓= > =a ! 0, = ! 1                (A.52)
                !0                                                     !0
where the above convergence follows from (A.35) in step 1 (c) in the proof of proposition A.0.17.
Since    = log =     = >(= 1 ), 0 < 1 < 1, the conditions of proposition A.0.16 hold with n = = 1.
                                   ✓π                                              ◆
                                             !(✓= )                        =Y 2
                                 =
                                %0                      ?(✓= )3✓= 24                  ! 0, = ! 1                        (A.53)
                                       U Y2    !0
where the last equality follows from (A.25) with n = = 1 in the proof of proposition A.0.16.
Using (A.52) and (A.53) with (A.51), we get
                                     ⇣                                         ⌘
                                                                     =(Y 2 a)
                                 %0= c(UY2 |y= , X= )             24               ! 0, = ! 1
Take a = Y 2 /2 to complete the proof. Mimicking the steps in the proof of corollary 1,
                             π
          3H ( ✓ˆ= , ✓0 )      3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= Jensen’s inequality
                             π                                              π
                          =      3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= +                    3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓=
                              UY                                                U Y2
                                     =Y 2 /2
                           Y + 24             2Y, with probability tending to 1 as = ! 1
where the second last inequality is a consequence of part 1. in theorem A.0.18. The remaining part
of the proof follows by (A.49) and (3.33).
Theorem A.0.19 Suppose conditions of theorem 3.4.2 hold. Then,
   1.
                                         ⇣                                             ⌘
                                                                          =n =2 Y 2 /2
                                    %0= c(UYn      2
                                                      =
                                                        |y= , X= )  24                  ! 1, = ! 1
   2.
                                                                           p
                                    %0= (|'(⇠)   ˆ       '(⇠ Bayes )|  8 2Yn = ) ! 1, = ! 1
                                                                 116


Proof. By assumptions (A1) and (A4), the prior parameters satisfy
                  ||µ= || 22 = >(=n =2 ), log ||             = || 1 = $ (log =), ||   ⇤
                                                                                      = || 1  = $ (1),           ⇤
                                                                                                                 =  = 1/    =.
Also by assumption (A3),
                                                                                  ’!=           ÷ !=
               ||[0     [✓=⇤ || 1 = >(n =2 ), ||✓=⇤ || 22 = >(=n =2 ), log(             :̃ E=          0 ⇤E 0= ) = $ (log =)
                                                                                  E=0         E 0 =E+1
Note     = ⇠ =0 , 0 < 0 < 1 and n = ⇠ = X , 0 < X < 1 0, thus                          = log =     = >(=n =2 ). Thus, the conditions
of proposition B.0.3 hold.
          ✓π                                                 ◆         ✓     π                                       ◆
                ! (✓= )                              =n =2 a                   ! (✓= )                             2
     %0=                   ?(✓= )3✓=  4                          %0=   log             ?(✓= )3✓= > =n = a ! 0, = ! 1
                    !0                                                           !0
                                                                                                                                   (A.54)
where the above convergence follows from (A.39) in step 2 (c) in the proof of proposition A.0.17.
Also, since        = log =    = >(= 1 n =2 ), 0 + X < 1 < 1. Thus conditions of proposition A.0.16 hold.
                                        π                                                !
                                                    !  (✓   = )                     2  2
                                 %0=                            ?(✓= )3✓= 24 =n = Y ! 0, = ! 1                                     (A.55)
                                             2
                                          U Y n=       !  0
where the last equality follows from (A.25) in the proof of proposition A.0.16.
                                                                            ⇣                                    2 2
                                                                                                                          ⌘
Using (A.54) and (A.55) with (A.51), we get %0= c(UYn                            2 |y , X )
                                                                                  =     =      =       24 =n = (Y      a)   ! 0, = ! 1.
Take a = Y 2 /2 to complete the proof. Mimicking the steps in the proof of corollary 2,
                            π
        3H ( ✓ˆ= , ✓0 )          3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓=              Jensen’s inequality
                            π                                                     π
                        =            3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= +                     3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓=
                              U Y n=                                                U Y2 n=
                                            2=n =2 Y 2
                         Yn = + 24                      2Yn = , with probability tending to 1 as = ! 1
where the second last inequality is a consequence of part 1. in theorem A.0.19 and the last inequality
last equality follows since n = ⇠ = X . Dividing by n = on both sides we get
                           n = 1 3H ( ✓ˆ= , ✓0 )  2Y, with probability tending to 1 as = ! 1
The remaining part of the proof follows by (A.49) and (3.33).
                                                                         117


                                                         APPENDIX B
     SUPPLEMENT FOR LEARNING INTRINSIC DIMENSIONALITY OF FEATURE
                  SPACE WITH VARIATIONAL BAYES NEURAL NETWORKS
Proof of Lemmas
                                                          Õ: =                                                        Õ: =
Lemma B.0.1 Consider, [✓= ( x) = V0 +                         9=1 V 9 k(W 9
                                                                              > x) and [ ⇤ (x) = V⇤ +
                                                                                                 ✓=              0        9=1 V⇤9 k(W ⇤9 > x).
If
               |V 9     V⇤9 |  n, 9 = 1, · · · , : =             |W 9 > x          W ⇤9 > x|  n, 9 = 1, · · · , : = ,
                              π
                                            |[✓= ( x)          [✓=⇤ (x)|3x  2n (: = + ||               ⇤
                                                                                                          || 1 )
                               x2[0,1] ?=
     Proof. This proof uses somes ideas in the proof of theorem 1 in [74].
                                                                Õ =
Note that |[✓= ( x) [✓=⇤ (x)|  |V0 V0⇤ | + :9=1                        |V 9 k(W >9 x) V⇤9 k(W ⇤9 > x)|. Let D 9 = W >9 x,
A 9 = W ⇤9 > x  W >9 x, then |[✓= ( x)            [✓=⇤ (x)| is bounded above by
                         ’ :=
                                   V9              V⇤9                               ’ : = V (1 + 4 D 9 +A 9 )
                                                                                               9                       V⇤9 (1 + 4 D 9 )
        |V0     V0⇤ | +                                        = |V0        V0⇤ | +
                          9=1
                               1 + 4D 9       1 + 4 D 9 +A 9                          9=1
                                                                                                  (1 + 4 D 9 +A 9 ) (1 + 4 D 9 )
                         ’: = |V
                                 9    V⇤9 | + |V 9 4 D 9 +A 9    V⇤9 4 D 9 |       ’:=                       ’:=
       = |V0     V0⇤ | +                                                     =2           |V 9    V⇤9 | +        |V⇤9 ||4A 9   1|
                         9=1
                                    (1 + 4 D 9 +A 9 ) (1 + 4 D 9 )                  9=0                      9=1
Since, |A 9 | < n < 1, thus |1        4A 9 | < 2|A 9 |  2n, the proof follows.
Lemma B.0.2 For any two functions, [0 and [1 ,
⌘(x) = f([1 (x))([0 (x)            [1 (x)) + log(1             f([0 (x)))           log(1        f([1 (x)))  2|[0 (x)                [1 (x)|
Proof. Using f(G) = 4 G /(1 + 4 G )  1
       |⌘(x)|  |f([0 (x))||[0 (x)             [1 (x)| + | log(1 f([0 (x)))                       log(1 f([1 (x)))|
                                                   ⇣                                                  ⌘
                |[0 (x)        [1 (x)| + log 1 + f([0 (x))(4 [1 (x) [0 (x)                       1)  2|[0 (x) [1 (x)|
                                                                 118


where the proof of the above line is as follows. Let ? = f([0 (x)), then 0  ?  1 and
A = [1 (x)      [0 (x),
                             ⇣                                                  ⌘
                         log 1 + f([0 (x))(4 [1 (x)               [ 0 ( x)
                                                                             1)    = |log (1 + ?(4A              1))|
        A > 0 : | log(1 + ?(4A            1))| = log(1 + ?(4A                  1))  log(1 + (4A               1)) = A = |A |
        A < 0 : | log(1 + ?(4A            1))| =         log(1 + ?(4A            1))          log(1 + (4A          1)) = A = |A |
Lemma B.0.3 Let ?(✓= | ) = "+ # (µ= , ⌃= ), where ⌃= is diagonal. Let Na = ✓= : 3KL (✓0 , ✓✓= ) < a ,
                              π
                                                                                                         1 f([0 (x))
           3KL (✓0 , ✓✓= ) =                 f([0 (x))([0 (x) [✓= ( x)) + log                                                   3x.
                                x2[0,1] ?=                                                             1 f([✓= ( x))
Assume conditions (C1),(C2),(C3) and (C4) hold, then
                                              π
                                                                                         a=n =2
                                                              ?(✓= | )3✓=             4
                                                ✓= 2Na n 2
                                                           =
Proof. This proof uses somes ideas in the proof of theorem 1 in [74].
      By condition (C3), let [✓=⇤ (x) be the neural network such that ||[✓=⇤                                   [0 || 1  ||[✓=⇤     [0 || 1 
an =2 /4. Define
                           ⇢
                  0                                                   ⇤>                 an =2
               Nan  2  = ✓= : |V 9        V⇤9 |, |  >
                                                           x          9 x|   <                          , 9 = 1, · · · , : =
                    =
                                                    9
                                                                                8(: = + || ⇤ || 1 )
For every ✓= 2 N 0 2 , by lemma B.0.1, we have
                       an =
                                       π
                                                                                                 an =2
                                                       |[✓= ( x)           [✓=⇤ (x)|3x                                                (B.1)
                                                                                                  4
                         Ø
                                         x2[0,1] ?=
For ✓= 2 N 0 2 ,          x2[0,1] ?=
                                      |[✓= ( x)            [0 (x)|3x  an =2 /2 which with lemma B.0.2 implies
                 an =
3KL (✓0 , ✓✓= )  an =2 . Thus, for every ✓= 2 N 0 2 , ✓= 2 Nan =2 which implies
                                  π                                        π
                                                               an =
                                                ?(✓= | )3✓=                             ?(✓= | )3✓=
                                     ✓= 2Na n 2                             ✓= 2N 0
                                             =                                    a n=2
Let U 9 = x>       >W
                       9  and U⇤9 = x> W ⇤9 , then ` 9 U = x>                 >µ
                                                                                  9W    and f 92U = x>           >⌃
                                                                                                                      9W     x. Also, using
|G B |  1,
                   U⇤9 2  x> W ⇤9 W ⇤9 > x  ||       ⇤ 2
                                                       9 || 1     `29 U = x>      >
                                                                                     µ 9 W µ>9 W x  ||         >
                                                                                                                  µ 9 W || 21
                                                             1
                   f 92U = x>      >
                                     ⌃ 9W x                 ⇤ || 2
                                                                      || x|| 22    f 92U  ||             2
                                                                                                   9 W || 1 ||  x|| 22
                                                      ||    9W 1
                                                                      119


Let X = an =2 /(8(: = + ||             ⇤ ||
                                            1 )),  then
   π                                        ÷:= π      V⇤9 +X
                                                                                      (V 9 ` 9 V ) 2         π      U⇤9 +X
                                                                                                                                                  ( U 9 ` 9 U)2
                                                                     1                                                              1
                                                               q                                                              q
                                                                                         2f 2                                                         2f 2
                 ?(✓= | )3✓= =                                                  4            9V      3V 9                                      4          9U    3U 9
    ✓= 2N 0
          a n=2                              9=1    V⇤9 X         2cf 92V                                        U⇤9 X          2cf 92U
                                            ÷:=
                                                       2X
                                                                      ( Ṽ ` 9 V ) 2
                                                                                          2X
                                                                                                           ( Ũ ` 9 U ) 2
                                                   q                                 q
                                                                          2f 2                                 2f 2
                                       =                         4           9V                       4             9U      ,    Ṽ 9 2 V⇤9 ± X, Ũ 9 2 U⇤9 ± X
                                                      2cf 9 V 2                          2cf 9 U  2
                                             9=1
                                                                                                                                               !!
                                                  Õ :=           2                                           ( Ṽ 9 ` 9 V ) 2 ( Ũ 9 ` 9 U ) 2
                                                            log      2 log X+log f 9 V +log f 9 U +                           +
                                                    9=1          c                                                2f 2             2f 2
                                       =4                                                                            9V               9U
                                                                                                                                                                     (B.2)
where the second last equality holds by mean value theorem. Note that Ṽ 9 2 V⇤9 ± 1 and Ũ 9 2 U⇤9 ± 1
since X ! 0, therefore
               ( Ṽ 9      ` 9 V)2        max((V⇤9            `9V         1) 2 , (V⇤9           ` 9 V + 1) 2 )                (V⇤9      ` 9 V)2            1
                                                                                                                                                  +         ,
                      2f 92V                                            2f 92V                                                      f 92V               f 92V
               ( Ũ 9      ` 9 U)2        (U⇤9       ` 9 U)2          1
                                                               +
                      2f 92U                     f 92U              f 92U
                                          Õ = ⇣                                                                                      ⌘
    which further implies :9=1                       ( Ṽ     ` 9 V ) 2 /(2f 92V ) + ( Ũ                ` 9 U ) 2 /(2f 92U ) is bounded above by
            ’ : = V⇤ 2
                        9
                                 ’ : = `2
                                             9V
                                                    ’:=
                                                             1             ’: = U⇤ 2
                                                                                       9
                                                                                                    ’: = `2
                                                                                                                9U
                                                                                                                         ’  :=
                                                                                                                                    1
      2                    +2                   +                  +2                      +2                       +
             9=1
                     f 92V        9=1
                                         f 92V       9=1 9 U
                                                            f 2
                                                                           9=1 9 U
                                                                                    f  2                   f
                                                                                                     9=1 9 U
                                                                                                                2               f2
                                                                                                                           9=1 9 U
                                                                  1 ’
                                                                             :=
                  ⇤ 2
      . (||         || 1  + ||µ V || 21 )||      1 2
                                                V || 1   +                          (||    ⇤ 2
                                                                                           9 || 1  + ||      >
                                                                                                                 µW 9 || 21 )||f 9 W1 || 21 = >(=n =2 )              (B.3)
                                                             || x|| 22 9=1
where the last equality follows since || ⇤ || 21 , ||µ V || 21 , 1/|| x|| 22 = >(=n =2 ),
    Õ: =          ⇤ 2 Õ: =              > µ || 2 = $ (1) and ||                        1                                         1
       9=1 ||W 9 || 1 , 9=1 ||                 W9 1                                  V || 1 , sup 9=1,··· ,: = || 9 W || 1 = $ (1).
               ’ :=
                                 2                                                                ’ :=
                      ( log              2 log X + log f 9 V + log f 9 U )                             (2 log 8 + log(: = + ||                      ⇤
                                                                                                                                                       || 1 )+
                9=1
                                 c                                                                 9=1
                                                                                            log f 9 V + log f 9 U                    2 log n =
                                        . : = (log : = + log ||               ⇤
                                                                                || 1 + log ||           V || 1        2 log n = )
                                            ’:=
                                       +          log ||     9 U || 1   = >(=n =2 )                                                                                  (B.4)
                                            9=1
                                                                                120


where the last equality follows since : = log = = >(=n =2 ), log ||V⇤ || 1 = $ (log =), log ||                                V || 1  =
                                                  Õ =                                                        Õ =
$ (log =), log n = = >(log =) and :9=1                     log || 9 U || 1  : = log || x|| + : = :9=1            log ||    9 W || 1  =
$ (: = log =) = >(=n =2 ).
Using (B.3) and (B.4) in (B.2), the proof follows.
                           e= = { ✓ : ✓✓ (H, x), ✓= 2 F= } where ✓✓ (H, x) is given by
                                      p
Lemma B.0.4 Let, F                             =                                        =
                                                         ⇣                          ⇣                 ⌘⌘
                                 ✓✓= (H, x) = exp H[✓= ( x)                   log 1 + 4 [✓= (      x)
                                                                                                                                 (B.5)
             n                                        o
and F= = ✓= : |\ 9 |  ⇠= , 9 = 1, · · · , :̃ = where :̃ = = : = 3= + 2: = + 1 = $ (: = 3= ). Then,
            π    p
                   2Y q                                 p
                                    e
                          [] (D, F= , ||.|| 2 )3D  . Y : = 3= (log : = + (1/2) log ? = + 2 log ⇠=                 log Y)
              Y 2 /8
where               e
           [] (D, F= , ||.|| 2 ) is the hellinger bracketing entropy of F̃= (see definition in 3 in [74]).
Proof. This proof uses somes ideas in the proof of lemma 2 of [74].
In this proof, let ✓ = ✓= . Note, by lemma 4.1 in [104], # (Y, F= , ||.|| 1 )  (3⇠= /Y) : = .
                                      p
                          e = ✓D✓ +(1 D) ✓ ( x, H).
For ✓1 , ✓2 2 F= , let ✓(D)                 1        2
   q                     q
     ✓✓1 ( x, H)           ✓✓2 ( x, H)  :̃ =       sup            e 9 ||✓1
                                                                 m ✓/m\               ✓2 || 1     ( x, H)||✓1      ✓2 || 1      (B.6)
                                                  9=1,··· , :̃ =
                                                                                                                           p
                                                    p
where the upper bound ( x, H) = :̃ = ? = ⇠= . This is because |m ✓/m\                        e 9 |, the derivative of ✓ w.r.t.
is bounded above by |m[✓ ( x)/m\ 9 | as shown below.
                                                           ✓ [ ( x) ◆ 1/2 ✓                            ◆ 1/2
                                 m ✓e     1 m[✓ ( x)             4 ✓                          1
                                       
                                m\ 9      2 m\ 9               1+4 ✓ [  ( x )          1 + 4 [✓ (  x)
Thus, using 4 D /(1 + 4 D ), 1/(1 + 4 [✓ (x) )  1, we get
                                 8
                                 >
                                 >
                                 >
      e
     m✓          m[✓ (x)         < 1,
                                 >                                    \ 9 = VA for some A = 0, · · · , : =
 2                          
                                 >
                                 >
                                 >
    m\ 9            m\ 9
                                 > |VA k 0 (WA> G) [ x] A 0 |,        \ 9 = WAA 0 for some A = 0, · · · , : = , A 0 = 0, · · · , 3=
                                 :
                                                                    Õ ?=                     Õ ?=    2 1/2 ( Õ ?=    2 1/2         p
Note, |VA |  ⇠= , |k 0 (D)|  1 and |[ G] A 0 | = |                   B=1 0 A B G 9 |  (
                                                                              0
                                                                                               B=1 0 A 0 B )   B=1 G B )            ?=
since     is orthonormal and |G B |  1. Hence the bound on                           ( x, H) follows.
                                                                   121


In view of (B.6) and theorem 2.7.11 in [126] (also see theorem 3 in [74] for more details), we have
                                                   p          ! :̃ =                                                 p
                                             3 :̃ = ? = ⇠=2                                                      :̃ = ? = ⇠=2
             # [] (Y, F  e= , ||.|| 2 )                              =)               e
                                                                                [] (Y, F= , ||.|| 2 ) . :̃ = log
                                                    2Y                                                                Y
where # [] and          [] denote the bracketing number and bracketing entropy as in definition 3 of [74].
                                                                                                               p
Using, the proof of lemma 1 in [74] (equation (34)) with 3= = :̃ = and ⇠= = ? = ⇠=2 , we get
                π        q                                     p
                                        e
                      Y
                               [] (D, F= , ||.|| 2 )3D   . Y : = 3= (log : = + (1/2) log ? = + 2 log ⇠=                   log Y)
                   0
                π     p
                        2Y                                   π     p
                                                                     2Y
        =)                              e
                               [] (D, F= , ||.|| 2 )3D 
                                                                                   e
                                                                           [] (D, F= , ||.|| 2 )3D
                                                               q
                   Y 2 /8                                      0
                                                         . Y :̃ = (log :̃ = + (log ? = )/2 + 2 log ⇠=                log Y)
                                                                                                               p
Lemma B.0.5 Let @(✓= | ) ⇠ "+ # (m= , S= ) with ` 9 V = V⇤9 , f 9 V = 1/ =, m 9 W =                                              W ⇤9 and
S9W =   3 = /(=||      x|| 2 ) 2 . Define
                                  π
                                                                                                        1     f([0 (x))
         3KL (✓0 , ✓✓= ) =                       f([0 (x))([0 (x)              [✓= ( x)) + log                             3x.
                                    x2[0,1] ?=                                                        1     f([✓= ( x))
Suppose conditions (C1) and (C3) hold, then
                                        π
                                            3KL (✓0 , ✓✓= )@(✓= | )3✓=  an =2 , 8a > 0
                 Ø
Proof. Since          3KL (✓0 , ✓✓= ) is a KL-distance, 3KL (✓0 , ✓✓= ) > 0. We shall thus establish an upper
                                      Ø
bound. By lemma B.0.2, 3KL (✓0 , ✓✓= )@(✓= | )3✓= is upper bounded by
                                    π
                             2          |[0 (x) [\ = ( x)|3x
                                  π π
                                                    |[0 (x) [\ ⇤= (x)|3x@(✓= | )3✓= +
                            π π
                                        x2[0,1] ?=
                                                  |[✓=⇤ (x) [\ = ( x)|3x@(✓= | )3✓=
                                           π π
                                    x2[0,1] ?=
                                  an =2
                                       +                      |[✓=⇤ (x)       [\ = ( x)|3x @(✓= | )3✓=
                                   2
                                                |                      {z                       }
                                                   x2[0,1] ?=
                                                                      ⌘( ✓= )
                                                                      122


π                                          π
            |[✓= ( x) [✓=⇤ (x)|3x                          |V0      V0⇤ |3x
 x2[0,1] ?=                                  x2[0,1] ?=
                                          ’:= π
                                                                           >                  ⇤>
                                       +                         |V 9 k(   9   x)    V⇤9 k(   9 G)|3x
                                          9=1     x2[0,1] ?=
                                                             ’:= π
                                         |V        V0⇤ | +                    |V 9 k(   >
                                                                                         9   x)   V⇤9 k( >
                                                                                                         9 x)|3x
                                                             9=1    x2[0,1] ?=
                                          ’:= π
                                                                                              ⇤>
                                       +                         |V⇤9 k(   >
                                                                           9   x)    V⇤9 k(   9 G)|3x
                                          9=1     x2[0,1] ?=
                                           ’:=
                                                |V 9      V⇤9 |
                                                      π
                                           9=0
                                               ⇤                             >              ⇤
                                       + ||      || 1                 |k(    9  x)     k(   9 >x)|3x
                                                         x2[0,1] ?=
 Therefore,
 π                        ’ :=
    ⌘(✓= )@(✓= | )3✓=            |V 9   V⇤9 |@(V 9 )3V 9 +
                                  π π
                           9=0
                           ⇤
                      ||     || 1                     |k( >9 x) k(             ⇤
                                                                               9 >x)|3x@( 9 )3        9
                                r
                                                  ?
                                                     π π
                                        x2[0,1]     =
                           := 2               ⇤                                >               ⇤>
                       =               + || || 1                          |k(  9    x)     k(  9 x)|3x@( 9 )3  9
                            = c                              x2[0,1] ?=
                                                                                                              (B.7)
                                                          123


     Now, let " 9 = { : | >9 x                     ⇤ > x|
                                                   9         an =2 /(16||        ⇤ ||
                                                                                       1 )}, then
                               π π
                                                           >                    ⇤>
                                                   |k(      9     x)      k(     9 x)|3x@( 9 )3               9
                                  π       π
                                       x2[0,1] ?=
                                                                   >                  ⇤>
                               =                         |k(       9   x)      k(      9 x)|3x@( 9 )3               9+
                               π       π
                                      "9    x2[0,1] ?=
                                                               >                  ⇤>
                                                      |k(      9     x)    k(     9 x)|3x@( 9 )3                9
                                 " 29    x2[0,1] ?=
                                       an =2
                                                + 2& 9 (" 29 )                                                                            (B.8)
                                   8|| || 1
                                          ⇤
     Thus, combining (B.7) and (B.8) and using : = = >(=n =2 ), we get
                               π
                                                                       an 2 an 2
                                    ⌘(✓= )@(✓= | )3✓=  = + = + 2|| ⇤ || 1 & 9 (" 29 )                                                     (B.9)
                                                                         4        8
     In the next steps, we deal with & 9 (" 29 ). Let X = an =2 /(16|| || ⇤1 )
                                                              ⇤>
                                       %(|     >
                                               9   x          9 x|     > X) = %(|U 9            U⇤9 |    X)                               (B.10)
where U 9 = x>           >
                             9 and U⇤9 = x> W ⇤9 . Note that U 9                  U⇤9 ⇠ # (` 9 U , f 92U ) with ` 9 U = x>              >   W ⇤9
G > W ⇤9 = G > (  >         )W ⇤9 and f 92U = (1/(=2 || x|| 22 ))|| x|| 22 = 1/=2 . Further note that since |G B |  1,
             ’ ?=                                      ’?=
  |` 9 U | =      |G B ||[(         >
                                         )   ⇤
                                             9 ]B|            | [(        >
                                                                                )   ⇤
                                                                                    9 ]B|  = ||(            >
                                                                                                                  )    ⇤
                                                                                                                         || 1 = >(= 1 ) = >(X)
              B=1                                      B=1
where the last equality holds since X ⇠ n =2 /|| ⇤ || 1                              = 1 because ||             ⇤ || 2
                                                                                                                     1   = >(=n =2 ). This also
                                                                 p
implies, (X ± ` 9 U )/f 9 U ⇠ (=n =2 )/|| ⇤ || 1                    =n = ! 1 which implies
                                      ⇤>
                %(|    >
                       9    x         9 x|   > X) = 1                ((X    ` 9 U )/f 9 U )) + 1             ((X + ` 9 U )/f 9 U )
                                                      ⇠ (f 9 U /(X         ` 9 U ))q((X         ` 9 U )/f 9 U )+                          (B.11)
                                                                                                                       =n =2
                                                     (f 9 U /(X + ` 9 U ))q((X + ` 9 U )/f 9 U ) . 4
where the asymptotic equivalence in the second step is a consequence of Mill’ratio. Thus, using
the above relation in (B.9),
                           π                                          ✓ 2                                    ◆
                                                                        n      n2                      =n =2         an =2
                                 ⌘(✓= )@(✓= | )3✓=  a = + = + 2||                            ⇤
                                                                                                || 1 4          
                                                                        4       8                                       2
                                                              p
where the last equality holds || ⇤ || 1 = >( =n =2 ).
                                                                       124


                                                                                                               p
Lemma B.0.6 Let @(✓= | ) ⇠ "+ # (m= , S= ) with < 9 V = V⇤9 , B 9 V = 1/ =, m 9 W =                                              W ⇤9 and
S9W =   3 = /(=||     x|| 2 ) 2 . Let, ?(✓= | ) = "+ # (µ= , ⌃= ), where ⌃= is diagonal. Suppose conditions
(C1), (C2), (C3) and (C4) hold, then
                                                        3KL (@, ?) = >(=n =2 )
Proof: With :̃ = ⇠ : = + 1 + : = (3= + 1), here 3KL (@, ?) can be simplified as
              ’ :=
                            p               1       (V⇤9 ` 9 V ) 2 ’          3=
                                                                                                                     1
           =       (log =f 9 V +              2
                                                +              2
                                                                           +        (2 log =|| x|| 2 f 9 9 0 W + 2           +
               9=1
                                         =f 9 V             f9 V             9 0 =1
                                                                                                                 = || x|| 22
          (W ⇤9    ` 9 9 0W ) 2       :̃ =
                                ))
                  f 92W                2
           . :̃ = (log = + log || x|| 2 + log ||fV || 1 + 1/(=|| x|| 2 ) 2 )+
              ’ :=
                                                            ⇤ 2
          :=       log ||f 9 W || 1 + ||fV⇤ || 1 (||          || 2 + ||µ V || 22 )
               9=1
             ’ :=
                                                                   :̃ =
                                     ⇤ 2
           +       ||f 9⇤W || 1 (||  9 || 2 + ||µ 9 W || 22 )           = >(=n =2 )                                               (B.12)
              9=1
                                                                    2
where the last equality holds since :̃ = log = = >(=n =2 ), log || x|| 2 = log ||fV || 1 = log ||f 9 W || 1 =
                                                                                                                       Õ =
$ (log =) and 1/|| x|| 22 = >(=n =2 ). Since ||.|| 2  ||.|| 1 , || ⇤ || 22 = ||µ V || 22 = >(=n =2 ), :9=1                    || ⇤9 || 22 =
       Õ =                      Õ =
$ (1), :9=1    ||µ 9 W || 22 = :9=1    || > µ 9 W || 22 = $ (1), as consequence of which the proof follows.
Lemma B.0.7 Let ?(✓= | ) = "+ # (µ= , ⌃= ), where ⌃= is diagonal. Suppose conditions (C1) and
(C2) hold, then
                                           π
                                                                                    =an =2
                                                    ?(✓= | )3✓=  4                        , 8a > 0
                                            ✓= 2F=2
                                                                                         2
where F= = {✓= : |\ 9 |  ⇠= , 9 = 1, · · · , :̃ = } with ⇠= = 4 Y=n = / :̃ = .
Proof: This proof uses somes ideas in the proof of theorem 1 in [74]. Let F9= = {\ 9 : |\ 9 |  ⇠= }
                                                                                             Ø
which implies F= = \ :̃9=1       =
                                   F9= =) F=2 = \ :̃9=1          =
                                                                     F9=2 . Note that ✓ 2F 2 ?(✓= | )3✓= is bounded above
                                                                                               = =
                                                                       125


    Õ :̃ =
by      9=1  %(F9=2 ) which is
             ’:= π                                                  1
                                                                         (V 9 ` 9 V ) 2
                                                   1
                                              q
                                                                 2f 2
          =                                                  4       9V                 3V 9 +
              9=0      ( 1,⇠= )[(⇠= ,1)          2cf 92V
          ’:= ’ 3= π                                                      1
                                                                               (W 9 9 0 ` 9 9 0 W ) 2
                                                         1
                                                q
                                                                       2f 2 0
                                                                 4        99 W                         3W 9 9 0
          9=1 9 0 =1     ( 1,⇠= )[(⇠= ,1)          2cf 929 0 W
             ’:=
          =        (2         ((⇠=         ` 9 V )/f 9 V )           ((⇠= + ` 9 V )/f 9 V ))
              9=0
            ’:= ’   3=
          +               (2          ((⇠=       ` 9 9 0 W )/f 9 9 0 W )        ((⇠= + ` 9 9 0 W )/f 9 9 0 W ))
             9=1 9 0 =1
             ’:=                                                                        ’:=
          =        (f 9 V /(⇠=          ` 9 V ))q((⇠=           ` 9 V )/f 9 V ) +            (f 9 V /(⇠= + ` 9 V ))q((⇠= + ` 9 V )/f 9 V )
              9=0                                                                       9=0
            ’:= ’   3=
          +               (f 9 9 0 W /(⇠= + ` 9 9 0 W ))q((⇠= + ` 9 9 0 W )/f 9 9 0 W )
             9=1   9 0 =1
            ’:= ’   3=
          +               (f 9 9 0 W /(⇠=       ` 9 9 0 W ))q((⇠=           ` 9 9 0 W )/f 9 9 0 W )
             9=1 9 0 =1
                                                                                                                                            p
      where the above equality holds due Mill’s ratio and the fact ` 9 V , ` 9 9 0 W = >( =n =2 ), f 9 V , f 9 9 0 W =
                                                                                          p
$ (=A ), A > 0 which implies (⇠= ± `)/f                                       (⇠=            =)/("=A ) ⇠ ⇠= /=A = exp((=n =2 / :˜= )(Y
(A :̃ = log =)/(=n =2 ))) ! 1 since :̃ = log = = >(=n =2 ). Further,
             π                                     ’:=
                                                               (⇠= ` 9 V ) 2 /(2f 29 V )            (⇠= +` 9 V ) 2 /(2f 29 V )
                          ?(✓= | )3✓= .                   (4                              +4                                    )
               ✓= 2F=2                             9=1
                                                  ’:= ’    3=
                                                                     (⇠= ` 9 9 0 W ) 2 /(2f 29 9 0 W )          (⇠= +` 9 9 0 W ) 2 /(2f 29 9 0 W )
                                               +               (4                                       +4                                         )
                                                  9=1 9 0 =1
                                                                                                                                                     a=n =2
                                               ⇠ :̃ = exp( exp((=n =2 / :̃ = ) (Y                       (A :̃ = log =)/(=n =2 ))))  4
since (=n =2 / :̃ = )(Y         (A :̃ = log =)/(=n =2 ))             log(a=n =2 ) when :̃ = log = = >(=n =2 ) and =n =2 ! 0.
Proof of Theorem 1
This proof uses some ideas in the proof of lemmas 3 and lemma 5 in [74].
                                                                              126


Let D = (y= , x1 , · · · , x= ), then
            3KL (@ ⇤ , c(.|D ))
            π                                                   π
                                          @ ⇤ (✓= )                                          @ ⇤ (✓= )
      =               @ (✓= ) log
                        ⇤
                                                       3✓= +              @ ⇤ (✓= ) log                    3✓=
                                        c(✓= |D )                                          c(✓= |D )
                               π
              U Y n=                                              U Y2 n=
                                          @ ⇤ (✓= )         c(✓= |D )
      =         ⇤
              @ (UYn = )                  ⇤            log ⇤                 3✓=
                                 U Y n= @ (UYn = )            @ (✓= )
                            π
                                        @ ⇤ (✓= )         c(✓= |D )
              ⇤
            @ (UYn = )2
                                        ⇤       2    log ⇤                 3✓=
                              U Y2 n= @ (UYn = )            @ (✓= )
                                     @ ⇤ (UYn = )                               @ ⇤ (UYn 2 )
            @ (UYn = ) log
              ⇤
                                                      + @ (UYn = ) log
                                                          ⇤       2
                                                                                     2
                                                                                           =
                                                                                                 , by Jensen’s inequality
                                  c(UYn = |D )                                c(UYn = |D )
            @ ⇤ (UYn = ) log @ ⇤ (UYn = ) + @ ⇤ (UYn        2
                                                              =
                                                                ) log @ ⇤ (UYn   2
                                                                                   =
                                                                                     )    @ ⇤ (UYn 2
                                                                                                      =
                                                                                                        ) log c(UYn 2
                                                                                                                      =
                                                                                                                        |D )
              @ ⇤ (UYn  2
                            ) log c(UYn     2
                                                |D ) log 2, since G log G + (1 G) log(1 G)                                  log 2
                          =
                                      π
                                              =
                                                                                       π                                  !
                                                !(✓= | )                                     ! (✓= | )
      =       @ ⇤ (UYn  2
                            ) log                           ?(✓= | )3✓= log                                ?(✓= | )3✓=         log 2
                          =                2
                                        U Y n=      !0                                            !0
                        Ø                                                                Ø
    Let ⇢ 1= = log U 2 (! (✓= | )/! 0 ) ?(✓= | )3✓= , ⇢ 2= = log (! (✓= | )/! 0 ) ?(✓= | )3✓=
             Ø
                            Y n=
and ⇢ 3= = log(! 0 /! (✓= | ))@(✓= | )3✓= . Then for any @ 2 Q= ,
                      @ ⇤ (UYn  2
                                    )⇢ 1=  3KL (@, c(.|D ))                @ ⇤ (UYn 2
                                                                                         )⇢ 2= + log 2
                                                                          π
                                  =                                                    =
                                                                                   ! (✓= | )
                                            = 3KL (@, ?(.| ))                 log               @(✓= | )3✓= +
                                                  π
                                                                                       !0
                                                      ! (✓= | )
                                            log                    ?(✓= | )3✓= + |⇢ 2= | + log 2                                  (B.13)
                                                         !0
                                              3KL (@, ?(.| )) + ⇢ 3= + (1               @ ⇤ (UYn 2
                                                                                                     =
                                                                                                       ))⇢ 2= + log 2
                                              >(=n =2 ) + ⇢ 3= + ⇢ 2= + log 2                                                    (B.14)
    where the above inequality holds due to lemma B.0.6.
    We show three main things, ⇢ 3= = > %0= (=n =2 ), ⇢ 2= = > %0= (=n =2 ) and ⇢ 1=                         log 2 =Y 2 n =2 + > %0= (1).
This completes the proof because
       @ ⇤ (UYn 2
                   =
                     )=Y 2 n =2  >(=n =2 ) + > %0= (=n =2 ) + > %0= (=n =2 ) + $ %0= (1) =) @ ⇤ (UYn           2
                                                                                                                  =
                                                                                                                    ) = > %0= (1)
                                                                    127


Handling ⇢ 3= : Note, %0= (|⇢ 3= | > Y=n =2 ) can be bounded above using Markov’s inequality as
           ✓π                                            ◆                   ✓π                                           ◆
  1                                                                1
      ⇢ 0=      @(✓= | ) log(! 0 /! (✓= | ))3✓=                        ⇢  =
                                                                                    @(✓= | ) |log(! 0 /! (✓= | ))| 3✓=
    2                                                           Y=n =2 0
                                                                         π              π
Y=n =
                                                                   1
                                                                             @(✓= | )      |log(! 0 /!(✓= | ))| ! 0 3`3✓=
                                                                Y=n =2
                                                               π
                                                                    @(✓= | ) (3 ! (! 0 , !(✓= | )) + 2/4) 3✓=
                                                                         π
                                                                   1
                                                                            (=3KL (✓0 , ✓✓= ) + 2/4)3✓=  a/Y
                                                                Y=n =2
where the third step follows from lemma 4 in in [74] and the fourth step follows from lemma B.0.5.
Since a is arbitrary, ⇢ 3= = > %0= (=n =2 ). We next shown ⇢ 2= = > %0= (=n =2 ) as follows.
Handling ⇢ 2= : Note, %0= (|⇢ 2= | > Y=n =2 ) can be bounded above using Markov’s inequality as
           ✓    π                                       ◆               π          π
  1                                                               1
      ⇢ log (! (✓= | )/! 0 ) ?(✓= | )3✓= =
         =
                                                                              log (!(✓= | )/! 0 ) ?(✓= | )3✓= ! 0 3`
Y=n =2 0                                                       Y=n =2
                                                                   1
                                                                     2
                                                                        (3KL (! 0 , ! ⇤ ) + (2/4))
                                                                Y=n =
             Ø
With ! ⇤ = !(✓= | ) ?(✓= | )3✓= , the last equality follows from lemma 4 in [74]. Further,
                                                                  ✓      ✓      π                          ◆◆
                          ⇤
              3KL (! 0 , ! ) = ⇢ 0=               ⇤
                                    (log(! 0 /! )) =        ⇢ 0=    log ! 0 / ! (✓= | ) ?(✓= | )3✓=
                                                 π                                        !
                              ⇢ 0=    log(! 0 /           ! (✓= | ) ?(✓= | )3✓= )
                               π                              π
                                                   #a n 2
                                                       =
                                        ?(✓= | )3✓= +                  3   ! (! 0 , !(✓= | )) ?(✓= | )3✓=
                                  #a n 2                         #a n 2
                                      =                              =
                                          a=n =2
                                 log 4          + a=n =2 = 2a=n =2
where the second step follows from Jensens’ inequality and the last step follows from lemma B.0.3.
Lastly, we show ⇢ 1=         log 2 + =Y 2 n =2 + > %0= (1) as follows.
                                                                           2
Handling ⇢ 1= : For this, F= = {✓= : |\ 9 |  ⇢ 3= = 4 =Yn = / :̃ = }. Thus, %0= (⇢ 1=  log 2                   =Y 2 n =2 ) is
                                                            128


bounded above by
                                         π                                                                 !
                                                                                                 =Y 2 n =2
                                  %0=                  (! (✓= | )/! 0 ) ?(✓= | )3✓=          4               +
                                           U Y2 n= \F=
                                  |                               {z                                      }
                                                   ✓π                                                        ◆
                                                                  ⇢ 11=
                                                                                                   =Y 2 n =2
                                            %0=          (!(✓= | )/! 0 ) ?(✓= | )3✓=            4
                                            |                           {z                                   }
                                                     F=2
                                                                        ⇢ 12=
                                                                   2
Using lemma B.0.4 with Y = Yn = and ⇠= = 4 =Yn = / :̃ =
    π   p
            2Yn =                                       q
                             e
                                                                                                                                   p
                      [] (D, F= , ||.|| 2 )3D . Yn = :̃ = (log : = + (1/2) log ? = + 2 log ⇠=                  log n = )  Y 2 n =2 =
     Y 2 n =2 /8
                                                                                               2
where the above equality holds since :̃ = log = = >(=n =2 ), ? = = >(4 =n = / :̃ = ) and =n =2 ! 1. Therefore,
by theorem 1 in [138], we have ⇢ 11= ! 0, = ! 1.
                          ⇣Ø                                                       ⌘
Additionally, %0= F 2 (!(✓= | )/! 0 ) ?(✓= | )3✓= 4                         =Y 2     is bounded above by Markov’s inequality
                              =
by
                       ✓π                                         ◆                π
        =Y 2 n =2                                                        =Y 2 n =2                               2   2  2Y 2 )
      4           ⇢ 0=        (! (✓= | )/! 0 ) ?(✓= | )3✓= = 4                           ?(✓= | )3✓= = 4 =n = (Y               !0
                          F=2                                                        F=2
where the above equality holds due to lemma B.0.7 for a = 2Y 2 . Thus, ⇢ 12= ! 0, = ! 1.
                                                                   129


BIBLIOGRAPHY
     130


                                                 BIBLIOGRAPHY
[1]  A      , N.,            C             , B. Approximate nearest neighbours and the fast johnson-
     lindenstrauss transform. Proceedings of the Symposium on Theory of Computing (2006),
     557–563.
[2]  A          , J., R          , R., C                      , J., V              , M.,       D        , C. A
     relationship between the transient structure in the monomeric state and the aggregation
     propensities of U-synuclein and V-synuclein. Biochemistry 53 (11 2014).
[3]  B , J., S         , Q.,         C         , G. Efficient variational inference for sparse deep learning
     with theoretical guarantee. In Advances in Neural Information Processing Systems (2020),
     H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33, Curran
     Associates, Inc., pp. 466–476.
[4]  B         , A., S               , M. J.,          W             , L. The consistency of posterior distribu-
     tions in nonparametric problems. Ann. Statist. 27, 2 (1999), 536–561.
[5]  B         , A. R. Universal approximation bounds for superpositions of a sigmoidal function.
     IEEE Transactions on Information theory 39, 3 (1993), 930–945.
[6]  B                     , S.,       M         , T. Statistical foundation of variational bayes neural net-
     works. Neural Networks 137 (2021), 151–173.
[7]  B       , C. M. Bayesian Neural Networks. Journal of the Brazilian Computer Society 4, 1
     (1997), 61–68.
[8]  B , D., N , A.,               J         , M. Latent dirichlet allocation. Journal of Machine Learning
     Research 3 (2003), 993–1022.
[9]  B , D. M.,                L            , J. D. A correlated topic model of science. The Annals of
     Applied Statistics 1, 1 (2007), 17–35.
[10] B              , C., C              , J., K                  , K.,     W          , D. Weight uncertainty
     in neural network.
[11] B              , C., C              , J., K                  , K.,     W          , D. Weight uncertainty
     in neural network. In Proceedings of Machine Learning Research, vol. 37. PMLR, 2015,
     pp. 1613–1622.
[12] B           , L. Bagging predictors. Machine Learning 24, 2 (Aug. 1996), 123–140.
[13] C         , E. J., R             , J. K.,          T , T. Stable signal recovery from incomplete and
     inaccurate measurements. Communications on Pure and Applied Mathematics 59, 8 (2006),
     1207–1223.
[14] C             , T. I.,       S               , R. J. Random-projection ensemble classification. Journal
     of the Royal Statistical Society: Series B (Statistical Methodology) 79, 4 (2017), 959–1035.
                                                           131


[15] C                 , P.,       S           , M. Scalable variational inference for bayesian variable
     selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7
     (2012).
[16] C             , C. M., P          , N. G.,         S       , J. G. Handling sparsity via the horseshoe.
     D. van Dyk and M. Welling, Eds., vol. 5 of Proceedings of Machine Learning Research,
     PMLR, pp. 73–80.
[17] C          , G.,         R        , C. P. Rao-blackwellisation of sampling schemes. Biometrika
     83, 1 (1996), 81–94.
[18] C            , R., M            , M., M C           , J., G         , M., P               , A., S           ,
     T., G            , M., D           , E.,      R         , L. Predicting conversion from mild cognitive
     impairment to alzheimer’s disease using neuropsychological tests and multivariate methods.
     Journal of clinical and experimental neuropsychology 33 (02 2011), 187–99.
[19] C      , S.-T., H       , Y.-H., H         , Y.-L., K , S.-J., T         , H.-S., W , H.-K.,         C      ,
     D.-R. Comparative analysis of logistic regression, support vector machine and artificial
     neural network for the differential diagnosis of benign and malignant solid breast tumors by
     the use of three-dimensional power doppler imaging. Korean journal of radiology : official
     journal of the Korean Radiological Society 10 (08 2009), 464–71.
[20] C      , T., L , M., L , Y., L , M., W                , N., W      , M., X , T., X , B., Z             , C.,
           Z         , Z. Mxnet: A flexible and efficient machine learning library for heterogeneous
     distributed systems.
[21] C        , B., Z          , D.,          S     , D. Domain transfer learning for mci conversion
     prediction. vol. 15, pp. 82–90.
[22] C          -A               , B.-E. Convergence rates of variational inference in sparse deep
     learning. In Proceedings of the 37th International Conference on Machine Learning (13–18
     Jul 2020), H. D. III and A. Singh, Eds., vol. 119 of Proceedings of Machine Learning
     Research, PMLR, pp. 1831–1842.
[23] C               , N.,        S       -T         , J. An Introduction to Support Vector Machines and
     Other Kernel-based Learning Methods. Cambridge University Press, 2000.
[24] C , Y., L , B., L , S., Z             , X., F , M., L , T., Z , W., P            , M., J     , T., J , J. S.,
                 A               ’ D            N                   I           . Identification of conversion
     from mild cognitive impairment to alzheimer’s disease using multivariate predictors. PLOS
     ONE 6, 7 (07 2011), 1–10.
[25] C           , G. Approximation by superpositions of a sigmoidal function. Mathematics of
     control, signals and systems 2, 4 (1989), 303–314.
[26] D             , S. Experiments with random projection. Proceedings of the 16th Conference on
     Uncertainty in Artificial Intelligence (2013).
                                                        132


[27] D              , S.,       G         , A. An elementary proof of a theorem of johnson and linden-
     strauss. Random Structures & Algorithms 22, 1 (2003), 60–65.
[28] D                , C., B        , P., S      , L., B                   , K.,      T             , J. Predic-
     tion of mci to ad conversion, via mri, csf biomarkers, and pattern classification. Neurobiol
     Aging (2011).
[29] D               , D., L , X., T             , M., P              , G., C        , K., B , K., M         , J.,
     D     , R., S         , Y.,        P        , G. Combining early markers strongly predicts conversion
     from mild cognitive impairment to alzheimer’s disease. Biological psychiatry 64 (09 2008),
     871–9.
[30] D         A                S           ,J         I      V     , J. C. S. Comparison between svm and
     logistic regression: Which one is better to discriminate? Revista Colombiana de Estadistica
     Numero especial en Bioestadistica 35 (06 2012), 223–237.
[31] D , J.,            H         , Q. Prediction of mci to ad conversion using laplace eigenmaps learned
     from fdg and mri images of ad patients and healthy controls. In 2017 2nd International
     Conference on Image, Vision and Computing (ICIVC) (2017), pp. 660–664.
[32] D      , X., Y , Z., C , W., S , Y.,                   M , Q. A survey on ensemble learning. Frontiers
     of Computer Science 14 (2019), 241 – 258.
[33] D           , D. L. Compressed sensing. IEEE Transactions on Information Theory 52, 4
     (2006), 1289–1306.
[34] D       , J., E , G., O , Y., G                    , B.,    D              , C. Multi-atlas skull-stripping.
     Acad Radiol (2013), 1566–1576.
[35] D       , J., E , G., O , Y., R                      , S., G , R., G , R., S                      , T.,
     D                , C. Muse: Multi-atlas region segmentation utilizing ensembles of registration
     algorithms and parameters, and locally optimal atlas selection. NeuroImage 127 (12 2015).
[36] D       , J., E , G., R                  , M.,         D            , C. Hierarchical parcellation of mri
     using multi-atlas labeling methods. Alzheimer’s Disease Neuroimaging Initiative.
[37] D              , S.,       O       -M             , L. Logistic regression and artificial neural network
     classification models: A methodology review. Journal of biomedical informatics 35 (10
     2002), 352–9.
[38] D        , J., H        , E.,       S         , Y. Adaptive subgradient methods for online learning and
     stochastic optimization. Journal of Machine Learning Research 12, 61 (2011), 2121–2159.
[39] D            , R.,         K         , A. Sharp generalization error bounds for randomly-projected
     classifiers. In Proceedings of the 30th International Conference on Machine Learning (2013),
     S. Dasgupta and D. McAllester, Eds., vol. 28, PMLR, pp. 693–701.
[40] D            , R. J.,          K        , A. Random projections as regularizers: Learning a linear
     discriminant from fewer observations than dimensions. Machine Learning 99, 2 (2015),
     257–286.
                                                           133


[41] E, H. G.,          D      , V. C. Keeping the neural networks simple by minimizing the description
     length of the weights. In Proceedings of the sixth annual conference on Computational
     learning theory, COLT’93. ACM press, 1993, p. 5–13.
[42] E                 , C., O           , E., B       , M., E             , S., R          , S., R         , S., S       ,
     G., E          , A., W           , A.,        M              , H. Small baseline volume of left hippocampus
     is associated with subsequent conversion of mci into dementia: Th goteborg mci study. J
     Neurol Sci 271(2) (2008), 48–59.
[43] E       , M., W           , C., T                   , J., S       , L., P           , R., J     , C., F          , H.,
     B       , AL          A                 , G., S              , P., V         , B., D        , B., W        , M.,
     H          , H. Prediction of conver- sion from mild cognitive impairment to alzheimer’s disease
     dementia based upon biomarkers and neuropsychological test performance. Neurobiol Aging
     33(7) (2012), 1203–1214.
[44] F         , M. Treatment of mild cognitive impairment (mci). Curr. Alzheimer Res 6(4)
     (2009), 273–297.
[45] F     , J.,         S        , N. Sparse-input neural networks for high-dimensional nonparametric
     regression and classification, 2019.
[46] F             , J., H         , T.,        T              , R. The elements of statistical learning. Springer
     series in statistics. Springer, New York, 2009.
[47] G          , S., G          , J. K.,                        V        , A. W. Convergence rates of posterior
     distributions. The Annals of Statistics 28, 2 (2000), 500 – 531.
[48] G       , S., Y , J.,              D        -V       , F. Model selection in bayesian neural networks via
     horseshoe priors. Journal of Machine Learning Research 20, 182 (2019), 1–46.
[49] G     , N., B         , G.,         N         , A. Face recognition experiments with random projection.
     In Proceedings of SPIE - The International Society for Optical Engineering (2005), vol. 5776.
[50] G        , A. Practical variational inference for neural networks. In Advances in Neural
     Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q.
     Weinberger, Eds., vol. 24. Curran Associates, Inc., 2011, pp. 2348–2356.
[51] G        , A. Generating sequences with recurrent neural networks, 2014. arXiv.1308.0850.
[52] G                 , R.,          D           , D. B. Bayesian compressed regression. Journal of the
     American Statistical Association 110, 512 (2015), 1500–1514.
[53] G                 , R.,        D           , D. B. Compressed gaussian process for manifold regression.
     Journal of Machine Learning Research 17, 69 (2016), 1–26.
[54] G          , K. An Introduction to Neural Networks. Taylor & Francis, Inc., USA, 1997.
[55] H         , C., M W                   , B., M                   , N.,         K                    , G. Loco: Dis-
     tributing ridge regression with random projections, 2015. arXiv:1406.3469.
                                                            134


[56] H            , C., S       , V., X , G.,          J         , S. Predictive markers for ad in a multi-
     modality framework: An analysis of mci progression in the adni population. NeuroImage
     55 (03 2011), 574–89.
[57] H         , G., S              , N.,        S         , K. Lecture 6a overview of mini-batch gradient
     descent. http://www.cs.toronto.edu/ hinton/coursera/lecture6/lec6.pdf.
[58] H       , A. E.,        K             , R. W. Ridge regression: Biased estimation for nonorthogonal
     problems. Technometrics 12, 1 (1970), 55–67.
[59] H         , S. H., E                      , A., K         , A.,     B         -F       , A. Predicting
     conversion from mci to ad using resting-state fmri, graph theoretical approach and svm.
     Journal of Neuroscience Methods 282 (03 2017).
[60] H         , K., S                    , M.,        W       , H. Multilayer feedforward networks are
     universal approximators. Neural Networks 2, 5 (1989), 359–366.
[61] H       , A., S              , G.,            F          , F.    Deep bayesian regression models.
     arXiv:1806.02160.
[62] I              , A. D. N. Accessed on: Nov. 3, 2020. [online]. available. http://adni.loni.usc.edu.
[63] J             , T.,       J          , M. I. A variational approach to bayesian logistic regression
     problems and their extensions.
[64] J     , K., H             , W., H            , M. P.,       L        , A. Compromise-free bayesian
     neural networks. arXiv:2004.12211.
[65] K           , V., B                     , S., S   , M.,        M    , T. Black box variational bayes
     model averaging, 2021.
[66] K          , D. P., S            , T.,       W         , M. Variational dropout and the local reparam-
     eterization trick. In Advances in Neural Information Processing Systems (2015), C. Cortes,
     N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28, Curran Associates, Inc.,
     pp. 2575–2583.
[67] K           , I. Alzheimer’s disease: A clinical and basic science review (msrj: Medical student
     research journal). Medical Student Research Journal 4 (09 2014), 24–33.
[68] K           , I. O., S             , L. L., B         , A. C.,      I          , A. D. N. Predicting
     progression from mild cognitive impairment to alzheimer’s dementia using clinical, mri,
     and plasma biomarkers via probabilistic pattern classification. PLOS ONE 11, 2 (02 2016),
     1–25.
[69] K        , J., Z         , P., C , T., Z , Z., L , L., W               , N.,      W , L. Prediction
     of transition from mild cognitive impairment to alzheimer’s disease based on a logistic
     regression-artificial neural network- decision tree model. Geriatr Gerontol Int 21(1) (2021),
     43–47.
                                                       135


[70] L                , T. V.           L2 regularization versus batch and weight normalization.
     arXiv:1706.05350.
[71] L             , J.,        V            , A. Bayesian approach for neural networks–review and case
     studies. Neural networks : the official journal of the International Neural Network Society
     14, 3 (2001), 257–274.
[72] L              , P.,         R       , S. Variational bayes model averaging for graphon functions
     and motif frequencies inference in w-graph models. Statistics and Computing 26, 6 (2015),
     1173–1185.
[73] L        , Y., B           , L., B          , Y.,       H        , P. Gradient-based learning applied to
     document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[74] L , H. Consistency of posterior distributions for neural networks. Neural Networks 13, 6
     (2000), 629 – 642.
[75] L , S., B                , A., Y , D., L , J.,           A          , B. Predicting progression from mild
     cognitive impairment to alzheimer’s disease using longitudinal callosal atrophy. Alzheimer’s
     and Dementia: Diagnosis, Assessment and Disease Monitoring 2 (03 2016).
[76] L , S.-I., L , H., A                  , P.,       N , A. Efficient l1 regularized logistic regression. In
     AAAI (2006).
[77] L          , M., L , V. Y., P               , A.,       S          , S. Multilayer feedforward networks
     with a nonpolynomial activation function can approximate any function. Neural networks 6,
     6 (1993), 861–867.
[78] L , F., T       , L., T        , K.-H., J , S., S       , D.,    L , J. Robust deep learning for improved
     classification of ad/mci patients. In Machine Learning in Medical Imaging (Cham, 2014),
     G. Wu, D. Zhang, and L. Zhou, Eds., Springer International Publishing, pp. 240–247.
[79] L , X., L , C., C , J.,             O          , J. Variance reduction in black-box variational inference
     by adaptive importance sampling. In Proceedings of the Twenty-Seventh International Joint
     Conference on Artificial Intelligence, ÒCAI-18 (2018), International Joint Conferences on
     Artificial Intelligence Organization, pp. 2404–2410.
[80] L       , F., L , Q.,          Z         , L. Bayesian neural networks for selection of drug sensitive
     genes. Journal of the American Statistical Association 113, 523 (2018), 955–972.
[81] L , Z., M           , T.,        B          , A. A role for prior knowledge in statistical classification
     of the transition from mci to alzheimer’s disease. unpublished report., 2020.
[82] L        , D. A., B              , S., M         , R. A., D               , V.,             A           ’
     D           N                     I              (ADNI). A multivariate predictive modeling approach
     reveals a novel csf peptide signature for both alzheimer’s disease state classification and for
     predicting future disease progression. PLOS ONE 12, 8 (08 2017), 1–18.
                                                          136


[83] L          , B., H             , G.,        M       , J. A variational bayes algorithm for fast and
     accurate multiple locus genome-wide association analysis. BMC bioinformatics 11, 58
     (2010).
[84] L      , M., J        , L.,       W              , M. J. In Advances in Neural Information Process-
     ing Systems (2011), J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger,
     Eds., vol. 24, Curran Associates, Inc., pp. 1206–1214.
[85] M    K , D. J. C. A practical bayesian framework for backpropagation networks.
[86] M             , T. L., T        , G. H.,       S       , S. H. A random matrix-theoretic approach
     to handling singular covariance estimates. IEEE Transactions on Information Theory 57, 9
     (2011), 6256–6271.
[87] M             , W. Data structures for statistical computing in python. Proceedings of the 9th
     Python in Science Conference (01 2010).
[88] M M            , H. B. A survey of algorithms and analysis for adaptive online learning. Journal
     of Machine Learning Research 18, 90 (2017), 1–50.
[89] M                  , M., N            , C.,      S      , M. Deep learning on brain cortical thick-
     ness data for disease classification. In 2018 Digital Image Computing: Techniques and
     Applications (DICTA) (2018), pp. 1–5.
[90] M        , S., K            , A., R , F., A , A.,            K      , S. A. A nonparametric approach
     for mild cognitive impairment to ad conversion prediction: Results on longitudinal data.
     IEEE Journal of Biomedical and Health Informatics 21, 5 (2017), 1403–1410.
[91] M       C, F Y, D. C. Baseline and longitudinal patterns of brain atrophy in mci pa- tients,
     and their use in prediction of short-term conversion to ad: Results from adni. NeuroImage
     44(4) (2009), 1415–1422.
[92] M            , A. J.,       S      -F       , M. Temporal trends in the long term risk of progression
     of mild cognitive impairment: a pooled analysis. Journal of Neurology, Neurosurgery &
     Psychiatry 79, 12 (2008), 1386–1391.
[93] M                 , V., K            , A.,           H        , A.       Bayesian neural networks.
     arXiv:1801.07710.
[94] N                , T., D          , A. B., H                , L., V         , S. J., S      , L.,
     Z             , K. The true cost of stochastic gradient langevin dynamics. arXiv:1706.02692.
[95] N     , R. M. Bayesian training of backpropagation networks by the hybrid monte-carlo
     method. https://www.cs.toronto.edu/⇠radford/ftp/bbp.pdf.
[96] N     , R. M. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg,
     1996.
                                                      137


[97] P          , J., B , D.,           J        , M. Variational bayesian inference with stochastic search.
      In Proceedings of the 29th International Conference on International Conference on Machine
      Learning, ICML’12 (2012), ACM press, p. 1363–1370.
[98] P      , T.,           C         , G. The bayesian lasso. Journal of the American Statistical
      Association 103, 482 (2008), 681–686.
[99] P , D., B                         , A.,        Y     , Y. On statistical optimality of variational bayes.
      In Proceedings of Machine Learning Research, A. Storkey and F. Perez-Cruz, Eds., vol. 84.
      PMLR, 2018, pp. 1579–1588.
[100] P , D., B                        , A.,        Y     , Y. On statistical optimality of variational bayes.
      In Proceedings of the Twenty-First International Conference on Artificial Intelligence and
      Statistics (2018), A. Storkey and F. Perez-Cruz, Eds., vol. 84, PMLR, pp. 1579–1588.
[101] P               , F., V              , G., G              , A., M       , V., T        , B., G      , O.,
      B            , M., P                    , P., W      , R., D         , V., V             , J., P    , A.,
      C                  , D., B          , M., P         , M.,        D           , E. Scikit-learn: Machine
      learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[102] P          , T., L         , L., C            , S., S       , D., P      R          , A., S          , I.,
      M           Ã§ , A., G                 , M.,        C.M            , S. Predicting progression of mild
      cognitive impairment to dementia using neuropsychological data: A supervised learning
      approach using time windows. BMC Medical Informatics and Decision Making 17 (07
      2017).
[103] P             , R. C., R          , R. O., K            , D. S., B     , B. F., G    , Y. E., I   , R. J.,
      S       , G. E.,          J     , C. R. Mild cognitive impairment: ten years later. Archives of
      neurology 66, 12 (December 2009), 1447–1455.
[104] P           , D. Empirical processes: Theory and applications. NSF-CBMS Regional Confer-
      ence Series in Probability and Statistics 2 (1990), i–86.
[105] P         , N. G.,            R           , V. Posterior concentration for sparse deep learning.
      In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach,
      H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Asso-
      ciates, Inc.
[106] P      , R. A useful theorem for nonlinear devices having gaussian inputs. IRE Transactions
      on Information Theory 4, 2 (1958), 69–72.
[107] R                 , R., G           , S.,         B      , D. M.     Black box variational inference.
      arXiv:1401.0118.
[108] R            , S., S         , A., W , J., S           , L., F , H.,           M           , B. Baseline
      mri predictors of conversion from mci to probable ad in the adni cohort. Current Alzheimer
      research 6 (08 2009), 347–61.
                                                        138


[109] R        , T. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
      Society, Series B (1996), 267–288.
[110] R    , S. M. Simulation, fifth ed. Academic Press, 2013.
[111] S          -H          , J. Nonparametric regression using deep neural networks with ReLU
      activation function. The Annals of Statistics 48, 4 (2020), 1875 – 1897.
[112] S     Ä          B, S. A. Learning with kernels: support vector machines, regularization,
      optimization, and beyond. Cambridge, 31. MA: MIT Press, 2002.
[113] S          , J., P             , J., S           , F., C          , K., C         , V., C        , R.,
           D                  , P. Predicting cognitive decline in subjects at risk for alzheimer disease
      by using combined cerebrospinal fluid, mr imaging, and pet biomarkers. Radiology 266 (12
      2012).
[114] S     , T., J        , J., L , Y., W , P., Z , C.,             Y , Z. Decision supporting model for
      one-year conversion probability from mci to ad using cnn and svm. In 2018 40th Annual
      International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)
      (2018), pp. 738–741.
[115] S        , J. A tutorial on principal component analysis. arXiv:1404.1100.
[116] S      , N., F              , J., H        , T.,      T         , R. A sparse-group lasso. Journal of
      Computational and Graphical Statistics 22, 2 (2013), 231–245.
[117] S      , B., D , S., Z              , Y., G             , T.,    T       , G. Layer-specific adaptive
      learning rates for deep networks, 2015. arXiv.1510.04609.
[118] S             , N., H            , G., K               , A., S         , I.,     S                , R.
      Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine
      Learning Research 15, 56 (2014), 1929–1958.
[119] S , H.-I.,            S      , D. Deep learning-based feature representation for ad/mci classifi-
      cation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013
      (Berlin, Heidelberg, 2013), K. Mori, I. Sakuma, Y. Sato, C. Barillot, and N. Navab, Eds.,
      Springer Berlin Heidelberg, pp. 583–590.
[120] S , S., C           , C.,         C       , L. Learning Structured Weight Uncertainty in Bayesian
      Neural Networks. A. Singh and J. Zhu, Eds., vol. 54 of Proceedings of Machine Learning
      Research, PMLR, pp. 1283–1292.
[121] S , S., Z             , G., S , J.,           G        , R. B. Functional variational bayesian neural
      networks. In 7th International Conference on Learning Representations, ICLR 2019. (2019),
      OpenReview.net.
[122] S , Y., S         , Q.,        L       , F. Consistent sparse deep learning: Theory and computation.
      Journal of the American Statistical Association 0, ja (2021), 1–42.
                                                          139


[123] T             J        , H., S      , M.,         C           , N. Cerebral atrophy in mild cognitive
      impairment: A systematic review with meta-analysis. Alzheimer’s & Dementia: Diagnosis,
      Assessment & Disease Monitoring 1 (12 2015).
[124] T       , J. Lecture notes. part III: Black-box variational inference. http://www.it.uu.se/
      research/systems_and_control/education/2018/pml/lectures/VILectute NotesPart3.pdf.
[125] T , Z., C        , J., K       , Q., Z        , M., A             , A.,       S           , K. Dynamic
      embedding projection-gated convolutional neural networks for text classification. IEEE
      Transactions on Neural Networks and Learning Systems (2021), 1–10.
[126]           V       , A.,        W            , J. Weak Convergence and Empirical Processes: With
      Applications to Statistics. Springer Series in Statistics. Springer, New York., 1996.
[127] V       , V. The Support Vector Method of Function Estimation. Springer US, Boston, MA,
      1998, pp. 55–85.
[128] V       , V. The Nature of Statistical Learning Theory. Springer, 1999.
[129] V       , V. N. Statistical Learning Theory. Wiley-Interscience, 1998.
[130] V                 , Y., R            , V., I     , R.,      V        , P. Predicting short-term mci-to-
      ad pro- gression using imaging, csf, genetic factors, cognitive resilience, and demographics.
      Sci Rep 2235 (2019), 9.
[131] V              , T., L      , S., B         , D., V                 , S., D         , P., D T       , F.,
           D                   , J. Support vector machine versus logistic regression modeling for
      prediction of hospital mortality in critically ill patients with haematological malignancies.
      BMC medical informatics and decision making 8 (01 2009), 56.
[132] W , R., Z          , M., X        , H.,        Z    , Z. Neural control variates for variance reduction.
      arXiv:1806.00159.
[133] W     , B., H        , R., X , Y., Z           , F.,       P, W. Identifying mild cognitive impair-
      ment conversion to alzheimer’s disease from medical image information. In 2016 IEEE
      International Conference on Consumer Electronics-Taiwan (ICCE-TW) (2016), pp. 1–2.
[134] W     , J., X , F., N , F.,           L , X. Unsupervised adaptive embedding for dimensionality
      reduction. IEEE Transactions on Neural Networks and Learning Systems (2021), 1–12.
[135] W     , Y.,        B , D. M. Frequentist consistency of variational bayes. Journal of the
      American Statistical Association 114, 527 (2019), 1147–1161.
[136] W , R., L , C., F               , N., ,         L , L. Prediction of conversion from mild cognitive
      impairment to alzheimer’s disease using mri and structural network features. Frontiers in
      aging neuroscience (2016).
[137] W          , M.,       T , Y. Bayesian learning via stochastic gradient langevin dynamics. In
      Proceedings of the 28th International Conference on International Conference on Machine
      Learning, ICML’11 (2011), ACM press, pp. 681–688.
                                                       140


[138] W       , W. H.,          S       , X. Probability inequalities for likelihood ratios and convergence
      rates of sieve mles. Annals of Statistics 23, 2 (1995), 339–362.
[139] W , A., N             , S., M          , E., T         , R. E., H             -L       , J. M.,       G     ,
      A. L. Deterministic variational inference for robust bayesian neural networks.
[140] Y      , K.,     M          , T. On the classification consistency of high-dimensional sparse neural
      network. In 2019 IEEE International Conference on Data Science and Advanced Analytics
      (DSAA) (2019), pp. 173–182.
[141] Y      , K.,        M          , T. Statistical aspects of high-dimensional sparse artificial neural
      network models. Machine Learning and Knowledge Extraction 2, 1 (2020), 1–19.
[142] Y      , Y., P , D.,             B                   , A. U-variational inference with statistical guaran-
      tees. Annals of Statistics 48, 2 (2020), 886–905.
[143] Y , Y., R            , L.,          C                , A. On early stopping in gradient descent learning.
      Constructive Approximation 26 (2007), 289–315.
[144] Y , J., F         , M., V                , R., L           , V., R          , N., N      , G., D            ,
      A.,         N            , V. Sparse learning and stability selection for predicting mci to ad
      conversion using baseline adni data. BMC neurology 12 (06 2012), 46.
[145] Y        , J., M       , M., M             , J., M              , A., C   , D.,     O           , S. Accurate
      multimodal probabilistic prediction of conversion to alzheimer’s disease in patients with mild
      cognitive impairment. Neuroimage Clin (05 2013), 735–745.
[146] Y , Z., Y , F., Y            , K., C , W., C            , C. L. P., C       , L., Y , J.,         W     , H.-
      S. Semisupervised classification with novel graph construction for high-dimensional data.
      IEEE Transactions on Neural Networks and Learning Systems (2020), 1–14.
[147] Z        , D.,       S         , D. Multi-modal multi-task learning for joint prediction of clinical
      scores in alzheimer’s disease. pp. 60–67.
[148] Z        , D., S      , D.,          I             , A. D. N. Predicting future clinical changes of mci
      patients using longitudinal and multimodal biomarkers. PLOS ONE 7, 3 (03 2012), 1–15.
[149] Z        , F.,      G , C. Convergence rates of variational posterior distributions. Annals of
      Statistics 48, 4 (2020), 2180–2207.
[150] Z , C., C           , Y., G , Z., H                , F., L , J.,        G           , T. Adaptive learning
      rates with maximum variation averaging, 2020. arXiv.2006.11918.
[151] Z , J., R         , S., H           , T.,        T             , R. 1-norm support vector machines. MIT
      Press, (2003), 49–56.
[152] Z , H.,          H          , T. Regularization and variable selection via the elastic net. Journal
      of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301–320.
                                                           141