VARIATIONAL BAYES DEEP NEURAL NETWORK: THEORY, METHODS AND APPLICATIONS By Zihuan Liu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2021 ABSTRACT VARIATIONAL BAYES DEEP NEURAL NETWORK: THEORY, METHODS AND APPLICATIONS By Zihuan Liu Bayesian neural networks (BNNs) have achieved state-of-the-art results in a wide range of tasks, especially in high dimensional data analysis, including image recognition, biomedical diagnosis and others. My thesis mainly focuses on high-dimensional data, including simulated data and brain images of Alzheimer’s Disease. We develop variational Bayesian deep neural network (VBDNN) and Bayesian compressed neural network (BCNN) and discuss the related statistical theory and algorithmic implementations for predicting MCI-to-dementia conversion in multi-modal data from ADNI. The transition from mild cognitive impairment (MCI) to dementia is of great interest to clinical research on Alzheimer’s disease (AD) and related dementias. This phenomenon also serves as a valuable data source for quantitative methodological researchers developing new approaches for classification. The development of VBDNN is motivated by an important biomedical engineering application, namely, building predictive tools for the transition from MCI to dementia. The predictors are multi-modal and may involve complex interactive relations. In Chapter 2, we numerically compare performance accuracy of logistic regression (LR) with support vector machine (SVM) in classifying MCI-to-dementia conversion. The results show that although SVM and other ML techniques are capable of relatively accurate classification, similar or higher accuracy can often be achieved by LR, mitigating SVM’s necessity or value for many clinical researchers. Further, when faced with many potential features that could be used for classifying the transition, clinical researchers are often unaware of the relative value of different approaches for variable selection. Other than algorithmic feature selection techniques, manually trimming the list of potential predictor variables can also protect against over-fitting and also offers possible insight into why selected features are important to the model. We demonstrate how similar performance can be achieved using user-guided, clinically informed pre-selection versus algorithmic feature selection techniques. Besides LR and SVM, Bayesian deep neural network (BDNN) has quickly become the most popular machine learning classifier for prediction and classification with ADNI data. However, their Markov Chain Monte Carlo (MCMC) based implementation suffers from high computational cost, limiting this powerful technique in large-scale studies. Variational Bayes (VB) has emerged as a competitive alternative to overcome some of these computational issues. Although the VB is pop- ular in machine learning, neither the computational nor the statistical properties are well understood for complex modeling such as neural networks. First, we model the VBDNN estimation method- ology and characterize the prior distributions and the variational family for consistent Bayesian estimation (in Chapter 3). The thesis compares and contrasts the true posterior’s consistency and contraction rates for a deep neural network-based classification and the corresponding variational posterior. Based on the complexity of the deep neural network (DNN), this thesis assesses the loss in classification accuracy due to VB’s use and guidelines on the characterization of the prior distributions and the variational family. The difficulty of optimization associated with variational Bayes solution has been quantified as a function of the complexity of the DNN. Chapter 4 proposes using a BCNN that takes care of the large ? small = problem by projecting the feature space onto a smaller dimensional space using a random projection matrix. In particular, for dimension reduction, we propose randomly compressed feature space instead of other popular dimension reduction techniques. We adopt a model averaging approach to pool information across multiple projections. As the main contribution, we propose the variation Bayes approach to simultaneously estimate both model weights and model-specific parameters. By avoiding using standard Monte Carlo Markov Chain and parallelizing across multiple compression, we reduce both computation and computer storage capacity dramatically with minimum loss in prediction accuracy. We provide theoretical and empirical justifications of our proposed methodology. ACKNOWLEDGEMENTS Throughout the writing of this dissertation I have received a great deal of support and assistance. Firstly, I would like to express my sincere gratitude to my advisors Prof. Maiti and Prof. Bhattacharya for the continuous support of my Ph.D study and related research, for their patience, motivation, and immense knowledge. Their guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my Ph.D study. Besides my advisors, I would like to thank the rest of my thesis committee: Prof. Bender, Prof. Hong, and Prof. Zhu for their insightful comments and encouragement, but also for the hard question which incented me to widen my research from various perspectives. My sincere thanks also goes to Prof. Wang from Western Connecticut State University and Dr. Zhao from Cleveland Clinic, who provided me an opportunity to join their team as researcher, and who gave access to the laboratory and research facilities. Without their precious support it would not be possible to conduct this research. Thanks to the Michigan State University Graduate School for awarding me a Dissertation Completion Fellowship, providing me with the financial means to complete this thesis. In addition, I would like to thank my parents for their wise counsel and sympathetic ear. You are always there for me. Finally, I could not have completed this dissertation without the support of my friends, LiangLiang, Zhang and CheukYin, Lee, who provided stimulating discussions as well as happy distractions to rest my mind outside of my research. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 MCI-to-dementia conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Bayesian Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Variational inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Posterior Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Bayesian Compressed Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 7 CHAPTER 2 A ROLE FOR PRIOR KNOWLEDGE IN STATISTICAL CLASSIFICA- TION OF THE TRANSITION FROM MCI TO ALZHEIMER’S DISEASE 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Transition from MCI to dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Materials and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Data Used in Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Clinical Cognitive Assessment and Genetic data . . . . . . . . . . . . . . . 16 2.3.3 MRI data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4 Method and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.2 Support Vector machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5.1 Comparison with different modalities . . . . . . . . . . . . . . . . . . . . 27 2.5.2 Comparison of Pre-selection and ! 1 norm . . . . . . . . . . . . . . . . . . 28 2.5.3 Comparison of Groups One and Two . . . . . . . . . . . . . . . . . . . . . 28 2.5.4 Comparison of LR and SVM . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 CHAPTER 3 CONSISTENT VARIATIONAL BAYES CLASSIFICATION WITH DEEP NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Introdution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 The Neural Networks Classifier and Likelihoods . . . . . . . . . . . . . . . . . . . 38 3.3 Bayesian Inference with Variational Algorithm . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Prior Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.3 Black Box Variational Algorithm using score function estimator . . . . . . 44 3.3.4 Control Variate: Stabilizing the stochastic gradient . . . . . . . . . . . . . 45 3.3.5 RMSprop Learning Rate: Stabilizing the learning rate. . . . . . . . . . . . 47 v 3.3.6 Classification using variational posterior . . . . . . . . . . . . . . . . . . . 48 3.4 Posterior and Classification Consistency . . . . . . . . . . . . . . . . . . . . . . . 48 3.4.1 Posterior consistency and its implication in practice . . . . . . . . . . . . . 49 3.4.2 Discussion of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.3 Classification consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Simulation Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.1 Simulation Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.5.2 Parameters choice for statistical and computational models. . . . . . . . . . 56 3.5.3 Gradient stabilization paramaters. . . . . . . . . . . . . . . . . . . . . . . 57 3.5.4 Testing accuracy and convergence. . . . . . . . . . . . . . . . . . . . . . . 60 3.5.5 Large number of layers and challenges. . . . . . . . . . . . . . . . . . . . . 61 3.6 Numerical Properties and Alzheimer’s Disease Study . . . . . . . . . . . . . . . . 62 3.6.1 Parameters choice for statistical and computational models. . . . . . . . . . 63 3.6.2 Gradient stabilization paramaters. . . . . . . . . . . . . . . . . . . . . . . 64 3.6.3 Testing accuracy and convergence. . . . . . . . . . . . . . . . . . . . . . . 65 3.6.4 Numerical comparison with popular models . . . . . . . . . . . . . . . . . 65 3.7 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 CHAPTER 4 LEARNING INTRINSIC DIMENSIONALITY OF FEATURE SPACE WITH VARIATIONAL BAYES NEURAL NETWORKS . . . . . . . . . . 68 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Bayesian neural network for random projection based compressed feature space . . 72 4.2.1 Bayesian neural network model . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.2 Compression in the feature space with random projections . . . . . . . . . 72 4.2.3 Prior choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Variational Bayes model averaging for pooling multiple instances of random projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.1 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3.2 ELBO derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Intrinsic dimensionality and prediction . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 Optimal dimension neighborhood selection . . . . . . . . . . . . . . . . . 76 4.4.2 Classification based on optimal neigborhood choice . . . . . . . . . . . . . 76 4.5 Algorithm and its implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.6 Theoretical results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.7 Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.7.1 Problem setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.7.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7.3 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7.4 ADNI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.7.5 MNIST Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.8.1 Optimal dimensional region . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.8.2 Comparative Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.8.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 vi CHAPTER 5 CONCLUSIONS, DISCUSSION, AND DIRECTIONS FOR FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.1 Conclusions and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Directions for future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 APPENDIX A SUPPLEMENT FOR CONSISTENT VARIATIONAL BAYES CLASSIFICATION WITH DEEP NEURAL NETWORKS . . . . . 94 APPENDIX B SUPPLEMENT FOR LEARNING INTRINSIC DIMENSION- ALITY OF FEATURE SPACE WITH VARIATIONAL BAYES NEURAL NETWORKS . . . . . . . . . . . . . . . . . . . . . . . . 118 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 * vii LIST OF TABLES Table 2.1: Sample Sizes by Timing and Diagnosis: Group One and Two . . . . . . . . . . . 15 Table 2.2: Clinical Features and Cognitive Assessment Score of Group One . . . . . . . . . 18 Table 2.3: Pre-selected MRI Features of Group One . . . . . . . . . . . . . . . . . . . . . 20 Table 2.4: Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Table 2.5: Top 10 features of Group One obtained by ! 1 regularization . . . . . . . . . . . 27 Table 2.6: LR and SVM performance of Group One (Time = 3 years) for models on single- and multi-modal feature sets . . . . . . . . . . . . . . . . . . . . . . . . 30 Table 2.7: LR and SVM performance of Group Two (Time =2 years) for single-data and multi-modal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Table 3.1: Performance of algorithms algorithms 1, 2, 4, 5 for scenario 1. . . . . . . . . . . 60 Table 3.2: Performance of algorithms 1, 2, 4, 5 for scenario 2 . . . . . . . . . . . . . . . . 61 Table 3.3: Performance of algorithms 1, 2, 4, 5 for scenario 1 and scenario 2 for 3 layers. . 62 Table 3.4: Performance of algorithms 1, 2, 4, 5 for ADNI. . . . . . . . . . . . . . . . . . . 65 Table 3.5: Performance for different classifiers. LR: Logistic regression. SVM: Support vector machine. ANN: Frequentist artificial neural network. SG-MCMC: Stochastic gradient MCMC Bayesian neural network . . . . . . . . . . . . . . . 66 Table 4.1: Summary of data,where n, p and c denote the numbers of samples, features and classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Table 4.2: RPVBNN setting for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Table 4.3: Table: Small simulated data performance . . . . . . . . . . . . . . . . . . . . . 87 Table 4.4: Table: Large simulated data performance . . . . . . . . . . . . . . . . . . . . . 88 Table 4.5: Table: ADNI data performance . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Table 4.6: MNIST data performance in term of testing accuracy and time (based on 500 epochs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 viii LIST OF FIGURES Figure 2.1: Comparison of distributions for baseline predictor variables between MCI-S and MCI-C groups. (a) The mean MMSE score in MCI-S is higher than in MCI-C. (b) Mean Learning scores of MCI-C and MCI-S groups are 2.5 and 5. . 16 Figure 2.2: Comparisons between MCI-S and MCI-C groups on baseline predictor vari- ables. The y-axis of panels (a) through (d) represents the number of par- ticipants developing AD. Blue and red bars represent non-converters and converters, respectively. Panel (a) shows a greater number of converters than non-converters for both men and women. Panel (b) shows more than half of MCI-C subjects are APOE4 carriers and approximately 70% MCI-S subjects are non-APOE4 carriers. Panel (c) shows MCI-S subjects have the relatively lower CDR score and MCI-C subjects have higher CDR score. The number of people in MCI-C group has a downward trend as CDR score increases. Panel (d) shows MCI-C subjects have the relatively higher ADASQ4 score. The average of ASADQ4 score of MCI-S and MCI-C subjects are approximately 5 and 8, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2.3: Flowchart of the LR and SVM method A) ROI-P: ROI level data with Pre- selection; B) ROI-NP: ROI level data with No Pre-selection; C) CCAR: Clinical, Cognitive assessments score, APOE4 and ROI level data. . . . . . . . 26 Figure 2.4: Model performance on ROI feature set by number of features for LR and SVM. Panel (a) shows dramatic growth in AUC with LR as the number of features increases from 1 to 30, and then becoming more static at approximately 74% - i.e., as the number of features increases from 30 to 40, but drops significantly when the number of features reaches to 41. Panel (b) shows the AUC increased dramatically as the number of features grows from 1 to 28, but fluctuated after 29. The optimal number of ROI features for both methods are 29 and 28, and their corresponding optimized AUC were approximately 74.0% and 78.0%. . . 36 Figure 2.5: Model performance on CCA feature set by number of features for LR and SVM. Figure (a) shows there is a significant increase in the AUC with LR as the number of features increases from 1 to 5, then there is a slight decrease in the testing accuracy when the number of features is greater than 5. Figure (b) shows the AUC shot up dramatically as the number of features increases from 1 to 4. The optimal number of CCA features obtained by LR and SVM are 5 and 4, and their corresponding optimized AUC are approximately 84.0% and 83.0%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Figure 3.1: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 1 layer. . . . . . . 58 ix Figure 3.2: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers. . . . . . . 58 Figure 3.3: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 2 for 1 layer. . . . . . . 59 Figure 3.4: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers. . . . . . . 59 Figure 3.5: ELBO convergence of algorithms 1, 2, 4, 5 for ADNI. . . . . . . . . . . . . . . 64 Figure 4.1: Small simulated data: # = 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 4.2: Large simulated data: # = 128 . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 4.3: ADNI data: # = 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 4.4: MNIST data: # = 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 x LIST OF ALGORITHMS Algorithm 1: BBVI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Algorithm 2: BBVI-CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Algorithm 3: RPVBNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Algorithm 4: BBVI-RMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Algorithm 5: BBVI-CV-RMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 xi CHAPTER 1 INTRODUCTION Introduction In this thesis, we develop variational Bayesian deep neural network (VBDNN) and Bayesian compressed neural network (BCNN) and discuss the related statistical theory and algorithmic implementations in the context of classification, such as classifying MCI-to-dementia conversion. Chapter 1 reviews the background, research questions and development of Bayesian neural network (BNN). Chapter 2 introduces the prediction of the transition from mild cognitive impairment (MCI) to dementia for brain images of Alzheimer’s Disease using traditional machine learning models (logistic regression and support vector machine). Finally, chapter 3 introduces the VBDNN estimation methodology and the choice of the prior distributions and the variational family. In particular, we discuss the statistical framework for neural networks based classification problem and provide posterior consistency and classification consistency. Chapter 4 introduces a variational Bayes neural network predictive model for addressing the curse of dimensionality (small = large ?) by compressing the feature space using random projection matrices. Finally, chapter 5 introduces Conclusions, Discussion, and Suggestions for Future Research. 1.1 MCI-to-dementia conversion The transition from mild cognitive impairment (MCI) to dementia is of great interest to clinical research on Alzheimer’s disease (AD) and related dementias. Alzheimer’s disease (AD) is a progressive, age-related, neurodegenerative disease and the most common cause of dementia [147, 148, 67]. Behaviorally, Alzheimer’s dementia is commonly preceded by mild cognitive impairment (MCI), a syndrome characterized by declines in memory and other cognitive domains that exceed cognitive decrements associated with normal aging [148, 103]. However, the prodromal symptoms of MCI are not prognostically deterministic: individuals with MCI tend to progress to diagnoses 1 of probable AD at a rate of 8%-15% per year, and many conversions are detectable within 3 years of initial presentation [24, 44, 2]. Research efforts to provide new insights into the incidence of MCI-to-AD conversion have focused largely on clinically or biologically relevant features (i.e., neuroimaging markers, clinical exam data, neuropsychological test scores) and on different methods for statistical classification [145]. 1.2 Bayesian Deep Neural Networks Due to the universal approximation theory of stochastic functions and larger access to computational power, Bayesian Deep Neural Networks (BDNNs) are fashionable in machine learning and statistics for classification and prediction from big data. The BDNNs based prediction has several advantages over standard parametric statistical models. They implicitly consider the interactions or dependence among predictor variables and model the unknown functional relationship between the predictors and responses. For example, we consider classifying Alzheimer’s disease status from brain imaging an important biomedical problem. The image features are segmented into voxels or regions of interest (ROI’s). Due to their physical adjacency and biological proximities, a simple parametric model or semi-parametric models, such as logistic regression or generalized additive models may not be appropriate. Besides the dependence (spatial) among the predictors, some network structures might be in the feature space while modeling the brain images. The BDNNs can take into account these data features without any explicit assumptions about their dependence structure. Further, these studies often have additional features but in different modes such as genetic and demographic information, brings additional complexity while modeling dependence among the features. Thus, machine learning-based approaches, such as deep neural networks, become useful in this type of application. Bayesian neural networks (BDNNs) have been comprehensively studied by [7], [95], [71], and many others. More recent developments which establish the efficacy of BDNNs can be found in [120], [93], [61], [80], [64] and the references therein. The estimation of the posterior distribution is a key part of Bayesian inference and represents the information about the uncertainties for both data and parameters. However, the exact analytical solution for the posterior 2 distribution is intractable as the number of parameters is very large and the functional form of a neural network does not lend itself to exact integration (see [11]). Several approaches have been proposed for solving posterior distribution of weights of BDNNs, based on both optimization-based techniques such as variational inference (VI), and sampling-based approach, such as Markov Chain Monte Carlo (MCMC). 1.3 Variational inference Markov Chain Monte Carlo (MCMC) techniques are typically used to obtain sampling-based estimates of the posterior distribution. Indeed, BDNNs with MCMC have not seen widespread adoption due to computational cost in terms of both time and storage on a large dataset, [66, 94, 132, 139]. In contrast to MCMC, VI tends to converge faster, and it has been applied to many popular Bayesian models, such as factorial models and topic models [79, 9, 8]. We want to take a variational approximation approach for posterior estimation in the context of deep neural network classification models. The basic idea of VI is that it first defines a family of variational distributions and then minimizes the Kullback-Leibler (KL) divergence with respect to the variational family. Many recent works have discussed the application of variational inference to Bayesian deep neural networks e.g., [50], [11], [121]. Although there is a plethora of literature on variational inference for neural networks, the theoretical properties of the variational posterior in BDNNs remain relatively unexplored and this limits the use of this powerful computational tool beyond the machine learning community. Some of the previous works that focused on theoretical properties of variational posterior include the frequentist consistency of variational inference in parametric models in presence of latent variables (see [135]). Optimal risk bounds for mean-field variational Bayes for Gaussian mixture (GM) and Latent Dirichlet allocation (LDA) models have been discussed in [99]. The work of [142] proposed U-variational inference Bayes risk for GM and LDA models. The [149] discusses the variational posterior contraction rates in Gaussian sequence models, infinite exponential families and piece-wise constant models. The works of [105] and [122] study the posterior contraction rates 3 for Bayesian sparse deep neural network models under spike and slab priors. Three more closely related works which study variational posterior are: (1) [3] discusses the contraction rates of VB in sparse BDNN models with spike and Gaussian slab priors and mean-field spike and Gaussian slab variational family (2) [22] discusses the contraction rates of a tempered VB solution with spike and Gaussian slab priors and mean-field spike and Gaussian slab variational family and (3) [6] discusses consistency of VB for single-layer neural network models with Gaussian priors and mean field Gaussian variational family. All these three works focus on a regression setting unlike our classification set up, which in turn allows for the generalization of VBDNNs to generalized linear models. Further, none of these works discuss computational details and the theoretical guidelines of BDNN to achieve the desired level of accuracy. The work of [6] does not establish contraction rates or deal with deep networks. The works of both [22] and [3] establish contraction rates, however, their notion of convergence do not agree with the classical definition of posterior contraction as established in Theorem 2.1 in [47] wherein one needs to find the rates at which variational posterior gives probability to shrinking Hellinger neighborhoods of the true density. The notion of contraction as used in [22] considers the contraction rate of the quantity \/|| 50 5\ || 1 instead of Hellinger neighborhoods of the true density. The work of [3] on the other hand considers the posterior expectation of the square of the Hellinger distance instead of the posterior probability of shrinking Hellinger neighborhoods. Note, in terms of the notion of consistency, our work is similar to those of [105] (Theorem 5.1) and [122] (Theorem 2.1) but in the context of variational posterior instead of the true posterior. The derivation of the posterior contraction rates in the classical sense provided us the additional advantage to quantify the loss incurred due to the use of VI approach over MCMC approach on the classification accuracy of the BDNN’s, a result which to the best of our knowledge, does not exist in the literature. Additionally, our current work does not assume a sparsity constant B⇤ which can control the overall complexity of the model. We instead start with a dense network and break down the complexity of a deep neural network into three components (1) the number of layers (2) the number of nodes and (3) the strength of interactions between active nodes. Then, we study 4 the impact of each of these components on the consistency, contraction rates and classification accuracy of the variational posterior based on BDNN. Finally, this thesis adopts the control variates and adaptive learning rate approach as proposed in [107] to BDNNs. This allowed us to analyze the stability of the numerical optimization used for obtaining a variational Bayes solution as a function of the complexity of the model. We like to emphasize that, unlike the high-dimensional regression model, the sparsity constant B⇤ is not well defined in DNN as the layers can be thought of a sequence and there should not be any gap between layers. 1.4 Posterior Consistency To evaluate the validity of a posterior in non-parametric models, one must establish its consistency and contraction rates. Unlike any of the previous works, we establish the posterior consistency and contraction rates of the variational posterior in the classical sense, see theorems 3.4.1 and 3.4.2. For a simple consistency result, one needs to show that the posterior concentrates around the Hellinger neighborhood of the true density function with overwhelming probability. A deep neural network model for which the input feature space and number of layers is fixed enjoys consistency properties irrespective of the true function under study as long as the total number parameters of grow at a rate smaller than the sample size =. In this direction, we establish that posterior probability of an Y- Hellinger neighborhood grows at the rate 1 24 =Y 2 /2 in contrast to the slower 1 a, a ! 0 as = ! 1 rate for the variational posterior. For establishing the rates of contraction, one needs to show that the posterior concentrates around shrinking Hellinger neighborhoods of the true density with overwhelming probability. To determine the rates of contraction, one needs assumptions on the neural network solution that approximates the true function and the number of total parameters being less than =. Treating the input feature space as the number of nodes in the 0th layer, we found that the approximating neural network solution must satisfy three main properties (1) the number of layers grows at a rate smaller than log = (2) the number of nodes in each layer are well controlled (3) the number of connections between active nodes is well controlled. In this direction, we establish that the true posterior probability of a shrinking Yn = - Hellinger neighborhood grows 5 at the rate 1 24 =Y 2 n =2 /2 in contrast to the slower 1 a rate for the variational posterior. For BDNN, we next establish the connection between posterior contraction rate and classifi- cation accuracy. In this direction, we first show that the classification accuracy of a consistent posterior asymptotically approaches the Bayes classifier’s classification accuracy. With no assump- tions on the true function, we show that a deep neural network model for which the number of input features and number of layers is fixed, we show that the convergence rates of the classification ac- curacy are the same for both variational approximation and true posterior. However, under suitable assumptions on the approximating neural network solution as described in the above paragraph, we establish that the classification accuracy of variational posterior approaches to the classification accuracy of the Bayes classification at the rate n =2/3 in contrast to the higher rate of n = for the true posterior. This interesting theoretical discovery quantifies the loss to the use of variational posterior instead of using the true posterior density. We provide prior elicitation for Bayesian estimation. Our detailed mathematical treatment provides theoretical guidelines for selecting the prior distributions that might affect prediction accuracy. For example, even one works with fairly vague priors, there is a limit for choosing the hyper-parameter values to achieve a desired level of consistency. We also discuss how the choice of variational distribution along with the prior distribution affects the posterior consistency. Besides the theoretical validation, the challenges of implementing a VI based approach is two folds: (1) the choice of the variational family (2) the optimization of the KL-divergence. For the first issue, we show that a simple mean-field Gaussian variational family suffices for posterior consistency along with good numerical performance. For the second issue, the current paper discusses the associated computational challenges of using a VI approach and provides statistically principled guidelines to overcome the same. We first adapted the black-box variational inference (BBVI) algorithm in [107] to the classification based on BDNN’s and used Monte Carlo estimates of the gradient of the evidence lower bound (ELBO) for stochastic optimization of the variational parameters. We then adapted the control variates approach as in [107] to allow for faster convergence to the solution. We found that control variates offers a great deal of improvement in terms of time 6 management when using one or two layers. With increase in the number of layers, it was observed that using adaptive learning rates like Adagrad as in [107] can offer huge advantage to allow for stable optimization. We, however, propose the use of the RMSprop due to its superior performance over Adagrad. Finally we discuss in detail the learning rate selection, number of Monte-Carlo samples and other tuning controls in the context of variational Bayes implementation. 1.5 Bayesian Compressed Neural Networks Bayesian neural networks (BNNs) have achieved state-of-the-art results in a wide range of tasks, especially in high dimensional data analysis, including image recognition, biomedical diagnosis and others. One of the major disadvantages in using neural networks and deep networks is that they require a huge number of training data due to a large number of inherent parameters [140, 45]. For example, high-dimensional neural networks have been widely applied with regularization, dropout techniques or early stopping to prevent overfitting [118, 143]. Furthermore, most commonly used dimensional reduction techniques include Lasso [17], Ridge [58], Elastic net [152], Sparse group lasso [116], Bayesian Lasso [98], Horseshoe prior [16], principal component analysis [115]. Even though the ;1 and ;2 norm can force the weights to become zero or small, they do not have the regularizing effect of making the computed function simpler [70]. Additionally, all these methods rely on the use of whole data, which severely increases the cost of both computation and memory storage. In this thesis, we propose the use of a BNN on a compressed feature space to take care of the large ? small = problem by projecting the feature space onto a smaller dimensional space using a random projection matrix. Random-projection (RP) is a powerful dimension reduction technique which uses RP matrices to map data into low-dimensional spaces. The use of RP in high dimensional statistics is motivated from the Johnson–Lindenstrauss Lemma [27] which states for x1 , · · · , x= 2 R ? , n 2 (0, 1) and 3 > 8 log =/n 2 , there exists a linear map 5 : R ? ! R3 such that (1 n)||G8 G 9 || 22  || 5 (x8 ) 5 (x 9 )|| 22  (1 + n)||G8 G 9 || 22 for 8, 9 = 1, · · · , =. The properties of the RPs and their applications to statistical problems were furthered explored in [33, 13], etc.. 7 To reduce the sensitivity to the choice of random matrices, one must pool information obtained from multiple projections. In this thesis, we adopt a Bayesian model averaging approach for combining information across multiple instances RP based neural networks. There are two main challenges of implementing Bayesian modeling averaging (1) due to the convoluted structure of the neural network likelihood, closed form expressions do not exist for the posterior distribution under each model (2) posterior distribution of model weights is completely intractable and no closed form solutions exist. Thereby, the implementation of standard Markov Chain Monte Carlo (MCMC) is next to impossible. Further, the computation and storage cost associated with MCMC implementation is humongous since each posterior model weight depends on the remaining models’ posterior model weight. To address the challenges of MCMC implementation, we use variational inference (VI) [63, 9] approach to provide an approximate solution for Bayesian model averaging (BMA) to allow for combining of BNNs with multiple instances of compression on the feature space. There has been a plethora of literature implementing variational inference in the neural networks [10]. However, their implementation makes use of the entire feature space, thereby putting a great burden on computational stability and memory storage. We address two main challenges in this thesis (1) developing a variational Bayes (VB) solution for BNNs with compressed feature space (2) providing a VB solution for doing BMA across multiple instances of RP. Further, we establish the posterior contraction rates for the variational posterior for classification (the theory is extendable to regression set up with minor modifications). In this direction, we provide characterization of the prior, variational posterior and the RP matrix which guarantees the convergence of the variational Bayes neural network (VBNN) under the compressed feature space to the true density of the observations. The main advantage of implementing a BMA approach is that it gives the posterior model weights under each compression of feature space. The so obtained posterior model weights in turn induce a probability distribution on the projected dimension of the feature space. The mode of this probability distribution concentrates around the intrinsic dimensionality of the feature space. 8 The BMA approach is then applied to a pool of RP matrices whose projected dimension lies in a neighborhood of the intrinsic dimensionality to improve the prediction performance. Finally, we study the numerical behavior of the proposed procedure in the light of simulation and real data sets. To the best of our knowledge, no literature provides theoretical guarantees and computation algorithms of VBNNs with compressed feature space. 9 CHAPTER 2 A ROLE FOR PRIOR KNOWLEDGE IN STATISTICAL CLASSIFICATION OF THE TRANSITION FROM MCI TO ALZHEIMER’S DISEASE 2.1 Introduction The transition from mild cognitive impairment (MCI) to dementia is of great interest to clinical research on Alzheimer’s disease (AD) and related dementias. This phenomenon also serves as a valuable data source for quantitative methodological researchers developing new approaches for classification. However, the growth of machine learning (ML) approaches for classification may falsely lead many clinical researchers to underestimate the value of logistic regression (LR), which often demonstrates classification accuracy equivalent or superior to other ML methods. Further, when faced with many potential features that could be used for classifying the transition, clinical researchers are often unaware of the relative value of different approaches for variable selection. Using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the present study investigated automated and theoretically-guided feature selection techniques in the context of LR and support vector machine (SVM) classification methods for predicting conversion from MCI to dementia. The present findings demonstrate how similar performance can be achieved using user-guided, clinically informed pre-selection versus algorithmic feature selection techniques. These results show that although SVM and other ML techniques are capable of relatively accurate classification, similar or higher accuracy can often be achieved by LR, mitigating SVM’s necessity or value for many clinical researchers. 2.2 Transition from MCI to dementia Alzheimer’s disease (AD) is a progressive, age-related, neurodegenerative disease and the most common cause of dementia [147, 148, 67]. Behaviorally, Alzheimer’s dementia is commonly preceded by mild cognitive impairment (MCI), a syndrome characterized by declines in memory and 10 other cognitive domains that exceed cognitive decrements associated with normal aging [148, 103]. However, the prodromal symptoms of MCI are not prognostically deterministic: individuals with MCI tend to progress to diagnoses of probable AD at a rate of 8%-15% per year, and many conversions are detectable within 3 years of initial presentation [24, 44, 2]. Research efforts to provide new insights into the incidence of MCI-to-AD conversion have focused largely on clinically or biologically relevant features (i.e., neuroimaging markers, clinical exam data, neuropsychological test scores) and on different methods for statistical classification [145]. For clinical researchers, however, there may be a tendency to conflate more sophisticated, novel analytic approaches and the value of multimodal information from neuroimaging and clinical assessment. Moreover, whereas statisticians may inherently understand the comparability of differ- ent quantitative approaches, the novelty of both big data and data-driven approaches for studying MCI-to-AD conversion may lead clinical researchers to assume that such data-driven methods are inherently superior to more theoretically-grounded approaches. Thus, the value of using extant findings and domain expertise to help guide and constrain the application of newer data-driven ap- proaches capable of capitalizing on emergent big data may be a particularly important consideration for clinical researchers. Statistical classification in clinical research has traditionally utilized binary logistic regres- sion (LR). However, key attributes of modern clinical and neuroimaging data, including high dimensionality and the presence of ground truth estimates of pathology and diagnosis provide new opportunities for quantitative research. This has led to a substantial expansion in the use of data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI; http://adni.loni.usc.edu) for quantitative research and methodological development, particularly by researchers utilizing and developing prediction and classification methods in machine learning (ML). Besides LR, support vector machine (SVM) has quickly become the most common type of ML classifier for diagnostic prediction and classification with ADNI data. In general, LR works well when the data is linearly separable and the number of data is greater than the number of features. Moreover, SVM and LR have similar misclassification rates (MCRs) when used to diagnose malignant tumors from imaging 11 data [19, 30]. Indeed, before the rapid expansion of ML research and applied work over the past decade, many clinical researchers and those outside of engineering and mathematically intensive disciplines had little exposure to classification approaches other than LR. Despite its growing popularity, the relative benefits of SVM or other forms of ML [101, 87] over LR for such classification are not always apparent. Although this may be of little surprise to statisticians and quantitative researchers, such perspectives are often lost on clinical researchers, whose implicit beliefs in the superiority of ML is driven by the volume of publications, rather than through training or empirical demonstration. Most efforts to develop new classification methods for prediction of MCI-to-AD conversion are well suited to integrate measures from multiple sources such as demographics, clinical rating scores, neuropsychological testing, neuroimaging, genetic markers, etc. However, identifying which combination of features most accurately classifies conversion from MCI to AD is a key challenge for ADNI, and may vary by method. The ! 1 norm regularization method (i.e., ! 1 ) is a highly used feature selection technique for LR and SVM. ! 1 is popular for addressing circumstances in which the number of features is quite large or even larger than the sample size. Despite some risk of abusing the statistical terminology, the problem is often generically referred to as the “small n, large p" or high dimensional problem. The ! 1 technique has dual impacts, namely the algorithm can (i) optimize a higher number of parameters in comparison to sample size, and (ii) reduce the effective number of parameters (i.e., performing variable selection). This powerful technique has been implemented in ADNI data with LR [144]. Furthermore, ! 1 and other algorithmic feature selection methods used in ML suffer from one key limitation: they are agnostic to theoretical considerations, and as such, they cannot interpret why selected features are meaningful and important to the model. When sampling from a large pool of features, the algorithmic approaches fail to consider prior knowledge of features and their associations with the relevant systems in variable selection. Therefore, domain expertise and prior knowledge may afford additive or differential value for choosing features and interpreting model results over algorithmic feature selection methods alone. 12 However, most real-world problems occur in the context of additional information about each potential feature and its conceptual relationship with the phenomenon being classified. Other than using ! 1 feature selection, manually trimming the list of potential predictor variables can also protect against over-fitting, and also offers potential insight into why selected features are important to the model. When guided by prior knowledge, user-guided or ‘manual’ feature selection may be a valuable additional step to help minimize potentially spurious effects. This perspective is frequently lost on applied researchers, as most commonly used variable selection algorithms are context-free – that is, they only look at relationships within the data set, and cannot factor in the wider meanings of variables. Furthermore, this also means that automated algorithms may identify relationships among a large number of predictor variables that are spurious and are unlikely to generalize outside the data set. Although there are a vast number of potential neuroimaging features in ADNI data, the present study focused only on regional brain volumes segmented from structural magnetic resonance imaging (MRI) data, the most commonly used neuroimaging datatype for classifying MCI-to-dementia conversion. In contrast to prior studies that used a limited set of volumetric brain features, the present study utilized data generated by modern multi-atlas segmentation methods and analyses included up to 259 features - anatomically specific gray and white matter volumes. However, the large pool of extant findings from studies evaluating regional brain MRI volumetry in prediction and classification of MCI-to-dementia conversion using both limited and expansive feature sets also provides a valuable set of priors for relevant brain regions [18, 43, 123, 91, 42, 108]. Thus, applied researchers are often left with the conundrum of more confirmatory approaches that use few regions in classification or more exploratory methods in which prior findings have little value. The present study addressed two questions regarding commonly-used classification approaches for predicting MCI-to-dementia conversion in multi-modal data from ADNI. First, we compared performance accuracy of binary LR with SVM in classifying MCI-to-dementia conversion. Second, we asked if applying prior knowledge in feature selection outperforms algorithmic variable selection alone. We hypothesized that 1) LR would perform comparably to SVM, and 2) user-guided 13 variable selection would outperform algorithmic variable selection alone. This work is intended to demonstrate to clinical researchers the benefit of using ML in an informed fashion, rather than as a ‘black box’ that obscures clear interpretation. Moreover, we wish to emphasize that this study is not meant to highlight a novel innovation in quantitative methods, but rather to provide an important example to applied researchers regarding the comparable value of ML methods and importance of domain expertise in classification with ADNI data. 2.3 Materials and Data The data used in the preparation of this study were obtained from the Alzheimer’s Disease Neu- roimaging Initiative (ADNI). ADNI is an ongoing joint public-private effort to utilize neuroimaging, other biological markers, and clinical and neuropsychological assessment to measure the incidence and progression of MCI to early dementia. Determination of sensitive and specific markers of preclinical AD and MCI is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as reduce the time and cost of clinical trials. Data in the present study came from all sites across the U.S and Canada. All ADNI study participants included in the present analyses were between 55 and 90 years old, spoke English or Spanish as their native language, and had a study partner who provided an independent assessment of functioning. This study used a subset of the 819 participants from ADNI-1 diagnosed with MCI at baseline and for whom the data from demographic, clinical cognitive assessments, APOE4 genotyping, and MRI measurements were also available. To evaluate differences in classification performance due to participant inclusion and drop out, we subdivided the sample into two overlapping groups. After applying other criteria for inclusion, Group One included all patients whose follow-up period was at least 36 months (n = 265); Group Two consisted of all patients with follow-up assessments at 24 months (n = 308). Although the ADNI study protocol includes additional follow-up visits at 6-month intervals, the present study only evaluated baseline data for features (i.e., clinical, neuropsychological, brain volumetric) in classification analyses. In addition, identification of stable vs. converting clinical outcomes only considered longer-term outcomes based on assessments at 2 14 and 3 years after baseline. The final samples included 265 and 308 study participants in Groups One and Two, respectively, who met criteria for inclusion. Both Groups included participants who were stable in their diagnosis (MCI-S) and those who converted to a diagnosis of dementia over the 2 or 3 years (MCI-C). Table 2.1 shows the participant characteristics. Diagnostic criteria for MCI included an MMSE score at baseline between 24 and 30, a CDR score of 0.5, and a subjective memory complaint, in addition to objective memory loss measured by education-adjusted scores on the Logical Memory II subscale of the Wechsler Memory Scale, generally preserved activities of daily living and no dementia. The diagnostic criteria for dementia were an MMSE score between 20 and 26, and a CDR score between 0.5 and 1.0. The clinical status of each participant diagnosed with MCI was re-assessed at each follow-up visit and updated to reflect one of several outcomes (e.g., MCI or dementia subtypes). The MCI-C and MCI-S group designations were based on this follow-up clinical diagnosis and marked as either 1 for MCI-C or 0 for MCI-S in classification study. Table 2.1: Sample Sizes by Timing and Diagnosis: Group One and Two Group Time # MCI-S (y=0) # MCI-C (y=1) # Total patients One 36 months 101 164 265 Two 24 months 122 186 308 Table shows the number of MCI-C, MCI-S and total subjects in Group One and Two. The number of MCI-C patients is higher than MCI-S patients in both groups. 2.3.1 Data Used in Classification Evaluation of extant reports of common predictors of conversion from MCI to dementia focused on dimensions of neuropsychological test performance, clinical assessment, genetic data, and regional brain volumes. In the present study, we first divided these variables into two sets of features, with all non-brain volumetric variables in one set and all variables representing regional brain volumes in a second set. In addition, we created a third set of features from the volumetry feature set that only included 26 of the 259 brain volumes. Henceforth, we refer to models that only include one 15 of these three feature sets as ’single-modality,’ whereas models that combine brain and non-brain feature sets are referred to as ’multi-modal.’ (a) MMSE scores in MCI-C and MCI-S groups (b) Learning in MCI-C and MCI-S groups Figure 2.1: Comparison of distributions for baseline predictor variables between MCI-S and MCI- C groups. (a) The mean MMSE score in MCI-S is higher than in MCI-C. (b) Mean Learning scores of MCI-C and MCI-S groups are 2.5 and 5. 2.3.2 Clinical Cognitive Assessment and Genetic data We considered a total of 19 clinical features as potential predictors of MCI-to-AD progression in our classification analyses. These included the following assessment scores: the Mini Mental State Examination (MMSE), Clinical Dementia Rating Sum of Boxes (CDR-SB), Alzheimer’s Disease Assessment Scale-cognitive sub-scale (ADAS-cog), Functional Activities Questionnaire (FAQ) measures of activities of daily living, Trail Making Test-B (TRABSCOR), the immediate and delayed recall components of the Rey Auditory Verbal Learning Test (RAVLT), the Digit- Symbol Coding test (DIGT) and the Digit Symbol Substitution Test from the Preclinical Alzheimer Cognitive Composite (mPACCdigit). We also considered genotype for carriers of the epsilon-4 allele of the apolipoprotein E (APOE) gene [145] as a genetic predictor in this study. Table 2.2 summarizes all 19 clinical, demographic and genetic features used in this study. Preliminary comparison of six clinical and genetic predictors by MCI-C and MCI-S subgroups showed five of them (APOE4, ADAS4, CDR, MMSE and RAVLT.learning) significantly differ between the 16 (a) Sex distributions (b) APOE4 genotype distributions (c) CDR distributions (d) ADAS distributions Figure 2.2: Comparisons between MCI-S and MCI-C groups on baseline predictor variables. The y-axis of panels (a) through (d) represents the number of participants developing AD. Blue and red bars represent non-converters and converters, respectively. Panel (a) shows a greater number of converters than non-converters for both men and women. Panel (b) shows more than half of MCI-C subjects are APOE4 carriers and approximately 70% MCI-S subjects are non-APOE4 carriers. Panel (c) shows MCI-S subjects have the relatively lower CDR score and MCI-C subjects have higher CDR score. The number of people in MCI-C group has a downward trend as CDR score increases. Panel (d) shows MCI-C subjects have the relatively higher ADASQ4 score. The average of ASADQ4 score of MCI-S and MCI-C subjects are approximately 5 and 8, respectively. groups, whereas one (SEX) does not. Fig 2.1 and 2.2 illustrate the distribution of these predictors for both groups. Overall, in comparison to MCI-S participants, those in the MCI-C group were more cognitively and functionally impaired at baseline, exhibited greater verbal memory impairments, and included a greater proportion of APOE4 carriers. 17 Table 2.2: Clinical Features and Cognitive Assessment Score of Group One Characteristics MCI-S MCI-C Test statistic P-value Age(years) 74.34 ± 7.78 74.84 ± 6.83 -0.528 > 0.50 Education(years) 15.57 ± 2.94 15.73 ± 2.91 -0.527 > 0.51 Sex, % female 33.67% 34.14% 0 11 APOE4 carriers % 34.65% 62.19% 17.900 < 0.0010 CDRSB 1.23 ± 0.61 1.72 ± 0.92 -5.237 < 0.0010 MMSE score 27.61 ± 1.74 26.82 ± 1.71 3.645 < 0.0010 ADAS11 8.89 ± 3.79 12.29 ± 4.16 -6.823 < 0.0010 ADAS13 14.48 ± 5.50 20.01 ± 5.79 -7.795 < 0.0010 ASASQ4 4.76 ± 2.19 6.77 ± 2.21 -7.339 < 0.0010 RAVLT.immediate 36.21 ± 10.10 29.10 ± 7.98 6.021 < 0.0010 RAVLT.learning 4.19 ± 2.47 2.91 ± 2.26 4.231 < 0.0010 RAVLT.forgetting 4.31 ± 2.59 4.47 ± 2.15 -1.501 0.1350 RAVLT.perc.forgeting 51.55 ± 31.04 72.85 ± 30.45 -5.464 < 0.0010 LEDLTOTAL 4.96 ± 2.36 3.41 ± 2.66 4.931 < 0.0010 DIGTSCOR 40.75 ± 11.09 36.72 ± 10.96 2.883 < 0.0050 TRABSCOR 109.43 ± 62.94 132.09 ± 71.36 -2.704 0.0070 FAQ 1.50 ± 2.99 4.96 ± 4.79 -7.243 < 0.0010 mPACCdigit 5.376 ± 2.96 8.06 ± 2.96 7.174 < 0.0010 mPACCtrailsB 5.47 ± 3.06 8.22 ± 2.98 7.174 < 0.0010 Table only for Group One where has 265 patients and 36 months follow-up time. Values are shown as mean ± standard deviation or percentage. Test statistics and P-values for differences between MCI-S and MCI-C are based on (a) t-test or (b) chi- square test. MCI-S = non-progressive MCI; MCI-P = progressive MCI; APOE = apolipoprotein E; MMSE = Mini-Mental State Examination. RAVLT = The Rey Auditory Verbal Learning Test (immediate: sum of 5 trails; learning: trial 5-trial 1; Forgetting: trial 5-delayed; perc.forgetting: Precent forgetting); DIGT = The Digit- Symbol Coding test; TRAB = Trail Making tests; CDRSB = Clinical Dementia Rating Scaled Response; FAQ = Activities of Daily living Score; ADAS = Alzheimer’s Disease Assessment Scale–Cognitive sub- scale; mPACCdigit = the Digit Symbol Substitution Test from the Preclinical Alzheimer Cognitive Composite; 18 2.3.3 MRI data Structural MRI data were collected according to the ADNI acquisition protocol using T1-weighted scans (GradWarp, B1 Correction, N3, Scaled) [36]. These data included baseline structural MRI scans of 840 ADNI participants, including 230 diagnosed as cognitively normal, 200 with diagnoses of dementia, and 410 diagnosed with MCI. Processing for ROI-based volumetric data used in the present study included brain extraction [34] and a multi-atlas, consensus-based label fusion scheme for anatomical parcellation [35] to generate template-based ROIs deformed to individual subject space. MRI scans were automatically segmented into 145 anatomic regions of interest (ROIs) spanning the entire brain. An additional 114 derived ROIs were calculated by combining single ROIs within a tree hierarchy, to obtain volumetric measurements from larger structures [36]. In total, 259 ROIs were measured and used as potential predictors of MCI-to-dementia progression in this study. One of the goals of this study is to investigate if manually selecting predictors improves a model’s performance. Based on the extant literature [68], we manually selected 26 out of 259 features as theoretically significant predictors of MCI to dementia progression (Table 2.3) [18, 43, 123, 91, 42, 108]. While many brain regions have been reported as showing some relationship to MCI- to-dementia progression, prior reports and reviews clearly implicate hippocampal and entorhinal cortical volumes as markers of such conversion. In addition, we manually selected additional regions based on their common occurrence across reports, including cingulate gyrus, precuneus, amygdala, inferior frontal gyrus, superior parietal lobule, and lobar white matter volumes. 2.4 Method and Algorithm In the following section, we utilize binary LR and SVM classification techniques to investigate which approach yields superior discrimination accuracy in the context of ADNI data. Prior comparisons of logistic regression and SVM have reported that SVM requires fewer variables than logistic regression to achieve an equivalent level of misclassification rate (MCR) [131, 30]. These also report SVM performs better than LR with microarray expression data [30]. Furthermore, 19 Table 2.3: Pre-selected MRI Features of Group One Characteristics MCI-S MCI-C Test statistic P-value HippoR 3684 ± 438 3366 ± 437 5.735 < 0.001 HippoL 3414 ± 418 3105 ± 388 5.994 < 0.001 flWMR 96720 ± 6218 96976 ± 5585 -0.338 0.73 flWML 93671 ± 5836 94238 ± 5160 -0.802 0.42 plWMR 47197 ± 3415 47141 ± 3098 0.135 0.89 plWML 50149 ± 3714 50038 ± 3467 0.242 0.81 tlWMR 56076 ± 3252 55934 ± 2931 0.359 0.72 tlWML 55412 ± 3396 55468 ± 3023 -0.136 0.89 ACgCR 3167 ± 756 3128 ± 641 0.438 0.66 ACgCL 4104 ± 787 4075 ± 689 0.312 0.76 EntR 2189 ± 365 1983 ± 373 4.412 < 0.001 EntL 2050 ± 399 1844 ± 356 4.240 < 0.001 MCgCR 4176 ± 547 4200 ± 541 -0.341 0.73 MCgCL 3988 ± 493 4002 ± 559 -0.213 0.83 MFCR 1581 ± 342 1505 ± 524 1.805 0.07 MFCL 1566 ± 285 1548 ± 291 0.487 0.62 OpIFGR 2575 ± 608 2425 ± 546 2.021 0.04 OpIFGL 2465 ± 550 2361 ± 579 1.466 0.14 OrIFGR 1252 ± 315 1196 ± 362 1.322 0.18 OrIFGL 1514 ± 335 1398 ± 356 2.658 < 0.001 PCgCR 3679 ± 466 3528 ± 415 2.657 < 0.001 PCgCL 3991 ± 442 3789 ± 424 3.676 < 0.001 PCuR 10129 ± 1193 9862 ± 1313 1.701 0.09 PCuL 10005 ± 1263 9759 ± 1299 1.522 0.13 SPLR 8867 ± 1140 8693 ± 1219 1.180 0.02 SPLL 8880 ± 1192 8662 ± 1313 1.390 0.17 Values are shown as mean ± standard deviation or percentage. Test statistics and P-values for differences between MCI-C and MCI-S are based on t-test. MCI-S = non-progressive MCI; MCI-C = progressive MCI. HippoR = Right Hippocampus; HippoL = Left Hippocampus; flWMR = frontal lobe WM right; flWML = frontal lobe WM left; plWMR = parietal lobe WM right; plWML = parietal lobe WM left; tlWMR = temporal lobe WM right; tlWML = temporal lobe WM left; ACgCR=Right ACgG anterior cingulate gyrus; ACgCL=Left ACgG anterior cingulate gyrus; EntR = Right Ent entorhinal area; EntL = Left Ent entorhinal area; MCgCR = Right MCgG middle cingulate gyrus; MCgCL = Left MCgG middle cingulate gyrus; MFCR = Right MFC medial frontal cortex; MFCL = Left MFC medial frontal cortex; OpIFGR = Right OpIFG opercular part of the inferior frontal gyrus; OpIFGL = Left OpIFG opercular part of the inferior frontal gyrus; OrIFGR = Right OrIFG orbital part of the inferior frontal gyrus; OrIFGL = Left OrIFG orbital part of the inferior frontal gyrus; PCgCR = Right PCgG posterior cingulate gyrus; PCgCL = Left PCgG posterior cingulate gyrus; PCuR = Right PCu precuneus; PCuL = Left PCu precuneus; SPLR = Right SPL superior parietal lobule; SPLL = Left SPL superior parietal lobule. SVMs have a nice dual form, giving sparse solutions when using the kernel trick. In addition, both methods involve minimizing some cost associated with the misclassification based on likelihood ratio for a probabilistic model. Therefore, LR and SVM share common roots in statistical pattern 20 recognition, which we utilize in the comparison of their performance on multi-modal ADNI data. 2.4.1 Logistic Regression Logistic regression (LR) is the most commonly used machine learning approach for binary classifi- cation. In the past decade this has been applied to task of MCI-to-dementia conversion [29, 144, 82]. In the present study, we consider a supervised learning task where we are given M training examples {⇡ = (G8 , H8 ), 8 = 1, ..." }. Here each G8 2 <# is # dimensional feature vectors, and H8 2 {0, 1} is a class label. The goal of LR is to model the probability ? of a random variable y being 1 or 0 given the experimental data x. The logistic regression model is defined as follows: ? ;>68C ? = ;>6 (2.1) 1 ? Logit, the natural logarithm of the odds, is the key concept that underlies logistic regression. The equation for LR is: %(H8 = 1|G8 ; ) ’# ;>6 = j xi j (2.2) 1 %(H8 = 1|G8 ; ) 9=1 where = (V1 , ...V # )) are the parameters or weights of the logistic regression model, xi j = (G81 , ...G8# ), 8 = 1, ...". Also, %(H8 = 1|G8 , ) is the probability that 8C⌘ MCI patient will develop dementia and %(H8 = 0|G8 , ) is the probability that 8C⌘ MCI patient will not develop dementia. Denote %(H8 = 1|G8 ; ) = ⌘(G8 ), then 1 ⌘(G8 ) = Õ# (2.3) 1 + 4G ?( 9=1 j G8 9 ) LR is usually trained by minimizing an error function; an appropriate choice of such a function for binary classification problems is the cross-entropy error: 48 ( ) = H8 ;>6(⌘(G8 )) (1 H8 );>6(1 ⌘(G8 ))) (2.4) The total cost over the data {⇡ = (G 8 , H8 ), 8 = 1, ..." } is: 1 ’ " ( )= [ H8 ;>6(⌘(G8 )) (1 H8 );>6(1 ⌘(G8 ))] (2.5) " 8=1 21 Consider the problem of finding the maximum likelihood estimate (MLE) of the parameters for the unregularized logistic regression model. To find the optimized weights , the total cost needs to be minimized. The optimization function can be written: 1 ’ " > ?C8<0; = <8= [ H8 ;>6(⌘(G8 )) (1 H8 );>6(1 ⌘(G8 ))] (2.6) " 8=1 Solving Eq. (2.6) yields the optimal weights of . However, the model-building challenge is to abstract the underlying distribution from the particular instance D of samples because of the relatively small sample size, as compared to the number of features. The problem of replicating the data set instead of identifying the underlying distribution is known as overfitting [37]. To avoid the overfitting problem, it is often necessary to apply a dimension reduction technique. ! 1 and ! 2 norm are widely used to avoid overfitting, especially when there is a only small number of training examples, or when there is a larger number of features to be learned. ! 1 norm or ;0BB> is also often used for feature selection, and has been shown to generalize well in the presence of many irrelevant features [76, 109]. ! 1 regularization is implemented by adding ! 1 norm to the cost function; the cost function and the optimization function were based on the following: 1 ’ " ( )= [ H8 ;>6(⌘(G8 )) (1 H8 );>6(1 ⌘(G8 ))] + _| | (2.7) " 8=1 and 1 ’ " > ?C8<0; = <8= { [ H8 ;>6(⌘(G8 )) (1 H8 );>6(1 ⌘(G8 ))] + _| |} (2.8) " 8=1 where _ is positive tuning parameter. This Eq. (2.8) is refereed to as ! 1 regularized logistic regression. 2.4.2 Support Vector machine Support Vector Machine (SVM) is another classification and regression method that can handle high-dimensional feature vectors. Algorithmically, SVMs build optimal boundaries between data sets by solving a constrained quadratic optimization problem [23, 112, 129, 128, 127]. The number of studies applying SVM to evaluate classification of conversion from MCI to dementia has grown over the past decade [147, 148, 24, 145, 56, 68, 31, 59, 28, 136]. 22 We briefly review basic support vector machines with linear kernal (SVM-linear) for classifi- cation problems: Let ) ⌘(G) + V0 = 0 denote an equidistant hyperplane (decision surface) to the closest point of each class on the new space. The goal of SVMs is to find and V0 such that | ) ⌘(G) + V0 | = 1 for all points closer to the hyperplane. In the following classifier construction, one assumes that: 8 > < > 1 8 5 H8 = 1 ) ⌘(G8 ) + V0 = (2.9) > >  1 8 5 H8 = 0 : such that the distance from the closest point of each class to the hyperplane is 1/|| || and the distance between the two groups is 2/|| ||. To maximize the margin, the SVM requires the solution of the following optimization primal problem [151]: ’" ’# <8= , 0 {1 H8 [V0 + V)9 ⌘ 9 (G8 9 )]} (2.10) 8=1 9=1 where ⌘ 9 is the kernel function which is a linear function for SVM-linear. Specifically we choose, ⌘ 9 (G 9 ) = G 9 for 9-th covariate. To make the algorithm work for highly correlated features and improve the fitted model’s prediction accuracy, we reformulate our optimization by adding ! 1 -norm of V, i.e. the ;0BB> penalty as follows: ’" ’ # <8= , 0 {1 H8 [V0 + V)9 ⌘ 9 (G8 9 )]} + _||V|| 1 (2.11) 8=1 9=1 where _ is the tuning parameter that controls the trade-off between loss and penalty. The lasso penalty shrinks the fitted coefficients V towards zero, and hence benefits from the reduction in fitted coefficients’ variance. 2.4.3 Experimental Design We built four different classifiers, each designed to classify individual ADNI participants as be- longing to either the MCI-C group or the MCI-S group: Classifier 1 is logistic regression (C-LR); 23 Classifier 2 is logistic regression with ! 1 norm (C-LR-1); Classifier 3 is support vector machine (C- SVM), and Classifier 4 is SVM with ! 1 norm (C-SVM-1). To test the classifiers’ performance, we constructed five different data sources (Table 2.4). The first three single-modality data sets included clinical cognitive assessment scores and APOE4 status (CCA), all MRI volumes (ROI-NP), and MRI volumes with pre-selection (ROI-P), respectively. Two additional multi-modal data sets were constructed by combining the CCA data separately with ROI-NP and ROI-P data sets (i.e., brain volumes with and without pre-selection). Furthermore, it is interesting to note that the number of MCI-S subjects is 101 (38%) in the Group One and 122 (39%) in Group Two, which makes the data rather imbalanced. Consequently, to precisely report the results obtained from the models, the present study also assessed additional model performance parameters, including AUC score, sensitivity and specificity (accuracy coefficient is unreliable for imbalanced data). The prediction procedure consisted of three processing stages for Group One (Time=36 months) and Group Two (Time=24 months): 1) Split data as training, validation, testing set; 2) Train classifiers using train- ing set, tune hyper-parameter using the validation set, and assess classifiers using testing set, then train classifiers again using ! 1 norm on the same training set; 3) Report the testing accuracy, AUC score, sensitivity and specificity of each classifier on single-modality data. Specifically, the first stage used 80% of the sample as a training set while the remaining 20% of the data constituted the testing set. In the second stage, the optimal subsets of features of each data source are determined and chosen following application of ! 1 norm. We then list the top 10 features of each data set for each of the models. In the last stage, we report AUC score, sensitivity (percent of MCI-C subjects correctly classified), and specificity (percent of MCI-S subjects correctly classified) as measures of classification accuracy. To protect against over-fitting and to avoid optimistically-biased estimates of model performance, we report 20 measures of predictive performance for each classifier (1-4); for these different partitions of the data, we report the mean and standard deviation of testing accuracy, AUC score, sensitivity, and specificity (Tables 6 & 7). We also investigate the relationship between the number of features and model performance. Finally, we compare the performance of LR with SVM based on their ability to handle the problem with a large number of covariates. Figure 2.3 24 illustrates the diagram of the prediction framework. Table 2.4: Modalities Data sources # features Single-modality Clinical Cognitive Assessments score and APOE4 data (CCA) 19 ROI with no pre-selection data (ROI-NP) 259 ROI with pre-selection data (ROI-P) 26 Multi-modal CCA and ROI with no pre-selection data (CCAR-NP) 278 CCA and ROI with pre-selection data (CCAR-P) 45 2.5 Results and Analysis Cross-validation and choice of _ We adopted 10-fold cross-validation to tune the hyper-parameters for each model, which included dividing the data into separate sets for training and validation. The ratio of case in training and validation was 8:2. Here, the training set was used to train the model and the validation set was used to select the hyper-parameters. The results of a 10-fold cross-validation run are summarized with the mean and standard deviation of the model skill scores based on testing data. Cross-validation was also applied to tune the hyper-parameters; _ is used to denote the hyper-parameters for both LR-! 1 and SVM-! 1 . To select the optimized _, we tried different values of the _; results reported here include values of _ = 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and applied them to the Eq (2.8) and (2.11). Next, we selected the _ value based on the best cross-validation score and used the selected _ with Classifiers 2 and 4 to select optimal features. For brevity, the model performance estimates are reported in Tables 2.6 and 2.7 for each different modalities, and the top 10 selected features are reported in Table 2.5. For example, the best _ for ROI-NP-! 1 was 0.01 and the top 3 optimal features selected by LR were left amygdala, right accumbens area, and right middle 25 Figure 2.3: Flowchart of the LR and SVM method A) ROI-P: ROI level data with Pre-selection; B) ROI-NP: ROI level data with No Pre-selection; C) CCAR: Clinical, Cognitive assessments score, APOE4 and ROI level data. 26 temporal gyrus. After hyper-parameters were selected, we adopted a 10-fold cross-validation again to avoid optimistically-biased estimates of model performance. In each iteration, 212 of the 265 participants are selected by simple random sampling as training cases and the remaining 53 were used as test cases. The approximate 4:1 ratio of training to test cases is, of course, arbitrary. Table 2.5: Top 10 features of Group One obtained by ! 1 regularization Source LR-L1 (Classifier 2) SVM-L1 (Classifier 4) Data CCA ROI-NP CCAR-NP CCA ROI-NP CCAR-NP 1 FAQ AmyL FAQ FAQ AmyL FAQ 2 mPACCtrailsB AccmR AmyL Yrs. Educ. AccmR AmyL 3 APOE4 MTGR ADASQ4 APOE4 AOrGL AccmR 4 ADASQ4 HippoL HippoL mPACCdigit PCgGL AOrGL 5 Learning AOrGL MTGR ADASQ4 HippoL PTR 6 Yrs. Educ. PrGR APOE4 Learning PrGR AnGR 7 Forgetting PCgGL AOrGL ADAS11 POrGR APOE4 8 mPACCdigit InfR Learning mPACCtrailsB PTR PCgGL 9 ADAS13 POR mPACCtrailsB DELTOTAL LOrGL Learning 10 ADAS11 MOGL mPACCdigit Forgetting MOrGL POrGR AccmR = Right Accumbens Area; AmyL = Left Amygdala; HippoL = Left Hippocampus; InfR = Right Inf Lat Vent; AOrGL = Left anterior orbital gyrus; AnGR = Left angular gyrus; LOrGL = Left lateral orbital gyrus; MOGL = Left middle occipital gyrus; MOrGL = Left medial orbital gyrus; MTGR = Right middle temporal gyrus; PCgGL = Left posterior cingulate gyrus; POR = Right parietal operculum; POrGR = Right posterior orbital gyrus; PrGR = Right precentral gyrus; PTR = Right planum temporale 2.5.1 Comparison with different modalities We compared the performance of each classifier (1-4) on the five different feature sets (Table 2.4) based on estimates of AUC, sensitivity and specificity. As shown in Table 2.6, the results of using LR with ! 1 regularization (Classifier 2) can achieve the high AUC of 81.2% and sensitivity of 81.4% on single-modality data (CCA), which is considerably better than performance of LR on the other four modalities. Similarly, the best AUC and sensitivity achieved by SVM are 81.4% and 81.6% based on the combination of CCA and SVM-L1. Furthermore, we also found the highest accuracy achieved by both classifiers without applying regularization is based on the single-modality data (CCA); this indicated both classifiers perform best on single-modality data. 27 2.5.2 Comparison of Pre-selection and ! 1 norm We found that using prior knowledge to inform feature selection improves model performance and protects against over-fitting. As shown in Table 2.6, model performance (i.e., AUC) on ROI-P (64.3%) and CCAR-P (76.3%) outperformed ROI-NP (60.6%) and CCAR-NP (60.1%). However, the performance of Classifier 2 on the ROI-NP-! 1 and CCAR-NP-! 1 data sets had AUC score of 64.1% and 64.0%, while the ROI-P-! 1 and CCAR-P-! 1 had respective AUC scores of 64.3% and 77.9%; this suggests that user-guided pre-selection significantly improved model performance over ! 1 norm. In addition, the SVM (Classifiers 3 & 4) had similar and comparable results with LR classifiers. First, as with the LR models, the observed AUC estimates for CCAR-P and ROI-P (69.2% and 64.1%, respectively), were superior to AUCs from the CCAR-NP (59.1%) and ROI-NP analyses (61.4%). Classifier 4 exhibited similar performance on the CCAR-P-! 1 as Classifier 2, with an AUC value of 79.6% – higher than the model for CCAR-NP-! 1 (74.0%). Therefore, manually selecting features improves model’s performance whether ! 1 norm is applied, or not. Second, these results show it is necessary and important to use pre-selection because both LR and SVM models on CCAR-P-! 1 , with respective AUC estimates of 77.9% and 78.5%, exhibited superior performance over the models without such pre-selection (i.e., LR and SVM on CCAR-NP-! 1 had AUC estimates of 64.0% and 74.0%, respectively). 2.5.3 Comparison of Groups One and Two In addition to the results from models of Group One (i.e., MCI-to-AD conversion over 36 months), we also evaluated the performance of Group Two (i.e., MCI-to-AD conversion over 24 months) in an effort to gain further insight regarding possible benefits of shorter or longer assessment periods on classification of the progression of MCI to dementia. Table 2.7 summarizes the predictive performance of LR and SVM for Group Two. Similarly, we also evaluated classifier performance for single- and multi-modality feature sets. The best result is obtained by using SVM- ! 1 model (Classifier 4) on CCAR-P, and its corresponding AUC, Sn and Sp are 76.2%, 60.1% and 79.2%, which verifies the assumption that manually selecting techniques improves the model’s 28 performance again. However, it warrants mention that all classifiers’ performance on the Group One data outperformed the same classifiers’ performance on the same data sets in Group Two. For example, Classifier 2 of Group One on CCA achieved AUC and Sn values of 81.2% and 83.1%, which is considerably better than the same classifier of Group Two on CCA (i.e., 76.3% and 79.8%). Similarly, Classifier 3 for ROI-NP had an AUC of 61.4% for Group One and 56.6% for Group Two. The experimental results indicated superior model performance on data obtained using longer than using shorter follow-up periods. Given the uncertainty in conversion, a longer time window for assessment of cognitive and functional change clearly yields more accurate classification. 2.5.4 Comparison of LR and SVM In addition to comparing classification between different time windows of assessment, we also compared performance differences between LR and SVM. The results, including models’ ability to address the over-fitting problem of LR and SVM methods with different modalities are displayed in Table 2.6, 2.7 and Fig.2.4, and 2.5. First, it is worth noting that both LR and SVM do not work well if no ! 1 penalization used, since Classifiers 2 and 4 outperform Classifiers 1 and 3 on the same data set. Second, it is worth noting that SVM has a better performance on MRI data when the L1 feature selection method is employed. Third, it was possible to obtain good performance accuracy using LR, which had equivalent model performance as SVM for “large p" data (ROI-P), as evidenced by respective AUC estimates for Classifiers 1 and 3 of 64.3% and 64.1%. Finally, as shown in Fig. 2.5 and 2.4, the SVM method is more stable and robust than LR to the large number of features when n is small. To summarize, the best performance of Group One was achieved by Classifier 4 (SVM with ! 1 norm) when using multi-modal – i.e., CCAR-L1, had an AUC of 81.4%. 29 Table 2.6: LR and SVM performance of Group One (Time = 3 years) for models on single- and multi-modal feature sets Source LR (Classifier 1 and 2) SVM (Classifier 3 and 4) Features Modality Test Acc % AUC % Sp % Sn % Test Acc % AUC % Sp % Sn % # Features 19 (2) CAA 74.3 ± 6.0 80.8 ± 7.0 62.3 ± 12.1 81.5 ± 6.2 72.4 ± 6.9 80.0 ± 7.3 53.6 ± 13.2 79.4 ± 7.7 19 (1) 259 (2) ROI-NP 58.1 ± 7.0 60.6 ± 8.1 45.5 ± 13.4 65.3 ± 7.9 59.5 ± 7.3 61.4 ± 8.5 46.5 ± 11.9 67.3 ± 8.5 259 (1) 26 (2) ROI-P 64.4 ± 6.5 64.3 ± 6.6 46.1 ± 10.4 75.0 ± 9.6 62.1 ± 5.9 64.1 ± 6.2 43.6 ± 9.5 78.4 ± 10.4 26 (1) 278 (2) CCAR-NP 57.6 ± 7.2 60.1 ± 8.1 44.8 ± 12.9 65.1 ± 9.0 57.8 ± 6.8 59.1 ± 7.0 45.9 ± 10.4 65.1 ± 7.5 278 (1) 45 (2) CCAR-P 72.7 ± 6.4 76.3 ± 6.5 60.5 ± 10.4 80.4 ± 8.2 66.9 ± 6.0 69.2 ± 6.4 53.6 ± 13.2 74.4 ± 10.5 45 (1) 3 (2) CCA-! 1 74.9 ± 6.4 81.2 ± 6.7 61.3 ± 12.0 83.1 ± 6.6 74.2 ± 6.0 81.4 ± 6.9 61.6 ± 11.5 81.6 ± 5.9 4 (1) 27 (2) ROI-NP-! 1 62.2 ± 6.6 64.1 ± 7.9 53.1 ± 13.1 68.1 ± 7.2 62.7 ± 5.8 67.0 ± 6.7 53.7 ± 11.6 67.7 ± 7.4 29 (1) 17 (2) ROI-P-! 1 64.4 ± 6.5 64.3 ± 6.2 46.2 ± 11.0 74.9 ± 9.6 64.4 ± 5.7 64.7 ± 5.8 46.7 ± 11.1 75.4 ± 8.3 5 (1) 27 (2) CCAR-NP-! 1 62.6 ± 7.2 64.0 ± 8.2 51.8 ± 12.7 69.5 ± 7.3 67.4 ± 6.4 74.0 ± 7.4 55.7 ± 12.1 74.1 ± 7.1 18 (1) 25 (2) CCAR-P-! 1 73.1 ± 6.5 77.9 ± 5.9 61.6 ± 10.9 79.6 ± 7.7 73.5 ± 6.2 78.5 ± 6.4 61.6 ± 9.3 80.8 ± 7.5 14 (1) Predictive performance of LR and SVM (mean ± standard deviation) for all models. Performance estimates include testing accuracy (Test Acc %), area under the cureve (AUC), sensitivity (Sn), and specificity (Sp). The number (#) of features was determined via (1): Classifier 2; (2): Classifier 4. 30 Table 2.7: LR and SVM performance of Group Two (Time =2 years) for single-data and multi- modal data Source LR (Classifier 1 and 2) SVM (Classifier 3 and 4) Features Modality Test Acc % AUC % Sp % Sn % Test Acc % AUC % Sp % Sn % # Features 19 (2) CAA 69.9 ± 5.3 76.2 ± 5.5 56.7 ± 9.0 79.3 ± 7.3 69.4 ± 5.4 75.4 ± 5.5 56.7 ± 8.8 78.5 ± 7.1 19 (1) 259 (2) ROI-NP 58.1 ± 4.2 58.8 ± 5.6 49.7 ± 7.1 64.4 ± 5.9 57.8 ± 5.0 56.6 ± 6.4 50.3 ± 7.1 62.9 ± 7.5 259 (1) 25 (2) ROI-P 63.4 ± 4.7 65.8 ± 4.3 43.7 ± 10.2 77.8 ± 8.6 64.5 ± 4.7 66.2 ± 5.0 44.5 ± 8.5 79.1 ± 9.1 25 (1) 278 (2) CCAR-NP 57.3 ± 4.0 58.8 ± 5.4 47.5 ± 8.3 64.3 ± 5.8 56.6 ± 5.5 56.4 ± 5.2 48.9 ± 7.9 62.3 ± 10.4 278 (1) 45 (2) CCAR-P 70.2 ± 5.4 74.0 ± 5.0 56.7 ± 9.5 80.6 ± 7.0 69.5 ± 4.9 72.0 ± 5.3 58.1 ± 8.1 78.0 ± 8.2 45 (1) 4 (2) CCA-! 1 70.1 ± 4.8 76.3 ± 5.3 56.8 ± 9.9 79.8 ± 7.6 70.4 ± 4.9 76.4 ± 7.7 56.8 ± 9.8 79.4 ± 7.7 4 (1) 31 (2) ROI-NP-! 1 62.2 ± 6.0 64.7 ± 6.0 48.9 ± 9.2 72.0 ± 6.8 60.8 ± 4.5 65.9 ± 6.1 53.6 ± 7.5 64.3 ± 7.9 29 (1) 14 (2) ROI-P-! 1 64.1 ± 4.6 66.8 ± 3.8 42.8 ± 11.3 79.8 ± 8.4 65.4 ± 4.0 67.8 ± 3.9 46.3 ± 9.4 81.1 ± 7.2 6 (1) 32 (2) CCAR-NP-! 1 62.6 ± 6.3 64.8 ± 6.0 49.1 ± 9.1 72.1 ± 6.1 64.5 ± 5.1 71.7 ± 4.8 55.4 ± 7.8 71.4 ± 8.9 26 (1) 27 (2) CCAR-P-! 1 70.0 ± 5.5 74.3 ± 5.5 57.8 ± 8.0 78.3 ± 8.8 71.3 ± 4.9 76.2 ± 4.7 60.1 ± 7.1 79.2 ± 8.5 14 (1) For each modality, the predictive performance of LR and SVM are shown (mean ± standard deviation), including testing accuracy, AUC, sensitivity (Sn), specificity (Sp), # features is the number of features; # features is the number of features; this parameter was determined via (1): Classifier 2; (2): Classifier 4. 31 2.6 Discussion and Conclusion In this thesis, we applied two machine learning methods under multiple conditions, to test accuracy in classifying patients with MCI who progress to clinically-defined dementia (MCI-C) from those who remain stable (MCI-S). Using multi-modal data from ADNI, we compared LR and SVM classification accuracy and pre-selection dimensional reduction techniques - i.e., feature selection as informed by prior findings in clinical neuroscience and by ! 1 norm. Notably, the present results demonstrate important boundaries for applying feature selection techniques in statistical classification of MCI-to-dementia conversion. Specifically, we found that while using ! 1 for pre- selection can improve accuracy, it also benefits from a more limited, theoretically based set of feature inputs. In addition, we found that model performance benefited from a longer window of assessment. These results have implications for studies utilizing multi-modal data for such classification, including features from clinical neuropsychological assessment, demographic and genetic markers, MRI-based volumetric brain measures, and other modalities. Comparison of user-defined and ! 1 pre-selection for LR and SVM classifiers yielded multiple noteworthy findings, consistent with previously published reports [147, 148, 24, 29, 145, 56, 144, 68]. First, the classification results showed that the model using multi-modal data with cognitive, clinical, and volumetric data (CCAR) achieved better classification accuracy than the methods based on single-modality (CCA, ROI). Moreover, the AUC of CCAR based on LR or SVM was either statistically significantly or at least numerically greater than those based on the single-modality model. Based in AUC, we reported the highest accuracy was observed for CCAR data at 78.5% by ! 1 SVM and 77.9% by ! 1 LR. Second, SVM demonstrated several advantages over LR in discriminating MCI-C from MCI-S (Fig. 2.4). For one, SVM performance tended to be more stable than LR when the number of features was relatively large. In other words, the model performance of SVM on ROI data remained more stable than LR when using larger numbers of features without user-defined pre-selection. In particular, SVM performance on ROI data improved as the number of features increased from 20 and 30. In contrast, the AUC values for ROI data sets remained fairly static despite increasing the number of features. However, LR model performance 32 decreased gradually after the number of ROI features reached 40. Third, the classification results clearly demonstrate that manually selecting features on MRI data not only improved the model performance and protected the classifier from overfitting, but also affords easier interpretation of each selected feature’s contribution to the model. In addition, we show that pre-selection improves performance: Tables 2.6 and 2.7 suggest it is the best strategy to obtain the maximum model performance, compared to features selection based on ! 1 norm. The present findings can also be interpreted in the context of other reports over the past decade that also investigated the prognostic capacity of brain volumetry data to predict the conversion of MCI to dementia, using either SVM or LR, and that also combined volumetry data with other imaging and biomarker modalities such as MRI, functional MRI (fMRI), positron emission tomography (PET) to cerebral spinal fluid (CSF) protein markers [147, 148, 24, 29, 145, 56, 144, 68, 78, 130, 69]. In addition, one can vary the degrees of non-linearity and flexibility in the model by employing different kernel functions. For example, Young et al (2013) report [145], results from both SVM and Gaussian process (GP) classification on MCI progression in ADNI data using MRI, PET, APOE4, and CSF biomarkers. In contrast the present study and with other published work that used MCI-C and MCI-S groups as training and test data sets, they trained a classifier to distinguish cognitively normal older adults from those diagnosed as probable AD. They reported that the accuracy using GP – an AUC value of 79.5% – was substantially higher than using any individual modality or using multi-kernel SVM. Other studies of MCI-to-dementia classification reporting high accuracy have also implemented other approaches such as multiple kernel learning (pMKL) classification techniques using clinical, MRI and plasma biomarkers data. One method using this approach to identify the important features first grouped the data set into five different data sources and then applied a filter-wrapper approach of feature selection techniques in combination with Joint Mutual Information (JMI) criterion to achieve an AUC of 82% [68]. We also found consistently superior classification performance in patients classified under a longer window of assessment. MCI-to-dementia conversion is a process that can take several years to reliably track an individual from onset of amnestic MCI to early-stage dementia [145, 92, 75]. 33 For the modeled features to be of use for classification necessitates well-defined, if not orthogonal classes. However, MCI is not inherently prodromal to dementia: a large proportion of individuals with MCI never progress, either reverting to cognitively normal status or remaining rather stable. Furthermore, others may show early evidence of brain atrophy that precedes cognitive impairment by years. In order to account for this variable timing, others have employed methods such as supervised learning using time windows [102]; however, even those methods strongly benefit from longer follow-up periods. Thus, MCI is an inherently heterogeneous and poorly-defined class, particularly in terms of the relationships between brain characteristics and the likelihood and timing of further cognitive decline. Most recent computational neuroimaging studies in the past few years have utilized multi-modal features [24, 31, 82, 59, 28, 114, 89, 90, 133, 136]. For example, when Ding et al applied SVM with PET and MRI data to classify the transition from MCI to AD, they reported the sensitivity and specificity were 66.67% and 64.52% [31]. In addition to PET and structural MRI data, CSF protein markers can be used to predict progression from MCI to AD, in addition to proteomic, demographic and cognitive data [28, 113, 21]. By applying LR with ! 1 norm to CSF markers for classifying individual patients as belonging to either the MCI-C and MCI-S group, one study reported a sensitivity and specificity of 80% and 75% [82]. Furthermore, Varatharajah and colleagues (2020) showed SVM-linear outperforms other advanced classification methods, including linear classifiers—multiple kernel learning (MKL) with linear kernels, SVM with a linear kernel, and generalized linear model (GLM), in predicting transition from MCI to AD [130]. In general, LR works well when the data is linearly separable and the number of data is greater than the number of features, whereas SVM with Gaussian Kernel is mostly used when the data is not linearly separable. In addition to LR and SVM, deep neural network approaches also offer benefits [78, 119], but have not had the extent of application in ADNI data as SVM and LR. Using a novel LR, artificial neural network (ANN) model and decision tree (DT) model for classifying the progression of MCI to AD, Kuang (2021) reported that the ANN exhibited the highest sensitivity at 82.1% [69]. In conclusion, models applying prior knowledge for classification and prediction of MCI- 34 to-dementia conversion outperform those without pre-selection. This theoretically guided pre- selection of features from MRI-based regional brain volumes appears to protect the model against over-fitting. In addition, the present findings demonstrate that SVM classifier performance is more stable than LR for dealing with the “large p" problem. Clinical researchers should note the value of evaluating different classification and pre-selection approaches in application to clinical or research questions, and be mindful that not all machine learning techniques are equally beneficial for modeling specific clinical outcomes. 35 Figure 2.4: Model performance on ROI feature set by number of features for LR and SVM. Panel (a) shows dramatic growth in AUC with LR as the number of features increases from 1 to 30, and then becoming more static at approximately 74% - i.e., as the number of features increases from 30 to 40, but drops significantly when the number of features reaches to 41. Panel (b) shows the AUC increased dramatically as the number of features grows from 1 to 28, but fluctuated after 29. The optimal number of ROI features for both methods are 29 and 28, and their corresponding optimized AUC were approximately 74.0% and 78.0%. 36 Figure 2.5: Model performance on CCA feature set by number of features for LR and SVM. Figure (a) shows there is a significant increase in the AUC with LR as the number of features increases from 1 to 5, then there is a slight decrease in the testing accuracy when the number of features is greater than 5. Figure (b) shows the AUC shot up dramatically as the number of features increases from 1 to 4. The optimal number of CCA features obtained by LR and SVM are 5 and 4, and their corresponding optimized AUC are approximately 84.0% and 83.0%. 37 CHAPTER 3 CONSISTENT VARIATIONAL BAYES CLASSIFICATION WITH DEEP NEURAL NETWORKS 3.1 Introdution Bayesian deep neural network (BDNN) models are ubiquitous in classification problems; however, their Markov Chain Monte Carlo (MCMC) based implementation suffers from high computational cost, limiting the use of this powerful technique in large-scale studies. Variational Bayes (VB) has emerged as a competitive alternative to overcome some of these computational issues. This thesis focuses on the variational Bayesian deep neural network (VBDNN) estimation methodology and discusses the related statistical theory and algorithmic implementations in the context of classification. For a deep neural network-based classification, the thesis compares and contrasts the true posterior’s consistency and contraction rates and the corresponding variational posterior. Based on the complexity of the deep neural network (DNN), this thesis provides an assessment of the loss in classification accuracy due to VB’s use and guidelines on the characterization of the prior distributions and the variational family. The difficulty of the numerical optimization for obtaining the variational Bayes solution has also been quantified as a function of the complexity of the DNN. The development is motivated by an important biomedical engineering application, namely building predictive tools for the transition from mild cognitive impairment to Alzheimer’s disease. The predictors are multi-modal and may involve complex interactive relations. 3.2 The Neural Networks Classifier and Likelihoods Let . be a binary random variable taking values 0 or 1, representing the class levels and - 2 R ? is a feature vector drawn from a feature space with some marginal distribution % - . We consider the following binary classification problem %(. = 1|- = G) = f([0 (x)), %(. = 0|- = G) = 1 f([0 (x)) (3.1) 38 where [0 (·) : R ? ! R is some continuous function and f(.) = 4 (.) /(1 + 4 (.) ) is the sigmoid function. Thus, % -,. , the joint distribution of (-, . ) is a product of the conditional distribution in (4.1) and the marginal distribution % - . Borrowing some notations from [14] and [141], a classifier ⇠ is a Borel measurable function ⇠ : R ? ! {0, 1}, with the interpretation that we assign a point x 2 R ? to class ⇠ (x). The test error of a classifier ⇠ is given by π '(⇠) = {⇠ (-)<. } 3% -,. (3.2) R ? ⇥{0,1} Based on (4.1), we define the Bayes classifier as 8 > > > < 1, > f([0 (x)) 1/2 ⇠ Bayes (x) = (3.3) > > > > 0, otherwise : The Bayes classifier is optimal ([46]) since it minimizes the mis-classification error risk in (4.2). However, the Bayes classifier is not useful in practice, since the function [0 (x) is unknown. Thus, a classifier is obtained based on a set of training observations {(x1 , H 1 ), ..., (x= , H = )}, which are drawn from % -,. . A good classifier based on the sample should have the risk tending to the Bayes risk as the number of observations tends to infinity, without any requirement for its probability distribution. This is so called universal consistency. Multiple methods have been adopted to estimate [0 (x), including logistic regression (a linear approximation), generalized additive model (GAM, a nonparametric nonlinear approximation), deep neural networks (a complicated structure which is dense in continuous functions) etc. The first two methods usually work in practice with good theoretical foundation, however, they may fail to catch the complicated dependency of the feature vector x in a wide range of applications including the problem that we consider in this article. On the other hand, the neural network structure which can exploit the dependency implicitly without any specific parametric structure, has relatively few theoretical works establishing its statistical efficacy in Bayesian models. In this thesis, we thereby focus our attention on classification using deep neural networks. Consider a single layer neural network model with ? predictor variables. The layer has : = nodes, where : = may be a diverging sequence depending on =. The validity of neural network 39 approximations is based on the universal approximation results [25], which states that the single layer neural network is able to approximate any continuous function with a quite small approximation error when : = is large. Assume a Fourier representation of [0 (x) of the form π 48! x ˜ (3!) ) 5 (x) = R? Ø and denote ⌫,⇠ = { 5 (·) : ⌫ k!k 2 | 5˜|(3!) < ⇠} for some bounded subset ⌫ of R ? containing zero for some constant ⇠ > 0. Then, for all functions [0 2 ⌫,⇠ , there exist a single layer neural p network output [(x) such that k[ [0 k 2 = $ (1/ : = ) [5]. This result ensures good approximation property of single layer neural network, and the convergence rate depends only on the number of nodes under mild conditions on [0 (x). [77] proved that as long as the activation function is not algebraic polynomials, the single layer neural network is dense in the continuous function space, thus can be used to approximate any given continuous function. We use ✓= to index the set of all the parameters. For ? = ⇥ 1 input vector x, consider a deep neural network with ! = hidden layers and : 1= , · · · , : ! = = being the number of nodes in the hidden layers. Let : 0= = ? = + 1 and : (! = +1)= = 1. It can be checked that the total number of parameters is Œ! = = = E=0 : (E+1)= (: E= + 1) due to the formulation below. [✓= (x) = b ! + A ! k(b ! 1 + A ! 1 k(b ! 2 + A ! 2 k(· · · k(b1 + A1 k(b0 + A0 x))) bE , E = 0, · · · , ! are vectors of dimension : E+1 ⇥ 1 and AE , E = 1, · · · , ! 1 are matrices each of dimension : E+1 ⇥ : E . We have suppressed the dependence on = for notation simplicity. For the purposes of this thesis, we use the activation function to be the sigmoid function, k(G) = 4 G /(1 + 4 G ), although the theoretical results are valid to a wider class of activation functions such as tan-hyperbolic, Gaussian etc.. Thus, using the neural network in (3.4) as an approximation to the true function [0 (x) in (4.1), the conditional probabilities of . given - = x is given by %(. = 1|- = x) = f([✓= (x)), %(. = 0|- = x) = 1 f([✓= (x)) (3.4) Assuming Bernoulli distribution, the conditional density of . |- = x under the model is: ⇣ ⇣ ⌘⌘ ✓✓= (H, x) = exp H[✓= (x) log 1 + 4 [✓= (x) (3.5) 40 Thus, the likelihood function for the data (y= , X= ) = (H8 , x8 )8=1 = under the model is = h ! ÷ = ’ ⇣ ⌘i !(✓= ) = ✓✓= (H8 , x8 ) = exp H8 [✓= (x8 ) log 1 + 4 [✓= (x8 ) (3.6) 8=1 8=1 In view of (4.1), the conditional density of . |- = x under the truth ⇣ ⇣ ⌘⌘ ✓0 (H, x) = exp H[0 (x) log 1 + 4 [0 (x) (3.7) Therefore, the likelihood function for the data under the truth is given by = h ! ÷ = ’ ⇣ ⌘i !0 = ✓0 (H8 , x8 ) = exp H8 [0 (x8 ) log 1 + 4 [0 (x8 ) (3.8) 8=1 8=1 3.3 Bayesian Inference with Variational Algorithm 3.3.1 Prior Choice For Bayesian analysis, prior distributions have to be assigned for all parameters defining the model. Although one may have a prior knowledge concerning the function represented by a neural network, it is generally difficult to translate this into a meaningful prior on neural network weights. We assume an independent normal prior as follows: ÷ = 1 1 (\ 9= ` 9= ) 2 q 2f 2 ?(✓= ) = 4 9= (3.9) 2cf 9=2 9=1 (A0) For = = [f1= , · · · , f == ], ⇤ = = [1/f1= , · · · , 1/f == ], assume log || = || 1 = $ (log =) || ⇤ = || 1 = $ (1), where ||.|| 1 is the supremum norm of a vector as in definition A.0.1 in appendix B. Note, the above assumption ensures that the variance associated with each \ 9= do not grow at an arbitrarily large rate in which case the consistency of both the Bayesian and variational Bayes approach would break down. Restrictions on the mean parameter µ= = [`1= , · · · , ` == ] directly impact the consistency rate and are more case specific (see section 3.4 for a thorough discussion). 41 The reason for choosing the above form of prior is two folds: (1) first it guarantees that the true posterior distribution is consistent (2) second it guarantees, under a suitable choice of the variational family, the approximated variational posterior is also consistent. The choice of prior in (3.9) is not unique. Indeed, one can work with a much more generic class of priors such that (1) and (2) hold. Note, each prior comes with its own associated computation complexity, implementation and theoretical justification. We choose one which does a fairly good job under all these three criterion. In view of (3.6) and (3.9), posterior distribution of ✓= given y= = [H 1 , · · · , H = ] > and X= = [x1 , · · · , x= ] > is c(✓= , y= , X= ) !(✓= ) ?(✓= ) c(✓= |y= , X= ) = =Ø (3.10) c(y= , X= ) !(✓= ) ?(✓= )3✓= where c(y= , X= ) is free from the parameter and depends only on y= and X= . 3.3.2 Variational Inference As a first step to variational inference (VI) procedure, one has to start with a variational family. Given several options, we work with one which is simple, computationally and structurally tractable, and more importantly they provide statistically consistent posterior estimation. We posit a mean field Gaussian variational family of the form 8 > 9 > > < > ÷ 1 (\ 9= < 9= ) 2 > = > = 1 q 2B 2 Q= = @(✓= ) : @(✓= ) = (3.11) > > 4 9= > > 2cB29= > > : ; :=1 Note that the variational family assumes that each \ 9= is independent with mean and standard deviation equal to < 9= and B 9= respectively. The variational posterior aims to reduce the KL-distance between the variational family and the true posterior [9, 41, 11]. For the true posterior, c(.|y= , X= ) in (4.9), the variational posterior is c ⇤ = argmin 3KL (@, c(.|y= , X= )). (3.12) @2Q= where 3KL , the Kullback-Leibler (KL) divergence between a variational family member @(✓n ) and the true posterior c(✓= |y= , X= ) is given by 42 π 3KL (@, c(.|y= , X= )) = log(@(✓= )/c(✓= |y= , X= ))@(✓= )3✓= (3.13) Bases on (4.9), simplifying further, we get π 3KL (@, c(.|y= , X= )) = [log @(✓= ) log c(✓= , y= , X= )]@(✓= )3✓= + log c(y= , X= ) = ELBO(@, c(., y= , X= )) + log c(y= , X= ) (3.14) Since the last term in (3.14) does not depend @, optimizing (3.14) w.r.t. to @ boils down to optimizing the first term. Indeed the first term is nothing but the negative of the evidence lower bound (ELBO). Thus in order to minimize the KL-distance, we shall instead maximize the ELBO between @ and c(., y= , X= ). Alternatively, we define c ⇤ as c ⇤ = argmax ELBO(@, c(., y= , X= )) (3.15) @2Q= To maximize the ELBO in (3.14), let V@ = (< 1= , · · · , < 2 , · · · , B 2 ) where < == , B1= == 9= and B 9= is the mean and standard deviation of \ 9= under the density @. Thus, each @ 2 Q= is indexed by its parameters. Consequently, π ELBO(@(.|V@ ), c(., y= , X= )) = [log c(✓= , y= , X= ) log @(✓= |V@ )]@(✓= |V@ )3✓= π π = log !(✓= )@(✓= |V@ )3✓= + [log ?(✓= ) log @(✓= |V@ )]@(✓= |V@ )3✓= π = log !(✓= )@(✓= |V@ )3✓= 3KL (@(.|V@ ), ?(.)) = LV@ 3KL (@(.|V@ ), ?(.)) (3.16) The derivative of 3KL (@(.|V@ ), ?(.)) w.r.t. V@ has a closed form expression (see appendix A). The key challenge is the derivative LV@ w.r.t. to V@ which we discuss next π π rV@ LV@ = rV@ log !(✓= )@(✓= |V@ )3✓= = log ! (✓= )rV@ @(✓= |V@ )3✓= π = rV@ log @(✓= |V@ ) log !(✓= )@(✓= |V@ )3✓= = ⇢ @(.|V@ ) (log @(✓= |V@ ) log !(✓= )) (3.17) 43 where the last equality holds since rV@ log @(✓= |V@ )@(✓= |V@ ) = rV@ @(✓= |V@ ). The black-box variational inference (BBVI) algorithm, [107], optimizes the ELBO using gra- dient descent method by making use of a similar approach. The key challenge in evaluating the gradient in (3.17) is the computation of the expectation. Exact computation of the expectation leads to high computational complexity whereas using noisy estimates leads to high variability. In section 3.3.3, we elucidate how to ensure fast and efficient estimation of the gradient. 3.3.3 Black Box Variational Algorithm using score function estimator The gradient in (3.17) is difficult to evaluate for problems with complex likelihood structures arising out of deep network models. Alternatively, the above expectation is evaluated by sampling from the variational distribution and forming the corresponding Monte Carlo estimates of the gradient. We next explain the computation of Monte Carlo estimate of the gradient in (3.17) by using ideas similar to [107, 124]. Let V@ denote the current value of the variational parameters. We generate , samples from the variational distribution @(.|V@ ) and define the noisy but unbiased estimate of rV@ LV@ as 1 ’ , õ rV@ L V = rV log @(✓= [F] |V@ ) log ! (✓= [F]) (3.18) @ , F=1 @ where ✓= [1], · · · , ✓= [,] are samples generated from @(.|V@ ). Similarly, a noisy but unbiased estimate of the LV@ is given by 1 ’ , b LV@ = log ! (✓= [F]) (3.19) , F=1 Algorithm 1 provides the pseudocode summarizing the overall algorithm for BBVI. 44 Algorithm 1 BBVI 1. Fix an initial value for variational family parameters V@1 . 2. Fix a step size sequence dC , C = 1, · · · . 3. Set C = 1. 4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ). 5. Compute r õ V@C L as in (3.18) V@C 6. Update ✓ ◆ V@C+1 = V@C + dC õ r V@C L r V@C 3KL (@(.|V@ ), ?(.)) (3.20) V@C 7. Set C = C + 1. 8. Repeat steps 4-7 until the convergence of ELBO using L bV C as in (3.19) and @ ELBO = L bV C 3KL (@(.|V@ ), ?(.)) @ In the implementation of the above algorithm, one needs to compute 3 ! (@(.|V@ ), ?(.)), rV@ log @(✓= |V@ ) and rV@ 3 ! (@(.|V@ ), ?(.)) for the variational parameters V@ . For the choice of ? and @ as in (3.9) and (3.11), the explicit expressions have been presented in appendix A. For the variational parameters (B1= , · · · , B == ), the updating rule in (3.20) may lead to negative estimates. However, one must guard against this since variance terms cannot be negative. Thus, to perform the optimization, we reparametrize the variance terms as B 9= = log(1 + 4A 9= ), 9 = 1, · · · , = and update the quantities AA= in each step instead of B 9= . By chain rule, for any function 6(V@ ), 4A 9= rA 9= 6(V@ ) = rV 6(V@ )| B 9= =log(1+4A 9= ) (1 + 4A 9= ) @ where second term is the derivative of 6(V@ ) w.r.t. B 9= evaluated at B 9= = log(1 + 4A 9= ). The explicit expressions of derivatives w.r.t. A 9= have also been provided in appendix A. 3.3.4 Control Variate: Stabilizing the stochastic gradient We can use algorithm 1 to maximize the ELBO, however a major drawback is that the noisy estimator of the gradient has high variance. There are two major techniques to reduce the variance 45 of gradients. One of them is “Rao-Blackwellization", where the idea is to replace the noisy estimate of gradient with its conditional expectation w.r.t. a subset of the variables, [107]. This method is useful when the posterior distribution is separable across subsets of variables or while dealing with latent variables. A convoluted likelihood as in (3.6) is not separable across the components of ✓= and there are no latent variables in our model. We thereby refrain from using the Rao-Blackwellization approach. Algorithm 2 BBVI-CV 1. Fix an initial value for variational parameter V@1 . 2. Fix a step size sequence dC , C = 1, · · · . 3. Set C = 1. 4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ). 5. Compute 2¢C = cov(u1C , u2C )/var(u2C ) where u1C and u2C are same as in (3.22). 6. Compute r õ V@C L as in (3.21). V@C 7. Update ✓ ◆ V@C+1 = V@C + dC õ r V@C L r V@C 3KL (@(.|V@ ), ?(.)) V@C 8. Set C = C + 1. 9. Repeat steps 4-7 until the convergence of ELBO using L bV C as in (3.19) and @ ELBO = L bV C 3KL (@(.|V@ ), ?(.)) @ Another method which also gives an efficient technique for stabilizing the gradient is called control variate (CV) (see [110, 97, 107]). We use CV to reduce the variance of the MC approx- imations of the gradients. The key idea behind the variance reduction as proposed in [110] is to replace the target function, whose expectation is being approximated by Monte Carlo, with an aux- iliary function that has the same expectation but a smaller variance. To reduce the variance of the function b (q), one instead considers the function bˆ (q) = b (q) 1 i(q) ⇢ @ (i(q)) where i(q) is function with finite expectation and 2 is a scalar. Such a choice ensures ⇢ @ ( bˆ (q)) = ⇢ @ (b (q)) and Var@ ( bˆ (q)) = Var@ (b (q)) + 22 Var@ (i(q)) 22Cov@ (b (q), i(q)) which is minimized at 46 2¢ = Cov@ (b (q), i(q))/Var@ (i(q)). Thus, greater the correlation between b and i, greater the variance reduction. Similar to [107], we use rV@ log @(✓|V@ ) as a choice for i(q). The stochastic approximation of the gradient in (3.17) is then modified as 1 ’ , õ rV@ L V = rV log @(✓= [F] |V@ ) [log ! (✓= [F]) 2¢ ] (3.21) @ , F=1 @ It is impossible to obtain an exact expression for 2¢, one thus uses 2b¢ = cov(u1 , u2 )/var(u2 ), D 1 [F] = rV@ log @(✓= [F]|V@ ) log ! (✓= ) D 2 [F] = rV@ log @(✓= [F] |V@ ) (3.22) The extension of algorithm 1 with variance reduction of MC approximations due to CV is annotated as BBVI-CV and summarized in algorithm 2. Similar to the implementation of algorithm 1, for the implementation of algorithm 2, we use the reparametrization of B 9= = log(1 + 4A 9= ) as explained in section 3.3.3. 3.3.5 RMSprop Learning Rate: Stabilizing the learning rate. Note that both BBVI and BBVI-CV algorithms as described in section 3.3.3 and 3.3.4 work fairly well with a fixed learning rate for a single layer network. However, their performance deteriorates significantly when the neural networks have two or more layers. This can be attributed to the fact that the gradients for the different parameters changes at significantly different rates. In order to overcome these issues, a wide class of adaptive learning rates have been explored in [117], [150], etc. for the frequentist optimization of parameters in deep neural networks. One such popular technique which performs well in practice, called the RMSprop, was introduced in [57] where the gradient is divided by a running average of its recent magnitude. As described in both [57] and [51], let ⌧ C denote the value of the current gradient, then define 'C = 0.9'C 1 + 0.1⌧ C2 , C = 1, 2, · · · p and the replace the learning dC by the effective learning rate dC /( 'C + n) for some small n > 0. Numerical studies show for one layer network, RMSprop leads to faster convergence and for multiple 47 layer networks, convergence is not possible without an adaptive learning rate similar to RMSprop. One could also experiment with other adaptive learning rates like AdaGrad, AdaDelta, ADAM, etc. to serve the same purpose as that of RMSprop (see [88] and [38] for more details on other adaptive learning rates.) The updated version of BBVI and BBVI-CV using RMSprop, renamed as BBVI-RMS and BBVI-CV-RMS, are summarized as algorithms 4 and 5 and provided in appendix A. 3.3.6 Classification using variational posterior Define, [(x), ˆ the variational estimator of [0 (x) as ✓π ◆ 1 ˆ [(x) =f ⇤ f([✓= (x))c (✓= )3✓= (3.23) where c ⇤ is the variational posterior. Analgous to (4.3), the classifier based on [(x) ˆ is 8 > > > < 1, > ˆ f( [(x)) 1/2 ⇠ˆ (x) = (3.24) > > > > 0, otherwise : Note, the formulation in (3.23) guarantees that we directly approximate the main quantity of interest, Ø f([0 (x)) as in (4.1) by its posterior mean, f([✓= (x))c ⇤ (✓= )3✓= , which is empirically estimated as 1 ’ , [ˆ (x) = , f([✓= [8] (x)) (3.25) , F=1 where ✓= [1], · · · , ✓= [,] are multiple samples from the variational posterior c ⇤ . Since generation of multiple samples from the variational posterior is much cheaper, the order of error between (3.23) and (3.25) is negligible. 3.4 Posterior and Classification Consistency In this section, we establish that the Bayesian inference procedure proposed in section 3.3 enjoys theoretical guarantees in terms of consistency of the posterior estimation and classification. For a simple Gaussian mean field family as in (3.11), we establish that the variational posterior (3.12) 48 is consistent under suitable assumptions on the prior parameters. We also discuss how the true function [0 impacts the rate of consistency of the variational posterior. Finally, we present how the consistency rates of the variational posterior differ from those of the true posterior. Let 50 and 5✓= be the joint density of the observations (H8 , x8 )8=1 = under the truth and the model respectively. Without loss of generality, we assume -8 ⇠ * [0, 1] ? = , which implies 50 (x) = 1 and 5✓= (x) = 1. This implies that the joint distribution of (H8 , x8 )8=1 = depends only the conditional distribution of . |- = x. From (4.1) and (3.4) with ✓✓= and ✓0 as in (B.5) and (3.7), ⇣ ⇣ ⌘⌘ 5✓= (H, x) = 5✓= (H|x) 5✓= (x) = exp H[✓= (x) log 1 + 4 [✓= (x) = ✓✓= (H, x) ⇣ ⇣ ⌘⌘ 50 (H, x) = 50 (H|x) 50 (x) = exp H[0 (x) log 1 + 4 [0 (x) = ✓0 (H, x) (3.26) We next define the Hellinger neighborhood of the true density function 50 = ✓0 as UY = {✓= : 3H (✓0 , ✓✓= ) < Y} (3.27) where the Hellinger distance, 3H (✓0 , ✓✓= ) is given by π ’ ✓p q ◆2 1/2 ©1 ™ 3H (✓0 , ✓✓= ) = ≠ ✓0 (H, x) ✓✓= (H, x) 3xÆ̈ . 2 ´ x2[0,1] ?= H2{0,1} Also, the Kullback-Leibler (KL) neighborhood of the true density function 50 = ✓0 is NY = {✓= : 3KL (✓0 , ✓✓= ) < Y} (3.28) where the KL distance, 3KL (✓0 , ✓✓= ) is given by π ’ ✓ ✓0 (H, x) ◆ 3KL (✓0 , ✓✓= ) = log ✓0 (H, x) 3x x2[0,1] ?= H2{0,1} ✓✓= (H, x) Let %0= denote the true distribution of (y= , X= ) = (H8 , x8 )8=1 = under the true density ✓ . 0 3.4.1 Posterior consistency and its implication in practice In the following two theorems, we establish the posterior consistency of c ⇤ defined in (3.12). In this direction, we show that the variational posterior concentrates in Y small Hellinger neighborhoods 49 of the true density ✓0 . In theorem 3.4.1, we establish this result for a fixed choice of the neighborhood distance Y. In theorem 3.4.2, we establish the same result for shrinking neighborhood sizes of the true function ✓0 . For both these theorems, the total number of parameters = allows to grow at a rate of =0 for some 0 < 0 < 1. Note, theorem 1 is a simple consistency result and holds due to the universal approximation properties of neural networks (see [60]) when the number of layers and input variables are fixed. This is an important result since it shows, irrespective of the function under study, BDNN’s enjoy consistency properties if the number of input variables and the number of layers are fixed. Additionally, we also provide a characterization on the prior distribution such as that the rate of growth of ! 2 norm of the prior mean parameter necessary to guarantee the consistency result in theorem 3.4.1 (see A2) and contraction results in theorem 3.4.2 (see A4). The theorem 3.4.2 studies the contraction rate of the variational posterior, it is more restrictive in nature and requires additional assumptions on the approximating neural network solution to the true function [0 (see assumption (A3) below). We next describe how our current theoretical development contrast to the recent works of [105] and [3]. Firstly, theorem 2 establishes the variational posterior contraction rates following the classical definition of contraction as in the theorems of 2.1 of [47] and theorem 2.1 of [122]. It differs from the consistency results in [3] which deal with posterior expectation of square of Hellinger distance and [105] which consider the lower bound on \/||[✓= [0 || 1 . Secondly, unlike the two aforementioned works, we assume a restriction only on the total number of parameters = in the system instead of developing the results for the same number of nodes in each layer, an assumption which can severely restrict the space of neural networks solution one works with. Third, both [105] and [3] assume that there exists a true sparse solution all of whose coefficients are bounded above by a constant ⌫ (see condition 4.3 in [3]). We impose no such restriction to begin with on our true neural network solution but derive the most relaxed condition on the joint growth of the number of nodes and strength of connections between active nodes to allow for rates of contraction to hold (see condition 3. in (A3)). Indeed, if we make the assumption that all coefficients of the neural net are bounded above by ⌫, condition 3. in (A3) simplifies to a restriction only on the number of nodes as in [105] and [3]. Lastly, both 50 these works establish their contraction results in context of regression problems which allows them to use results from [111]. However, our systematic development here requires the derivation of the tools for a classification set up and ideas may be extended to other generalized linear models. Theorem 3.4.1 Let = ⇠ =0 , 0 < 0 < 1 and ? = = ?, ! = = ! be constants independent of =. (A2) The prior parameters in (3.9) satisfy assumption (A1) and ||µ= || 22 = >(=). Then, %0= c ⇤ (UY2 ) ! 0 Here, ||.|| 2 is the ! 2 norm of a vector as in definition A.0.1 in the appendix B. By the above theorem, for any a > 0, c ⇤ (UY2 ) < a with probability tending to 1 as = ! 1. Under the conditions of theorem 3.4.1, it can be established that the true posterior satisfies c(UY2 |y= , X= ) < 24 =Y 2 /2 with probability tending to 1 as = ! 1 (see theorem A.0.18 part 1. in the appendix D). This implies that the probability of the Y Hellinger neighborhoods of the true function ✓0 for the true posterior increases at the rate of 1 24 =Y 2 /2 in contrast to the slow rate of 1 a for the variational posterior. Theorem 3.4.2 Suppose = ⇠ =0 , 0 < 0 < 1, ! = ⇠ log =, n =2 ⇠ = X , 0 < X < 1 0. Suppose, (A3) There exists a sequence of neural network functions [✓=⇤ satisfying 1. ||[0 [✓=⇤ || 1 = >(n =2 ) 2. ||✓=⇤ || 22 = >(=n =2 ) ⇣Õ Œ! = ⌘ 3. log != E=0 : E= ⇤ E 0 =E+1 0 E 0 = = $ (log =) where 0 ⇤E 0= = supB=0,··· ,: (E 0+1)= ||A⇤E 0 [B]] || 1 . (A4) The prior parameters satisfy assumption (A1) and ||µ= || 22 = >(=n =2 ). By the above theorem, for any a > 0, c ⇤ (UYn 2 ) < a with probability tending to 1 as = ! = 1. Under the conditions of theorem 3.4.2, it can be established that the true posterior satisfies c(UYn 2 |y , X ) < 24 =Y 2 n =2 /2 with probability tending to 1 as = ! 1 (see theorem A.0.19 part 1. = = = in the appendix D). This implies that the probability of the shrinking Yn = Hellinger neighborhoods 51 of the true function ✓0 for the true posterior increases at the rate of 1 24 =Y 2 n =2 /2 in contrast to the slow rate of 1 a for the variational posterior. Remark: For a single layer, assumption (A3) condition 3. holds if the number of input features increases at a rate polynomial in =. As the number of layers increases, one needs the row sums in the true solution ⇤ E, E = 0, · · · , ! = to be bounded. This shows that even with a control on the number of nodes, the strength of the signal into every active node node must be well controlled (this corresponds to edge selection following node selection). 3.4.2 Discussion of the proof We next briefly outline the main steps in the the proof of theorems 3.4.1 and 3.4.2. The details are deferred to appendix C. The first step of the proof is to establish that 3KL (c ⇤ , c(.|y= , X= )) is bounded below by a quantity which is determined by the rate of consistency of the true posterior. The second step is to show 3KL (c ⇤ , c(.|y= , X= )) is bounded above at a rate which is greater than its lower bound if and only if the variation posterior is consistent. Note, 3KL (c ⇤ , c(.|y= , X= )) π π c ⇤ (✓= ) c ⇤ (✓= ) = c (✓= ) log ⇤ 3✓= + c ⇤ (✓= ) log 3✓= c(✓= |y= , X= ) c(✓= |y= , X= ) π π UY U Y2 c ⇤ (✓= ) c(✓= |y= , X= ) c ⇤ (✓= ) c(✓= |y= , X= ) = ⇤ c (UY ) ⇤ log 3✓ = c ⇤ (U Y 2 ) 2 log 3✓= U Y c (UY ) c ⇤ (✓= ) ⇤ U Y2 c (UY ) c ⇤ (✓= ) c ⇤ (UY ) c ⇤ (UY2 ) c ⇤ (UY ) log + c ⇤ (UY2 ) log , by Jensen’s inequality c(UY |y= , X= ) c(UY2 |y= , X= ) where UY as in (4.16) note that for any Y > 0, Since c(UY |y= , X= )  1, thus c ⇤ (UY ) log c ⇤ (UY ) + c ⇤ (UY2 ) log c ⇤ (UY2 ) c ⇤ (UY2 ) log c(UY2 |y= , X= ) c ⇤ (UY2 ) log c(UY2 |y= , X= ) log 2, since G log G + (1 G) log(1 G) log 2 ✓ π π ◆ !(✓= ) ! (✓= ) = c (UY ) log ⇤ 2 ?(✓= )3✓= log ?(✓= )3✓= log 2 U Y2 !0 !0 Thus, with 52 π π ! (✓= ) ! (✓= ) = = log ?(✓= )3✓= ⌫= = log ?(✓= )3✓= (3.29) U Y2 !0 !0 we get the following main step towards the proof of theorems 3.4.1 and 3.4.2. c ⇤ (UY2 ) =  3KL (c ⇤ , c(.|y= , X= )) + |⌫= | + log 2 (3.30) In the above proof we have assumed c ⇤ (UY ) > 0, c ⇤ (UY2 ) > 0. If c ⇤ (UY2 ) = 0, there is nothing to prove. If c ⇤ (UY ) = 0, then following the steps of the proof in appendix C, we get Y 2 = > %0= (1) which is a contradiction. The first term = is decomposed as π π !(✓= ) !(✓= ) 4 = = ?(✓= )3✓= + ?(✓= )3✓= U Y2 \F= !0 U Y2 \F=2 !0 where {F= }1 ==1 is a suitably chosen sequence of sieves. Indeed our choice of F= is given by n o F= = ✓= : |\ 9= |  ⇠= , 9 = 1, · · · , (=) (3.31) 1/ 1n 2/ where ⇠= = 4 = = in theorem 3.4.1 and ⇠= = 4 = = (=) in theorem 3.4.2 respectively where 1 is chosen to ensure Hellinger bracketing entropy (see definition A.0.2 in the appendix B) of F= is well controlled (proposition A.0.16 in the appendix C). Secondly, the prior needs to give negligible probability outside F=2 so that term 4 = is well controlled. The prior in (3.9) satisfies this for theorem 3.4.1 and theorem 3.4.2 with assumptions (A1), (A2) and (A1), (A4) respectively. The second quantity ⌫= is controlled by the rate at which the prior gives mass to shrinking KL neighborhoods of the true density ✓0 . In theorem 3.4.1, this rate is controlled as long as the prior parameters in (3.9) satisfy (A1) and (A2). In theorem 3.4.2, the same rate is controlled as long as the prior parameters satisfies (A1) and (A4) and the true function [0 has a neural network solution which satisfies assumption (A3). Finally, we bound 3KL (c ⇤ , c(.|y= , X= )) by 3KL (@, c(.|y= , X= )) for a suitable @ 2 Q= (see propositions B.0.5 and A.0.17 in the appendix). From Relation (A.30) in the appendix, π π ! (✓= ) !(✓= ) 3 ! (@, c(.|y= , X= ))  3KL (@, ?) + log @(✓= )3✓= + log ?(✓= )3✓= (3.32) !0 !0 53 The last term above is nothing but |⌫= |. The second term is the most crucial quantity. π π !(✓= ) log @(✓= )3✓= ⇡ = 3KL (✓0 , ✓✓= )@(✓= )3✓= . !0 For both the theorems 3.4.1 and 3.4.2, the right hand side can always be controlled by choosing @ = "+ # (< ⇤= , B=⇤ ) for a suitable choice of the sequence < ⇤= and B=⇤ . We discuss the choice of B=⇤ in the appendix C. For theorem 3.4.1, < ⇤= = \ =⇤ where [✓=⇤ is the finite neural network approximation of [0 and for theorem 3.4.2, the < ⇤= = \ =⇤ corresponds to [✓=⇤ , the rate controlled neural network approximation of assumption (A3). Finally, the first term in (3.32) is determined by both prior and @. In theorem 3.4.1, it is controlled as long as the prior parameters in (3.9) satisfy (A1), (A2). In theorem 3.4.2, the same rate is controlled as long as the prior parameters satisfies (A1), (A4) and the sequence ✓=⇤ satisfies assumption (A3). In light of the above discussion, there are three main properties which a prior must satisfy to allow for the convergence of variational posterior. For any a > 0 1. For a sequence of sieves {F= }1 ==1 with well controlled Hellinger bracketing entropy, π =n =2 a ?(✓= )3✓=  4 ,= ! 1 F=2 2. With NY as in (3.28), π =n =2 a ?(✓= )3✓= 4 ,= ! 1 NY n 2 = Ø 3. For a @ satisfying 3KL (✓0 , ✓✓= )@(✓= )3✓= < Y, = ! 1, 3 ! (@, ?)  =n =2 a, = ! 1 Whereas condition 1 and 2 are standard assumptions for consistency of true posterior (see assump- tions 1 and 2 in [4] and theorem 2 in [74]), condition 3 is an additional requirement which makes the variational posterior consistent. The proof presented in this section can be generalized to a much wider class of priors satisfying (1)-(3). 54 3.4.3 Classification consistency In this section, we discuss the classification accuracy of the predictions made by the variational posterior by comparing to the optimal mis-classification error. In view of (4.2), let '(⇠) ˆ and '(⇠ Bayes ) denote the classification accuracy under the variational classifier in (3.24) and the Bayes classifier in (4.3) respectively, then ˆ |'(⇠) '(⇠ Bayes )| = |⇢ - ⇢. |- [ ⇠ Bayes (-)<. ] | ⇠ˆ (-)<. = |⇢ - ⇢. |- [( ⇠ˆ (-)=0 ⇠ Bayes (-)=0 )f([0 (-)) +( ⇠ˆ (-)=1 ⇠ Bayes (-)=1 ) (1 f([0 (-)))]|  2⇢ - [ ⇠ˆ (-)<⇠ Bayes (-) |f([0 (-)) 1/2|] = 2⇢ - [ ˆ f( [(-)) 1/2,f([0 (-))<1/2 |f([0 (-)) 1/2| + ˆ f( [(-))<1/2,f([ 0 (-)) 1/2 |f([0 (-)) 1/2|]  2⇢ - |f([0 (-)) ˆ f( [(-))| (3.33) The above result establishes how the difference in classification accuracy depends on the logit links ˆ [(-) and [0 (-) as defined in (3.23) and (4.1) respectively. Using the above result, in corollary 3.4.3, we establish the classification accuracy of the variational estimate [(x) ˆ under no assumptions on the true function [0 (x). In corollary 3.4.4, we establish the same result under assumption (A3) on the true function [0 (x). Note, although theorem 3.4.1 requires minimal assumptions, it gives a much weaker convergence result on the classification accuracy. Corollary 3.4.3 Under the conditions of theorem 3.4.1, %0= |'(⇠)ˆ '(⇠ Bayes )| ! 0 By the above corollary, for any a > 0, |'(⇠) ˆ '(⇠ Bayes )| < a with probability tending to 1 as = ! 1. Under the conditions of theorem 3.4.1, it can be established that the true posterior also gives classification consistency at the same rate and there is no loss in using a variational posterior approximation (see theorem A.0.18 part 2. in the appendix D). Corollary 3.4.4 Under conditions of theorem 3.4.2, for every 0  ^  2/3, %0= ˆ n = ^ |'(⇠) '(⇠ Bayes )| ! 0 55 By the above corollary, for any a > 0, 0  ^  2/3, |'(⇠) ˆ '(⇠ Bayes )| < an =: with probability tending to 1 as = ! 1. Under the conditions of theorem 3.4.2, it can be established that the true posterior satisfies |'(⇠) ˆ '(⇠ Bayes )| < an =: for every a > 0, 0  ^  1 with probability tending to 1 as = ! 1 (see theorem A.0.19 part 2. in the appendix D). Thus, the classification consistency occurs at the rate n =2/3 for the variational posterior in contrast to n = for the true posterior. 3.5 Simulation Studies. In this section, we study the performance of the four algorithms viz BBVI, BBVI-CV, BBVI-RMS, BBVI-CV-RMS in the context of two simulation scenarios. We used approximate 2:1 ratio for training and test cases. All the covariates are normalized. We adopted a 10-fold cross-validation to avoid optimistically-biased estimates of model performance. 3.5.1 Simulation Scenarios Scenario 1: We simulate = = 3000 observations from a 2-2-2-1 network, i.e. a neural network with 2 input features, 2 hidden layer with 2 nodes each and 1 output layer as 8 > > > < 0, > b2 + A2 k(b1 + A1 (k(b0 + A0 x8 ))) > 0 H8 = > > > > 1, otherwise : where x8 2 R2 , are i.i.d. from # (0, 1) and entries in b 9 , A 9 , 9 = 0, 1, 2 are i.i.d. from * (0, 1). Scenario 2: We simulate = = 3000 observations from the following non linear function as 8 > > > < 0, > 24 G8 [1] + 3 sin(G8 [2]G8 [3]) + 4G8 [4] 3 3>0 H8 = > > > > 1, otherwise : where x8 2 R4 are i.i.d. from # (0, 1). 3.5.2 Parameters choice for statistical and computational models. In order to implement the BBVI, BBVI-CV, BBVI-RMS, and BBVI-CV-RMS, we need to make a valid choice of the prior parameters ` 9= , f 9= for 9 = 1, · · · , = as in (3.9). We use the choice of 56 ` 9= = 0 and f 9= = 1 for our prior parameters. Indeed, this choice satisfies conditions (A1), (A2) and (A4) as assumed in the consistency proofs of theorems 3.4.1 and 3.4.2. Next, we need to make a choice on the number nodes in each hidden layer. We experiment with 1 and 2 hidden layers with 2 nodes in each layer. The choice of number of nodes satisfy the assumption of theorem 3.4.1 and 3.4.2. 3.5.3 Gradient stabilization paramaters. The choice of the initial learning rate is dC = 14 4 , C 1 for BBVI and BBVI-CV and dC = 14 1 , C 1 for BBVI-RMS and BBVI-CV-RMS. These values were chosen to ensure the optimal performance of the algorithms, however little sensitivity to the initial choice was observed. As explained in section 3.3, to allow for stable optimization, we study the sensitivity to the different samples sizes (, use of control variates and the RMSprop based gradient descent method. The choice of sample size ( is sensitive to the performance to model in terms of algorithmic stability and convergence time. Whereas each update with small sample size takes less time, the variability of the estimate is high. On the other hand a large sample size leads to less variable estimates but each update takes a much longer time. We experimented with ( = 200, ( = 500 and ( = 1000. For scenario 1, Figures 3.1 and 3.2 illustrates how the ELBO changes with ( for one and two layers respectively. For scenario 2, Figures 3.3 and 3.4 provide the same illustration for one and two layers respectively. It is evident that increase in ( from 200 to 1000 stabilizes the ELBO and helps with a faster convergence. As explained in section 3.3.4, the maximization of the ELBO requires stabilization of the variance of the stochastic gradient in (3.18) which is done by the use of control variate. For scenario 1, Figures 3.1 and 3.2 illustrates how the ELBO changes with use of control variates for one and two layers respectively. For scenario 2, Figures 3.3 and 3.4 provide the same illustration for one and two layers respectively. It is evident that the use of control variates stabilizes the ELBO by a huge margin and allows for its faster convergence. Finally, as explained in section A, the use of RMSprop stabilizes the optimization of ELBO by normalizing the gradients by their running 57 Figure 3.1: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 1 layer. Figure 3.2: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers. 58 magnitude. For scenario 1, Figures 3.1 and 3.2 illustrates how the ELBO changes with use of RMSprop versus a fixed learning rate for one and two layers respectively. For scenario 2, Figures 3.3 and 3.4 provide the same illustration for one and two layers respectively. It is evident that the use of RMSprop leads to stable ELBO and faster convergence rates. Figure 3.3: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 2 for 1 layer. Figure 3.4: ELBO convergence of algorithms 1, 2, 4, 5 for scenario 1 for 2 layers. 59 Testing accuracy(%) Convergence time(s) Layers Method Sample size (S) Fixed RMSprop Fixed RMSprop 1 BBVI 200 97.41 ± 0.50 96.89 ± 0.93 23 114 500 97.72 ± 0.38 97.52 ± 0.74 55 106 1000 98.01 ± 0.33 97.38 ± 0.39 108 80 BBVI-CV 200 97.82 ± 0.40 97.61 ± 0.60 21 6 500 97.84 ± 0.40 97.67 ± 0.34 52 7 1000 97.84 ± 0.42 97.94 ± 0.40 104 10 2 BBVI 200 97.79 ± 0.71 97.02 ± 1.10 200 98 500 94.34 ± 3.82 97.75 ± 0.95 452 39 1000 91.50 ± 5.17 98.11 ± 0.42 904 65 BBVI-CV 200 96.34 ± 0.75 97.61 ± 0.44 118 17 500 96.33 ± 0.73 97.30 ± 0.60 272 23 1000 96.36 ± 0.74 97.74 ± 0.54 552 40 Table 3.1: Performance of algorithms algorithms 1, 2, 4, 5 for scenario 1. 3.5.4 Testing accuracy and convergence. We evaluate the model’s performance for all four algorithms BBVI, BBVI-CV, BBVI-RMS and BBVI-CV-RMS under two criteria (1) testing accuracy (2) convergence time. The test accuracy, ) (⇠) of a classifier is given by 1 '(⇠) where '(⇠) is the mis-classification error rate as described in (4.2). The convergence criterion is defined as the point where Monte Carlo estimate of the ELBO as in (3.19) converges. For scenario 1, table 3.1 gives the performance of four algorithms for 1 and 2 layers. The best average accuracy of 98.11% is obtained for BBVI-RMS with ( = 1000 with 2 layers. The optimal time is achieved with BBVI-CV-RMS for ( = 200 for one layer with average accuracy of 97.61%. We can make two conclusions here, although the true data is generated from 2 layer network structure, a one layer approximation is fairly competitive. BBVI-CV-RMS with ( = 200 provides the best convergence time of nearly 6 sec for one layer and 17 sec for two layers with competitive accuracy. For scenario 2, table 3.2 gives the performance of four algorithms for 1 and 2 layers. The best average accuracy of 91.12% is obtained for BBVI-CV-RMS with ( = 1000 with 2 layers. The optimal time is achieved with BBVI-CV-RMS for ( = 200 for one layer with average accuracy of 91.11%. The improvement obtained by moving from 1 to 2 layers is only marginal. BBVI-CV-RMS with ( = 200 provides the best convergence time of nearly 19 sec for one layer and 11 sec for two 60 Testing accuracy(%) Convergence time(s) Layers Method Sample size (S) Fixed RMSprop Fixed RMSprop 1 BBVI 200 83.66 ± 14.51 88.71 ± 7.12 190 15 500 90.22 ± 0.54 90.32 ± 0.98 364 390 1000 90.28 ± 0.75 90.41 ± 0.71 732 710 BBVI-CV 200 90.51 ± 0.87 90.42 ± 0.64 17 19 500 90.51 ± 0.87 90.65 ± 0.61 36 33 1000 90.53 ± 0.91 90.78 ± 0.49 69 37 2 BBVI 200 88.40 ± 0.50 89.89 ± 0.88 256 421 500 90.52 ± 0.38 90.48 ± 0.74 518 544 1000 90.61 ± 0.33 90.32 ± 0.65 906 608 BBVI-CV 200 90.62 ± 0.40 91.11 ± 0.58 444 11 500 90.74 ± 0.40 90.98 ± 0.54 862 12 1000 90.72 ± 0.42 91.12 ± 0.53 1646 13 Table 3.2: Performance of algorithms 1, 2, 4, 5 for scenario 2 layers with competitive accuracy. 3.5.5 Large number of layers and challenges. We finally discuss the performance for all four algorithms BBVI-RMS and BBVI-CV-RMS when the number of layers are 3. For 3 layers, using a fixed learning rate does not allow for the maximization of the ELBO. This may be attributed to the different scales of the gradients for the different parameters. Similar behavior is also observed in parametric optimization of artificial deep neural networks (see [57] for more details). From table 3.3, it is evident that the improvement from using 3 layers over 2 layers provides only a marginal improvement for scenario 1. For scenario 2, the performance at 3 layers is worse than that in the case of 2 layers. As explained in the previous sections, the performance of both BBVI-RMS and BBVI-CV- RMS improves with increase in sample size (. However, a great deal of sensitivity to choice of the initial learning rate was observed. The observed sensitivity was even more profound in the case of control variates especially under scenario 2. For scenario 1, the optimal learning rate dC was found to be 0.1 and 0.3 (S=200) and 0.35(S=500 and 1000) for C 1 for BBVI-RMS and BBVI- CV-RMS respectively. For scenario 2 under BBVI-RMS, the optimal learning rates were found to be dC = 0.055, dC = 0.04 and dC = 0.04, C 1 for ( = 200, ( = 500 and ( = 1000 respectively. For 61 scenario 2 under for BBVI-CV-RMS, the optimal learning rates dC = 0.4, dC = 0.55 and dC = 0.63, C 1 for ( = 200, ( = 500 and ( = 1000 respectively. With the optimal choice of dC at hand, the BBVI-CV-RMS provided faster convergence results with a comparable test accuracy to that of BBVI-RMS. This sensitivity to the choice of the initial learning rate especially in the case of control variates for large number of layers needs to be explored as a part of future work. Scenario 1 Scenario 2 Method S Testing accuracy(%) Time(s) Testing accuracy(%) Time(s) BBVI-RMS 200 97.76 ± 0.87 218 84.68 ± 4.85 423 500 97.65 ± 0.83 169 88.00 ± 5.56 631 1000 98.21 ± 0.73 132 90.69 ± 0.67 714 BBVI-CV-RMS 200 96.23 ± 1.05 212 84.53 ± 8.90 33 500 97.83 ± 0.81 166 88.28 ± 2.03 37 1000 98.42 ± 0.72 124 89.33 ± 1.67 45 Table 3.3: Performance of algorithms 1, 2, 4, 5 for scenario 1 and scenario 2 for 3 layers. 3.6 Numerical Properties and Alzheimer’s Disease Study The transition from mild cognitive impairment (MCI) to Alzheimer’s disease (AD) is of great interest for clinical researchers. Several studies over the past decade have shown and compared the performance of different machine learning methods on this classification task. For this classification problem, we illustrate the performance of variational Bayesian neural networks as developed under section 3.3 in terms of classification accuracy, numerical complexity and time of convergence. We implemented both algorithms, algorithm 1 and 2 and shall hence forth refer to them as BBVI and BBVI-CV respectively. For a comparative baseline, we also report the performance for several machine learning techniques as applicable to this task. We like to emphasize that, our primary goal here is to illustrate the computational methodology rather incremental improvement for a specific application. Alzheimer’s disease (AD) is a progressive, age-related, neurodegenerative disease and the most common cause of dementia [147, 148, 68]. Behaviorally, AD is commonly preceded by mild cognitive impairment (MCI), a syndrome characterized by decline in memory and other cognitive domains that exceed cognitive decrements associated with normal aging [148, 103]. However, 62 the prodromal symptoms of MCI are not prognostically deterministic: individuals with MCI tend to progress to probable AD at a rate of 8%-15% per year, and most conversions occur within 3 years of presentation, [24, 44, 2]. We used T1-weighted MRI images from the collection of standardized datasets. The description of the standardized MRI imaging from ADNI can be found in http://adni.loni.usc.edu/methods/mri-analysis/adni-standardized-data. This study used a subset of the MCI subjects from ADNI-1, who had data from demographic, clinical cognitive assessments, APOE4 genotyping, and MRI measurements. In total, there are 819 individuals with a baseline diagnosis of MCI, but we only consider patients whose follow-up period was at least 36 months and no missing values. The final samples included 265 subjects which included participants who were stable in their diagnosis (MCI-S) and those who converted to a diagnosis of AD over 3 years (MCI-C). We considered a total of 18 clinical predictors as potential of MCI-to-AD progression in our classification analyses. Structural MRI data were collected according to the ADNI acquisition protocol using T1-weighted scans (GradWarp, B1 Correction, N3, Scaled). Based on the extant literature, [68, 81], we used 24 ROI features as clinically significant of MCI to dementia progression. The dependence and interactions among different modes of features (clinical, MRI) and within the modes may be different and hard to model explicitly. Thus, a neural network-based modeling is intuitive from predictive modeling and machine learning perspective. Of the 265 patients, 186 are selected by simple random sample as training cases and the remaining 79 as test cases. The approximate 2:1 ratio for training and test cases is, of course, arbitrary. All the covariates (except categorical variables) were z-normalized. The outcome H8 for the 8 th patient is either 1 for MCI-C or 0 for MCI-S in classification study. 10-fold cross-validation is used to avoid optimistically-biased estimates of model performance. 3.6.1 Parameters choice for statistical and computational models. In order to implement the BBVI, BBVI-CV, BBVI-RMS, and BBVI-CV-RMS, we use the choice of ` 9= = 0 and f 9= = 1 similar to section 3.6. For the number of layers, we found that one 63 layer provides a good enough performance and inclusion of addition layers do not offer additional improvement in the accuracy. We tried with : 1= = 2, 10, 20 and obtained the best results at : 1= = 10, the results of which are reported in this thesis. 3.6.2 Gradient stabilization paramaters. The choice of the initial learning rate is dC = 14 4 , C 1 for BBVI and BBVI-CV and dC = 14 1 , C 1 for BBVI-RMS and BBVI-CV-RMS. As explained in section 3.3, to allow for stable optimization, we study the sensitivity to the different samples sizes (, use of control variates and the RMSprop based gradient descent method. For ADNI, figure 3.5 illustrates how the ELBO changes with (. It is evident that increase in ( from 200 to 1000 stabilizes the ELBO and helps with a faster convergence. For ADNI, figure 3.5 illustrates how the ELBO changes with use of control variates. It is evident that the use of control variates stabilizes the ELBO by a huge margin and allows for its faster convergence. Similary, figure 3.5 also illustrates how the ELBO changes with use of RMSprop versus a fixed learning rate. It is evident that the use of RMSprop leads to stable ELBO and faster convergence rates. Figure 3.5: ELBO convergence of algorithms 1, 2, 4, 5 for ADNI. 64 3.6.3 Testing accuracy and convergence. For ADNI, table 3.4 gives the performance of BBVI, BBVI-CV, BBVI-RMS, BBVI-CV-RMS for one layer. The best average accuracy of 76.88% was obtained for BBVI with ( = 200 for BBVI. The optimal convergence time is achieved with BBVI-CV-RMS for ( = 200 for one layer with average accuracy of 76.25% and convergence time is 36 seconds. Thus, the conclusions for real data corroborate the use of BBVI-CV-RMS for a single layer NN. Testing accuracy(%) Convergence time(s) Method Sample size (S) Fixed RMSprop Fixed RMSprop BBVI 200 76.88 ± 3.32 75.75 ± 3.27 68 49 500 76.75 ± 3.63 76.50 ± 3.90 105 62 1000 76.75 ± 3.12 76.63 ± 3.21 231 65 BBVI-CV 200 76.75 ± 3.41 76.25 ± 3.83 146 36 500 76.75 ± 3.58 76.63 ± 3.95 210 38 1000 76.75 ± 3.71 76.75 ± 4.07 264 39 Table 3.4: Performance of algorithms 1, 2, 4, 5 for ADNI. 3.6.4 Numerical comparison with popular models In this section, we numerically compare the testing accuracy of BBVI, BBVI-CV, BBVI-RMS and BBVI-CV-RMS and BBVI-CV to a few benchmark models which include logistic regression (LR) and support vector machine (SVM) as developed by [101, 87] and frequentist artificial neural network (ANN) [20, 54]. We also compared with a Bayesian neural network models which uses Stochastic Gradient MCMC [137] . For all neural network models, viz, artificial neural network (ANN) and Stochastic Gradient MCMC Bayesian neural network (SG-MCMC), the number of nodes are fixed at : = = 10 with a single hidden layer. Table 3.5 provides the training and testing accuracy and empirical standard errors for all methods under consideration. For the 4 models viz BBVI, BBVI-CV, BBVI-RMS and BBVI-CV-RMS, the results reported correspond to the optimal parameter combination which provides the best average test accuracy. Little to no difference was observed across different choices of the algorithm parameters (see table 3.4). LR, SVM, ANN and SG-MCMC have considerably larger standard 65 errors for testing accuracy. One might observe an improvement in performance of SG-MCMC Bayesian neural network by optimally choosing their tuning parameters. However studying that is beyond the scope of this thesis as they are different methodology and the underlying statistical theories are not well established. Classifier Training accuracy (%) Testing accuracy (%) LR 82.1 ± 2.5 70.9 ± 5.5 SVM 80.3 ± 2.2 70.6 ± 5.5 ANN 82.0 ± 5.6 74.1 ± 6.8 SG-MCMC 80.8 ± 4.6 73.5 ± 5.9 BBVI 80.7 ± 2.1 76.9 ± 3.3 BBVI-CV 80.3 ± 2.3 76.8 ± 3.4 BBVI-RMS 81.2 ± 2.4 76.8 ± 3.3 BBVI-CV-RMS 82.8 ± 1.6 76.8 ± 4.1 Table 3.5: Performance for different classifiers. LR: Logistic regression. SVM: Support vector machine. ANN: Frequentist artificial neural network. SG-MCMC: Stochastic gradient MCMC Bayesian neural network 3.7 Conclusion and Discussion The theoretical rigour and computational detail for variational Bayes neural network classifier presented in this article is novel and unique contribution to statistical literature. Although the variational Bayes is popular in machine learning, neither the computational method nor the statistical properties are well understood for complex modeling such as neural networks. We characterize the prior distributions and the variational family for consistent Bayesian estimation. The theory also quantifies the loss due to VB numerical approximation compared to the true posterior distribution. For practical implementation, we reveal that the algorithm may not be as simple and straightforward as it sounds in computer science literature, rather it requires careful crafting on several parameters associated in various steps. Nevertheless, the computation could be quite faster compared to popular Monte Carlo Markov Chain procedure of approximating the posterior distributions. Although we build the framework on a multi-layer neural networks model with simplistic prior structure, the detail statistical theory and computational methodology are quite involved. This investigation opens up possibility of exploring much wider class of models and priors. For 66 example, shrinkage priors, such as double exponential and horseshoe priors can be explored for building sparse neural networks or one can experiment with various other variational families. However, their computational details and associated statistical properties are not immediate. We hope this research will accelerate further development of statistical and computational foundation for variational inference in general machine learning research. 67 CHAPTER 4 LEARNING INTRINSIC DIMENSIONALITY OF FEATURE SPACE WITH VARIATIONAL BAYES NEURAL NETWORKS 4.1 Introduction Bayesian neural networks (BNNs) have achieved state-of-the-art results in a wide range of tasks, especially in high dimensional data analysis including image recognition, biomedical diagnosis and others. One of the major disadvantage in using neural networks and deep networks is that they require a huge number of training data due to the large number of inherent parameters [140, 45]. For example, high-dimensional neural networks have been widely applied with regularization, dropout techniques or early stopping to prevent overfitting [118, 143]. Furthermore, most commonly used dimensional reduction techniques include Lasso [17], Ridge [58], Elastic net [152], Sparse group lasso [116], Bayesian Lasso [98], Horseshoe prior [16], principal component analysis [115]. Even though the ;1 and ;2 norm can force the weights to become zero or small, they do not have the regularizing effect of making the computed function simpler [70]. Additionally, all these methods rely on the use of whole data which severely increases the cost of both computation and memory storage. In this chapter, we propose the use of a BNN on a compressed feature space to take care of the large ? small = problem by projecting the feature space onto a smaller dimensional space using a random projection matrix. Random-projection (RP) is a powerful dimension reduction technique which uses RP matrices to map data into low-dimensional spaces. The use of RP in high dimensional statistics is motivated from the Johnson–Lindenstrauss Lemma [27] which states for x1 , · · · , x= 2 R ? , n 2 (0, 1) and 3 > 8 log =/n 2 , there exists a linear map 5 : R ? ! R3 such that (1 n)||G8 G 9 || 22  || 5 (x8 ) 5 (x 9 )|| 22  (1 + n)||G8 G 9 || 22 for 8, 9 = 1, · · · , =. The properties of the RPs and their applications to statistical problems were furthered explored in [33, 13], etc.. In order to reduce the sensitivity to the choice of random matrices, one must pool information 68 obtained from multiple projections. In this chapter, we adopt a Bayesian model averaging approach for combining information across multiple instances RP based neural networks. There are two main challenges of implementing Bayesian modeling averaging (1) due to the convoluted structure of the neural network likelihood, closed form expressions do not exist for the posterior distribution under each model (2) posterior distribution of model weights is completely intractable and no closed form solutions exist. Thereby, the implementation of standard Markov Chain Monte Carlo (MCMC) is next to impossible. Further, the computation and storage cost associated with MCMC implementation is humongous since each posterior model weight is dependent on the posterior model weight of the remaining models. To address the challenges of MCMC implementation, we use variational inference (VI) [63, 9] approach to provide an approximate solution for Bayesian model averaging (BMA) to allow for combining of BNNs with multiple instances of compression on the feature space. There has been a plethora of literature implementing variational inference in the neural networks [10]. However, their implementation makes use of the entire feature space, thereby putting a great burden on computational stability and memory storage. We address two main challenges in this thesis (1) developing a variational Bayes (VB) solution for BNNs with compressed feature space (2) providing a VB solution for doing BMA across multiple instances of RP. Further, for a given instance of random compression, we establish the posterior contraction rates for the variational posterior for classification (the theory is extendable to regression set up with minor modifications). In this direction, we provide characterization of the prior, variational posterior and the RP matrix which guarantees the convergence of the variational Bayes neural network (VBNN) under the compressed feature space to the true density of the observations. The main advantage of implementing a BMA approach is that it gives the posterior model weights under each compression of feature space. The so obtained posterior model weights in turn induce a probability distribution on the projected dimension of the feature space. The mode of this probability distribution concentrates around the intrinsic dimensionality of the feature space. To improve the prediction performance, the BMA approach is then applied to a pool of RP matrices 69 whose projected dimension lie in a neighborhood of the intrinsic dimensionality. Finally, we study the numerical behavior of the proposed procedure in the light of simulation and real data sets. To the best of our knowledge there exist no literature which provides theoretical guarantees and computation algorithm of VBNNs with compressed feature space. For a long time, people have studied feature reduction using projection matrix in both supervised and unsupervised learning. [146] proposed semisupvervised classification with graph construction and the idea of projection matrix which is used to preserve the local and global structure of data. In addition to semisupervised learning, projection method have been used in convolutional neural network, [125] introduced an efficient convolutional neural network which can control how much context information can be incorporated into each specific position using word-embedding projection matrix. In terms of unsupervised learning, [134] proposed an unsupervised adaptive embedding method which combined the calculation of projection matrix and construction of affinity graph together. Early works on Bayesian neural networks (BNNs) have been comprehensively discussed by [85, 96, 71]. With the computational and information science advancement, recent developments with higher efficient BNNs can be found in [120, 93, 61, 64] and the references therein. However, with increase in the dimension of the feature space, the prediction accuracy of BNN’s is severely compromised. To circumvent this issue, penalization and sparse network based approach has been studied by [80, 45, 140, 48, 3], etc.. The major drawback of these sparsity based methods is one needs to work on the entire data which increases the time of implementation by a manifold. With the work of [27], the idea of using RPs to overcome the curse of dimensionality became very popular. Further, RPs have been used in a wide range of statistical problems [1, 86, 84, 39, 55, 40, 49], etc.. To ensemble information across projections, [14] uses a bagging approach for classification problems as in [12]. On the other hand, the works of [52] and [53] propose the use of BMA in the context of linear regression and Gaussian processes. There exists a plethora of literature implementing variational inference [9] to overcome the drawback of a full MCMC implementation. The majority of Black-box variational methods for 70 Bayesian learning of neural networks are based on Pathwise gradient estimator [41, 83, 15, 11, 121], which is computed using reparameterization trick [106]. Another line of Black-box variational extensions is based on the score-function estimator using Monte Carlo estimator to find the full gradient, including control variate [107] and stochastic search [97]. Theoretical properties of the variational posterior in context of individual models have been studied in the works of [6, 135, 100, 149, 3]. The works of [65] and [72] explore variational inference for BMA in the context of generalized linear models and graph on functions respectively. To the best of our knowledge, BMA in context of Bayesian neural networks with compressed feature space remains unexplored. Firstly, we introduce the RP idea in a neural network predictive model where the feature space grow exponentially with training sample size which in turn significantly reduces the computational complexity and storage capacity associated with BNNs. Second, we apply the BMA idea in conjunction with VB to allow for parallelization across RPs without compromising the uncertainty quantification of a Bayesian approach. Third, we develop the associated statistical foundation, namely the posterior contraction of the variational posterior for BNNs under a compressed feature space. The theory not only provide trustworthiness to our model, the results also provide theoretical guidelines for prior selection and the choice of variational family of distributions. Fourth, we innovatively apply the learned posterior model weights to obtain the intrinsic dimensionality of the feature space. Fifth, to improve predictive accuracy, we employ VB with BMA on a subspace of RPs with projected dimension centred around the intrinsic dimensionality. Fifthly, we provide numerical results to enunciate that our proposed approach learns well the intrinsic dimension of feature space and beats the predictive performance of all competing methods for the large ? small = problems. Lastly, the performance of the proposed methodology has been enunciated in the context real data sets like ADNI and MNIST. 71 4.2 Bayesian neural network for random projection based compressed fea- ture space 4.2.1 Bayesian neural network model For a binary random variable . , representing the class levels 0 or 1, and a feature vector - 2 R ? with some marginal distribution % - , consider the classification problem %(. = 1|- = x) = f([0 (x)) = 1 %(. = 0|- = x) (4.1) where [0 (·) : R ? ! R is some continuous function and f(.) = 4 (.) /(1 + 4 (.) ) is the sigmoid function. Following [14] and [141], the test error of a classifier ⇠ is given by π '(⇠) = {⇠ (-)<. } 3% -,. (4.2) R ? ⇥{0,1} where the joint density % -,. is a product of (4.1) and % - . The Bayes classifier is then 8 > > > < 1, > f([0 (x)) 1/2 ⇠ Bayes (x) = (4.3) > > > > 0, otherwise : Since [0 (x) is unknown, we thereby use single-layer neural network model approximation with : nodes: ’:= [✓ (x) = V0 + V 9 k(W 90 + ) 9 x) = V0 + > k( 0 + x) (4.4) 9=1 where = [V0 , · · · , V : ], 0 = [W10 , · · · , W :0 ] and = [ 1, · · · , :] and ✓ = (V0 , , 0 , vec( )) is the set of all the parameters. Note, ✓ is a ⇥ 1 vector where = 1 + : + : ( ? + 1). Both : and ( ? >> =) grow as a function of =. We then use the following model for the problem in (4.1). %(. = 1|- = x) = f([✓ (x)) = 1 %(. = 0|- = x) (4.5) 4.2.2 Compression in the feature space with random projections There exists several choices for compressing the feature space X using RP matrices such as those proposed in [33, 13, 14, 27, 26], etc.. For a given choice of the compression matrix , we consider 72 single-layer neural network with : nodes for the input vector x as [✓ ( x) = V0 + > k( 0 + ( x)) (4.6) where is a 3 ⇥ ? projection matrix, and 0 are : ⇥ 1 vector and > =[ 1, · · · , :] is now a 3 ⇥ : matrix. Thus, in the projected space the number of parameters reduce from = : ? + 2: + 1 to = : 3 + 2: + 1. This leads to the following model under the projected space %(. = 1|- = x) = f([✓ ( x)) = 1 %(. = 0|- = x) (4.7) Experiments with different projection matrices suggested the use of the one in [52]. In this method, p we draw the elements 8 9 independently, setting 8 9 = 1/ 0 ⇤ , with probability 0 2⇤ , 0 with p probability 20 ⇤ (1 0 ⇤ ) and 1/ 0 ⇤ with probability (1 0 ⇤ ) 2 , with the rows of then normalized using Gram-Schmidt orthogonalization. The parameter 0 ⇤ 2 (0.1, 1) provides a handle on the sparsity of the projection matrix. We do not rely on the data to generate . Also, the algorithmic implementation discussed can be generalized to any arbitrary class of projection matrices. 4.2.3 Prior choice For the neural network [✓ ( x) based on the projected input x, we assume an independent Gaussian prior on each of the entries of ✓ , i.e. ?(✓ |" ) = MVN(µ , diag( )), where diag( ) is a diagonal matrix. With this choice of the prior and likelihood as in (4.7), the posterior distribution based on the compressed data set (H8 , x8 )8=1= is given by ! (✓ |" ) ?(✓ |" ) c(✓ |" ) = Ø (4.8) !(✓ |" ) ?(✓ |" )3✓ where " is the model induced by random matrix with corresponding likelihood ! (✓ |" ) = Œ= 8=1 exp(H 8 [✓ ( x8 ) log(1 + exp([✓ ( x8 )))). The denominator in (4.8) is free of ✓ . 73 4.3 Variational Bayes model averaging for pooling multiple instances of ran- dom projection. 4.3.1 Bayesian model averaging Ensemble learning methods are most widely used in machine learning literature to pool across varying classifiers to solve given problem a [32]. In this section, we address the same problem from a Bayesian perspective. Let A denote a subspace of the space of all random matrices. We assume that each RP matrix induces a separate model " , 2 A on the data D = (H8 , x8 )8=1 = . Thus, the predictive distribution of a new observation H =+1 given x=+1 is π ?(H =+1 |x=+1 , D) = ?(H =+1 |x=+1 , " , ✓ , D)c(" , ✓ |D)3`(" , ✓ ) (4.9) where ` is the product measure of counting and Lebesgue measure. Note, that in the implementation of (4.9), the most difficult quantity to compute is c(" , ✓ |D). In [52], explicit forms could be obtained for linear regression model, something which is next to impossible for convoluted neural network structure. In the next section, we circumvent this issue using variational inference. 4.3.2 ELBO derivation Let c(" , ✓ |D) denote the joint density of the parameter and the model conditional on the data. We posit a variational distribution @(" , ✓ ) of the form @(✓ |" ) ⇠ MVN(m , diag(s )) where s is a diagonal matrix and @(" ) are weights for the individual model. Thus, our variational family may be expressed as ( ) @(" ) (2c) /2 1 m ) > (diag( s )) 1 ( ✓ m ) Q= = @(" , ✓ ) = @(" )@(✓ |" ) = 4 2 (✓ |diag(s )| 1/2 The optimal variational distribution minimizes the Kullback-Leibler distance between c(.|D) and the variational family Q= . Thus, @ ⇤ = argmin 3KL (@, c(.|D)), where @2Q= 3KL (@, c(.|D)) = ⇢& (log c(✓ , " |D) log @(✓ , " )) = log c(D) + ELBO 74 where ELBO = ⇢& (log c(✓ , " , D) log @(\ , " )). Since log c(D) is independent of ✓ and " , therefore @ ⇤ (" , ✓ ) = argmin ELBO(@, c(.|D)). We next simplify the ELBO as @2Q= ⇢& (log c(✓ , " , D) log @(✓ , " )) ’ = @(" )⇢&(✓ |" ) (log c(D|" , ✓ ) + log c(✓ |" ) 2A + log c(" ) log @(✓ |" ) log @(" )) ’ = @(" )⇢&(✓ |" ) (log ! (✓ |" ) + log ?(✓ |" ) + log c(" ) log @(✓ |" ) log @(" )) ’ = @(" )(L (.|" ) + log c(" ) log @(" )) where L (.|" ) = ⇢&(✓ |" ) (log ! (✓ |" ) + log ?(✓ |" ) log @(✓ |" )) is nothing but the ELBO under the model " . Note, that the derivative of the ELBO with respect to variational parameters m , s is given by rm ,s ELBO = @(" )rm ,s L (.|" ) Since @(" ) is just constant, thus the gradient update for model specific variational parameters is nothing but the gradient update from each individual model. Also, equating the derivative of ELBO with respect to @(" ) to zero, we get r@(" ) ELBO = 0 =) log c(" ) log @(" ) + L (.|" ) 1=0 =) @(" ) / exp(log c(" ) + L (.|" )) Õ Thus, the optimal model weights are @ ⇤ (" ) = exp(log c(" ) + L ⇤ (.|" ))/ exp(log c(" ) + L ⇤ (.|" )) where L ⇤ (.|" ) are is the optimal ELBO under models . Note, that the main advantage of the above derivation is that the models can be individually trained in a parallel fashion and the final model weights depend only on the final ELBO values from each model. 75 4.4 Intrinsic dimensionality and prediction 4.4.1 Optimal dimension neighborhood selection Let 3 ⇥ ? denote the dimension of a RP matrix 2 A. Using section 4.3, one can obtain the posterior model weights @ ⇤ (" ). The values of 3 with largest values of the posterior model weights @(" ) tend to concentrate around the optimal dimension of the feature space. Let 31  32  · · · be an enumeration of the unique values of 3 , 2 M . Define the average posterior probability of each dimension value 38 as 1 ’ @8⇤ = @(" ) |A 8 | 8 2A where A8 ={ 2 A : 3 = 38 }. The plot of (8, @8⇤ ) attains its peak around optimal dimension of feature space for prediction of the response. Let 3 ⇤ = argmax @8⇤ , then for some a1 , a2 > 0, 8 I3 ⇤ = [b3 ⇤ (1 a1 )c, d3 ⇤ (1 + a2 )e] (4.10) is the optimal dimension neighborhood which is used for the final classification task. Finally let A 3 ⇤ be a subspace of RP matrices with dimension 3 ⇥ ? where 3 2 I3 ⇤ . 4.4.2 Classification based on optimal neigborhood choice Using section 4.3, obtain the variational distribution @ ⇤ (" , ✓ ) for every 2 A 3 ⇤ . Let [b = Ø [✓ ( x=+1 )@ ⇤ (✓ |" )3\ be the variational Bayes estimator of under model " . Define ’ [b(x=+1 ) = @ ⇤ (" )b [ (x=+1 ) (4.11) 2A 3 ⇤ Based on [b(x=+1 ) define the classification rule as bH =+1 = I [b[ (x=+1 ) 0] (4.12) Remark: Note, the proposed estimator [(x ˆ =+1 ) is not the exact variational Bayes estimator of [(x=+1 ) = log(%(H =+1 = 1|x=+1 )/%(H =+1 = 0|x=+1 )). However, it is a good enough approximator for sufficiently large training size and computationally way faster, especially when the number of models and test samples are large. 76 4.5 Algorithm and its implementation. Algorithm 3 RPVBNN 1. Initialization: (m0 , s0 , d {C} ) 2A where d {C} , C 0 is step size sequence for model . 2. Parallelization : a) Set C = 1, b) For 2 A, calculate the gradient of L b(.|" ) in (4.14) with respect to m and s . c) Update the parameters mC and sC as mC+1 = mC + d C rm L b(.|" )| m =mC sC+1 = sC + d C rs L b(.|" )| s =sC d) Set C = C + 1. e) Repeat steps (b)-(d) till convergence. 3. Model averaging: a) For the optimized values (m⇤ , s⇤ ) 2A , compute ( Lb⇤ (.|" )) 2A using (4.14). b) Compute the model weights exp(log c(" ) + L b⇤ (.|" )) @ (" ) = Õ ⇤ b⇤ 2A exp(log c(" ) + L (.|" )) 4. Optimal neighborhood selection: Using the values (@ ⇤ (" )) 2A compute a) The optimal neighborhood I3 ⇤ as in (4.10). b) The subspace A 3 ⇤ using based on I3 ⇤ (see section 4.4.1). 5. Classification: a) Repeat steps (1)-(3) for 2 A3⇤ . b) Compute [b(x=+1 ) and b H =+1 using relations (4.11) and (4.12) respectively. 77 Œ 2 /(2B 2 Gradient update equations. For @(✓ |" ) = 9=1 (1/(2cB2 9 ) 1/2 )4 (\ 9 < 9) 9) and ?(✓ |" ) = Œ: 2 2 2 1/2 )4 (\ 9 ` 9 ) /(2f 9 ) , the ELBO is 9=1 (1/(2cf 9 ) L (.|" ) = ⇢&(✓ |" ) (log !(✓ |" )) 3KL (@(.|" ), ?(.|" ))) where 3KL (@(.|" ), ?(.|" ))) is given by ! ’ f 9 B2 9 (< 9 ` 9 )2 1 log + + (4.13) 9=1 B 9 2f 2 9 2f 2 9 2 Since ⇢& (log ! (✓ |" )) cannot be computed explicitly, generate , samples ✓ [1], · · · , ✓ [,] Õ from &(✓ |" ) and compute b !(.|" ) = (1/,) , F=1 log !(✓ [F] |" ). Thus, the final gradient function which is optimized w.r.t. to the parameters m and s is given by b(.|" ) = b L ! (.|" ) 3KL (@(.|" ), ?(.|" ))) (4.14) Remark: Since, we need the variance parameter B 9 to be always positive, thus, we consider the reparametrization, B 9 = log(1+4 B̃ 9 ) and update the parameters B̃ 9 instead where rB̃ 9 L b(.|" ) = 4 B̃ 9 /(1 + 4 B̃ 9 )rB 9 L (.|" )| log(1+4 B̃ where rB 9 L b(.|" )| B̃ is the derivative of L b(.|" ) 9) log(1+4 9) with respect to B 9 evaluated at B 9 = log(1 + 4 B̃ 9 ). 4.6 Theoretical results. In this section, we study the convergence properties of the variational posterior for a given projection matrix (without model averaging). The results presented here are similar in spirit to the notion of posterior consistency in [52]. Let 50 (H, x) and 5✓ (H, x) be the joint density of the data D = (H8 , x8 )8=1 = under the truth and the model respectively. Without loss of generality, we assume x8 ⇠ * [0, 1] ? , which implies 50 (x) = 5✓ (x) = 1. This implies that the joint distribution of (H8 , x8 )8=1 = depends only the conditional distribution of . |- = x. Thus, under the model indexed by the projection matrix , 5✓ (H, x) = 5✓ (H|x) 5 (x) 5✓ (x) = ✓✓ (H, x) 50 (H, x) = 50 (H|x) 50 ( x) = ✓0 (H, x) (4.15) 78 where ✓\ (H1, x) = exp(H[✓ ( x) log(1 + exp([✓ ( x)))) and ✓0 (H, x) = exp(H[0 (x) log(1 + exp([0 (x)))) are defined respectively. We next define the Hellinger neighborhood of the true function density function 50 = ✓0 as UY = {✓ : 3H (✓0 , ✓✓ ) > Y} π ’ ⇣p p ⌘2 2 23H (✓0 , ✓✓ ) = ✓0 (H, x) ✓✓ (H, x) 3x. (4.16) x H We next give the set of conditions which ensure that the variational posterior for a given projection matrix , is consistent to the true density function 50 . Recall ? is the total number of covariates, : is the number of nodes. If the dimension of is 3 ⇥ ?, then the total number of parameters in the model indexed by is = 1 + : + : (3 + 1), i.e. ✓ is ⇥ 1 vector. Õ Let [✓⇤ (x) = V0⇤ + :9=1 V⇤9 k( ⇤9 > x) be the neural network which can approximate the true function [0 (x) in ! 1 norm. The existence of such a neural network is guaraneteed by [60]. Suppose is an orthonormal projection matrix, prior ?(✓ ) = MVN(µ , diag( )), =n =2 ! 1 and the following conditions hold: 2 1. (C1): : log = = >(=n =2 ), ? = >(4 =n = ). Õ: 2. (C2): ||µ V || 21 = >(=n =2 ), log || V || 1 = $ (log =), || V 1 || 1 = $ (1), 9=1 || >µ 9 W || 1 = $ (1), sup 9=1,··· ,: log || = $ (log =), sup 9=1,··· ,: || 1 = $ (1). 9 W || 1 9 W || 1 3. (C3): ||[0 [✓⇤ || 1 = >(n =2 ), || ⇤ || 2 1 = >(=n =2 ), sup 9=1,··· ,: ||( > )W ⇤9 || 1 = >(= 1 ), Õ: ⇤ 2 9=1 || 9 || 1 = $ (1). 4. (C4): log || x|| = $ (log =), 1/|| x|| = >(=n =2 ) Condition (C1) gives restrictions on the number of effective parameters (⇠ : 3 ) and the true number of covariates (⇠ ?). Condition (C2) puts restrictions on the growth of the prior parameters. Õ Note, although the condition :9=1 || > µ 9 W || 1 = $ (1) seems to depend on the matrix , it can be easily ensured by setting µ 9 W = 0. Condition (C3) quantifies how fast the neural network solution converges to the true function while keeping their coefficients magnitude under control. Although, 79 the condition sup 9=1,··· ,: ||( > )W ⇤9 || 1 = >(= 1 ) is restrictive, it holds for any W ⇤9 in the column space of the projection matrix . Condition (C4) for projection matrices relates to condition (iii) in Theorem 3.1 of [52]. For the posterior in (4.8), let the variational posterior be @ ⇤ = argmin ELBO (@, c(.|D, " )) @2Q= where Q= = {@(✓ ) = MVN(m , diag(s ))}. For a fixed , one can obtain @ ⇤ by following the step 2. in algorithm 3. Theorem: Suppose =n =2 ! 1 and conditions (C1)-(C4) hold, then for any Y > 0, %0= @ ⇤ (UYn 2 = )!0 where %0(=) is the joint distribution of (H8 , x8 )8=1 = under (4.1). The proof has been presented in the supplement section. The above proof shows that the variational posterior @ ⇤ concentrates around shrinking Hellinger neighborhoods of the true function 50 with overwhelming probability. 4.7 Numerical Study 4.7.1 Problem setup We mimic the RP generation mechanism as in section 4.2.2. We fix the value of 0 ⇤ at 0.3 for this whole section. We also experimented with the RP mechanism in [14] where the is taken to the matrix of left singular vectors in the eigenvalue decomposition of ˜ where ˜ has all entries drawn from # (0, 1) distribution. However we omit the results since no significant improvement was observed. Further, the Algorithm 3 is also sensitive to the choice of the learning rate d, the number of projections and the batch size. In this thesis, we do not explore the sensitivity with respect to these parameters due to their small impact. For the number of projection matrices, we followed the 2-power rule as: let the range of the project dimension 3 is [ ? 1 , ? 2 ], the number of projections is chosen as # = D(min{28 : 28 ( ?2 ? 1 ) > 0}). This ensures that approximately 80 D number of 3 values are chosen from any unit sub-interval of [ ? 1 , ? 2 ]. We employ parallel programming technique across different projection to reduce computational time. We first learn the optimal dimension of the data and then use it to improve the predictive accuracy. We analyze the performance of the algorithm in light of four data sets, two simulated datasets generated using a non linear function in the input feature space and two real data set obtained from neuroimaging and computer vision studies. 4.7.2 Datasets We consider four cases of the data sets. The details of each data are summarized in Table 4.1. In the first two cases, we use the non-linear system [80] to generate observations with varying number of features. For these two data sets, the intrinsic dimensionality of the feature space is defined by number of active variables used in the data generation. We employ our algorithm to validate if it can recover the intrinsic dimensionality of the feature space and provide desirable classification accuracy. For the remaining two data sets, the intrinsic dimensionality is unknown. In the third case, we use the ADNI dataset from neuroimaging studies. In the last case, we use the MNIST dataset from computer vision studies. The proposed algorithm allows us to learn the intrinsic dimensionality of the feature space in both neuroimaging and computer vision applications. For implementation, all input variables are I normalized. We first employ RPVBNN to learn the intrinsic dimensionality of the dataset. With a knowledge on the optimal dimensionality, we compare RPVBNN with several other traditional algorithms (logistic regression, random forest and gradient boosting) and VBNN which is the standard variational Bayes neural network based on the whole feature space. For the simulated datasets and ADNI, to prevent over-fitting and optimistically-biased estimates of model performance, we consider 10 different splits of the data into train and test datasets. We report the mean and standard deviation of train and test accuracy and AUC score over the 10 splits. For the MNIST data, since author-defined splits exist [73], we only report the train and test accuracy and algorithm run time. 81 4.7.3 Simulated data We generate two simulated data from 8 > < 1 , if 4 G1 + G 22 + 5 sin(G 3 G4 ) > 2 3>0 H= (4.17) > > 0 , otherwise : Since the number of active variables is 4, the intrinsic dimensionality of feature space is 4. To test if RPVBNN can capture the intrinsic dimensionality, we consider two simulated data examples 1) small simulated data 2) large simulated data. For small simulated data, we work with a smaller number of covariates (? = 20) and for the large simulated data, larger number of covariates are used (? = 200). Note, for both datasets = = 3000 observations are generated from (4.17). However, x ⇠ MVN(0, ⌃) with f88 = 0.5 and f8 9 = 0.25 and has dimension ? = 20 and ? = 200 under the small and large datasets respectively. The large simulated data exemplifies the small = large ? problem. For the 10 splits of cross validation, the ratio of observations in the training and test is 7:3, thus the train and test data sets have 2100 and 900 subjects respectively. 4.7.4 ADNI Data We utilized the data provided by Alzheimer’s disease Neuroimaging Initiative (ADNI) database http://www.loni.ucla.edu/ADNI. ADNI is an ongoing joint public-private effort to utilize neuroimaging, other biological markers, and clinical and neuropsychological assessment to measure the incidence and progression of MCI to early AD. The data used consisted of 819 subjects with baseline characteristics, genetics and diagnosis of MCI. For consistency, we only consider patients whose follow-up period was at least 36 months and no missing values. The final samples included = = 265 subjects which included participants who were stable in their diagnosis (MCI-S) and those who converted to a diagnosis of AD over 3 years (MCI-C). We used ? = 277 variables which included diagnosis, neuropsychological tests score, epsilon-4 allele of the apolipoprotein E (APOE) gene and ROIs levels features derived from T1 magnetic resonance imaging (MRI). Analogous to the simulated examples, the ratio of subjects in the training and testing was 7:3 for the 10 splits. 82 Table 4.1: Summary of data,where n, p and c denote the numbers of samples, features and classes. Data Source n p c Small Simulated data [80] 3000 20 2 Large Simulated data [80] 3000 200 2 ADNI [62] 264 278 2 MNIST [73] 70000 784 10 4.7.5 MNIST Data In addition to the ADNI, we evaluate the model performance on computer vision data - MNIST. The MNIST dataset is a large collection of handwritten digits and from the National Institute of Standards and Technology (NIST). MNIST dataset contains = = 70000 images with 60000 and 10000 in train and test sets respectively and a feature space of dimension ? = 784 [73]. 4.8 Results 4.8.1 Optimal dimensional region Figure 4.1: Small simulated data: # = 32 Using section 4.4.1, we obtain the average posterior probabilities of the projected dimensions. For all the four data sets, for learning the intrinsic dimension we employ RPVBNN with : = 32 nodes. Since obtaining the optimal dimension is a preprocessing step, we avoided experimentation with number of nodes in this step. Figures 4.1, 4.2, 4.3 and 4.4 give the average probability density 83 Figure 4.2: Large simulated data: # = 128 Figure 4.3: ADNI data: # = 128 curve as a function of the projected dimensions for small and large simulated data sets and ADNI and MNIST respectively. The intrinsic dimensionality estimate corresponds to the mode of this density curve while the optimal dimension neighborhood is a small interval around this posterior mode. For small simulated data, Figure 4.1 shows a dramatic growth in the average posterior probability as the number of projected dimensions increase from 3 to 5 with a significant drop when the number of projected features reach 7, followed by stabilization after 8. Thus, with a peak around 5, the optimal dimension neighborhood for small data is taken (3,7). For large data, 84 Figure 4.4: MNIST data: # = 128 Figure 4.2 shows that the average posterior probability peaks between 3 and 8 and stabilizes after 10. Thus, the optimal neighborhood for large data was taken to be (3,10). Note, for both small and large simulated datasets, the true intrinsic dimensionality was 4. The fact that the average posterior probability concentrates around 4 further corroborates that our algorithm learns well the intrinsic dimensionality of the feature space in regards to the prediction of response. Next, for ADNI data, Figure 4.3 shows that average posterior probability peaks between projected dimensions of 10 to 20 followed by stabilization after 30. Thus, the optimal dimensional neighborhood for ADNI data was chosen as (1,30). Finally, for MNIST data, Figure 4.4 shows that the optimal dimension neighborhood can be chosen as (580,600). 4.8.2 Comparative Baselines For the two simulated examples and ADNI data, we consider 10 splits of the data as in section 4.7. With the optimal dimension neighborhood we use steps (4)-(5) of algorithm 3 to obtain the mean and standard deviation of train and test accuracy and AUC score of RPVBNN (see tables 4.3, 4.5 and 4.4 respectively). In addition to the performance of RPVBNN, we also provide results from logistic regression with ! 1 penalty (LR-! 1 ), random forest (RF), gradient boosting (GB) as comparative baseline. In particular, we report the LR performance for varying values of 85 Table 4.2: RPVBNN setting for evaluation Data N Learning rate Batch size Optimal Region Small data 16 0.01 256 (3,7) Large data 64 0.01 256 (3,10) ADNI 64 0.01 185 (1,30) MNIST 128 0.01 512 (580,600) _ = 0.01, 0.1, 0.5, 1, 5, 10, 100 together with the performance at _ 0 , the optimum _ obtained from 10-fold cross validation. To build both RF and GB models, we start with 50 trees and increase the number of trees by 100 trees each time until we see either no improvement of test accuracy or increase in the standard deviation of test accuracy [101]. Finally, for the same 10 splits, we report the results obtained using VBNN algorithm which works on the whole feature space without any compression (it is indeed a version of RPVBNN with # = 1, 3 = ? and = ). The number of nodes for both RPVBNN and VBNN are varied as : = 32, 64, 128. For the MNIST dataset, since user defined train and test splits already exist, we only report the train and test accuracy and algorithm run time for RPVBNN (see table 4.6). As a comparative baseline, we also provide the results of VBNN. For all the datasets, the details of RPVBNN settings including optimal dimension neighborhoods, the number of projections, learning rate and batch size are summarized in Table 4.2. 4.8.3 Experimental Results For small simulated data, as shown in Table 4.3, the results of using RPVBNN with 128 hidden nodes can achieve a test accuracy and AUC of 94.88% and 95.88% respectively which is considerably better than performance of other learning algorithms. Also, the impact of the number of nodes is minimal which further justifies our attainment of optimal dimensionality neighborhood using only : = 32 nodes. Whereas for the small simulated dataset the second best performer was VBNN, its performance significantly deteriorates for the large simulated dataset. This is because the with a large feature space of ? = 200, the training size of 2100 is way smaller. Since RPVBNN works with compressed feature space of 3 2 [3, 10], it still has the best testing accuracy and AUC of 94.96% 86 and 96.63% respectively (see table 4.4). This clearly indicates that RPVBNN is an effective solution to the small = large ? problem. Also, since at each instance one works with the compressed feature space, one gains a huge advantage in both memory storage and computational efficiency as long as the intrinsic dimensionality of feature space lies in a smaller dimensional subspace (although multiple compressions are needed, one can leverage parallelization across compressions). The ADNI with ? = 277 and training sample 180 is another example of a small = and large ? problem. RPVBNN still continues (see table 4.5) to outperform all its competitors where VBNN suffers from the curse of dimensionality. Interestingly, overall gradient boosting seems to the second best performer after RPVBNN. For the MNIST data (see Table 4.2), note that VBNN with the best testing accuracy of 97.8% slightly outperforms the RPVBNN with the best test accuracy of 97.32. For MNIST, the training size = = 60000 is way larger than ? = 784, is best performer. However, the average run time for one run based on 500 epochs using 128 nodes of VBNN is 2640 seconds while the same value with 3 in the optimal dimension neighborhood is 2350 seconds. To conclude, when ? >> =, RPVBNN offers the biggest advantage in terms of memory storage, computational Table 4.3: Table: Small simulated data performance Model Setting Train Acc(%) Test Acc(%) AUC(%) LR-;1 _ = 10 67.63 ± 0.64 67.46 ± 1.04 68.04 ± 1.54 _=1 67.59 ± 0.67 67.47 ± 0.97 68.07 ± 1.58 _ = 0.1 67.32 ± 0.77 67.44 ± 0.97 68.47 ± 1.51 _ = 0.01 65.49 ± 0.91 66.04 ± 1.36 68.77 ± 1.56 _0 = 0.1 67.32 ± 0.77 67.44 ± 0.97 68.47 ± 1.51 RF 10 trees 66.83 ± 1.76 66.02 ± 1.84 73.76 ± 3.63 25 trees 68.75 ± 1.91 67.92 ± 1.84 78.30 ± 3.67 50 trees 68.90 ± 1.09 67.84 ± 1.65 80.07 ± 2.05 GB 10 trees 72.78 ± 0.78 72.05 ± 1.45 74.18 ± 1.58 50 trees 79.78 ± 1.78 77.34 ± 1.73 88.74 ± 1.72 100 trees 87.30 ± 1.52 83.22 ± 2.08 92.51 ± 0.88 150 trees 90.81 ± 0.90 85.56 ± 2.01 93.78 ± 0.84 250 trees 94.17 ± 0.81 86.91 ± 1.63 94.41 ± 0.66 350 trees 94.07 ± 1.07 87.44 ± 1.98 94.62 ± 0.75 450 trees 97.62 ± 0.62 87.80 ± 1.95 94.84 ± 0.86 VBNN 32 nodes 94.88 ± 0.52 89.84 ± 0.64 90.36 ± 0.71 64 nodes 95.36 ± 0.38 90.27 ± 0.59 90.89 ± 0.53 128 nodes 95.28 ± 0.45 90.28 ± 0.65 90.88 ± 0.56 RPVBNN 32 nodes 95.70 ± 0.40 94.77 ± 0.68 95.68 ± 0.56 64 nodes 95.80 ± 0.42 94.80 ± 0.60 95.45 ± 0.64 128 nodes 95.83 ± 0.31 94.88 ± 0.76 95.88 ± 0.43 87 Table 4.4: Table: Large simulated data performance Model Setting Train Acc(%) Test Acc(%) AUC(%) LR-;1 _ = 10 71.34 ± 0.61 64.12 ± 1.58 65.88 ± 0.90 _=1 71.39 ± 0.63 64.21 ± 1.34 66.13 ± 0.91 _ = 0.1 70.90 ± 0.59 66.31 ± 1.35 68.13 ± 0.95 _ = 0.01 65.75 ± 0.86 66.07 ± 1.09 70.62 ± 1.41 _0 = 0.01 65.75 ± 0.86 66.07 ± 1.09 70.62 ± 1.41 RF 10 trees 62.78 ± 3.31 61.12 ± 3.70 66.21 ± 6.43 25 trees 62.93 ± 3.93 61.69 ± 3.07 70.24 ± 3.43 50 trees 60.04 ± 1.48 59.59 ± 1.58 71.82 ± 2.67 100 trees 60.14 ± 1.61 59.54 ± 1.39 74.81 ± 2.36 GB 10 trees 73.51 ± 0.82 72.14 ± 1.78 74.43 ± 1.62 50 trees 80.53 ± 1.30 77.45 ± 2.69 87.68 ± 3.15 100 trees 87.79 ± 1.37 80.75 ± 2.36 90.69 ± 1.59 150 trees 91.98 ± 1.06 82.25 ± 2.68 91.56 ± 1.57 250 trees 96.31 ± 0.08 84.04 ± 2.99 92.41 ± 1.82 350 trees 98.46 ± 0.05 84.64 ± 2.71 92.70 ± 1.71 450 trees 99.40 ± 0.04 85.06 ± 2.55 92.98 ± 1.72 550 trees 99.78 ± 0.01 84.97 ± 2.20 92.95 ± 1.70 VBNN 32 nodes 62.76 ± 1.21 60.88 ± 1.59 65.23 ± 1.78 64 nodes 62.52 ± 1.71 60.90 ± 1.50 65.62 ± 1.39 128 nodes 63.61 ± 1.13 61.42 ± 1.11 66.21 ± 1.06 RPVBNN 32 nodes 96.54 ± 0.22 94.70 ± 0.81 96.21 ± 0.45 64 nodes 96.57 ± 0.41 94.89 ± 0.68 96.45 ± 0.63 128 nodes 96.66 ± 0.28 94.96 ± 0.90 96.63 ± 0.41 Table 4.5: Table: ADNI data performance Model Setting Train Acc(%) Test Acc(%) AUC(%) LR-; 1 _ = 10 100.00 ± 0.00 65.75 ± 4.07 68.71 ± 3.75 _=1 100.00 ± 0.00 63.25 ± 3.12 65.21 ± 4.44 _ = 0.1 100.00 ± 0.00 61.00 ± 4.70 61.62 ± 4.00 _ = 0.01 100.00 ± 0.00 60.12 ± 4.55 61.08 ± 4.15 _0 = 10 100.00 ± 0.00 65.75 ± 4.07 68.71 ± 3.75 RF 10 trees 81.78 ± 2.23 68.87 ± 5.37 75.08 ± 5.18 25 trees 82.38 ± 2.35 70.88 ± 4.43 79.29 ± 5.00 50 trees 83.67 ± 1.79 72.00 ± 3.88 78.89 ± 4.90 100 trees 83.08 ± 2.62 71.25 ± 4.50 79.95 ± 4.85 GB 10 trees 87.24 ± 1.51 73.25 ± 4.40 80.44 ± 4.23 25 trees 95.41 ± 1.16 74.37 ± 5.34 81.32 ± 4.01 50 trees 99.78 ± 0.35 74.87 ± 4.95 81.16 ± 4.17 100 trees 100.00 ± 0.00 73.75 ± 3.95 80.79 ± 3.89 VBNN 32 nodes 62.51 ± 2.02 62.75 ± 4.67 62.54 ± 3.43 64 nodes 62.51 ± 2.02 62.75 ± 4.67 62.54 ± 3.43 128 nodes 62.51 ± 2.02 62.75 ± 4.67 62.54 ± 3.43 RPVBNN 32 nodes 78.57 ± 1.76 75.66 ± 3.80 81.88 ± 1.76 64 nodes 78.62 ± 1.92 75.70 ± 4.85 82.12 ± 1.83 128 nodes 78.84 ± 1.62 75.94 ± 3.84 82.33 ± 1.91 88 Table 4.6: MNIST data performance in term of testing accuracy and time (based on 500 epochs) Model Setting Train Acc(%) Test Acc(%) Time(s) VBNN 32 nodes 97.63 96.88 354 128 nodes 98.08 97.33 758 256 nodes 98.13 97.40 1385 512 nodes 99.11 97.80 2640 RPVBNN 32 nodes 97.82 97.18 280 128 nodes 97.84 97.29 720 256 nodes 98.00 97.30 1143 512 nodes 98.06 97.32 2350 efficiency and prediction accuracy in addition to the inference on the intrinstic dimensionality of the feature space. For = >> ?, RPVBNN is equally competitive while still providing computational and memory gain as long as the input resides in a smaller dimensional subspace with respect to prediction. 4.9 Conclusion In this chapter, we consider a variational Bayes neural network predictive model for addressing the curse of dimensionality (small = large ?) by compressing the feature space using RP matrices. To remove the sensitivity to the choice of the RP matrix, we propose a model averaging approach to base our projection on the most relevant models. To improve computational complexity, we provide a variational inference technique which can estimate model specific parameters and model weights both at the same time. As a by-product, we use the posterior model weights of the projected dimensions to learn the intrinsic dimensionality of the feature space in context of prediction. The advantage of variational inference approach proposed in the context of Bayesian model averaging has two advantages (1) it has the computation gain of frequentist ensemble approaches since one can parallelize across different models (2) it provides the uncertainty quantification of associated with each random projection via posterior probabilities. The approach presented in this thesis can generalized to a wide class of problems arising out of Bayesian neural networks which require learning of the model importance or averaging across models. 89 CHAPTER 5 CONCLUSIONS, DISCUSSION, AND DIRECTIONS FOR FUTURE RESEARCH 5.1 Conclusions and discussion In this thesis, we first applied two machine learning methods (LR and SVM) under multiple conditions, to test accuracy in classifying patients with MCI who progress to clinically-defined dementia (MCI-C) from those who remain stable (MCI-S). Using multi-modal data from ADNI, we compared LR and SVM classification accuracy and pre-selection dimensional reduction techniques - i.e., feature selection as informed by prior findings in clinical neuroscience and by ! 1 norm. Notably, the present results demonstrate important boundaries for applying feature selection techniques in statistical classification of MCI-to-dementia conversion. Specifically, we found that while using ! 1 for pre-selection can improve accuracy, it also benefits from a more limited, theoretically based set of feature inputs. In addition, we found that model performance benefited from a longer window of assessment. These results have implications for studies utilizing multi-modal data for such classification, including features from clinical neuropsychological assessment, demographic and genetic markers, MRI-based volumetric brain measures, and other modalities. This thesis also demonstrates that SVM classifier performance is more stable than LR for dealing with the “large p" problem. Clinical researchers should note the value of evaluating different classification and pre-selection approaches in application to clinical or research questions, and be mindful that not all machine learning techniques are equally beneficial for modeling specific clinical outcomes. To further tackle the high dimensional data and variability and complexity of big data, we introduce the variational Bayes neural network and provide the theoretical rigour and computational detail for BDNNs. Although the variational Bayes is popular in machine learning, neither the computational method nor the statistical properties are well understood for complex modeling such as neural networks. We characterize the prior distributions and the variational family for consistent Bayesian estimation. The theory also quantifies the loss due to VB numerical approximation 90 compared to the true posterior distribution. For practical implementation, we reveal that the algorithm may not be as simple and straightforward as it sounds in computer science literature, rather it requires careful crafting on several parameters associated in various steps. Nevertheless, the computation could be quite faster compared to popular Monte Carlo Markov Chain procedure of approximating the posterior distributions. Even though BDNN has achieved higher model performnce in classifying the transtion from MCI to dementia, it fails to address the curse of dimensionality and learn the true dimensionlity of data. We then consider a variational Bayes neural network predictive model for addressing the curse of dimensionality (small = large ?) by compressing the feature space using RP matrices. To remove the sensitivity to the choice of the RP matrix, we propose a model averaging approach to base our projection on the most relevant models. To improve computational complexity, we provide a variational inference technique which can estimate model specific parameters and model weights both at the same time. The derivation shows that use of variational inference provides a huge advantage by allowing parallelization across different models at hand. Unlike Markov Chain Monte Carlo, the variational technique proposed in this thesis allows to obtain optimal model weights after individual models have been trained, by just making model Evidence Lower Bound (ELBO) and prior model weights. The approach is generalizable to a wide class of problems where Bayesian model averaging is next to impossible due to the large dimension of the data or intractable likelihood. 5.2 Directions for future research The future research is mainly focused on two aspects: choice of prior structure, Bayesian compressed deep neural network. Although this thesis builds the framework on a multi-layer neural networks model with simplistic prior structure, the detail statistical theory and computational methodology are quite involved. This investigation opens up possibility of exploring much wider class of models and priors. For example, shrinkage priors, such as double exponential and horseshoe priors can be explored for building sparse neural networks or one can experiment with various other 91 variational families. However, their computational details and associated statistical properties are not immediate. We hope this research will accelerate further development of statistical and computational foundation for variational inference in general machine learning research. Moreover, we explored the sensitiveness to the number of projections and dimension of the projection empirically. However, further investigation is needed in order to obtain a statistically optimal solution. Another interesting direction to pursue will be studying the impact of different projections and qualifying prediction accuracy as a function of the projection. This current work presents a proof of concept for shallow networks. However the methodology developed in this thesis can be extended to deep neural networks. Another interesting line of work will be extension to more complex feature spaces to learn the intrinsic dimensionality of these spaces. 92 APPENDICES 93 APPENDIX A SUPPLEMENT FOR CONSISTENT VARIATIONAL BAYES CLASSIFICATION WITH DEEP NEURAL NETWORKS Algorithms of variational implementation Algorithm 4 BBVI-RMS 1. Fix an initial value for variational family parameters V@1 . 2. Fix a step size sequence dC , C = 1, · · · . 3. Set C = 1 and n > 0. 4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ). 5. Compute r õ V@C L as in (3.18) V@C 6. Compute õ ⌧C = r V@C L r V@C 3KL (@(.|V@ ), ?(.)) V@C 'C = 0.9'C 1 + 0.1⌧ 2C 7. Update ⌧C V@C+1 = V@C + dC p (A.1) 'C + n 8. Set C = C + 1. 9. Repeat steps 4-7 until the convergence of ELBO using L bV C as in (3.19) and @ ELBO = L bV C 3KL (@(.|V@ ), ?(.)) @ 94 Algorithm 5 BBVI-CV-RMS 1. Fix an initial value for variational parameter V@1 . 2. Fix a step size sequence dC , C = 1, · · · . 3. Set C = 1. 4. Simulate , samples ✓= [1], · · · , ✓= [,] from @(.|V@C ). 5. Compute 2¢C = cov(u1C , u2C )/var(u2C ) where u1C and u2C are same as in (3.22). 6. Compute r õ V@C L as in (3.21). V@C 7. Compute õ ⌧C = r V@C L r V@C 3KL (@(.|V@ ), ?(.)) V@C 'C = 0.9'C 1 + 0.1⌧ 2C 8. Update ⌧C V@C+1 = V@C + dC p 'C + n 9. Set C = C + 1. 10. Repeat steps 4-7 until the convergence of ELBO using L bV C as in (3.19) and @ ELBO = L bV C 3KL (@(.|V@ ), ?(.)) @ With @ and ? as in (3.11) and (3.9) respectively, ! ’ = f 9= B29= (< 9= ` 9= ) 2 1 3KL (@, ?) = log + + 9=1 B 9= 2f 9= 2 2f 9=2 2 (< 9= ` 9= ) 1 B 9= r< 9= 3KL (@, ?) = 2 rB 9= 3KL (@, ?) = + 2 2f 9= B 9= f 9= ! ! \ 9= < 9= r< 9= LV@ = ⇢ @(.|V@ ) log !(✓= ) B29= ! ! (\ 9= < 9= ) 2 1 rB 9= LV@ = ⇢ @(.|V@ ) log ! (✓= ) B39= B 9= 95 Preliminaries A.0.1 Definitions Definition A.0.1 For a vector ↵ and a function 6, Õ qÕ 1. ||↵|| 1 = 8 |U8 |, ||↵|| 2 = 8 U82 , ||↵|| 1 = max8 |U8 |. Ø qØ 2. ||6|| 1 = x2 j |6(x)|3x, ||6|| 2 = x2 j 6(x) 2 3x, ||6|| 1 = supx2 j |6(x)| Definition A.0.2 (Bracketing number and entropy) For any two functions ; and D, define the bracket [;, D] as the set of all functions 5 such that ;  5  D. Let ||.|| be a metric. Define an Y bracket as a bracket with ||D ; ||  Y. Define the bracketing number of a set of functions F ⇤ as the minimum number of Y brackets needed to cover F ⇤ , and denote it by # [] (Y, F ⇤ , ||.||). Finally, the Hellinger bracketing entropy, denoted by [] (Y, F ⇤ , ||.||), is the natural logarithm of the bracketing number ([104]). Definition A.0.3 (Covering number and entropy) Let (+, ||.||) be a normed space, and F ⇢ +. {+1 , · · · , D = } is an Y covering of F if F ⇢ [8=1 # ⌫(+ , Y), or equivalently, 8 \ 2 F , 9 8 such 8 that ||\ +8 || < Y. The covering number of F denoted by # (Y, F , ||.||) = min{= : 9 Y covering over F of size =}. Finally, the Hellinger covering entropy, denoted by (Y, F , ||.||), is the natural logarithm of the covering number ([104]). A.0.2 Lemmas Lemma A.0.4 With e [] (D, F= , ||.|| 2 ) as in Definition A.0.2, for e [] (D, F= , ||.|| 2 )  = log("= /D), π p e Y [] (D, F= , ||.|| 2 )3D .Y = (log "= log Y) 0 Proof. See proof of lemma 7.14 in [6]. Ø Lemma A.0.5 Suppose @ satisfies 3KL (✓0 , ✓✓= )@(✓= )3✓=  Y, then for any a > 0, ✓π ◆ !(✓= ) Y = %0 @(✓= ) log 3✓= =a  !0 a 96 Proof. See proof of lemma 7.13 in [6]. Ø Lemma A.0.6 Suppose NY = {✓= : 3KL (✓0 , ✓✓= ) < Y} and NY ?(✓= )3✓= 4 =Y , = ! 1 then for any a > 0, ✓ π ◆ ! (✓= ) 2Y %0= log ?(✓= )3✓= =a  !0 a Proof. See proof of lemma 7.12 in [6]. Ø Lemma A.0.7 Suppose, F=2 ?(✓= )3✓=  4 =Y , = ! 1 for any Y > 0. Then, for every Ỹ < Y. ✓π ◆ = !(✓= ) =Ỹ %0 ?(✓= )3✓= 4  4 =(Y Ỹ) ✓= 2F= 2 !0 Proof. See proof of lemma 7.16 in [6]. Lemma A.0.8 Let [✓=⇤ (x) = b⇤! + A⇤! k(b⇤! 1 + A⇤! 1 k(· · · k(b⇤1 + A⇤1 k(b⇤0 + A⇤0 x))) be a fixed neural network. Let [✓= (x) = b ! + A ! k(b ! 1 + A ! 1 k(· · · k(b1 + A1 k(b0 + A0 x))) be a neural network such that Y |\ 9= \ ⇤9= |  Õ ! Œ! = E=0 :̃ E= E 0 =E+1 0 ⇤E 0= where :̃ E= = : E= + 1. Then, π |[✓= (x) [✓=⇤ (x)|3G  Y x2[0,1] ?= Proof. In the proof, we suppress the dependence on =. Define the projection %E as %+ [✓ (x) = b+ 1 + A+ 1 k(· · · k(b1 + A1 k(b0 + A0 x))). We claim that Õ Œ Y +E=0 :̃ E E!0=E+1 0 ⇤E 0 |%+ [✓ (x) [B] %+ [✓⇤ (x) [B] |  Õ ! Œ! ⇤ (A.2) E=0 :̃ E E 0 =E+1 0 E 0 Õ! Œ We prove this by induction. Let E = 1 as follows. Let Ỹ = Y/ E=0 :̃ E E!0=E+1 0 ⇤E 0 , then |%1 [✓ (x) [B] %1 [✓⇤ (x) [B]|  |b1 b⇤1 [B]| + |A1 [B] > k(b0 + A0 x) A⇤1 [B] > k(b⇤0 + A⇤0 x)| ’ :1  Ỹ + ||A1 [B] A⇤1 [B]|| 1 + |A⇤1 [B] [B0] (k(b0 [B] + A0 [B] > x) k(b⇤0 [B] + A⇤0 [B] > x))| B 0 =0 ’ :1 = Ỹ + : 1 Ỹ + Ỹ |A⇤1 [B] [B0]|(: 0 + 1) = Ỹ(1 + : 1 + 0 ⇤1 ( ? = + 1))  Ỹ( :̃ 1 + 0 ⇤1 :̃ 0 ) B 0 =0 97 where the second line holds since k(D)  1 and the third step is shown next. Let D = b0 [B] A0 [B] > x and D X = b0 [B] + A0 [B] > x b⇤0 [B] + A⇤0 [B] > x, then for |D X | < 1 4 D+D X 4 D 4 D (4 D X 1) |k(D) k(D + D X )| =  (A.3) (1 + 4 D+D X ) (1 + 4 D ) (1 + 4 D ) (1 + 4 D+D X ) 4 |4 X 1| D D   |D X | (A.4) (1 + 4 D ) (1 + 4 D 1 ) since 4 D /((1 + 4 D )(1 + 4 D 1 ))  1/2 and |4 D X 1|  2|D X | for |D X | < 1. Now, |D X | = |b0 [B] Õ? b⇤0 [B] | + B 0==0 |A0 [B] [B0] A⇤0 [B] [B0]|  ( ? = + 1) Ỹ < 1. Suppose the result hold for + 1, we show the result for + as follows: |%+ [✓ (x) [B] %+ [✓⇤ (x) [B]|  |b+ [B] b+⇤ [B]| + |A+ [B] > k(%+ 1 [✓ (x)) A+⇤ [B] > k(%+ 1 [✓⇤ (x))| ’:+  Ỹ + ||A+ [B] A+⇤ [B] > || 1 + |A+⇤ [B] [B0] (k(%+ 1 [✓ (x) [B]) k(%+ 1 [✓⇤ (x) [B]))| B 0 =0 ’:+  Ỹ + ||A+ [B] A+⇤ [B] > || 1 + |A+⇤ [B] [B0] (%+ 1 [✓ (x) [B]) k(%+ 1 [✓⇤ (x) [B])| B 0 =0 where the second step follows since k(D)  1 and the third step follows by relation (A.3) provided |%+ 1 [✓ (x) [B] %+ 1 [✓⇤ (x) [B]|  1. But this holds using relation (A.2) with E = + 1. Thus proceeding further we get ’ :+ ’1 + ÷1 + |%+ [✓ (x) [B] %+ [✓⇤ (x) [B]|  Ỹ(1 + : + ) + 2Ỹ |,+⇤ [B] [B0] | :̃ E 0 ⇤E 0 B 0 =0 E=0 E 0 =E+1 ’1 + ÷ + ’+ ÷ +  Ỹ :̃ E + Ỹ :̃ E e \ 0E = Ỹ :̃ E 0 ⇤E 0 E=0 E 0 =E+1 E=0 E 0 =E+1 This completes the proof. Lemma A.0.9 If |[0 (x) [✓= (x)|  Y, then |⌘✓= (x)|  2Y where ⌘✓= (x) = f([0 (x))([0 (x) [✓= (x)) + log(1 f([0 (x))) log(1 f([✓= (x))) 98 Proof. Note that, |⌘✓= (x)|  |f([0 (x))||[0 (x) [✓= (x)| + | log(1 f([0 (x)) log(1 f([✓= (x))| ⇣ ⌘  |[0 (x) [✓= (x)| + log 1 + f([0 (x))(4 [✓= (x) [0 (x) 1)  2|[0 (x) [✓= (x)| where the second step follows by using f(G) = 4 G /(1 + 4 G )  1 and the proof of the third step is shown below. Let ? = f([0 (x)), then 0  ?  1 and A = [✓= (x) [0 (x), then ⇣ ⌘ log 1 + f([0 (x))(4 [✓= (x) [ 0 ( x) 1) = |log (1 + ?(4A 1))| A > 0 : | log(1 + ?(4A 1))| = log(1 + ?(4A 1))  log(1 + (4A 1)) = A = |A | A < 0 : | log(1 + ?(4A 1))| = log(1 + ?(4A 1))  log(1 + (4A 1)) = A = |A | Lemma A.0.10 For [✓= (x) = b ! + A ! k(b ! 1 + A ! 1 k(· · · k(b1 + A1 k(b0 + A0 x))), ÷!= sup r\ 9 [✓= (x)  0 E 0= 9=1,··· , = E 0 =1 where 0 E 0= = supE=0,··· ,: (E 0+1)= ||AE 0 [E]]|| 1 . Proof. We suppress the dependence on =. Let %+ = b+ + A+ k(b+ 1 + A+ 1 k(· · · b1 + A1 k(b0 + A0 x))). Define ⌧ +,+ = 1 : + +1 and for + = 0, · · · , !, + 0 = 0, · · · , + 1, let ⌧ + 0,+ = A+ (k 0 (%+ 1 ) A+ 1 (k 0 (%+ 2 ) · · · A++1 (k 0 (%+ 0 )))) where denotes component wise multiplication. With k(% 1 ) = x, we define 8 > > > < rbE [✓ (x) > = ⌧ E,! 1 : E+1 > > > > rAE [✓ (x) = ⌧ E,! 1 : E+1 k(%E 1 ) > : By the above form and the fact that k(D), k 0 (D), |G8 |  1, it can be easily checked by induction Œ |⌧ E,! |  E!0=E+1 0 E 0 which completes the proof. 99 e= = { ✓ : ✓✓ (H, x), ✓= 2 F= } where ✓✓ (H, x) is given by p Lemma A.0.11 Let, F = = ⇣ ⇣ ⌘⌘ ✓✓= (H, x) = exp H[✓= (x) log 1 + 4 [✓= (x) (A.5) and F= is given by n o F= = ✓= : |\ 9= |  ⇠= , 9 = 1, · · · , = (A.6) Then with e [] (D, F= , ||.|| 2 ) is as in definition A.0.2, π p 2Y q p e [] (D, F= , ||.|| 2 )3D .Y = ((! = + 1) log = + (! = + 2) log ⇠= log Y) Y 2 /8 Proof. In this proof, we suppress the dependence on =. Note, by lemma 4.1 in [104], ✓ ◆ 3⇠ # (Y, F= , ||.|| 1 )  . Y p For ✓1 , ✓2 2 F , let ✓(D)e = ✓D✓ +(1 D) ✓ (x, H). 1 2 Following equation (52) in [6], we get q q m ✓e ✓✓1 (x, H) ✓✓2 (x, H)  sup ||✓1 ✓2 || 1  (x, H)||✓1 ✓2 || 1 (A.7) 9 m\ 9 p where the upper bound (x, H) = (⇠ ) ! . This is because |m ✓/m\ e 9 |, the derivative of ✓ w.r.t. is bounded above by |m[✓ (x)/m\ 9 | as shown below. ✓ ◆p m ✓e 1 m[✓ (x) 4 [ ✓ ( x) 4 (H[✓ (x) log(1+4 ✓ )) [ ( x) = H m\ 9 2 m\ 9 1+4 [ ✓ ( x ) ✓ [ (x) ◆ 1/2 ✓ ◆ 1/2 1 m[✓ (x) 4 ✓ 1 1 m[✓ (x)   2 m\ 9 1+4 [ ✓ ( x ) 1+4 [ ✓ ( x ) 4 m\ 9 Thus, using 4 [✓ (x) /(1 + 4 [✓ (x) ) and Lemma A.0.10, we get m[✓ (x) ÷ ! ÷! sup  0 ⇤E = : E ⇠  ( ⇠) ! 9=0,··· , = m\ 9 E=1 E=1 In view of (B.6) and theorem 2.7.11 in [126], we have ✓ !+1 !+2 ◆ e= , ||.|| 2 )  3 e= , ||.|| 2 ) . ⇠ !+1 ⇠ !+2 # [] (Y, F =) [] (Y, F log 2Y Y where # [] and [] denote the bracketing number and bracketing entropy as in definition A.0.2. 100 Using, lemma A.0.4 with " = !+1⇠ !+2 , we get π Yq p e [] (D, F= , ||.|| 2 )3D . Y ((! + 1) log + 2(! + 2) log ⇠ log Y) 0 Therefore, π p 2Y π p 2Y e [] (D, F= , ||.|| 2 )3D  e [] (D, F= , ||.|| 2 )3D q Y 2 /8 0 p p . 2Y ((! + 1) log + (! + 2) log ⇠ log 2Y) p The proof follows by noting log 2Y log Y. A.0.2.1 Propositions Proposition A.0.12 Let @(✓= ) = "+ # (✓=⇤ , = /=2+23 ) and ?(✓= ) = "+ # (µ= , diag( 2 = )) where log || = || 1 = $ (log =) and || ⇤ = || 1 = $ (1). Let =n =2 ! 1, = log = = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ), ||µ= || 22 = >(=n =2 ), then for any a > 0, 3KL (@, ?)  =n =2 a Proof. ! ’ = p 1 (\ ⇤9= ` 9= ) 2 1 3KL (@, ?) = 1+3 log = f 9= + 1+3 2 + 9=1 = f 9= 2 f 9= 2 = ’ = 1 ’ = 1 ’ = \⇤ 2 9= ’ = `2 9=  ((3 + 1) log = 1) + log f 9= + 1+3 +2 +2 2 = f 2 2 f 9= f2 ✓ ◆ 9=1 9=1 9= 9=1 9=1 9= + ||✓= || 2 + ||µ= || 2 || =⇤ || 1 = >(=n =2 ) ⇤ 2 2 = =  ((3 + 1) log = 2) + = log || = || 1 + 2 2 = where the second last inequality uses ⇤ = = 1/ =. The last equality follows since log || = || 1 = $ (log =), || ⇤ = || 1 = $ (1), = log = = >(=n =2 ), ||µ= || 22 = >(=n =2 ) and ||✓=⇤ || 22 = >(=n =2 ). Proposition A.0.13 Let ?(✓= ) = "+ # (µ= , diag( 2 with log || = $ (log =), || ⇤ = =) = || 1 = || 1 $ (1). Let ||[0 [✓=⇤ || 1  Yn =2 /4, =n =2 ! 1. Define, π ✓ ◆ 1 f([0 (x)) 3KL (✓0 , ✓✓= ) = f([0 (x))([0 (x) [✓= (x)) + log 3x x2[0,1] ?= 1 f([✓= (x)) NY = ✓= : 3KL (✓0 , ✓✓= ) < Y (A.8) 101 Õ! = Œ! = If = log = = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ), log( E=0 : E= E 0 =E+1 0 ⇤E 0= ) = $ (log =), ||µ= || 22 = >(=n =2 ), π =n =2 a ?(✓= )3✓= 4 8 a>0 ✓= 2# Y n 2 = Proof. Let [✓=⇤ (x) = b⇤! + A⇤! k(b⇤! ⇤ ⇤ 1 + A ! 1 k(· · · k(b1 + A1 k(b0 + A0 x))) ⇤ ⇤ ⇤ be the neural network such that Yn =2 ||[✓=⇤ [0 || 1  (A.9) 4 Such a neural network exists since ||[0 [✓=⇤ || 1  Yn =2 /4. Next define neighborhood M Yn =2 as follows ( ) Yn =2 M Yn =2 = ✓= : |\ 9= \ ⇤9= | < Õ! = Œ! = , 9 = 1, · · · , = 2 E=0 :̃ E= E 0 =E+1 0 ⇤E 0= where :̃ E= = : E= + 1. For every ✓= 2 M Yn =2 , by lemma B.0.1, we have Yn =2 ||[✓= [✓=⇤ || 1  (A.10) 2 Combining (A.9) and (B.1), we get for ✓= 2 M Yn =2 , ||[✓= [0 || 1  Yn =2 /2. This, in view of lemma B.0.2, 3KL (✓0 , ✓✓= )  Yn =2 . Let ✓= 2 NYn =2 for every ✓= 2 M Yn =2 . Therefore, π π ?(✓= )3✓= ?(✓= )3✓= ✓= 2N Yn =2 ✓= 2M Y n 2 = Õ! = Œ! = Let X= = Yn =2 /(2 E=0 :̃ E= E 0 =E+1 0 ⇤E 0= ), then π ÷ = π \ ⇤9= +X = ( \ 9= ` 9= ) 2 1 q 2f 2 ?(✓= )3✓= = 4 9= 3\ 9= ✓= 2M Y n 2 \ ⇤9= X = 2cf 9= 2 = 9=1 ÷ 2X= ( \b9= ` 9= ) 2 , b = q 2f 2 = 4 9= \ 9= 2 [\ ⇤9= X= , \ ⇤9= + X= ] 2cf 9= 2 9=1 ! ÷ 1 2 ( \b ` )2 = 2 log c log X = +log f 9= + 9= 2 9= 2f = 4 9= (A.11) 9=1 where the second last equality holds by mean value theorem. 102 Note that b \ 9= 2 [\ ⇤9= 1, \ ⇤9= + 1] since X= ! 0, therefore (b\ 9= ` 9= ) 2 max((\ ⇤9= ` 9= 1) 2 , (\ ⇤9= ` 9= + 1) 2 ) (\ ⇤9= ` 9= ) 2 1 2  2  2 + 2 2f 9= 2f 9= f 9= f 9= where the last inequality follows since (0 + 1) 2  2(0 2 + 1 2 ). Again using (0 + 1) 2  2(0 2 + 1 2 ), ’ = (b \ 9= ` 9= ) 2 ’ = \⇤ 2 9= ’ = `2 9= ’ = 1 2 2 2 +2 2 + 9=1 2f 9= 9=1 f 9= 9=1 f 9= 9=1 9= f2  2(||✓=⇤ || 22 + ||µ= || 22 + 1)|| ⇤ = || 1  =an =2 (A.12) since ||✓=⇤ || 22 = >(=n =2 ), ||µ= || 22 = >(=n =2 ) and || ⇤ = || 1 = $ (1) and =n =2 ! 1. Also, ’!= ÷ != log X= + log f 9= = log 2 + log( :̃ E= 0 ⇤E 0= ) log Yn =2 E=0 E 0 =E+1 ’!= ÷!=  log 2 + log( :̃ E= 0 ⇤E 0= ) + log f 9= log Y 2 log n = E=0 E 0 =E+1  log 2 + $ (log =) + $ (log =) log Y + $ (log =) Õ! = Œ! = where the last follows since log || = || 1 = $ (log =), log( E=0 : E= E 0 =E+1 0 ⇤E 0= ) = $ (log =) and 1/=n =2 = >(1) which implies 2 log n = = >(log =). ’ = 1 2 log log X= + log f 9= = $ ( = log =) = >(=n =2 ) (A.13) 9=1 2 c where the last inequality follows since = log = = >(=n =2 ), Combining (B.3) and (B.4) and replacing (B.2), the proof follows. Õ! = Œ! = Proposition A.0.14 Let @(✓= ) ⇠ "+ # (✓=⇤ , = /=2+23 ), 3 > 3 ⇤ where E=0 : E= E 0 =E+1 0 ⇤E 0= = ⇤ $ (= 3 ), 3 ⇤ > 0. Define π ✓ ◆ 1 f([0 (x)) ⌘(✓= ) = f([0 (x))([0 (x) [✓= (x)) + log 3x x2[0,1] ?= 1 f([✓= (x)) Let ||[0 [✓=⇤ || 1  Yn =2 /4 where =n =2 ! 1. If = log = = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ), then π ⌘(✓= )@(✓= )3✓=  Yn =2 . 103 Proof. Since ⌘(✓= ) is a KL-distance, ⌘(✓= ) > 0. We shall thus establish an upper bound. π π ⌘(✓= )@(✓= )3✓=  |[✓= (x) [0 (x)|3x π π x2[0,1] ?=  |[✓= (x) [\ ⇤= (x)|3x@(✓= )3✓= + ||[✓=⇤ [0 || 1 π x2[0,1] ?=  ||[✓= [✓=⇤ || 1 @(✓= )3✓= + Yn =2 (A.14) where the first inequality is a consequence of lemma B.0.2 and the last inequality follows since ||[✓=⇤ [0 || 1 = >(n =2 ). Õ! Œ! = Let ( = {✓= : \ 9=1 = |\ 9= \ ⇤9= |  Yn =2 /( E=0 :̃ E= E 0 =E+1 0 ⇤E 0= )}, then π ||[✓= [✓=⇤ || 1 @(✓= )3✓= π π = ||[✓= [✓=⇤ || 1 @(✓= )3✓= + ||[✓= [✓=⇤ || 1 @(✓= )3✓= ( (2 π π ’ : !=  Yn =2 + |b ! [B] b⇤! [B]|@(✓= )3✓= + |A ! [B] [B0] A⇤! [B] [B0] |@(✓= )3✓= (2 (2 B 0 =1 ’ : != π + |A⇤! [1] [B]| @(✓= )3✓= (A.15) B=1 (2 Õ! Œ! = Let ( 2 = [ 9=1 = ( 29 where ( 9 = {|\ 9= \ ⇤9= |  D = } where D = = Yn =2 /( E=0 :̃ E= E 0 =E+1 0 ⇤E 0= ). We first compute &(( 2 ) as follows: ’ = ’ = π 2 &(( ) = &([ 9=1 ( 29 ) =  &(( 29 ) = @(\ 9= )3\ 9= 9=1 9=1 |\ 9= \ ⇤9= |>D = ⇣ ⇣ ⌘⌘ =2 = 1 =1+3 D = (A.16) Using (A.16) in the last term of (A.15), we get ’ : != π ’ : != |A⇤! [1] [B]| @(✓= )3✓= = &(( ) 2 |A⇤! [1] [B] | = 0 ⇤! = = = (1 (=1+3 D = )) ✓ ◆ B=1 (2 B=1 1 =2(1+3) D 2= = >(=n =2 )$ = 3 4 = >(n =2 ) (A.17) =1+3 D = 104 Õ! = Œ! = where the second step follows by Mill’s ratio, = = >(=n =2 ) and E=0 : E= E 0 =E+1 0 ⇤E 0= = $ (= 3 ) which implies =1+3 D = ! 1 . The third step holds because ! =2(1+3) Y 2 n=4 =1+3 =2(1+3) D 2= =2(1+3) D 2= log =( Õ! Œ != 0⇤ ) 2 (3+1) (A.18) :̃ 4 4 =4 E=0 E = E 0 =E+1 E 0 = = >(1) =1+3 D = Õ! Œ! = 0 ⇤E 0= ) 2 log = = $ (=23 log =) = >(=23 ) for 3 > 3 ⇤ and =2 n =4 ! 1. ⇤ since ( E=0 :̃ E= E 0 =E+1 105 For the second term in (A.15), let (0 = {|b ! [B] b⇤! [B] | > D = } π |b ! [B] b⇤! [B]| @(✓= )3✓= (2 π π = |b ! [B] b⇤! [B]|@(✓= )3✓= + |b ! [B] b⇤! [B] |@(✓= )3✓= π( 2 \( 0 ( 2 \( 0 2  |b ! [B] b⇤! [B]|@(b ! [B])3b ! [B] + ⇢ @(b! [B]) |b ! [B] b⇤! [B] |&( (˜2 ), (A.19) (0 (˜2 is the union of all ( 29 , 9 = 1, · · · , = except the one corresponding to b ! [B]. π |b ! [B] b⇤! [B]|@(b ! [B])3b ! [B] (0 π r =2+23 =2+23 ( b ! [B] b⇤! [B]) 2 = (b ! [B] b⇤! [B])4 2 3b ! [B] 2c π 1 | b ! [B] b⇤! [B] |>=1+3 D = 2 D 1 2 1+3 =p p 4 2 D 3D  4 = D = (A.20) =2+23 =1+3 D = 2c p Also, ⇢ @(b! [B]) |b ! [B] b⇤! [B]| = 2/c(1/=1+3 ). Thus ⇢ @(b! [B]) |b ! [B] b⇤! [B] |&( (˜2 ) ✓ ⇣ ⇣ ⌘⌘ ◆ = 1+3 = =2(1+3) D = =2(1+E) D 2= = $ 1+3 1 = D= ⇠ 2(1+3) 4 4 (A.21) = = D = where the first equality in the above step follows by observing that &( (˜2 ) behaves analogous to &(( 2 ) which was computed in (A.16) and the second equality in the above step follows due to Õ! Œ Mill’s ratio and E=0 :̃ E= E!0==E+1 0 ⇤E 0= = $ (=E ) which implies =1+3 D = ! 1. The third inequality in the above step is a consequence of the fact that =  =1+3 . Combining (A.17), (A.20) and (A.21), we get π =1+3 D = |b ! [B] b⇤! [B] | @(✓= )3✓=  4 (A.22) (2 Note the third term in (A.15) can be handled similar to third term and it can be shown π π ’ : != |b ! [B] b⇤! [B]|@(✓= )3✓= + |A ! [B] [B0] A⇤! [B] [B0] |@(✓= )3✓= (2 ( 2 B 0 =1 =1+3 D = =1+3 D = (=1+3 D = 2 log =)  : ! = +1 =4 = >((=n =2 ) 2 )4  >(n =2 )4 = >(n =2 ) (A.23) 106 where the last equality in the second step follows by = = >(=n =2 ) and the argument in (A.18) by which 4 (=1+3 D = 2 log =) = >(1). Combining (A.17) and (A.23) with (A.15) the proof follows. Proposition A.0.15 Let =n =2 ! 1. Let ?(✓= ) = "+ # (µ= , diag( 2 = )) where log || = || 1 = $ (log =) and ||µ= || 22 = >(=n =2 ). Suppose for some 0 < 1 < 1, = log = = >(= 1 n =2 ), then for 1n 2/ ⇠= = 4 = = = and F= as in (3.31), we have for any Y > 0, π 2 ?(✓= )3✓=  4 =Yn = , = ! 1 ✓= 2F=2 Proof. Let F9= = {\ 9= : |\ 9= |  ⇠= } F= = \ 9=1 = F9= =) F=2 = \ 9=1 = F9=2 Note that π ’ = π ( \ 9= ` 9= ) 2 1 q 2f 2 ?(✓= )3✓=  4 9= 3\ 9= ✓= 2F=2 2 F 9= 2cf 9= 2 9=1 ’ = π ’ = π ( \ 9= ` 9= ) 2 ( \ 9= ` 9= ) 2 ⇠= 1 1 1 q q 2f 2 2f 2 = 4 9= 3\ 9= + 4 9= 3\ 9= 1 2cf 9= 2 ⇠= 2cf 9=2 9=1 9=1 ’ = ✓ ✓ ◆◆ ’ = ✓ ✓ ◆◆ ⇠= ` 9= ⇠= + ` 9= = 1 + 1 9=1 f 9= 9=1 f 9= p Since ||µ= || 22 = >(=n =2 ) =) ||µ= || 1 = >( =n = ). Since log || = || 1 = $ (log =) which implies for some " > 0, 3 1, ✓ ◆ p |⇠= ` 9= | |⇠= + ` 9= | (⇠= =) 1 min , 4 log ⇠= (3+1) log = ⇠ 4 '= log = ! 1 f 9= f 9= = "3 = 3 1/2 " (A.24) where the last convergence holds since = log = = >(= 1 n =2 ) implies '= = (= 1 n =2 )/( = log =) (3 + 1) ! 1. Thus, using Mill’s ratio, we get: π ©’ f 9= ’ (⇠= ` 9= ) 2 (⇠= +` 9= ) 2 ™ = = f 9= ?(✓= )3✓= = $ ≠ Æ̈ 2f 2 2f 2 4 9= + 4 9= ⇠= ` 9= ⇠= + ` 9= ´ ✓= 2F=2 9=1 9=1 p (⇠= =) 2 Y=n =2 2 =4 2=2 " 2 4 107 where the last asymptotic inequality holds because p 2 ✓ ◆ (⇠= =) 1 4 2'= 2 log = log 2 = ⇠ 4 2'= log = 2 log = = Y=n =2 2= 3 " 2 2 2 = In the above step, the first asymptotic equivalence holds due to (A.24), the second inequality holds since =  =. The last inequality holds since '= ! 1 and log/= ! 0. Proposition A.0.16 Let =n =2 ! 1. Suppose = log = = >(= 1 n =2 ), for some 0 < 1 < 1, ! = ⇠ log = and ?(✓= ) = "+ # (µ= , diag( 2 where log || = $ (log =) and ||µ= || 22 = >(=n =2 ). Then for = )) = || 1 every Y > 0, π ! (✓= ) log ?(✓= )3✓=  log 2 Y 2 =n =2 + > %0= (1) U Y2 n= !0 Proof. It suffices to show π ! !(✓= ) Y=n =2 %0= ?(✓= )3✓= > 24 ! 0, = ! 1 (A.25) U Y2 n= !0 π ! !(✓= ) Y 2 =n =2 %0= ?(✓= )3✓= > 24 U Y2 n= !0 π ! ✓π ◆ ! (✓= ) Y 2 =n =2 !(✓= ) Y 2 =n =2  %0= ?(✓= )3✓= > 4 + %0= ?(✓= )3✓= > 4 U Y2 n= \F= !0 F=2 !0 1n 2/ Using lemma B.0.4 with Y = Yn = and ⇠= = 4 = = = , π p 2Yn = e [] (D, F= , ||.|| 2 )3D Y 2 n =2 /8 p . n= Y = ((! = + 1) log = + (! = + 2) log ⇠= log Yn = ) p p p  Yn = $ (max( = (! = + 1) log = , = (! = + 2) log ⇠= , log n = )) q q p p  Yn = max(>(n = = 1 log =), $ (n = = 1 log =), $ ( log =))  Y 2 n =2 = where e [] (D, F= , ||.|| 2 ) is as in definition A.0.2. The first inequality in the third step follows because ! = ⇠ log =, = log = = >(= 1 n =2 ) and = log ⇠= = = 1 n =2 , log n =2  log =. The second inequality in the third step follows since (= 1 log =)/= = >(1) 108 By theorem 1 in [138], for some constant ⇠ > 0, we have π ! ! ! (✓ = ) 2 2 ! (✓= ) Y 2 =n =2 %0= ?(✓= )3✓= > 4 Y =n =  %0= sup >4 2 ✓= 2U Y n= \F= ! 0 ✓= 2U Y2 n= \F= !0  4 exp( ⇠Y 2 =n =2 ) = >(=n =2 ) (A.26) Using proposition B.0.7 with Y = 2Y, we have π 2=Y 2 n = 2 ?(✓= )3✓=  4 ✓= 2F=2 Therefore, using lemma A.0.7 with Y = 2Y 2 n =2 and Ỹ = Y 2 n =2 , we have ✓π ◆ !(✓= ) Y 2 =n =2 Y 2 =n =2 %0= ?(✓= )3✓= > 4 4 ! 0. (A.27) F=2 !0 Combining (A.26) and (A.27), (A.25) follows. Proposition A.0.17 Let ?(✓= ) = "+ # (µ= , diag( 2 = $ (=) and || ⇤ = $ (1). = ), || = || 1 = || 1 1. Let ! = = !, ? = = ? independent of =. If = log = = >(=) and ||µ= || 22 = >(=), then 3KL (c ⇤ , c(.|y= , X= )) = > %0= (=) (A.28) 2. Let = log = = >(=n =2 ), ! = ⇠ log = and ||µ= || 22 = >(=n =2 ). There exists a neural network such Õ! = Œ ||[0 [✓=⇤ || 1 = >(=n =2 ), ||✓=⇤ || 22 = >(=n =2 ) and log( E=0 :̃ E= E!0==E+1 0 ⇤E 0= ) = $ (log =), then 3KL (c ⇤ , c(.|y= , X= )) = > %0= (=n =2 ) (A.29) Proof. For any @ 2 Q= . π π 3KL (@, c(.|y= , X= )) = @(✓= ) log @(✓= )3✓= @(✓= ) log c(✓= |y= , X= )3✓= π π !(✓= ) ?(✓= ) = @(✓= ) log @(✓= )3✓= @(✓= ) log Ø 3✓= !(✓= ) ?(✓= )3✓= π π !(✓= ) !(✓= ) = 3KL (@, ?) log @(✓= )3✓= + log ?(✓= )3✓= π π !0 !0 !(✓= ) ! (✓= )  3KL (@, ?) + log @(✓= )3✓= + log ?(✓= )3✓= (A.30) !0 !0 109 Since c ⇤ satisfies minimizes the KL-distance to c(.|y= , X= ) in the family Q= , therefore for any ^>0 %0= (3KL (c ⇤ , c(.|y= , X= )) > ^)  %0= (3KL (@, c(.|y= , X= )) > ^) (A.31) Proof of part 1. Note, = log = = >(=), ||`= || 22 = >(=), || = || 1 = $ (=) and || ⇤ = || 1 = $ (1). We p take @(✓= ) = "+ # (✓=⇤ , I = / =) where ✓=⇤ is defined next. For # 1, let [✓⇤# be a finite neural network approximation satisfying ||[✓=⇤ [0 || 1  Y/4. The existence of such a neural network is always guaranteed by [60]. Define ✓=⇤ same as ✓#⇤ for all the non zero coefficients and zeros for all non existent coefficients. Step 1 (a): Using proposition A.0.12, with n = = 1, we get for any a > 0, %0= (3KL (@, ?) > =a) = 0 (A.32) where the above step follows ||✓=⇤ || 22 = ||✓#⇤ || 22 = ||✓=⇤ || 22 = >(=). Step 1 (b): Next, note that π ✓ ◆ f([0 (x)) 1 f([0 (x)) 3KL (✓0 , ✓✓= ) = f([0 (x)) log + (1 f([0 (x))) log 3x f([✓= (x)) 1 f([✓= (x)) π ✓ ◆ x2[0,1] ?= 1 f([0 (x)) = f([0 (x))(f([✓= (x)) f([0 (x))) + log 3x (A.33) x2[0,1] ?= 1 f([✓= (x)) Since ||[0 [✓=⇤ || 1  Y/4, using proposition B.0.5 with n = = 1 and Y = Y π 3KL (✓0 , ✓✓= )@(✓= )3✓=  Y Õ! Œ! which follows by ||✓=⇤ || 22 = ||✓#⇤ || 22 = >(=) and log( E=0 :̃ E# E 0 =E+1 0 ⇤E 0 # ) = $ (log =). Therefore, by lemma A.0.5, ✓π ◆ !(✓= ) Y %0= log @(✓= )3✓= > =a  . (A.34) !0 a Step 1 (c): Since ||[0 [✓=⇤ || 1  Y/4, using proposition B.0.3 with n = = 1 and a = Y, π ?(✓= )3✓= exp( =Y) ✓= 2NY 110 Õ! Œ! which follows by ||✓=⇤ || 22 = ||✓#⇤ || 22 = >(=) and log( E=0 :̃ E= E 0 =E+1 0 ⇤E 0= ) = $ (log =). Therefore, using lemma A.0.6, we get ✓ π ◆ !(✓= ) 2Y %0= log ?(✓= )3✓= > =a  (A.35) !0 a Step 1 (d): From (A.31) and (A.30) we get %0= (3KL (c ⇤ , c(.|y= , X= )) > 3=a)  %0= (3KL (@, ?) > =a) ✓π ◆ ✓ π ◆ ! (✓= ) !(✓= ) 3Y + %0 = log @(✓= )3✓= > =a + %0 log = ?(✓= )3✓= > =a  (A.36) !0 !0 a where the last inequality is a consequence of (A.32), (A.34) and (A.35). Since Y is arbitrary, taking Y ! 0 completes the proof. Proof of part 2. Note, = log = = >(=n =2 ), ||`= || 22 = >(=n =2 ), log || = || 1 = $ (log =) and || =⇤ || 1 = Õ! = Œ $ (1). Let @(✓= ) = "+ # (✓=⇤ , I = /=2+23 ), 3 > 3 ⇤ where E=0 ⇤ :̃ E= E!0==E+1 0 ⇤E 0= = $ (= 3 ), 3 ⇤ > 0. We next define ✓=⇤ as follows: Let [✓=⇤ be the neural satisfying ||[✓=⇤ [0 || 1  Yn =2 /4 ||✓=⇤ || 22 = >(=n =2 ) The existence of such a neural network is guaranteed since ||[✓=⇤ [0 || 1 = >(n =2 ). Step 2 (a): Since ||✓=⇤ || 22 = >(=n =2 ), by proposition A.0.12, %0= (3KL (@, ?) > a=n =2 ) = 0 (A.37) Õ! = Œ! = Step 2 (b): Since ||[✓=⇤ [0 || 1  Yn =2 /4, ||✓=⇤ || 22 = >(=n =2 ) and ( E=0 :̃ E= E 0 =E+1 0 ⇤E 0= ) log = = >(=n =2 ), by proposition B.0.5, π 3KL (✓0 , ✓✓= )@(✓= )3✓=  Yn =2 Therefore, by lemma A.0.5, ✓π ◆ !(✓= ) 2 Y %0= log @(✓= )3✓= > a=n =  . (A.38) !0 a 111 Õ! = Œ! = Step 2 (c): Since ||[✓=⇤ [0 || 1  Yn =2 /4, ||✓=⇤ || 22 = >(=n =2 ) and log( E=0 :̃ E= E 0 =E+1 0 ⇤E 0= ) = $ (log =), by proposition B.0.3, π ?(✓= )3✓= exp( Y=n =2 ) ✓= 2N Yn =2 Therefore, using lemma A.0.6, we get ✓ π ◆ !(✓= ) 2 2Y %0 log = @(✓= )3✓= > a=n =  (A.39) !0 a Step 2 (d): From (A.31) and (A.30) we get ⇣ ⌘ %0= (3KL (c ⇤ , c(.|y= , X= )) > 3a=n =2 )  %0= 3KL (@, ?) > a=n =2 ✓π ◆ ✓ π ◆ ! (✓= ) 2 !(✓= ) 2 3Y + %0 = log @(✓= )3✓= > a=n = + %0 log = ?(✓= )3✓= > a=n =  (A.40) !0 !0 a where the last inequality is a consequence of (A.37), (A.38) and (A.39). Since Y is arbitrary, taking Y ! 0 completes the proof. Consistency of the variational posterior. Proof of Theorem 1. We assume Relation (B.13) holds with = and ⌫= are same as in (3.29). By assumptions (A1) and (A2), the prior parameters satisfy ||µ= || 22 = >(=), log || = || 1 = $ (log =), || ⇤ = || 1 = $ (1), ⇤ = = 1/ =. Note = ⇠ =0 , 0 < 0 < 1 which implies = log = = >(=). By proposition A.0.17 part 1., 3KL (c ⇤ , c(.|y= , X= )) = > %0= (=). (A.41) By step 1 (c) in the proof of proposition A.0.17 ⌫= = > %0= (=) (A.42) Since, = ⇠ =0 , = log = = >(= 1 ), 0 < 1 < 1. Using proposition A.0.16 with n = = 1, c ⇤ (UY2 ) = =Y 2 c ⇤ (UY2 ) log 2 + > %0= (1) = =Y 2 c ⇤ (UY2 ) + $ %0= (1) (A.43) 112 Thus, using (A.41), (A.42) and (A.43) in (B.13), we get =Y 2 c ⇤ (UY2 ) + $ %0= (1)  > %0= (=) + > %0= (=) =) c ⇤ (UY2 ) = > %0= (1) Proof of Theorem 2. We assume Relation (B.13) holds with = and ⌫= are same as in (3.29). Let = ⇠ =0 and n =2 ⇠ = X , 0 < X < 1 0. This implies = log = = >(=n =2 ). By assumptions (A1) and (A4), the prior parameters satisfy ||µ= || 22 = >(=n =2 ), log || = || 1 = $ (log =), || ⇤ = || 1 = $ (1), = ⇤ = 1/ =. Also by assumption (A3), ’!= ÷ != ||[0 [✓=⇤ || 1 = >(n =2 ), ||✓=⇤ || 22 = >(=n =2 ), log( :̃ E= 0 ⇤E 0= ) = $ (log =) E=0 E 0 =E+1 By proposition A.0.17 part 2., 3KL (c ⇤ , c(.|y= , X= )) = > %0= (=n =2 ). (A.44) By step 2 (c) in the proof of proposition A.0.17 ⌫= = > %0= (=n =2 ) (A.45) Since = ⇠ =0 , = log = = >(= 1 n =2 ), 0 + X < 1 < 1. Using proposition A.0.16, it follows that c ⇤ (UYn2 = ) = Y 2 =n =2 c ⇤ (UYn2 = ) log 2 + > %0= (1) = Y 2 =n =2 c ⇤ (UYn 2 = ) + $ %0= (1) (A.46) Thus, using (A.44), (A.45) and (A.46) in (B.13), we get =Y 2 n =2 c ⇤ (UYn 2 = ) + $ %0= (1)  > %0= (=n =2 ) + > %0= (=n =2 ) =) c ⇤ (UYn 2 = ) = > %0= (1) Proof of Corollary 1. Ø Let ✓ˆ= (H, x) = ✓✓= (H, x)c ⇤ (✓= )3✓= . ✓π ◆ 3H ( ✓ˆ= , ✓0 ) = 3H ⇤ ✓✓= c (✓= )3✓= , ✓0 π  3H (✓✓= , ✓0 )c ⇤ (✓= )3✓= Jensen’s inequality π π ⇤ = 3H (✓✓= , ✓0 )c (✓= )3✓= + 3H (✓✓= , ✓0 )c ⇤ (✓= )3✓= UY U Y2  Y + > %0= (1) 113 Taking Y ! 0, we get 3H ( ✓ˆ= , ✓0 ) = > %0= (1). Let ✓π ◆ 1 ˆ [(x) =f f([✓= (x))c (✓= )3✓=⇤ (A.47) ˆ then, note that [(x)ˆ = log ✓✓ˆ= (1, x) (0,x) . = π ’ q 2 ˆ 23H ( ✓= , ✓0 ) =2 2 ✓ˆ= (H, x)✓0 (H, x)3x x2[0,1] ? H2{0,1} π ’ 1 ˆ x) log(1+4 [ˆ ( x) )+H[0 ( x) log(1+4 [0 ( x) )} =2 2 4 { 2 ( H[( 3x x2[0,1] ? H2{0,1} π ⇣p p ⌘ =2 2 f([0 (x))f( [(x)) ˆ + (1 f([0 (x)))(1 ˆ f( [(x))) 3x π q x2[0,1] ? p p 2 2 1 ( f([0 (x)) ˆ f( [(x))) 2 3x π π x2[0,1] ? p p 2 1 2 ( f([0 (x)) ˆ f( [(x))) 3x (f([0 (x)) ˆ f( [(x))) 3x x2[0,1] ? 4 x2[0,1] ? (A.48) p In the above equation, the sixth and the seventh step hold because 1 G  1 G/2 and | ? 1 ?2 |  p p p p p p | ? 1 + ? 2 || ? 1 ? 2 |  2| ? 1 ? 2 | respectively. The fifth step holds because ⇣p p ⌘2 p ?1 ?2 + (1 ? 1 )(1 ?2) = ?1 ?2 + 1 ?1 ?2 + ? 1 ? 2 (1 ? 1 ) (1 ? 2 ) p p p p 2  ?1 ?2 + 1 ?1 ? 2 + ? 1 ? 2 = 1 ( ?1 ?2) By (A.48) and Cauchy Schwartz inequality, π ✓π ◆ 1/2 2 |f([0 (x)) f( [(x))|3x ˆ  (f([0 (x)) ˆ f( [(x))) 3x x2* [0,1] ? x2[0,1] ? p  2 23H ( ✓ˆ= , ✓0 ) = > %0= (1) (A.49) The proof follows in lieu (3.33). Proof of Corollary 2. We assume Relation (B.13) holds with = and ⌫= are same as in (3.29). Let = ⇠ =0 and n =2 ⇠ = X , 0 < X < 1 0. This implies = log = = >(=n =2 ). 114 Also, = log = = >(= 1 n =2 ), 0 + X < 1 < 1. This implies = log = = >(= 1 (n =2 ) ^ ), 0  ^  1. Thus, using proposition A.0.16 with n = = n =: , we get c ⇤ (UYn 2 = ^) = Y 2 =n =2^ c ⇤ (UYn2 ^) = log 2 + > %0= (1) = Y 2 =n =2^ c ⇤ (UYn 2 : ) + $ %0 (1) = (A.50) = This together with (A.44), (A.45) and (B.13) implies 2 2^ c ⇤ (UYn2 ^ ) = > % = (n = = 0 ) Ø Let ✓ˆ= (H, x) = ✓✓= (H, x)c ⇤ (✓= )3✓= . π π 3H ( ✓ˆ= , ✓0 )  ⇤ 3H (✓✓= , ✓0 )c (✓= )3✓= + 3H (✓✓= , ✓0 )c ⇤ (✓= )3✓= U Y n=^ U Y2 n ^ =  Yn =^ + > %0= (n =2 2^ ) Dividing by n =^ on both sides we get 1 3H ( ✓ˆ= , ✓0 ) = > %0= (n =2 3^ ) + > %0= (1) = > %0= (1), 0  ^  2/3. n =^ By (A.49), for every 0  ^  2/3, π 1 1 p |f([0 (x)) ˆ f( [(x))|3x  2 23H ( ✓ˆ= , ✓0 ) = > %0= (1). n =^ x2[0,1] ?= n =^ The proof follows in lieu of (3.33). Consistency of the true posterior. From (4.9), note that Ø Ø ! (✓= ) ?(✓= )3✓= (!(✓= )/! 0 ) ?(✓= )3✓= c(UY2 |y= , X= ) = Ø Y = ØY U2 U2 (A.51) ! (✓= ) ?(✓= )3✓= (!(✓= )/! 0 ) ?(✓= )3✓= Theorem A.0.18 Suppose conditions of theorem 3.4.1 hold. Then, 1. ⇣ ⌘ =Y 2 /2 %0= c(UY2 |y= , X= )  24 ! 1, = ! 1 2. p %0= (|'(⇠) ˆ '(⇠ Bayes )|  8 2Y) ! 1, = ! 1 115 Proof. By assumptions (A1) and (A2), the prior parameters satisfy ||µ= || 22 = >(=), log || = || 1 = $ (log =), || ⇤ = || 1 = $ (1), ⇤ = = 1/ =. Note = ⇠ =0 , 0 < 0 < 1 which implies = log = = >(=). Thus, the conditions of proposition B.0.3 hold with n = = 1. ✓π ◆ ✓ π ◆ ! (✓= ) ! (✓= ) %0= ?(✓= )3✓=  4 =a  %0= log ?(✓= )3✓= > =a ! 0, = ! 1 (A.52) !0 !0 where the above convergence follows from (A.35) in step 1 (c) in the proof of proposition A.0.17. Since = log = = >(= 1 ), 0 < 1 < 1, the conditions of proposition A.0.16 hold with n = = 1. ✓π ◆ !(✓= ) =Y 2 = %0 ?(✓= )3✓= 24 ! 0, = ! 1 (A.53) U Y2 !0 where the last equality follows from (A.25) with n = = 1 in the proof of proposition A.0.16. Using (A.52) and (A.53) with (A.51), we get ⇣ ⌘ =(Y 2 a) %0= c(UY2 |y= , X= ) 24 ! 0, = ! 1 Take a = Y 2 /2 to complete the proof. Mimicking the steps in the proof of corollary 1, π 3H ( ✓ˆ= , ✓0 )  3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= Jensen’s inequality π π = 3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= + 3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= UY U Y2 =Y 2 /2  Y + 24  2Y, with probability tending to 1 as = ! 1 where the second last inequality is a consequence of part 1. in theorem A.0.18. The remaining part of the proof follows by (A.49) and (3.33). Theorem A.0.19 Suppose conditions of theorem 3.4.2 hold. Then, 1. ⇣ ⌘ =n =2 Y 2 /2 %0= c(UYn 2 = |y= , X= )  24 ! 1, = ! 1 2. p %0= (|'(⇠) ˆ '(⇠ Bayes )|  8 2Yn = ) ! 1, = ! 1 116 Proof. By assumptions (A1) and (A4), the prior parameters satisfy ||µ= || 22 = >(=n =2 ), log || = || 1 = $ (log =), || ⇤ = || 1 = $ (1), ⇤ = = 1/ =. Also by assumption (A3), ’!= ÷ != ||[0 [✓=⇤ || 1 = >(n =2 ), ||✓=⇤ || 22 = >(=n =2 ), log( :̃ E= 0 ⇤E 0= ) = $ (log =) E=0 E 0 =E+1 Note = ⇠ =0 , 0 < 0 < 1 and n = ⇠ = X , 0 < X < 1 0, thus = log = = >(=n =2 ). Thus, the conditions of proposition B.0.3 hold. ✓π ◆ ✓ π ◆ ! (✓= ) =n =2 a ! (✓= ) 2 %0= ?(✓= )3✓=  4  %0= log ?(✓= )3✓= > =n = a ! 0, = ! 1 !0 !0 (A.54) where the above convergence follows from (A.39) in step 2 (c) in the proof of proposition A.0.17. Also, since = log = = >(= 1 n =2 ), 0 + X < 1 < 1. Thus conditions of proposition A.0.16 hold. π ! ! (✓ = ) 2 2 %0= ?(✓= )3✓= 24 =n = Y ! 0, = ! 1 (A.55) 2 U Y n= ! 0 where the last equality follows from (A.25) in the proof of proposition A.0.16. ⇣ 2 2 ⌘ Using (A.54) and (A.55) with (A.51), we get %0= c(UYn 2 |y , X ) = = = 24 =n = (Y a) ! 0, = ! 1. Take a = Y 2 /2 to complete the proof. Mimicking the steps in the proof of corollary 2, π 3H ( ✓ˆ= , ✓0 )  3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= Jensen’s inequality π π = 3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= + 3H (✓✓= , ✓0 )c(✓= |y= , X= )3✓= U Y n= U Y2 n= 2=n =2 Y 2  Yn = + 24  2Yn = , with probability tending to 1 as = ! 1 where the second last inequality is a consequence of part 1. in theorem A.0.19 and the last inequality last equality follows since n = ⇠ = X . Dividing by n = on both sides we get n = 1 3H ( ✓ˆ= , ✓0 )  2Y, with probability tending to 1 as = ! 1 The remaining part of the proof follows by (A.49) and (3.33). 117 APPENDIX B SUPPLEMENT FOR LEARNING INTRINSIC DIMENSIONALITY OF FEATURE SPACE WITH VARIATIONAL BAYES NEURAL NETWORKS Proof of Lemmas Õ: = Õ: = Lemma B.0.1 Consider, [✓= ( x) = V0 + 9=1 V 9 k(W 9 > x) and [ ⇤ (x) = V⇤ + ✓= 0 9=1 V⇤9 k(W ⇤9 > x). If |V 9 V⇤9 |  n, 9 = 1, · · · , : = |W 9 > x W ⇤9 > x|  n, 9 = 1, · · · , : = , π |[✓= ( x) [✓=⇤ (x)|3x  2n (: = + || ⇤ || 1 ) x2[0,1] ?= Proof. This proof uses somes ideas in the proof of theorem 1 in [74]. Õ = Note that |[✓= ( x) [✓=⇤ (x)|  |V0 V0⇤ | + :9=1 |V 9 k(W >9 x) V⇤9 k(W ⇤9 > x)|. Let D 9 = W >9 x, A 9 = W ⇤9 > x W >9 x, then |[✓= ( x) [✓=⇤ (x)| is bounded above by ’ := V9 V⇤9 ’ : = V (1 + 4 D 9 +A 9 ) 9 V⇤9 (1 + 4 D 9 )  |V0 V0⇤ | + = |V0 V0⇤ | + 9=1 1 + 4D 9 1 + 4 D 9 +A 9 9=1 (1 + 4 D 9 +A 9 ) (1 + 4 D 9 ) ’: = |V 9 V⇤9 | + |V 9 4 D 9 +A 9 V⇤9 4 D 9 | ’:= ’:= = |V0 V0⇤ | + =2 |V 9 V⇤9 | + |V⇤9 ||4A 9 1| 9=1 (1 + 4 D 9 +A 9 ) (1 + 4 D 9 ) 9=0 9=1 Since, |A 9 | < n < 1, thus |1 4A 9 | < 2|A 9 |  2n, the proof follows. Lemma B.0.2 For any two functions, [0 and [1 , ⌘(x) = f([1 (x))([0 (x) [1 (x)) + log(1 f([0 (x))) log(1 f([1 (x)))  2|[0 (x) [1 (x)| Proof. Using f(G) = 4 G /(1 + 4 G )  1 |⌘(x)|  |f([0 (x))||[0 (x) [1 (x)| + | log(1 f([0 (x))) log(1 f([1 (x)))| ⇣ ⌘  |[0 (x) [1 (x)| + log 1 + f([0 (x))(4 [1 (x) [0 (x) 1)  2|[0 (x) [1 (x)| 118 where the proof of the above line is as follows. Let ? = f([0 (x)), then 0  ?  1 and A = [1 (x) [0 (x), ⇣ ⌘ log 1 + f([0 (x))(4 [1 (x) [ 0 ( x) 1) = |log (1 + ?(4A 1))| A > 0 : | log(1 + ?(4A 1))| = log(1 + ?(4A 1))  log(1 + (4A 1)) = A = |A | A < 0 : | log(1 + ?(4A 1))| = log(1 + ?(4A 1))  log(1 + (4A 1)) = A = |A | Lemma B.0.3 Let ?(✓= | ) = "+ # (µ= , ⌃= ), where ⌃= is diagonal. Let Na = ✓= : 3KL (✓0 , ✓✓= ) < a , π 1 f([0 (x)) 3KL (✓0 , ✓✓= ) = f([0 (x))([0 (x) [✓= ( x)) + log 3x. x2[0,1] ?= 1 f([✓= ( x)) Assume conditions (C1),(C2),(C3) and (C4) hold, then π a=n =2 ?(✓= | )3✓= 4 ✓= 2Na n 2 = Proof. This proof uses somes ideas in the proof of theorem 1 in [74]. By condition (C3), let [✓=⇤ (x) be the neural network such that ||[✓=⇤ [0 || 1  ||[✓=⇤ [0 || 1  an =2 /4. Define ⇢ 0 ⇤> an =2 Nan 2 = ✓= : |V 9 V⇤9 |, | > x 9 x| < , 9 = 1, · · · , : = = 9 8(: = + || ⇤ || 1 ) For every ✓= 2 N 0 2 , by lemma B.0.1, we have an = π an =2 |[✓= ( x) [✓=⇤ (x)|3x  (B.1) 4 Ø x2[0,1] ?= For ✓= 2 N 0 2 , x2[0,1] ?= |[✓= ( x) [0 (x)|3x  an =2 /2 which with lemma B.0.2 implies an = 3KL (✓0 , ✓✓= )  an =2 . Thus, for every ✓= 2 N 0 2 , ✓= 2 Nan =2 which implies π π an = ?(✓= | )3✓= ?(✓= | )3✓= ✓= 2Na n 2 ✓= 2N 0 = a n=2 Let U 9 = x> >W 9 and U⇤9 = x> W ⇤9 , then ` 9 U = x> >µ 9W and f 92U = x> >⌃ 9W x. Also, using |G B |  1, U⇤9 2  x> W ⇤9 W ⇤9 > x  || ⇤ 2 9 || 1 `29 U = x> > µ 9 W µ>9 W x  || > µ 9 W || 21 1 f 92U = x> > ⌃ 9W x ⇤ || 2 || x|| 22 f 92U  || 2 9 W || 1 || x|| 22 || 9W 1 119 Let X = an =2 /(8(: = + || ⇤ || 1 )), then π ÷:= π V⇤9 +X (V 9 ` 9 V ) 2 π U⇤9 +X ( U 9 ` 9 U)2 1 1 q q 2f 2 2f 2 ?(✓= | )3✓= = 4 9V 3V 9 4 9U 3U 9 ✓= 2N 0 a n=2 9=1 V⇤9 X 2cf 92V U⇤9 X 2cf 92U ÷:= 2X ( Ṽ ` 9 V ) 2 2X ( Ũ ` 9 U ) 2 q q 2f 2 2f 2 = 4 9V 4 9U , Ṽ 9 2 V⇤9 ± X, Ũ 9 2 U⇤9 ± X 2cf 9 V 2 2cf 9 U 2 9=1 !! Õ := 2 ( Ṽ 9 ` 9 V ) 2 ( Ũ 9 ` 9 U ) 2 log 2 log X+log f 9 V +log f 9 U + + 9=1 c 2f 2 2f 2 =4 9V 9U (B.2) where the second last equality holds by mean value theorem. Note that Ṽ 9 2 V⇤9 ± 1 and Ũ 9 2 U⇤9 ± 1 since X ! 0, therefore ( Ṽ 9 ` 9 V)2 max((V⇤9 `9V 1) 2 , (V⇤9 ` 9 V + 1) 2 ) (V⇤9 ` 9 V)2 1   + , 2f 92V 2f 92V f 92V f 92V ( Ũ 9 ` 9 U)2 (U⇤9 ` 9 U)2 1  + 2f 92U f 92U f 92U Õ = ⇣ ⌘ which further implies :9=1 ( Ṽ ` 9 V ) 2 /(2f 92V ) + ( Ũ ` 9 U ) 2 /(2f 92U ) is bounded above by ’ : = V⇤ 2 9 ’ : = `2 9V ’:= 1 ’: = U⇤ 2 9 ’: = `2 9U ’ := 1 2 +2 + +2 +2 + 9=1 f 92V 9=1 f 92V 9=1 9 U f 2 9=1 9 U f 2 f 9=1 9 U 2 f2 9=1 9 U 1 ’ := ⇤ 2 . (|| || 1 + ||µ V || 21 )|| 1 2 V || 1 + (|| ⇤ 2 9 || 1 + || > µW 9 || 21 )||f 9 W1 || 21 = >(=n =2 ) (B.3) || x|| 22 9=1 where the last equality follows since || ⇤ || 21 , ||µ V || 21 , 1/|| x|| 22 = >(=n =2 ), Õ: = ⇤ 2 Õ: = > µ || 2 = $ (1) and || 1 1 9=1 ||W 9 || 1 , 9=1 || W9 1 V || 1 , sup 9=1,··· ,: = || 9 W || 1 = $ (1). ’ := 2 ’ := ( log 2 log X + log f 9 V + log f 9 U )  (2 log 8 + log(: = + || ⇤ || 1 )+ 9=1 c 9=1 log f 9 V + log f 9 U 2 log n = . : = (log : = + log || ⇤ || 1 + log || V || 1 2 log n = ) ’:= + log || 9 U || 1 = >(=n =2 ) (B.4) 9=1 120 where the last equality follows since : = log = = >(=n =2 ), log ||V⇤ || 1 = $ (log =), log || V || 1 = Õ = Õ = $ (log =), log n = = >(log =) and :9=1 log || 9 U || 1  : = log || x|| + : = :9=1 log || 9 W || 1 = $ (: = log =) = >(=n =2 ). Using (B.3) and (B.4) in (B.2), the proof follows. e= = { ✓ : ✓✓ (H, x), ✓= 2 F= } where ✓✓ (H, x) is given by p Lemma B.0.4 Let, F = = ⇣ ⇣ ⌘⌘ ✓✓= (H, x) = exp H[✓= ( x) log 1 + 4 [✓= ( x) (B.5) n o and F= = ✓= : |\ 9 |  ⇠= , 9 = 1, · · · , :̃ = where :̃ = = : = 3= + 2: = + 1 = $ (: = 3= ). Then, π p 2Y q p e [] (D, F= , ||.|| 2 )3D . Y : = 3= (log : = + (1/2) log ? = + 2 log ⇠= log Y) Y 2 /8 where e [] (D, F= , ||.|| 2 ) is the hellinger bracketing entropy of F̃= (see definition in 3 in [74]). Proof. This proof uses somes ideas in the proof of lemma 2 of [74]. In this proof, let ✓ = ✓= . Note, by lemma 4.1 in [104], # (Y, F= , ||.|| 1 )  (3⇠= /Y) : = . p e = ✓D✓ +(1 D) ✓ ( x, H). For ✓1 , ✓2 2 F= , let ✓(D) 1 2 q q ✓✓1 ( x, H) ✓✓2 ( x, H)  :̃ = sup e 9 ||✓1 m ✓/m\ ✓2 || 1  ( x, H)||✓1 ✓2 || 1 (B.6) 9=1,··· , :̃ = p p where the upper bound ( x, H) = :̃ = ? = ⇠= . This is because |m ✓/m\ e 9 |, the derivative of ✓ w.r.t. is bounded above by |m[✓ ( x)/m\ 9 | as shown below. ✓ [ ( x) ◆ 1/2 ✓ ◆ 1/2 m ✓e 1 m[✓ ( x) 4 ✓ 1  m\ 9 2 m\ 9 1+4 ✓ [ ( x ) 1 + 4 [✓ ( x) Thus, using 4 D /(1 + 4 D ), 1/(1 + 4 [✓ (x) )  1, we get 8 > > > e m✓ m[✓ (x) < 1, > \ 9 = VA for some A = 0, · · · , : = 2   > > > m\ 9 m\ 9 > |VA k 0 (WA> G) [ x] A 0 |, \ 9 = WAA 0 for some A = 0, · · · , : = , A 0 = 0, · · · , 3= : Õ ?= Õ ?= 2 1/2 ( Õ ?= 2 1/2 p Note, |VA |  ⇠= , |k 0 (D)|  1 and |[ G] A 0 | = | B=1 0 A B G 9 |  ( 0 B=1 0 A 0 B ) B=1 G B )  ?= since is orthonormal and |G B |  1. Hence the bound on ( x, H) follows. 121 In view of (B.6) and theorem 2.7.11 in [126] (also see theorem 3 in [74] for more details), we have p ! :̃ = p 3 :̃ = ? = ⇠=2 :̃ = ? = ⇠=2 # [] (Y, F e= , ||.|| 2 )  =) e [] (Y, F= , ||.|| 2 ) . :̃ = log 2Y Y where # [] and [] denote the bracketing number and bracketing entropy as in definition 3 of [74]. p Using, the proof of lemma 1 in [74] (equation (34)) with 3= = :̃ = and ⇠= = ? = ⇠=2 , we get π q p e Y [] (D, F= , ||.|| 2 )3D . Y : = 3= (log : = + (1/2) log ? = + 2 log ⇠= log Y) 0 π p 2Y π p 2Y =) e [] (D, F= , ||.|| 2 )3D  e [] (D, F= , ||.|| 2 )3D q Y 2 /8 0 . Y :̃ = (log :̃ = + (log ? = )/2 + 2 log ⇠= log Y) p Lemma B.0.5 Let @(✓= | ) ⇠ "+ # (m= , S= ) with ` 9 V = V⇤9 , f 9 V = 1/ =, m 9 W = W ⇤9 and S9W = 3 = /(=|| x|| 2 ) 2 . Define π 1 f([0 (x)) 3KL (✓0 , ✓✓= ) = f([0 (x))([0 (x) [✓= ( x)) + log 3x. x2[0,1] ?= 1 f([✓= ( x)) Suppose conditions (C1) and (C3) hold, then π 3KL (✓0 , ✓✓= )@(✓= | )3✓=  an =2 , 8a > 0 Ø Proof. Since 3KL (✓0 , ✓✓= ) is a KL-distance, 3KL (✓0 , ✓✓= ) > 0. We shall thus establish an upper Ø bound. By lemma B.0.2, 3KL (✓0 , ✓✓= )@(✓= | )3✓= is upper bounded by π 2 |[0 (x) [\ = ( x)|3x π π  |[0 (x) [\ ⇤= (x)|3x@(✓= | )3✓= + π π x2[0,1] ?= |[✓=⇤ (x) [\ = ( x)|3x@(✓= | )3✓= π π x2[0,1] ?= an =2  + |[✓=⇤ (x) [\ = ( x)|3x @(✓= | )3✓= 2 | {z } x2[0,1] ?= ⌘( ✓= ) 122 π π |[✓= ( x) [✓=⇤ (x)|3x  |V0 V0⇤ |3x x2[0,1] ?= x2[0,1] ?= ’:= π > ⇤> + |V 9 k( 9 x) V⇤9 k( 9 G)|3x 9=1 x2[0,1] ?= ’:= π  |V V0⇤ | + |V 9 k( > 9 x) V⇤9 k( > 9 x)|3x 9=1 x2[0,1] ?= ’:= π ⇤> + |V⇤9 k( > 9 x) V⇤9 k( 9 G)|3x 9=1 x2[0,1] ?= ’:=  |V 9 V⇤9 | π 9=0 ⇤ > ⇤ + || || 1 |k( 9 x) k( 9 >x)|3x x2[0,1] ?= Therefore, π ’ := ⌘(✓= )@(✓= | )3✓=  |V 9 V⇤9 |@(V 9 )3V 9 + π π 9=0 ⇤ || || 1 |k( >9 x) k( ⇤ 9 >x)|3x@( 9 )3 9 r ? π π x2[0,1] = := 2 ⇤ > ⇤> = + || || 1 |k( 9 x) k( 9 x)|3x@( 9 )3 9 = c x2[0,1] ?= (B.7) 123 Now, let " 9 = { : | >9 x ⇤ > x| 9  an =2 /(16|| ⇤ || 1 )}, then π π > ⇤> |k( 9 x) k( 9 x)|3x@( 9 )3 9 π π x2[0,1] ?= > ⇤> = |k( 9 x) k( 9 x)|3x@( 9 )3 9+ π π "9 x2[0,1] ?= > ⇤> |k( 9 x) k( 9 x)|3x@( 9 )3 9 " 29 x2[0,1] ?= an =2  + 2& 9 (" 29 ) (B.8) 8|| || 1 ⇤ Thus, combining (B.7) and (B.8) and using : = = >(=n =2 ), we get π an 2 an 2 ⌘(✓= )@(✓= | )3✓=  = + = + 2|| ⇤ || 1 & 9 (" 29 ) (B.9) 4 8 In the next steps, we deal with & 9 (" 29 ). Let X = an =2 /(16|| || ⇤1 ) ⇤> %(| > 9 x 9 x| > X) = %(|U 9 U⇤9 | X) (B.10) where U 9 = x> > 9 and U⇤9 = x> W ⇤9 . Note that U 9 U⇤9 ⇠ # (` 9 U , f 92U ) with ` 9 U = x> > W ⇤9 G > W ⇤9 = G > ( > )W ⇤9 and f 92U = (1/(=2 || x|| 22 ))|| x|| 22 = 1/=2 . Further note that since |G B |  1, ’ ?= ’?= |` 9 U | = |G B ||[( > ) ⇤ 9 ]B|  | [( > ) ⇤ 9 ]B| = ||( > ) ⇤ || 1 = >(= 1 ) = >(X) B=1 B=1 where the last equality holds since X ⇠ n =2 /|| ⇤ || 1 = 1 because || ⇤ || 2 1 = >(=n =2 ). This also p implies, (X ± ` 9 U )/f 9 U ⇠ (=n =2 )/|| ⇤ || 1 =n = ! 1 which implies ⇤> %(| > 9 x 9 x| > X) = 1 ((X ` 9 U )/f 9 U )) + 1 ((X + ` 9 U )/f 9 U ) ⇠ (f 9 U /(X ` 9 U ))q((X ` 9 U )/f 9 U )+ (B.11) =n =2 (f 9 U /(X + ` 9 U ))q((X + ` 9 U )/f 9 U ) . 4 where the asymptotic equivalence in the second step is a consequence of Mill’ratio. Thus, using the above relation in (B.9), π ✓ 2 ◆ n n2 =n =2 an =2 ⌘(✓= )@(✓= | )3✓=  a = + = + 2|| ⇤ || 1 4  4 8 2 p where the last equality holds || ⇤ || 1 = >( =n =2 ). 124 p Lemma B.0.6 Let @(✓= | ) ⇠ "+ # (m= , S= ) with < 9 V = V⇤9 , B 9 V = 1/ =, m 9 W = W ⇤9 and S9W = 3 = /(=|| x|| 2 ) 2 . Let, ?(✓= | ) = "+ # (µ= , ⌃= ), where ⌃= is diagonal. Suppose conditions (C1), (C2), (C3) and (C4) hold, then 3KL (@, ?) = >(=n =2 ) Proof: With :̃ = ⇠ : = + 1 + : = (3= + 1), here 3KL (@, ?) can be simplified as ’ := p 1 (V⇤9 ` 9 V ) 2 ’ 3= 1 = (log =f 9 V + 2 + 2 + (2 log =|| x|| 2 f 9 9 0 W + 2 + 9=1 =f 9 V f9 V 9 0 =1 = || x|| 22 (W ⇤9 ` 9 9 0W ) 2 :̃ = )) f 92W 2 . :̃ = (log = + log || x|| 2 + log ||fV || 1 + 1/(=|| x|| 2 ) 2 )+ ’ := ⇤ 2 := log ||f 9 W || 1 + ||fV⇤ || 1 (|| || 2 + ||µ V || 22 ) 9=1 ’ := :̃ = ⇤ 2 + ||f 9⇤W || 1 (|| 9 || 2 + ||µ 9 W || 22 ) = >(=n =2 ) (B.12) 9=1 2 where the last equality holds since :̃ = log = = >(=n =2 ), log || x|| 2 = log ||fV || 1 = log ||f 9 W || 1 = Õ = $ (log =) and 1/|| x|| 22 = >(=n =2 ). Since ||.|| 2  ||.|| 1 , || ⇤ || 22 = ||µ V || 22 = >(=n =2 ), :9=1 || ⇤9 || 22 = Õ = Õ = $ (1), :9=1 ||µ 9 W || 22 = :9=1 || > µ 9 W || 22 = $ (1), as consequence of which the proof follows. Lemma B.0.7 Let ?(✓= | ) = "+ # (µ= , ⌃= ), where ⌃= is diagonal. Suppose conditions (C1) and (C2) hold, then π =an =2 ?(✓= | )3✓=  4 , 8a > 0 ✓= 2F=2 2 where F= = {✓= : |\ 9 |  ⇠= , 9 = 1, · · · , :̃ = } with ⇠= = 4 Y=n = / :̃ = . Proof: This proof uses somes ideas in the proof of theorem 1 in [74]. Let F9= = {\ 9 : |\ 9 |  ⇠= } Ø which implies F= = \ :̃9=1 = F9= =) F=2 = \ :̃9=1 = F9=2 . Note that ✓ 2F 2 ?(✓= | )3✓= is bounded above = = 125 Õ :̃ = by 9=1 %(F9=2 ) which is ’:= π 1 (V 9 ` 9 V ) 2 1 q 2f 2 = 4 9V 3V 9 + 9=0 ( 1,⇠= )[(⇠= ,1) 2cf 92V ’:= ’ 3= π 1 (W 9 9 0 ` 9 9 0 W ) 2 1 q 2f 2 0 4 99 W 3W 9 9 0 9=1 9 0 =1 ( 1,⇠= )[(⇠= ,1) 2cf 929 0 W ’:= = (2 ((⇠= ` 9 V )/f 9 V ) ((⇠= + ` 9 V )/f 9 V )) 9=0 ’:= ’ 3= + (2 ((⇠= ` 9 9 0 W )/f 9 9 0 W ) ((⇠= + ` 9 9 0 W )/f 9 9 0 W )) 9=1 9 0 =1 ’:= ’:= = (f 9 V /(⇠= ` 9 V ))q((⇠= ` 9 V )/f 9 V ) + (f 9 V /(⇠= + ` 9 V ))q((⇠= + ` 9 V )/f 9 V ) 9=0 9=0 ’:= ’ 3= + (f 9 9 0 W /(⇠= + ` 9 9 0 W ))q((⇠= + ` 9 9 0 W )/f 9 9 0 W ) 9=1 9 0 =1 ’:= ’ 3= + (f 9 9 0 W /(⇠= ` 9 9 0 W ))q((⇠= ` 9 9 0 W )/f 9 9 0 W ) 9=1 9 0 =1 p where the above equality holds due Mill’s ratio and the fact ` 9 V , ` 9 9 0 W = >( =n =2 ), f 9 V , f 9 9 0 W = p $ (=A ), A > 0 which implies (⇠= ± `)/f (⇠= =)/("=A ) ⇠ ⇠= /=A = exp((=n =2 / :˜= )(Y (A :̃ = log =)/(=n =2 ))) ! 1 since :̃ = log = = >(=n =2 ). Further, π ’:= (⇠= ` 9 V ) 2 /(2f 29 V ) (⇠= +` 9 V ) 2 /(2f 29 V ) ?(✓= | )3✓= . (4 +4 ) ✓= 2F=2 9=1 ’:= ’ 3= (⇠= ` 9 9 0 W ) 2 /(2f 29 9 0 W ) (⇠= +` 9 9 0 W ) 2 /(2f 29 9 0 W ) + (4 +4 ) 9=1 9 0 =1 a=n =2 ⇠ :̃ = exp( exp((=n =2 / :̃ = ) (Y (A :̃ = log =)/(=n =2 ))))  4 since (=n =2 / :̃ = )(Y (A :̃ = log =)/(=n =2 )) log(a=n =2 ) when :̃ = log = = >(=n =2 ) and =n =2 ! 0. Proof of Theorem 1 This proof uses some ideas in the proof of lemmas 3 and lemma 5 in [74]. 126 Let D = (y= , x1 , · · · , x= ), then 3KL (@ ⇤ , c(.|D )) π π @ ⇤ (✓= ) @ ⇤ (✓= ) = @ (✓= ) log ⇤ 3✓= + @ ⇤ (✓= ) log 3✓= c(✓= |D ) c(✓= |D ) π U Y n= U Y2 n= @ ⇤ (✓= ) c(✓= |D ) = ⇤ @ (UYn = ) ⇤ log ⇤ 3✓= U Y n= @ (UYn = ) @ (✓= ) π @ ⇤ (✓= ) c(✓= |D ) ⇤ @ (UYn = )2 ⇤ 2 log ⇤ 3✓= U Y2 n= @ (UYn = ) @ (✓= ) @ ⇤ (UYn = ) @ ⇤ (UYn 2 ) @ (UYn = ) log ⇤ + @ (UYn = ) log ⇤ 2 2 = , by Jensen’s inequality c(UYn = |D ) c(UYn = |D ) @ ⇤ (UYn = ) log @ ⇤ (UYn = ) + @ ⇤ (UYn 2 = ) log @ ⇤ (UYn 2 = ) @ ⇤ (UYn 2 = ) log c(UYn 2 = |D ) @ ⇤ (UYn 2 ) log c(UYn 2 |D ) log 2, since G log G + (1 G) log(1 G) log 2 = π = π ! !(✓= | ) ! (✓= | ) = @ ⇤ (UYn 2 ) log ?(✓= | )3✓= log ?(✓= | )3✓= log 2 = 2 U Y n= !0 !0 Ø Ø Let ⇢ 1= = log U 2 (! (✓= | )/! 0 ) ?(✓= | )3✓= , ⇢ 2= = log (! (✓= | )/! 0 ) ?(✓= | )3✓= Ø Y n= and ⇢ 3= = log(! 0 /! (✓= | ))@(✓= | )3✓= . Then for any @ 2 Q= , @ ⇤ (UYn 2 )⇢ 1=  3KL (@, c(.|D )) @ ⇤ (UYn 2 )⇢ 2= + log 2 π = = ! (✓= | ) = 3KL (@, ?(.| )) log @(✓= | )3✓= + π !0 ! (✓= | ) log ?(✓= | )3✓= + |⇢ 2= | + log 2 (B.13) !0  3KL (@, ?(.| )) + ⇢ 3= + (1 @ ⇤ (UYn 2 = ))⇢ 2= + log 2  >(=n =2 ) + ⇢ 3= + ⇢ 2= + log 2 (B.14) where the above inequality holds due to lemma B.0.6. We show three main things, ⇢ 3= = > %0= (=n =2 ), ⇢ 2= = > %0= (=n =2 ) and ⇢ 1= log 2 =Y 2 n =2 + > %0= (1). This completes the proof because @ ⇤ (UYn 2 = )=Y 2 n =2  >(=n =2 ) + > %0= (=n =2 ) + > %0= (=n =2 ) + $ %0= (1) =) @ ⇤ (UYn 2 = ) = > %0= (1) 127 Handling ⇢ 3= : Note, %0= (|⇢ 3= | > Y=n =2 ) can be bounded above using Markov’s inequality as ✓π ◆ ✓π ◆ 1 1 ⇢ 0= @(✓= | ) log(! 0 /! (✓= | ))3✓=  ⇢ = @(✓= | ) |log(! 0 /! (✓= | ))| 3✓= 2 Y=n =2 0 π π Y=n = 1  @(✓= | ) |log(! 0 /!(✓= | ))| ! 0 3`3✓= Y=n =2 π  @(✓= | ) (3 ! (! 0 , !(✓= | )) + 2/4) 3✓= π 1  (=3KL (✓0 , ✓✓= ) + 2/4)3✓=  a/Y Y=n =2 where the third step follows from lemma 4 in in [74] and the fourth step follows from lemma B.0.5. Since a is arbitrary, ⇢ 3= = > %0= (=n =2 ). We next shown ⇢ 2= = > %0= (=n =2 ) as follows. Handling ⇢ 2= : Note, %0= (|⇢ 2= | > Y=n =2 ) can be bounded above using Markov’s inequality as ✓ π ◆ π π 1 1 ⇢ log (! (✓= | )/! 0 ) ?(✓= | )3✓= = = log (!(✓= | )/! 0 ) ?(✓= | )3✓= ! 0 3` Y=n =2 0 Y=n =2 1  2 (3KL (! 0 , ! ⇤ ) + (2/4)) Y=n = Ø With ! ⇤ = !(✓= | ) ?(✓= | )3✓= , the last equality follows from lemma 4 in [74]. Further, ✓ ✓ π ◆◆ ⇤ 3KL (! 0 , ! ) = ⇢ 0= ⇤ (log(! 0 /! )) = ⇢ 0= log ! 0 / ! (✓= | ) ?(✓= | )3✓= π !  ⇢ 0= log(! 0 / ! (✓= | ) ?(✓= | )3✓= ) π π #a n 2 =  ?(✓= | )3✓= + 3 ! (! 0 , !(✓= | )) ?(✓= | )3✓= #a n 2 #a n 2 = = a=n =2  log 4 + a=n =2 = 2a=n =2 where the second step follows from Jensens’ inequality and the last step follows from lemma B.0.3. Lastly, we show ⇢ 1= log 2 + =Y 2 n =2 + > %0= (1) as follows. 2 Handling ⇢ 1= : For this, F= = {✓= : |\ 9 |  ⇢ 3= = 4 =Yn = / :̃ = }. Thus, %0= (⇢ 1=  log 2 =Y 2 n =2 ) is 128 bounded above by π ! =Y 2 n =2 %0= (! (✓= | )/! 0 ) ?(✓= | )3✓= 4 + U Y2 n= \F= | {z } ✓π ◆ ⇢ 11= =Y 2 n =2 %0= (!(✓= | )/! 0 ) ?(✓= | )3✓= 4 | {z } F=2 ⇢ 12= 2 Using lemma B.0.4 with Y = Yn = and ⇠= = 4 =Yn = / :̃ = π p 2Yn = q e p [] (D, F= , ||.|| 2 )3D . Yn = :̃ = (log : = + (1/2) log ? = + 2 log ⇠= log n = )  Y 2 n =2 = Y 2 n =2 /8 2 where the above equality holds since :̃ = log = = >(=n =2 ), ? = = >(4 =n = / :̃ = ) and =n =2 ! 1. Therefore, by theorem 1 in [138], we have ⇢ 11= ! 0, = ! 1. ⇣Ø ⌘ Additionally, %0= F 2 (!(✓= | )/! 0 ) ?(✓= | )3✓= 4 =Y 2 is bounded above by Markov’s inequality = by ✓π ◆ π =Y 2 n =2 =Y 2 n =2 2 2 2Y 2 ) 4 ⇢ 0= (! (✓= | )/! 0 ) ?(✓= | )3✓= = 4 ?(✓= | )3✓= = 4 =n = (Y !0 F=2 F=2 where the above equality holds due to lemma B.0.7 for a = 2Y 2 . Thus, ⇢ 12= ! 0, = ! 1. 129 BIBLIOGRAPHY 130 BIBLIOGRAPHY [1] A , N., C , B. Approximate nearest neighbours and the fast johnson- lindenstrauss transform. Proceedings of the Symposium on Theory of Computing (2006), 557–563. [2] A , J., R , R., C , J., V , M., D , C. A relationship between the transient structure in the monomeric state and the aggregation propensities of U-synuclein and V-synuclein. Biochemistry 53 (11 2014). [3] B , J., S , Q., C , G. Efficient variational inference for sparse deep learning with theoretical guarantee. In Advances in Neural Information Processing Systems (2020), H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, Eds., vol. 33, Curran Associates, Inc., pp. 466–476. [4] B , A., S , M. J., W , L. The consistency of posterior distribu- tions in nonparametric problems. Ann. Statist. 27, 2 (1999), 536–561. [5] B , A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory 39, 3 (1993), 930–945. [6] B , S., M , T. Statistical foundation of variational bayes neural net- works. Neural Networks 137 (2021), 151–173. [7] B , C. M. Bayesian Neural Networks. Journal of the Brazilian Computer Society 4, 1 (1997), 61–68. [8] B , D., N , A., J , M. Latent dirichlet allocation. Journal of Machine Learning Research 3 (2003), 993–1022. [9] B , D. M., L , J. D. A correlated topic model of science. The Annals of Applied Statistics 1, 1 (2007), 17–35. [10] B , C., C , J., K , K., W , D. Weight uncertainty in neural network. [11] B , C., C , J., K , K., W , D. Weight uncertainty in neural network. In Proceedings of Machine Learning Research, vol. 37. PMLR, 2015, pp. 1613–1622. [12] B , L. Bagging predictors. Machine Learning 24, 2 (Aug. 1996), 123–140. [13] C , E. J., R , J. K., T , T. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics 59, 8 (2006), 1207–1223. [14] C , T. I., S , R. J. Random-projection ensemble classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79, 4 (2017), 959–1035. 131 [15] C , P., S , M. Scalable variational inference for bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Analysis 7 (2012). [16] C , C. M., P , N. G., S , J. G. Handling sparsity via the horseshoe. D. van Dyk and M. Welling, Eds., vol. 5 of Proceedings of Machine Learning Research, PMLR, pp. 73–80. [17] C , G., R , C. P. Rao-blackwellisation of sampling schemes. Biometrika 83, 1 (1996), 81–94. [18] C , R., M , M., M C , J., G , M., P , A., S , T., G , M., D , E., R , L. Predicting conversion from mild cognitive impairment to alzheimer’s disease using neuropsychological tests and multivariate methods. Journal of clinical and experimental neuropsychology 33 (02 2011), 187–99. [19] C , S.-T., H , Y.-H., H , Y.-L., K , S.-J., T , H.-S., W , H.-K., C , D.-R. Comparative analysis of logistic regression, support vector machine and artificial neural network for the differential diagnosis of benign and malignant solid breast tumors by the use of three-dimensional power doppler imaging. Korean journal of radiology : official journal of the Korean Radiological Society 10 (08 2009), 464–71. [20] C , T., L , M., L , Y., L , M., W , N., W , M., X , T., X , B., Z , C., Z , Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. [21] C , B., Z , D., S , D. Domain transfer learning for mci conversion prediction. vol. 15, pp. 82–90. [22] C -A , B.-E. Convergence rates of variational inference in sparse deep learning. In Proceedings of the 37th International Conference on Machine Learning (13–18 Jul 2020), H. D. III and A. Singh, Eds., vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 1831–1842. [23] C , N., S -T , J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [24] C , Y., L , B., L , S., Z , X., F , M., L , T., Z , W., P , M., J , T., J , J. S., A ’ D N I . Identification of conversion from mild cognitive impairment to alzheimer’s disease using multivariate predictors. PLOS ONE 6, 7 (07 2011), 1–10. [25] C , G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems 2, 4 (1989), 303–314. [26] D , S. Experiments with random projection. Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (2013). 132 [27] D , S., G , A. An elementary proof of a theorem of johnson and linden- strauss. Random Structures & Algorithms 22, 1 (2003), 60–65. [28] D , C., B , P., S , L., B , K., T , J. Predic- tion of mci to ad conversion, via mri, csf biomarkers, and pattern classification. Neurobiol Aging (2011). [29] D , D., L , X., T , M., P , G., C , K., B , K., M , J., D , R., S , Y., P , G. Combining early markers strongly predicts conversion from mild cognitive impairment to alzheimer’s disease. Biological psychiatry 64 (09 2008), 871–9. [30] D A S ,J I V , J. C. S. Comparison between svm and logistic regression: Which one is better to discriminate? Revista Colombiana de Estadistica Numero especial en Bioestadistica 35 (06 2012), 223–237. [31] D , J., H , Q. Prediction of mci to ad conversion using laplace eigenmaps learned from fdg and mri images of ad patients and healthy controls. In 2017 2nd International Conference on Image, Vision and Computing (ICIVC) (2017), pp. 660–664. [32] D , X., Y , Z., C , W., S , Y., M , Q. A survey on ensemble learning. Frontiers of Computer Science 14 (2019), 241 – 258. [33] D , D. L. Compressed sensing. IEEE Transactions on Information Theory 52, 4 (2006), 1289–1306. [34] D , J., E , G., O , Y., G , B., D , C. Multi-atlas skull-stripping. Acad Radiol (2013), 1566–1576. [35] D , J., E , G., O , Y., R , S., G , R., G , R., S , T., D , C. Muse: Multi-atlas region segmentation utilizing ensembles of registration algorithms and parameters, and locally optimal atlas selection. NeuroImage 127 (12 2015). [36] D , J., E , G., R , M., D , C. Hierarchical parcellation of mri using multi-atlas labeling methods. Alzheimer’s Disease Neuroimaging Initiative. [37] D , S., O -M , L. Logistic regression and artificial neural network classification models: A methodology review. Journal of biomedical informatics 35 (10 2002), 352–9. [38] D , J., H , E., S , Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 61 (2011), 2121–2159. [39] D , R., K , A. Sharp generalization error bounds for randomly-projected classifiers. In Proceedings of the 30th International Conference on Machine Learning (2013), S. Dasgupta and D. McAllester, Eds., vol. 28, PMLR, pp. 693–701. [40] D , R. J., K , A. Random projections as regularizers: Learning a linear discriminant from fewer observations than dimensions. Machine Learning 99, 2 (2015), 257–286. 133 [41] E, H. G., D , V. C. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, COLT’93. ACM press, 1993, p. 5–13. [42] E , C., O , E., B , M., E , S., R , S., R , S., S , G., E , A., W , A., M , H. Small baseline volume of left hippocampus is associated with subsequent conversion of mci into dementia: Th goteborg mci study. J Neurol Sci 271(2) (2008), 48–59. [43] E , M., W , C., T , J., S , L., P , R., J , C., F , H., B , AL A , G., S , P., V , B., D , B., W , M., H , H. Prediction of conver- sion from mild cognitive impairment to alzheimer’s disease dementia based upon biomarkers and neuropsychological test performance. Neurobiol Aging 33(7) (2012), 1203–1214. [44] F , M. Treatment of mild cognitive impairment (mci). Curr. Alzheimer Res 6(4) (2009), 273–297. [45] F , J., S , N. Sparse-input neural networks for high-dimensional nonparametric regression and classification, 2019. [46] F , J., H , T., T , R. The elements of statistical learning. Springer series in statistics. Springer, New York, 2009. [47] G , S., G , J. K., V , A. W. Convergence rates of posterior distributions. The Annals of Statistics 28, 2 (2000), 500 – 531. [48] G , S., Y , J., D -V , F. Model selection in bayesian neural networks via horseshoe priors. Journal of Machine Learning Research 20, 182 (2019), 1–46. [49] G , N., B , G., N , A. Face recognition experiments with random projection. In Proceedings of SPIE - The International Society for Optical Engineering (2005), vol. 5776. [50] G , A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, Eds., vol. 24. Curran Associates, Inc., 2011, pp. 2348–2356. [51] G , A. Generating sequences with recurrent neural networks, 2014. arXiv.1308.0850. [52] G , R., D , D. B. Bayesian compressed regression. Journal of the American Statistical Association 110, 512 (2015), 1500–1514. [53] G , R., D , D. B. Compressed gaussian process for manifold regression. Journal of Machine Learning Research 17, 69 (2016), 1–26. [54] G , K. An Introduction to Neural Networks. Taylor & Francis, Inc., USA, 1997. [55] H , C., M W , B., M , N., K , G. Loco: Dis- tributing ridge regression with random projections, 2015. arXiv:1406.3469. 134 [56] H , C., S , V., X , G., J , S. Predictive markers for ad in a multi- modality framework: An analysis of mci progression in the adni population. NeuroImage 55 (03 2011), 574–89. [57] H , G., S , N., S , K. Lecture 6a overview of mini-batch gradient descent. http://www.cs.toronto.edu/ hinton/coursera/lecture6/lec6.pdf. [58] H , A. E., K , R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 1 (1970), 55–67. [59] H , S. H., E , A., K , A., B -F , A. Predicting conversion from mci to ad using resting-state fmri, graph theoretical approach and svm. Journal of Neuroscience Methods 282 (03 2017). [60] H , K., S , M., W , H. Multilayer feedforward networks are universal approximators. Neural Networks 2, 5 (1989), 359–366. [61] H , A., S , G., F , F. Deep bayesian regression models. arXiv:1806.02160. [62] I , A. D. N. Accessed on: Nov. 3, 2020. [online]. available. http://adni.loni.usc.edu. [63] J , T., J , M. I. A variational approach to bayesian logistic regression problems and their extensions. [64] J , K., H , W., H , M. P., L , A. Compromise-free bayesian neural networks. arXiv:2004.12211. [65] K , V., B , S., S , M., M , T. Black box variational bayes model averaging, 2021. [66] K , D. P., S , T., W , M. Variational dropout and the local reparam- eterization trick. In Advances in Neural Information Processing Systems (2015), C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28, Curran Associates, Inc., pp. 2575–2583. [67] K , I. Alzheimer’s disease: A clinical and basic science review (msrj: Medical student research journal). Medical Student Research Journal 4 (09 2014), 24–33. [68] K , I. O., S , L. L., B , A. C., I , A. D. N. Predicting progression from mild cognitive impairment to alzheimer’s dementia using clinical, mri, and plasma biomarkers via probabilistic pattern classification. PLOS ONE 11, 2 (02 2016), 1–25. [69] K , J., Z , P., C , T., Z , Z., L , L., W , N., W , L. Prediction of transition from mild cognitive impairment to alzheimer’s disease based on a logistic regression-artificial neural network- decision tree model. Geriatr Gerontol Int 21(1) (2021), 43–47. 135 [70] L , T. V. L2 regularization versus batch and weight normalization. arXiv:1706.05350. [71] L , J., V , A. Bayesian approach for neural networks–review and case studies. Neural networks : the official journal of the International Neural Network Society 14, 3 (2001), 257–274. [72] L , P., R , S. Variational bayes model averaging for graphon functions and motif frequencies inference in w-graph models. Statistics and Computing 26, 6 (2015), 1173–1185. [73] L , Y., B , L., B , Y., H , P. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324. [74] L , H. Consistency of posterior distributions for neural networks. Neural Networks 13, 6 (2000), 629 – 642. [75] L , S., B , A., Y , D., L , J., A , B. Predicting progression from mild cognitive impairment to alzheimer’s disease using longitudinal callosal atrophy. Alzheimer’s and Dementia: Diagnosis, Assessment and Disease Monitoring 2 (03 2016). [76] L , S.-I., L , H., A , P., N , A. Efficient l1 regularized logistic regression. In AAAI (2006). [77] L , M., L , V. Y., P , A., S , S. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6, 6 (1993), 861–867. [78] L , F., T , L., T , K.-H., J , S., S , D., L , J. Robust deep learning for improved classification of ad/mci patients. In Machine Learning in Medical Imaging (Cham, 2014), G. Wu, D. Zhang, and L. Zhou, Eds., Springer International Publishing, pp. 240–247. [79] L , X., L , C., C , J., O , J. Variance reduction in black-box variational inference by adaptive importance sampling. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, ÒCAI-18 (2018), International Joint Conferences on Artificial Intelligence Organization, pp. 2404–2410. [80] L , F., L , Q., Z , L. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association 113, 523 (2018), 955–972. [81] L , Z., M , T., B , A. A role for prior knowledge in statistical classification of the transition from mci to alzheimer’s disease. unpublished report., 2020. [82] L , D. A., B , S., M , R. A., D , V., A ’ D N I (ADNI). A multivariate predictive modeling approach reveals a novel csf peptide signature for both alzheimer’s disease state classification and for predicting future disease progression. PLOS ONE 12, 8 (08 2017), 1–18. 136 [83] L , B., H , G., M , J. A variational bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC bioinformatics 11, 58 (2010). [84] L , M., J , L., W , M. J. In Advances in Neural Information Process- ing Systems (2011), J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, Eds., vol. 24, Curran Associates, Inc., pp. 1206–1214. [85] M K , D. J. C. A practical bayesian framework for backpropagation networks. [86] M , T. L., T , G. H., S , S. H. A random matrix-theoretic approach to handling singular covariance estimates. IEEE Transactions on Information Theory 57, 9 (2011), 6256–6271. [87] M , W. Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference (01 2010). [88] M M , H. B. A survey of algorithms and analysis for adaptive online learning. Journal of Machine Learning Research 18, 90 (2017), 1–50. [89] M , M., N , C., S , M. Deep learning on brain cortical thick- ness data for disease classification. In 2018 Digital Image Computing: Techniques and Applications (DICTA) (2018), pp. 1–5. [90] M , S., K , A., R , F., A , A., K , S. A. A nonparametric approach for mild cognitive impairment to ad conversion prediction: Results on longitudinal data. IEEE Journal of Biomedical and Health Informatics 21, 5 (2017), 1403–1410. [91] M C, F Y, D. C. Baseline and longitudinal patterns of brain atrophy in mci pa- tients, and their use in prediction of short-term conversion to ad: Results from adni. NeuroImage 44(4) (2009), 1415–1422. [92] M , A. J., S -F , M. Temporal trends in the long term risk of progression of mild cognitive impairment: a pooled analysis. Journal of Neurology, Neurosurgery & Psychiatry 79, 12 (2008), 1386–1391. [93] M , V., K , A., H , A. Bayesian neural networks. arXiv:1801.07710. [94] N , T., D , A. B., H , L., V , S. J., S , L., Z , K. The true cost of stochastic gradient langevin dynamics. arXiv:1706.02692. [95] N , R. M. Bayesian training of backpropagation networks by the hybrid monte-carlo method. https://www.cs.toronto.edu/⇠radford/ftp/bbp.pdf. [96] N , R. M. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996. 137 [97] P , J., B , D., J , M. Variational bayesian inference with stochastic search. In Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12 (2012), ACM press, p. 1363–1370. [98] P , T., C , G. The bayesian lasso. Journal of the American Statistical Association 103, 482 (2008), 681–686. [99] P , D., B , A., Y , Y. On statistical optimality of variational bayes. In Proceedings of Machine Learning Research, A. Storkey and F. Perez-Cruz, Eds., vol. 84. PMLR, 2018, pp. 1579–1588. [100] P , D., B , A., Y , Y. On statistical optimality of variational bayes. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics (2018), A. Storkey and F. Perez-Cruz, Eds., vol. 84, PMLR, pp. 1579–1588. [101] P , F., V , G., G , A., M , V., T , B., G , O., B , M., P , P., W , R., D , V., V , J., P , A., C , D., B , M., P , M., D , E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. [102] P , T., L , L., C , S., S , D., P R , A., S , I., M ç , A., G , M., C.M , S. Predicting progression of mild cognitive impairment to dementia using neuropsychological data: A supervised learning approach using time windows. BMC Medical Informatics and Decision Making 17 (07 2017). [103] P , R. C., R , R. O., K , D. S., B , B. F., G , Y. E., I , R. J., S , G. E., J , C. R. Mild cognitive impairment: ten years later. Archives of neurology 66, 12 (December 2009), 1447–1455. [104] P , D. Empirical processes: Theory and applications. NSF-CBMS Regional Confer- ence Series in Probability and Statistics 2 (1990), i–86. [105] P , N. G., R , V. Posterior concentration for sparse deep learning. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Asso- ciates, Inc. [106] P , R. A useful theorem for nonlinear devices having gaussian inputs. IRE Transactions on Information Theory 4, 2 (1958), 69–72. [107] R , R., G , S., B , D. M. Black box variational inference. arXiv:1401.0118. [108] R , S., S , A., W , J., S , L., F , H., M , B. Baseline mri predictors of conversion from mci to probable ad in the adni cohort. Current Alzheimer research 6 (08 2009), 347–61. 138 [109] R , T. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B (1996), 267–288. [110] R , S. M. Simulation, fifth ed. Academic Press, 2013. [111] S -H , J. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics 48, 4 (2020), 1875 – 1897. [112] S Ä B, S. A. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge, 31. MA: MIT Press, 2002. [113] S , J., P , J., S , F., C , K., C , V., C , R., D , P. Predicting cognitive decline in subjects at risk for alzheimer disease by using combined cerebrospinal fluid, mr imaging, and pet biomarkers. Radiology 266 (12 2012). [114] S , T., J , J., L , Y., W , P., Z , C., Y , Z. Decision supporting model for one-year conversion probability from mci to ad using cnn and svm. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (2018), pp. 738–741. [115] S , J. A tutorial on principal component analysis. arXiv:1404.1100. [116] S , N., F , J., H , T., T , R. A sparse-group lasso. Journal of Computational and Graphical Statistics 22, 2 (2013), 231–245. [117] S , B., D , S., Z , Y., G , T., T , G. Layer-specific adaptive learning rates for deep networks, 2015. arXiv.1510.04609. [118] S , N., H , G., K , A., S , I., S , R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929–1958. [119] S , H.-I., S , D. Deep learning-based feature representation for ad/mci classifi- cation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013 (Berlin, Heidelberg, 2013), K. Mori, I. Sakuma, Y. Sato, C. Barillot, and N. Navab, Eds., Springer Berlin Heidelberg, pp. 583–590. [120] S , S., C , C., C , L. Learning Structured Weight Uncertainty in Bayesian Neural Networks. A. Singh and J. Zhu, Eds., vol. 54 of Proceedings of Machine Learning Research, PMLR, pp. 1283–1292. [121] S , S., Z , G., S , J., G , R. B. Functional variational bayesian neural networks. In 7th International Conference on Learning Representations, ICLR 2019. (2019), OpenReview.net. [122] S , Y., S , Q., L , F. Consistent sparse deep learning: Theory and computation. Journal of the American Statistical Association 0, ja (2021), 1–42. 139 [123] T J , H., S , M., C , N. Cerebral atrophy in mild cognitive impairment: A systematic review with meta-analysis. Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring 1 (12 2015). [124] T , J. Lecture notes. part III: Black-box variational inference. http://www.it.uu.se/ research/systems_and_control/education/2018/pml/lectures/VILectute NotesPart3.pdf. [125] T , Z., C , J., K , Q., Z , M., A , A., S , K. Dynamic embedding projection-gated convolutional neural networks for text classification. IEEE Transactions on Neural Networks and Learning Systems (2021), 1–10. [126] V , A., W , J. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer, New York., 1996. [127] V , V. The Support Vector Method of Function Estimation. Springer US, Boston, MA, 1998, pp. 55–85. [128] V , V. The Nature of Statistical Learning Theory. Springer, 1999. [129] V , V. N. Statistical Learning Theory. Wiley-Interscience, 1998. [130] V , Y., R , V., I , R., V , P. Predicting short-term mci-to- ad pro- gression using imaging, csf, genetic factors, cognitive resilience, and demographics. Sci Rep 2235 (2019), 9. [131] V , T., L , S., B , D., V , S., D , P., D T , F., D , J. Support vector machine versus logistic regression modeling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC medical informatics and decision making 8 (01 2009), 56. [132] W , R., Z , M., X , H., Z , Z. Neural control variates for variance reduction. arXiv:1806.00159. [133] W , B., H , R., X , Y., Z , F., P, W. Identifying mild cognitive impair- ment conversion to alzheimer’s disease from medical image information. In 2016 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW) (2016), pp. 1–2. [134] W , J., X , F., N , F., L , X. Unsupervised adaptive embedding for dimensionality reduction. IEEE Transactions on Neural Networks and Learning Systems (2021), 1–12. [135] W , Y., B , D. M. Frequentist consistency of variational bayes. Journal of the American Statistical Association 114, 527 (2019), 1147–1161. [136] W , R., L , C., F , N., , L , L. Prediction of conversion from mild cognitive impairment to alzheimer’s disease using mri and structural network features. Frontiers in aging neuroscience (2016). [137] W , M., T , Y. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11 (2011), ACM press, pp. 681–688. 140 [138] W , W. H., S , X. Probability inequalities for likelihood ratios and convergence rates of sieve mles. Annals of Statistics 23, 2 (1995), 339–362. [139] W , A., N , S., M , E., T , R. E., H -L , J. M., G , A. L. Deterministic variational inference for robust bayesian neural networks. [140] Y , K., M , T. On the classification consistency of high-dimensional sparse neural network. In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (2019), pp. 173–182. [141] Y , K., M , T. Statistical aspects of high-dimensional sparse artificial neural network models. Machine Learning and Knowledge Extraction 2, 1 (2020), 1–19. [142] Y , Y., P , D., B , A. U-variational inference with statistical guaran- tees. Annals of Statistics 48, 2 (2020), 886–905. [143] Y , Y., R , L., C , A. On early stopping in gradient descent learning. Constructive Approximation 26 (2007), 289–315. [144] Y , J., F , M., V , R., L , V., R , N., N , G., D , A., N , V. Sparse learning and stability selection for predicting mci to ad conversion using baseline adni data. BMC neurology 12 (06 2012), 46. [145] Y , J., M , M., M , J., M , A., C , D., O , S. Accurate multimodal probabilistic prediction of conversion to alzheimer’s disease in patients with mild cognitive impairment. Neuroimage Clin (05 2013), 735–745. [146] Y , Z., Y , F., Y , K., C , W., C , C. L. P., C , L., Y , J., W , H.- S. Semisupervised classification with novel graph construction for high-dimensional data. IEEE Transactions on Neural Networks and Learning Systems (2020), 1–14. [147] Z , D., S , D. Multi-modal multi-task learning for joint prediction of clinical scores in alzheimer’s disease. pp. 60–67. [148] Z , D., S , D., I , A. D. N. Predicting future clinical changes of mci patients using longitudinal and multimodal biomarkers. PLOS ONE 7, 3 (03 2012), 1–15. [149] Z , F., G , C. Convergence rates of variational posterior distributions. Annals of Statistics 48, 4 (2020), 2180–2207. [150] Z , C., C , Y., G , Z., H , F., L , J., G , T. Adaptive learning rates with maximum variation averaging, 2020. arXiv.2006.11918. [151] Z , J., R , S., H , T., T , R. 1-norm support vector machines. MIT Press, (2003), 49–56. [152] Z , H., H , T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301–320. 141