Statistical machine learning theory and methods for high-dimensional low sample size (HDLSS) problems

High-dimensional low sample size (HDLSS) data analysis have been popular nowadays in statistical machine learning fields. Such applications involves a huge number of features or variables, but sample size is limited due to reasons such as cost, ethnicity and etc. It is important to find approaches to learn the underlying relationships via a small fraction of data. In this dissertation, we study the statistical properties for some non-parametric machine learning models that deal with these problems and apply these models to various fields for validation.In Chapter 2, we study the generalized additive model in the high-dimensional set up with a general link function that belong to the exponential family. We apply a two-step approach to do variable selection and estimation: a group lasso step as an initial estimator, then followed by a adaptive group lasso step to obtain final variables and estimations. We show that under certain conditions, the two-step approach consistently selects the truly nonzero variables and derived the estimation rate of convergence. Moreover, we show that the tuning parameter that minimizes the generalized information criterion (GIC) has asymptotically minimum risk. Simulations in variable selection and estimation are given. Real data examples including spam email and prostate cancer genetic data are also used to support the theory. Moreover, we discussed the possibility of using a l0 norm penalty.In Chapter 3, we study a shallow neural network model in the high-dimensional classification set up. The sparse group lasso, also known as the lp,1 + l1 norm penalty, is applied to obtain feature sparsity and a sparse network structure. Neural networks can be used to approximate any continuous function with an arbitrary small approximation error given that the number of hidden nodes is large enough, which is known as the universal approximation theorem. Therefore, neural networks are used to model complicated relationships between the response and predictors with huge interactions. We proved that under certain conditions, the sparse group lasso penalized shallow neural network has classification risk tend to the Bayes risk, which is the optimal among all possible models. Real data examples including prostate cancer genetic data, Alzheimer's disease (AD) magnetic resonance imaging (MRI) data and autonomous driving data are used to support the theory. Moreover, we proposed a l0 + l1 penalty and showed that the solution can be formulated as an mixed integer second order cone optimization (MISOCO) problem.In Chapter 4, we propose a stage-wise variable selection technique with deep neural networks in the high-dimensional set up, named ensemble neural network selection (ENNS). We apply an ensemble on the stage-wise neural network variable selection method to further the falsely selected variables, which is shown to be able to consistently filter out unwanted variables and selected the truly nonzero variables under certain conditions. Moreover, we proposed a second approach to further simplify the neural network structure by specifying the desired percentage of nonzero parameters in each hidden layer. A type of coordinate descent algorithm is proposed to obtain the solution from the second step. We also show that the two step approach achieves universal consistency for both regression and classification problems. Simulations are studied to support various arguments. Real data examples including the riboflavin production data, the prostate cancer genetic data and a region of interest (ROI) in MRI data are used to validate the method.

Read

In Collections: Electronic Theses & Dissertations

Copyright Status: In Copyright

Material Type: Theses

Authors: Yang, Kaixu

Thesis Advisors: Maiti, Tapabrata

Committee Members: Sakhanenko, Lyudmila
Zhong, Ping-shou
Zhu, David

Date Published: 2020

Subjects: Machine learning--Mathematical models
Mathematical statistics

Program of Study: Statistics - Doctor of Philosophy

Degree Level: Doctoral

Language: English

Pages: xii, 245 pages

ISBN: 9798635297742

Permalink: https://doi.org/doi:10.25335/az68-3w02

Statistical machine learning theory and methods for high-dimensional low sample size (HDLSS) problems

Full text