Machine Learning Methods for feature selection and prediction applied to large scale genetics data

The age of big data has brought exciting opportunities to elicit new insights in many scientific fields. In Genetics, big data-driven technologies can be transformative. Although big data-powered innovations are making strides in Genetics, many vital challenges remain. This dissertation focuses on developing statistical and machine learning methods for addressing the critical challenges of these complex genomic data sets.The first chapter addresses the challenges of variable selection due to severe collinearity in the features in Genome-Wide Association (GWA) studies involving millions of DNA variants (e.g., SNPs), many of which may be in highly correlated. We devised a novel Bayesian hierarchical hypothesis testing (BHHT). We present simulation results that show that demonstrate that the proposed method can lead to high power with adequate error control and fine mapping resolution. Furthermore, we demonstrate the feasibility of using the proposed methodology with big data by using it to map risk variants for serum urate using data UK-Biobank (n ~ 300,000, p ~ 15 million SNPs).Chapter two focuses on developing a Bayesian prior for improving the robustness of deep latent variable models. The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract relevant information for predictions. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity as a mechanism to provide the flexibility that allows each data point to learn its own dimension distribution. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as other sparsity induction mechanisms for latent variable models proposed in the literature. In the third chapter, we develop and benchmark machine learning for genomic prediction with ancestry-diverse data sets. Genomic prediction is commonly made by constructing polygenic scores (PGS) which are a linear combination of risk variants (SNPs). However, in modern genetic data sets complexities arise due to the presence of diverse ancestry groups. We develop a strategy to use deep learning for genomic prediction that leverages non-linear input-output patterns among physically proximal SNPs. Using TOPMed genotype data and Monte Carlo simulations, we evaluated whether local regressions using machine learning methodology can outperform linear models in cross-ancestry prediction.In summary, this thesis contributes novel statistical methods for mapping risk variants in the presence of collinearity, and novel machine learning methodology to infer latent representations of complex data sets and to predict disease risk using ultra-high dimensional genomic data.

Read