MACHINE LEARNING METHODS FOR FEATURE SELECTION AND PREDICTION
APPLIED TO LARGE SCALE GENETICS DATA

By

Anirban Samaddar

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Statistics—Doctor of Philosophy

2023

ABSTRACT

The age of big data has brought exciting opportunities to elicit new insights in many scientific

fields. In Genetics, big data-driven technologies can be transformative. Although big data-powered

innovations are making strides in Genetics, many vital challenges remain. This dissertation focuses

on developing statistical and machine learning methods for addressing the critical challenges of

these complex genomic data sets.

The first chapter addresses the challenges of variable selection due to severe collinearity in the

features in Genome-Wide Association (GWA) studies involving millions of DNA variants (e.g.,

SNPs), many of which may be in highly correlated. We devised a novel Bayesian hierarchical

hypothesis testing (BHHT). We present simulation results that show that demonstrate that the

proposed method can lead to high power with adequate error control and fine mapping resolution.

Furthermore, we demonstrate the feasibility of using the proposed methodology with big data by

using it to map risk variants for serum urate using data UK-Biobank (𝑛 ∼ 300, 000, 𝑝 ∼ 15 million

SNPs).

Chapter two focuses on developing a Bayesian prior for improving the robustness of deep latent

variable models. The information bottleneck framework provides a systematic approach to learning

representations that compress nuisance information in the input and extract relevant information

for predictions. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity

as a mechanism to provide the flexibility that allows each data point to learn its own dimension

distribution. Through a series of experiments using in-distribution and out-of-distribution learning

scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach

improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as

other sparsity induction mechanisms for latent variable models proposed in the literature.

In the third chapter, we develop and benchmark machine learning for genomic prediction with

ancestry-diverse data sets. Genomic prediction is commonly made by constructing polygenic scores

(PGS) which are a linear combination of risk variants (SNPs). However, in modern genetic data

sets complexities arise due to the presence of diverse ancestry groups. We develop a strategy to

use deep learning for genomic prediction that leverages non-linear input-output patterns among

physically proximal SNPs. Using TOPMed genotype data and Monte Carlo simulations, we

evaluated whether local regressions using machine learning methodology can outperform linear

models in cross-ancestry prediction.

In summary, this thesis contributes novel statistical methods for mapping risk variants in the

presence of collinearity, and novel machine learning methodology to infer latent representations of

complex data sets and to predict disease risk using ultra-high dimensional genomic data.

Copyright by
ANIRBAN SAMADDAR
2023

To my parents Anita Samaddar and Pradip Kumar Samaddar

v

ACKNOWLEDGEMENTS

I want to convey my gratitude to my advisors Dr. Gustavo de los Campos, and Dr. Tapabrata Maiti

for their guidance, encouragement, and insight that has helped shaped my Ph.D. experience. I would

also like to extend thanks to my wonderful committee members Dr. Shrĳita Bhattacharya and Dr.

Chih-li Sung for their valuable feedback on my dissertation. Additionally, I would like to thank

Dr. Sandeep Madireddy and Dr. Prasanna Balaprakash from the Argonne National Laboratory

for providing me with the opportunity to work on cutting-edge problems during my internship

and beyond. Finally, I would like to thank all my professors at Michigan State University (MSU)

for enhancing my knowledge in the field of Statistics and making my academic experience truly

fulfilling.

Beyond MSU, I am thankful to some of the exceptional teachers that have shaped my academic

journey. I am grateful to my high-school math teacher Tushar sir for nurturing my early interest

in Mathematics.

I want to thank my professors at Narendrapur Ramakrishna Mission and the

University of Calcutta who ignited my passion for statistics. Among them, I am particularly

indebted to Dr. Gaurangadeb Chattopadhyay for his guidance and motivation during my Master’s

at the University of Calcutta which led me to pursue higher studies in Statistics.

This dissertation is not only a product of my efforts but of those who have always been there for

me along the way. I would like to start by thanking my mother and father for the many sacrifices

they made to support me in pursuing my dreams. Graduate life can be difficult sometimes but

thinking about their struggles has always helped me to persevere. I like to thank my elder brother

for being a constant source of positivity and inspiration throughout my life. I would like to also

thank my sister-in-law who always treated me like her own brother and who I feel is always on my

side. Finally, I would like to thank my wife Sampriti who makes my life brighter every day and

who is always there for me when it rains.

I am fortunate to born and brought up in a close-knit family. I want to pay tribute to my late

paternal and maternal grandparents who would have been proud to see this day. In addition, I would

like to say thanks to - Tata, Boro Piso, Mani, Choto Piso, Jemma, Jethu, Akaki, Akaka, Tu Kaki, Tu

vi

Kaka, Notun Dadu, Notun Mammam, Boro Mami, Choto Mama, Choto Mami who has inspired

and supported me throughout my career. I would also like to thank my father- and mother-in-law,

Mani, Jaya di, and Apu da for providing constant encouragement to achieve my goals. Finally, I

would like to pay respect to my late pisemosai Dr. Shyamal Chakraborty who advised me to pursue

a bachelor’s degree in Statistics, and now thirteen years later when I am at the other end of my

educational journey I miss him very much.

My academic experience in the past five years is incomplete without my friends around East

Lansing and beyond. I extend my heartfelt gratitude to Arka, who has been my companion since

the beginning of my Ph.D. voyage, consistently offering motivation and encouragement. Sikta,

too, holds a special place for teaching me how to find joy amidst the demanding graduate studies

routine. I am also thankful to Arka, Sikta, and Sampriti for the memorable trips and Catan nights

that added a wonderful balance to my life. In addition, I would like to thank my friends in the

department - Mookyong, Hema, Tathagata, Sumegha, and my friends from the applied economics

department (MSU-AFRE) Myat, and Yeyoung for their constant support and friendship.

I want to extend my heartfelt gratitude to the friends who extend beyond the borders of East

Lansing. To Sayak, whose engaging conversations about science and history have consistently

ignited my curiosity; to Dinu, with whom I’ve had illuminating discussions about research and

music; to Arindam, a confidant with whom I can openly share my thoughts; and to Vaibhav, who

has stood by me as a true brother. Furthermore, my appreciation extends to an exceptional group

of individuals - Amartya, Parthajit, Santanu, Arabinda, Prolay, and Mishra - whose friendship has

enriched my life beyond measure, offering the finest friendship one could hope for. Lastly, I am

immensely thankful to Sourjya, who has remained unwaveringly by my side as both a friend and a

brother, a constant source of support throughout my life.

To anyone I may have unintentionally omitted, please know that your love and support are

deeply cherished, and my heart remains forever indebted to you. Your presence in my journey has

been invaluable, and for that, I am eternally thankful.

vii

TABLE OF CONTENTS

CHAPTER 1

BIBLIOGRAPHY . .

. .

.

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1
4

CHAPTER 2

.

.

.

.

Introduction .

BAYESIAN HIERARCHICAL HYPOTHESIS TESTING IN LARGE
5
SCALE GENOME-WIDE ASSOCIATION ANALYSIS . . . . . . . . .
5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
7
2.2 Bayesian Hierarchical Hypothesis Testing . . . . . . . . . . . . . . . . . . . .
2.3 Simulation Study .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
. 20
2.5 Discussion .
2.6 Software availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
BIBLIOGRAPHY .
.
. 27
APPENDIX A2
SUPPLEMENTARY METHODS . . . . . . . . . . . . . . . .
SUPPLEMENTARY DATA . . . . . . . . . . . . . . . . . . . . 31
APPENDIX B2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

CHAPTER 3

.
.
. .

Introduction .

.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

SPARSITY-INDUCING CATEGORICAL PRIOR IMPROVES
ROBUSTNESS OF THE INFORMATION BOTTLENECK . . . . . . . 38
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1
.
. 40
3.2 Related Works .
Information Bottleneck with Sparsity-Inducing Categorical Prior . . . . . . . . 42
3.3
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Conclusions . .
. 56
.
BIBLIOGRAPHY .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
SUPPLEMENTARY DATA . . . . . . . . . . . . . . . . . . . . 60
APPENDIX A3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

. .
.
.

CHAPTER 4

.

.

.

.

Introduction .

PREDICTION OF THE POLYGENIC RISK SCORES USING DEEP
LEARNING IN ANCESTRY-DIVERSE GENETIC DATA SETS . . . . 68
4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
. 75
4.3 Results .
.
. 77
4.4 Discussion .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
BIBLIOGRAPHY .
. 81
APPENDIX A4
. 82
APPENDIX B4

.
.
. .
.
TOPMED DATA SET . . . . . . . . . . . . . . . . . . . . . .
SUPPLEMENTARY METHODS . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

.
.
. .
.
.

.

CHAPTER 5

CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 86

viii

CHAPTER 1

INTRODUCTION

The age of big data has brought exciting opportunities to elicit new insights in many scientific fields.

In Genetics, big data revolution-driven technologies, such as pharmacogenomics and precision

medicine can be transformative. Although big data-powered innovations are making strides in

Genetics, many vital challenges remain. This dissertation focuses on developing statistical methods

for addressing critical challenges that emerge in the analysis of ultra-high-dimensional biobank-

size data sets. Specifically, in this thesis, we have developed methods for the two main aspects of

statistical learning – (1) variable selection, and (2) prediction.

Chapter one addresses the challenges of variable selection due to severe collinearity in the

features in Genome-Wide Association Studies (GWAS) involving millions of variants, many of

which may be in high linkage disequilibrium. We devised a Bayesian hierarchical hypothesis

testing (BHHT)–a novel multi-resolution testing procedure that offers high power with adequate

error control and fine-mapping resolution. Using Monte Carlo simulations, we show that the

proposed procedure offers high power with adequate error control and fine-mapping resolution.

Finally, we demonstrate the feasibility of using the proposed methodology to map risk variants for

serum urate using ultra-high dimensional (∼ 15 millions of SNPs) biobank size data (𝑛 ∼ 300, 000)

from the UK-Biobank.

Our results show that the proposed methodology leads to many more discoveries than those

obtained using traditional feature-centered inference procedures. We provide publicly available

software for applying the proposed procedure at scale in the following repository: https://github.

com/AnirbanSamaddar/Bayes_HHT.

Chapter two focuses on developing a Bayesian prior for improving the robustness of deep

latent variable models. The information bottleneck framework [4] provides a systematic approach

to learning representations that compress nuisance information in the input and extract semanti-

cally meaningful information about predictions. We present a novel sparsity-inducing spike-slab

categorical prior (similar to the one used for variable selection in chapter one) that uses sparsity

1

as a mechanism to provide the flexibility that allows each data point to learn its own dimension

distribution. Through a series of experiments using in-distribution and out-of-distribution learning

scenarios on the MNIST [3], CIFAR-10 [6], and ImageNet [2] data, we show that the proposed

approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as

well as other sparsity induction mechanisms for latent variable models proposed in the litera-

ture. The open-source software implementation is made available in the following repository:

https://github.com/AnirbanSamaddar/SparC-IB.

Chapter three integrates the Bayesian linear variable selection methodology used in chapter

one with deep learning methods to develop a novel machine learning (ML) framework for genomic

prediction in ancestry-diverse data sets. Genomic prediction (i.e., prediction of phenotypes or

disease risk using DNA information, e.g., SNPs) is commonly made by constructing polygenic

scores (PGS) which is a linear combination of risk variants (SNPs) that were found in a Genome-

Wide Association (GWA) study to be associated with a phenotype or disease.

Most GWA studies use linear models to construct PGS because of its simplicity and effectiveness

in capturing phenotypic variability within ancestry-homogenous populations. However, in modern

genetic data sets complexities arise due to the presence of diverse ancestry groups. There are

biological (e.g., gene-gene or genetic-by-environment interactions) as well as technological reasons

(the fact that prediction models use SNPs that are imperfect surrogates for those with causal effects)

that can lead to ancestry-specific SNP effects. Such heterogeneity is not well-captured by linear

PGS.

Theory and some empirical evidence suggest that SNP-SNP interactions are more prominent

among SNPs that are physically proximal in their genome position. Therefore, we develop an

ML methodology that aims at capturing non-linear local patterns. Our approach also considers the

computational complexities that arise from the application of ML methods to ultra-high-dimensional

data.

Using TOPMed [5] genotype data and Monte Carlo simulations we evaluated whether local

regressions using machine learning methodology can outperform linear models in cross-ancestry

2

prediction. In a first simulation using short chromosome segments (100 or 1000 kilo-base-pairs,

kbp) our results show that even in cases when the underlying causal model is strictly linear, due

to imperfect linkage disequilibrium (a term used in genetics to describe the imperfect correlation

between pairs of SNPs) between the SNPs used in the model and those with causal effects, ML

methodology can potentially outperform linear models–the difference between ML and linear

models was particularly sizable for short segments (100kbp) perhaps illustrating difficulties that

ML method face when trying to extract fine patterns when the underlying structure is sparse and

the input set becomes high-dimensional.

Then we extend the simulation and the methods to entire chromosomes. This setting presents a

serious computational challenge (in TOPMed, in chromosome one there are more than 40 million

SNPs). To address the statistical and computational challenges that emerge when applying ML

methods to ultra-high dimensional data, and budling upon ideas that were previously used for linear

models [1], we propose a two-step procedure whereas, in the first step, local PGS (ML-derived

or linear ones) are derived for short chromosome segments and in a second step these scores are

combined using either ML or linear methods. In total, we evaluate four procedures that emerge

from the use of either ML or linear models in the first and second steps. Our simulations show that

using a local Bayesian Variable Selection method in the first step and then combining the resulting

scores using ML yields higher prediction accuracy in cross-ancestry prediction than using ML or

linear models alone. These results are significant because they provide avenues to improve genomic

prediction using ancestry-diverse data using methods that scale to whole-genome applications.

Overall, this thesis contributes to advanced methods for research problems involving both

inference (e.g., fine mapping) and prediction. The methods that we develop are implemented in

scalable open-source software which we make available through GitHub repositories.

3

BIBLIOGRAPHY

[1] de Los Campos, G., Grueneberg, A., Funkhouser, S., Pérez-Rodríguez, P., and Samaddar, A.
(2023). Fine mapping and accurate prediction of complex traits using bayesian variable selection
models applied to biobank-size data. European Journal of Human Genetics, 31(3):313–320.

[2] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-
scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern
recognition, pages 248–255. Ieee.

[3] Deng, L. (2012). The mnist database of handwritten digit images for machine learning research.

IEEE Signal Processing Magazine, 29(6):141–142.

[4] Goldfeld, Z. and Polyanskiy, Y. (2020). The information bottleneck problem and its applications
in machine learning. IEEE Journal on Selected Areas in Information Theory, 1(1):19–38.

[5] Kowalski, M. H., Qian, H., Hou, Z., Rosen, J. D., Tapia, A. L., Shan, Y., Jain, D., Argos, M.,
Arnett, D. K., Avery, C., et al. (2019). Use of > 100,000 nhlbi trans-omics for precision medicine
(topmed) consortium whole genome sequences improves imputation quality and detection of
rare variant associations in admixed african and hispanic/latino populations. PLoS genetics,
15(12):e1008500.

[6] Krizhevsky, A., Hinton, G., et al. (2009). Learning multiple layers of features from tiny images.

4

CHAPTER 2

BAYESIAN HIERARCHICAL HYPOTHESIS TESTING IN LARGE SCALE
GENOME-WIDE ASSOCIATION ANALYSIS

2.1

Introduction

Many modern statistical learning problems require detecting from a large number of features

a small fraction of them that are jointly associated with an outcome.

In a regression set-up, a

common approach to detect the important variables is variable selection. In recent years, many

variable selection methods have been proposed (for reviews see e.g., [7], [23]). One can also pose a

variable selection task as a (multiple) hypothesis testing problem. An advantage of the hypothesis

testing framework is that one can adopt a decision-theoretic approach to find an optimal decision

rule that minimizes some loss function in terms of Type-I and Type-II errors. However, designing

rules that can attain high power with low error rates can be challenging [6].

Despite the important advancements in theory and methods for feature selection in high-

dimension regression, variable selection, and inference in the presence of highly collinear features

remain challenging. The problem of selecting a subset of predictors among a large set of correlated

features is ubiquitous in Genome-Wide Association (GWA) studies where the objective is to

map regions of the genome (either individual variants or chromosome segments) associated with

a phenotype.

In the last decade, several public (e.g., UK-Biobank, Million Veteran Program,

TOPMed, All of Us) and private (e.g., 23andMe®) initiatives have generated unprecedentedly large

biomedical datasets that comprise genotype data linked to extensive data on phenotype/disease.

The advent of big data brings unprecedented opportunities to advance genetic research and predict

complex traits ([5]). However, together with larger sample sizes, these modern data sets come with a

remarkable increase in marker density, with potentially millions of single nucleotide polymorphisms

(SNPs) distributed throughout the genome. With such a high SNP density, many SNPs can be highly

correlated due to a phenomenon called linkage disequilibrium (LD).

Bayesian variable selection methods (BVS) ([10], [16]) can be used to identify risk variants in

GWA studies. Following the seminal contribution of [21], these methods have been widely used

5

for the prediction of complex traits in plant and animal breeding (e.g., [21], [4], [14]) and also in

human genetics (e.g., [3], [19], [13], [17], [18], [32]).

Bayesian methods offer adequate quantification of uncertainty in variable selection problems.

This feature, which is essential for any inferential task, has some unwanted consequences when

the goal is to select individual variants associated with a phenotype. For example, when multiple

variants are all in LD with one or more causal variants, the posterior probability of non-null effect

(aka posterior inclusion probabilities or PIPs) of individual SNPs may not achieve a high value

for any individual locus because of the uncertainty about individual SNP effects introduced by

collinearity.

Therefore, variant centered BVS may not map important risk loci if those risk loci are located in

regions of high LD. This phenomenon has been reported and studied before ([11]) and the general

recommendation is to avoid using marginal posterior probabilities and focus instead on credible set

inferences, i.e., identifying sets of predictors that are jointly associated with an outcome with high

level of credibility. The joint posterior distribution (or samples from it) of BVS contains all the

information needed to identify such sets. However, we lack methods to identify credible sets from

posterior samples efficiently.

Our main objective is to fill this gap by developing methods for multi-resolution Bayesian

hypothesis testing that can lead to powerful inferences, with adequate error control, and fine-

mapping resolution, even in the presence of highly collinear predictors. Our approach integrates

ideas first proposed for frequentist hierarchical hypothesis testing ([34],[21], [1], [25], [27]) into a

Bayesian framework that can offer powerful inferences with adequate error control.

Our study offers the following contributions: First, we develop algorithms for Bayesian Hi-

erarchal Hypothesis Testing (BHHT) that offers high power, with adequate error control and fine

mapping resolution. These algorithms a solution to the challenge of identifying credible sets from

posterior samples. The algorithms presented control the FDR at one of three possible levels: (i)

individual credible sets, (ii) sub-families, and (iii) entire credible set FDR. Second, we present

analytical proofs that these methods adequately control the FDR (i.e., the expected proportion of

6

false discovery over conceptual repeated sampling). Third, using extensive simulations, we show

that the proposed BHHT methodology offers higher power than inferences based on marginal pos-

terior probability (i.e., variant-centered) methods and is competitive, and in some settings superior,

than state-of-the-art credible-set methods such as SuSiE [32]. Fourth, we demonstrate the feasi-

bility of applying the proposed methods with ultra-high-dimensional, large sample size, data by

using BHHT to map variants associated with serum urate (SU) using data from the UK-Biobank

(n 300,000) involving ultra-high-density SNP genotypes (15 million variants). Fifth, we provide

open-source software that implements the methods described in the study using algorithms that

scale to ultra-high-dimensional biobank-sample-size data.

2.2 Bayesian Hierarchical Hypothesis Testing

Consider a multiple regression model of the form,

𝑦𝑖 = 𝜇 + 𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + ... + 𝑥𝑖 𝑝 𝛽𝑝 + 𝜖𝑖

(𝑖 = 1, ..., 𝑛)

(2.1)

where 𝑦𝑖 is a phenotype of interest, 𝑥𝑖1, ..., 𝑥𝑖 𝑝, are omics features, 𝛽1, ..., 𝛽𝑝 are individual

feature effects, and 𝜖𝑖 is an error term for the i-th individual in the sample of size 𝑛. Bayesian

Variable Selection models often use IID priors from the Spike-and-Slab family, which include a

point of mass at zero (i.e., no-effect, 𝛽 𝑗 = 0) and a slab (non-zero effect) of the form

𝑝(𝛽 𝑗 ) = 1(𝛽 𝑗 = 0) (1 − 𝜋) + 𝜋 𝑝(𝛽 𝑗 |𝜃)

( 𝑗 = 1, ..., 𝑝)

(2.2)

Above, 𝜋 is the prior probability of a non-zero effect, 𝑝(𝛽 𝑗 |𝜃) is a density function (the ‘slab’)

describing the distribution of non-null effects, and 1(𝛽 𝑗 = 0) = {1 𝑖 𝑓 𝛽 𝑗 = 0; 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒} is

an indicator function. In this study, we use the BGLR R-package [24] which implements various

models from the spike slab family with Gaussian and scaled-t slabs. The software uses a flat prior

for the intercept and allows treating hyper-parameters controlling model sparsity and shrinkage

(the error variance, and the parameters indexing the prior distribution of effect {𝜋, 𝜃}) as unknown

and produces samples from the posterior distribution of effects which are marginal with respect to

hyper-parameters.

7

The samples from the posterior distribution of BVS models can be used to estimate the (marginal)

posterior probability of non-zero effects for each of the predictors: 𝑃(𝛽 𝑗 ≠ 0|𝑑𝑎𝑡𝑎) = (cid:100)𝑃𝐼 𝑃 𝑗 =
𝐾 −1 (cid:205)𝐾
𝑘=1 1(𝛽 𝑗 𝑘 ≠ 0). These posterior probabilities can be used to identify features that are
confidently associated with an outcome, e.g., those with (cid:100)𝑃𝐼 𝑃 𝑗 ≥ 0.9 corresponding to a local-FDR
≤ 0.1. However, as noted, marginal inferences can be under-powered in the presence of collinearity.

We will demonstrate this behavior in detail with a GWA study example in Sec 2.4.

2.2.1 Credible Set Inference

In the presence of collinearity, more powerful inferences can be obtained by making inferences

about sets of parameters i.e., using credible set methods, e.g., [11], [32]. A level-𝛼 credible set,

defined by Wang et al. (2020) [32], is a set of features whose joint probability of association is

greater that 𝛼. The joint probability of association of a set is the probability that at least one of

the features in the set is associated with the outcome. For a set of features, the set-PIP denoted by

𝑃𝐼 𝑃Ω of a set of features Ω can be defined as follows.

𝑝(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽 𝑗 ≠ 0 𝑤ℎ𝑒𝑟𝑒 𝑥 𝑗 ∈ Ω|𝑑𝑎𝑡𝑎) = 1 − 𝑝(𝑎𝑙𝑙 𝛽 𝑗 = 0 𝑤ℎ𝑒𝑟𝑒 𝑥 𝑗 ∈ Ω|𝑑𝑎𝑡𝑎)

These set-PIPs can be estimated using samples from the posterior distribution by simply counting

the proportion of posterior samples that contain at least one of the effects in the set different than

zero, for a set Ω = { 𝑗, 𝑗 ′},

(cid:100)𝑃𝐼 𝑃Ω = 1 − 𝐾 −1

𝐾
∑︁

(cid:214)

1( ˆ𝛽 𝑗 𝑘 = 0)

(2.3)

𝑘=1
Above, 𝑘 = 1, ..., 𝐾 is an index for posterior samples. However, as noted before, given 𝛼, we

𝑗 ∈Ω

lack algorithms that can identify the credible sets from posterior samples, offering high power, with

low error rate, and as fine-mapping precision (i.e., small CS) as allowed by the data. To fill this

gap, we propose using Bayesian Hierarchical Hypothesis Testing (BHHT).

2.2.2 Bayesian Hierarchical Hypothesis Testing

Bayesian Hierarchical Hypothesis Testing (BHHT) tests a sequence of nested hypotheses associ-

ated with hierarchical clusters of the inputs (𝑥𝑖1, ..., 𝑥𝑖 𝑝). The procedure is schematically described

in Fig. 2.1 which depicts an example involving four features clustered according to binary tree.

8

Figure 2.1 Schematic representation of Hierarchical Hypothesis Testing. The example involves 4
features (𝑋1 − 𝑋4). Testing starts at the top node and proceeds to lower levels of the hierarchy,
whenever a null hypothesis at the parent node is rejected Red (white) squares depict significant
(non-significant) test results, the path involving significant results is highlighted in red.

Testing starts at the top node; every time a hypothesis is rejected, the two nested hypotheses in-

volving the effects under the left and right branches (aka ‘children’) are tested. In Fig. 2.1, nodes

with red (white) squares denote null hypotheses that were (were not) rejected. The path containing

rejected nulls is marked in red. In the example in Fig. 2.1, testing involves four steps:

(i) 𝐻(0(1)) : 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 0 Vs 𝐻(𝑎(1)) : at least one coefficient different than 0.

Decision: reject 𝐻(0(1)).

(ii) 𝐻(0(2𝐿)) : 𝛽1 = 0 Vs 𝐻(0(2𝐿)) : 𝛽1 ≠ 0, and 𝐻(0(2𝑅)) : 𝛽2 = 𝛽3 = 𝛽4 = 0 Vs 𝐻(𝑎(2𝑅)) : at least

one of the three coefficients ≠ 0. Decisions: Do not reject 𝐻(0(2𝐿)) and reject 𝐻(0(2𝑅)).

(iii) 𝐻(0(3𝐿)) : 𝛽3 = 0 Vs 𝐻(0(3𝐿)) : 𝛽1 ≠ 0, and 𝐻(0(3𝑅)) : 𝛽3 = 𝛽4 = 0 Vs 𝐻(𝑎(3𝑅)): either 𝛽3 or

𝛽4 or both coefficients ≠ 0. Decisions: reject both 𝐻(0(3𝐿)) and 𝐻(0(3𝑅)).

(iv) 𝐻(0(4𝐿)) : 𝛽3 = 0 Vs 𝐻(0(4𝐿)) : 𝛽3 ≠ 0 and 𝐻(0(4𝑅)) : 𝛽4 = 0 Vs 𝐻(0(4𝑅)) : 𝛽4 ≠ 0. Decision:

no null rejected.

The final discovery set includes two elements 𝑋2 (a singleton) and a duplet {𝑋3, 𝑋4}.

9

2.2.3 Error Control

We focus on controlling the Bayesian FDR (BFDR); controlling BFDR at an 𝛼-level also

provides FDR control (over conceptual repeated sampling) at the same level (see Lemma 1 and its

proof in the Appendix). We consider decision rules that control the BFDR at three levels:

(i) Local (or node) BFDR (LFDR): For each test, the local BFDR is the posterior probability

of the null given the data: 𝐿𝐹 𝐷 𝑅Ω = 1 − 𝑃𝐼 𝑃Ω where Ω is the set of all coefficients under that

node. This quantity can be estimated for any hypothesis using expression (2.3). Thus, for LFDR

control, in BHHT a null hypothesis at a node is rejected if the 𝐿𝐹 𝐷 𝑅Ω of the node is smaller

than 𝛼. Note 1 of the Appendix provides a formal definition of the algorithm we use to traverse a

tree, rejecting the null hypothesis provided that the 𝐿𝐹 𝐷 𝑅Ω is smaller than a tolerable threshold

(𝛼 ∈ [0, 1], e.g., 𝛼 = 0.05). The algorithm renders a discovery set, 𝐷𝑆(𝛼) = {Ω1, Ω2, ..., Ω𝐷 },

consisting of the smallest nodes (i.e., deepest in the tree) that have 𝐿𝐹 𝐷 𝑅Ω < 𝛼. This algorithm

controls the posterior probability of Type-I error for each element of the discovery set.

The Bayesian FDR of the discovery set denoted by DS-BFDR is simply the average of the LFDRs

of each of the elements of the discovery set. Formally for a discovery set 𝐷𝑆(𝛼) = {Ω1, Ω2, ..., Ω𝐷 },

the DS-BFDR is defined as:

𝐷𝑆 − 𝐵𝐹 𝐷 𝑅 =

(cid:205)𝐷

𝑑=1 𝑝(𝑎𝑙𝑙 𝛽 𝑗 = 0 𝑤ℎ𝑒𝑟𝑒 𝑥 𝑗 ∈ Ω𝑑 |𝑑𝑎𝑡𝑎)
𝐷

(2.4)

It follows that a decision rule that controls the LFDR for each of the elements of the DS yields

DSs with a DS-BFDR < 𝛼. However, it is known that controlling the LFDR’s of each element in the

DS can be conservative. Therefore, a more powerful decision rule can be obtained by controlling

the discovery DS-BFDR directly.

(ii) Discovery-Set BFDR: To control the DS-BFDR in BHHT we propose an algorithm that

conducts tests over a grid of values of thresholds for the LFDR’s ( ˜𝛼1 < ˜𝛼2 < ...). For each value

in the grid, we determine the 𝐷𝑆(𝛼 𝑗 ) and estimate the DS-BFDR using (3). If the estimated DS-

BFDR is below the target threshold (e.g., 𝛼 = 0.05) we move to the next ˜𝛼 in the grid, otherwise,

if DS-BFDR > 𝛼, we use 𝐷𝑆( ˜𝛼( 𝑗−1)) as the final DS. To achieve computational efficiency, we set

10

our grid of values to the sorted LFDRs (see Note 2 of the Appendix for a formal description of the

proposed algorithm).

(iii) Subfamily-wise BFDR: A decision rule that controls the DS-BFDR does not bound the

LFDR in each cluster. Hence, there can be cases where clusters with low inclusion probabilities

are included in a DS not based on their own merit but by benefiting from the fact that other

clusters have very high inclusion probabilities. This behavior has been noted in multiple testing

[28] where many true rejections can inflate the denominator of the FDR so that it allows a small

number of false discoveries. Therefore, following [34], we consider a third procedure that controls

the subfamily-wise FDR. In this setting, the subfamilies of a node are the set of hypotheses in

a hierarchy that share the same parent. The sub-family FDR method rejects the null at a node,

provided that the average local FDR of the sub-family is below the chosen FDR threshold. This

approach is more conservative than a method controlling the DS-BFDR in (ii) and less conservative

than those based on the node-wise BFDR in (i). Note 3 in the Appendix describes an algorithm to

control sub-familywise BFDR.

2.2.4 Demonstration with a simplified example

To demonstrate the application of BHHT we present a simplified example involving mice

genomic data from Wellcome Trust [31]. The data used in the example is available with the

R-package BGLR [24] and consists of genotypes of 1814 mice. For our example, we select the

first 𝑝 = 300 SNPs from chromosome 1. We simulate a phenotype (𝑦) under the linear regression

model (1) with Gaussian iid error terms. Only four of the 300 SNPs had non-zero effects. We fixed

the error variance to ensure that the four causal variants explained 10% variance of the phenotype

(𝑦). Subsequently, we used the BGLR R-package to fit a Bayesian linear regression model to the

simulated data set with an independent point mass spike and a Gaussian slab prior (2) also known as

“BayesC” for genomic prediction in [9]) on the regression coefficients. The scripts used to simulate

and analyze the data are provided in the Supplementary Materials file.

Fig. 2.2 shows the posterior probability of non-zero effects (PIPs) of the SNPs, we observe

that none of the SNPs reach the 0.95 PIP cut-off (dashed horizontal line, corresponding to a local

11

Figure 2.2 Posterior Probability of non-zero effect by SNP and Credible Set (CS) identification
using Bayesian Hierarchical Hypothesis Testing (BHHT). Each point represents a SNP, the vertical
axis gives the posterior probability of non-zero effect (PIP) and the horizontal axis is the map
position in mega-base-pairs (Mbp). The purple vertical bars give the joint posterior probability of
association of each of the CS identified through BHHT (see Table 2.1 below).

Features included
63, 64, 65, 66, 68
125, 126, 129, 131
221, 223, 218, 219, 220, 222, 216, 217, 224, 225, 226, 228, 227
270, 272, 271, 266, 267, 265
Entire Discovery Set

Set size Set-PIP Local BFDR

5
4
13
6
28

0.96
0.98
0.96
1.00
0.9721

0.05
0.02
0.04
0.00
0.0280

Table 2.1 Four credible sets discovered through Bayesian Hierarchical Hypothesis Testing using
the node BFDR criteria (variants with non-zero effect in the simulation are highlighted in bold).

BFDR of 0.05). The reason why no individual variant achieves high PIP is the high collinearity

among variants nearby the ones with causal effects. We then apply the proposed BHHT algorithm

that controls the DS-BFDR (Note 3) using 𝛼 = 0.05. BHHT with those parameters produced DS

with four elements (Table 2.1), each with a high set-PIP (see vertical shaded area in Fig. 2.2 and

Table 2.1). Each of the elements of the DS contained a causal variant, plus variants in high LD

with it (Table 2.1).

12

2.3 Simulation Study

We conducted an extensive simulation study to evaluate the power-FDR performance of BHHT

under scenarios that considered varying: (i) degrees of collinearity (moderate or high), (ii) sample

sizes (n=10,000 and n=50,000), (iii) model sparsity (5 or 20 non-zero effects) and (iv) mapping

resolution (the maximum cluster size allowed in the discovery set (we varied this from 1 to 10

features). We also used this simulation to benchmark BHHT against inferences based on marginal

PIP and against SuSiE [15] – a state-of-the-art procedure for CS inference proposed for GWA

studies.

For each Monte Carlo replicate we simulated 525 predictors using a Markov process such that

the correlation between predictors was 𝐶𝑜𝑟 (𝑥𝑖( 𝑗), 𝑥𝑖( 𝑗+𝑘)) = 𝜌𝑘 . Here, 𝑖 is an index for the subject,

𝑗 is an index for the feature in the sequence, and 𝑘 is the lag-between predictors. We considered

two scenarios for collinearity: 𝜌 = 0.45 (moderate) and 𝜌 = 0.90 (high). Then, we simulated

a response 𝑦𝑖 according to the linear model [1] with either 5 or 20 of the 525 predictors having

non-zero effects (𝑆 = 5 or 20). The errors were iid Gaussian with zero mean and a variance tunned

such that the proportion of variance of the outcome explained by the signal (aka heritability, ℎ2)

was 0.0125. Further details about the simulation can be found in the Appendix.

We analyzed the simulated data using a Bayesian linear model with an independent spike-slab

prior (described in Sec. 2) for the regression coefficients. Models were fitted using the BGLR

function of the homonyms R-package BGLR [24]. Then, we clustered the features into a hierarchy

using hierarchical clustering ([15], [33]) and applied BHHT (described in Sec. 2.2) with each of the

three error control methods previously described (i.e., node-BFDR, DS-BFDR, and subfamily-wise

BFDR). We considered two benchmarks for BHHT:

(i) Marginal hypothesis testing. In this method, we used marginal SNP-PIPs to produce discovery

sets (see Note 4 of the Appendix for further details), and

(ii) SuSiE. a credible-set variable selection procedure Proposed in [32]. This method uses a prior

distribution that leads to the formation of credible sets. We fitted SuSiE and obtained the

credible sets using the R-package susieR [32].

13

We consider a credible set as a true discovery if it contains at least one causal variant (i.e., a

variant with a non-zero effect in the simulation). However, to prevent favoring methods that yield

large credible sets (i.e., poor mapping resolution) we conducted power-FDR evaluations restricting

the maximum discovery set size to either 3, 5, or 10 SNPs.

2.3.1 Power-FDR performance

Fig. 2.3 shows the power-FDR performances of SNP-PIP, BHHT (using the DS-BFDR error

control method), and SuSiE. The results in Fig. 2.3 were obtained by restricting the maximum

discovery set size to 5 SNPs. When the number of causal effects was equal to 5 and collinearity was

moderate (top-left panel in Fig. 2.3) we observed that all methods performed similarly, with the

method based on marginal probabilities (SNP-PIP) performing only slightly worse in the scenario

with 𝑛=10K. However, for the same trait architecture, when collinearity was high (𝜌 = 0.9 the

bottom row of the left panel) the two credible set methods (DS-BFDR and SuSiE) outperformed

marginal hypothesis testing (SNP-PIP) highlighting the power of credible set inference.

In the

scenario with only 5 causal variants and high collinearity, SuSiE performed slightly better than

BHHT using the DS-BFDR method. However, when we considered a model with more causal

variants (20 SNPs with non-zero effect, panel B in Fig. 2.3) the opposite happened, that is BHHT

outperformed SuSiE. This may reflect a limitation of the variational algorithm in SuSiE which in

multi-modal problems can be stuck at local optima (see the Discussion of [32]). Regardless of the

scenario and method, as one would expect, for any FDR level, power was higher with the largest

sample size.

The results in Fig. 2.3 were obtained by restricting the maximum credible size set to 5 variants.

We also performed the same evaluation restricting the maximum credible set size to 3 or 10 variants.

The results from these analyses are presented in Fig. B2.1 and Fig. B2.2 , respectively. The overall

results were conceptually similar to the ones presented in Fig. 2.3 and previously discussed with a

few differences: (i) As expected, imposing a more stringent restriction on the maximum cluster size

reduced the power advantage of the CS inference methods relative to marginal SNP-PIP inferences

(ii) When the maximum CS size was restricted to be only 3 SNPs, there were small differences

14

Figure 2.3 Power vs FDR curve for different simulation settings. n denotes the sample size and
r denotes the correlation between adjacent SNPs. The Left and right panel represents scenarios
where there are 5 and 20 causal variants respectively. We hold ℎ2 fixed at 1.25% for all settings. As
collinearity increases the credible set inference gains significant power over individual SNP-level
inference. Furthermore, the proposed method achieves higher power than SuSiE when there exists
a greater number of causal variants in the true model.

between SuSiE and BHHT, and, finally, (iii) when the maximum CS size was allowed to be 10

SNPs, BHHT outperformed SuSiE by a sizable margin in the scenario with high LD and 20 causal

variants. The results from BHHT previously discussed were based on the DS-BFDR error control

procedure. We also evaluated BHHT using the node (local) BFDR and the subfamily-wise BFDR.

Overall, there were no noticeable differences between the three error control methods (see Fig. B2.3

for results using the three methods for error control, for the same scenarios displayed in Fig. 2.3).

2.3.2 FDR control

In Sec. 3.1, we have observed that DS-BFDR can achieve high power at low FDR even in the

scenario of high collinearity. Another important aspect of the method is how accurately it estimates

the true FDR. In each panel of Fig. 2.4, we plot the empirical FDR vs the BFDR (estimated from

15

Figure 2.4 FDR vs Bayesian FDR plot for different simulation settings. n denotes the sample size
and r denotes the correlation between adjacent SNPs. The Left and right panel represents scenarios
where there are 5 and 20 causal variants respectively. We hold ℎ2 fixed at 1.25% for all settings.
All the methods accurately control the FDR.

posterior samples) for the four simulation settings. For all settings, we observed that all methods

accurately controlled the FDR at the desired level. Note that the different methods control different

error rates. In SNP-PIP and DS-BFDR, we control the global error rate. Hence, we can see that,

e.g., at 𝛼 = 0.1, the estimated BFDR for the two methods is much tighter to 𝛼 (orange and green

lines reach very close to 0.1) than SuSiE (blue line) for all settings.

2.4 Real Data Application

To demonstrate feasibility we applied the methods discussed in the previous section in a GWAS

for serum urate using the UK Biobank dataset. The dataset contains genotype and phenotype

information of 𝑛 ∼ 300, 000 unrelated individuals of European ancestry. Hyperuricemia is a

prevalent condition among adults in developed countries and it is a risk factor for gout [29].

Importantly, hyperuricemia has also been found to be associated with cardiovascular disease [2] and

16

many of the conditions that define metabolic syndrome. Furthermore, serum urate is moderately

heritable (30-40% in [26]). Previous studies have reported many genes and genomic regions

associated with serum urate; however, reported GWAS discoveries still explain only a fraction of

the heritability of the trait.

We first log-transform the serum urate levels to make it symmetrically distributed. We then

adjust it with respect to age, sex, center, and top 10-SNP derived principal components. We fitted

models using (1) 𝑛 ∼ 785, 000 SNPs from the SNP arrays used in the UK Biobank (hereinafter we

refer to these sets as the “calls”) and (2) ∼ 15 million SNPs from imputed genotypes provided by

the UK-Biobank (the imputed SNPs were filtered to satisfy minor-allele-frequency ≥ 0.1 % and

missing call rate ≤ 5 %). Due to the high density, the imputed genotypes have LD blocks with

near-perfect collinearity; therefore the analyses involving the ultra-high density SNPs serve as an

example of how the proposed methods can work under extreme collinearity.

It is computationally infeasible to fit a multiple regression model jointly with millions of SNPs.

However, the unrelated white Europeans from UK Biobank have almost no population structure.

Under these conditions, LD decays rather quickly with physical distance getting to values very

close to zero at 500 kilobase pairs to 1 megabase pairs. Therefore, following [8] we conducted the

association study by fitting local Bayesian regression models to overlapping segments throughout

the genome. Details of the modeling strategy are discussed in Appendix.

We first selected individual SNPs with marginal hypothesis testing (Note 4 ) with BFDR ≤

0.05. This approach yielded 123 discoveries in the analysis that used the genotyped SNPs, and 98

in the analysis that used the imputed SNPs (Table 2.2). The fact that fewer findings were obtained

with the imputed SNPs, which include the genotyped SNPs as a subset, may seem counterintuitive.

However, this happened because LD is much stronger in the imputed SNPs than in the calls. Under

these conditions, due to high LD, many SNP that are in regions harboring causal variants do not

reach high PIP. This problem can be overcome by identifying sets of SNPS with high joint-inclusion

probability.

Therefore, we use the function segments() of the BGData R-package [12] to identify chromo-

17

Discoveries

Singletons
Credible sets:
Size ∈ [1, 5)
Size ∈ [5, 10)
Size ∈ [10, 20)
Size ≥ 20
Sub-Total

SNP-set

Calls
128

Imputed
98

27
4
1
11
43

27
11
13
31
82

Total

166

180

Table 2.2 Number of discoveries for serum urate by SNP-set used and type. The singletons
and credible sets are from applying the marginal testing and BHHT algorithm respectively with
𝛼 = 0.05.

some segments harboring SNPs with elevated PIP. Specifically, we identify segments with SNPs

that had PIP > 0.05 and were at a mutual distance smaller than 100 Kbp. Finally, we applied BHHT

with DS-BFDR error control (Note 2) within each of these segments to identify credible sets. This

approach led to the identification of 43 additional discoveries for the calls and 82 for the imputed

genotypes. The physical positions of these segments are displayed for the imputed data set, together

with the singletons, in Fig. B2.4 where singletons are represented by blue lines and segments are

shown as bands on a yellow to red scale. For the calls, the use of CS inference increased the number

of discoveries by 35%, and for the imputed genotypes, the increase in the number of discoveries

was 83% (Table 2.2). These results highlight the importance of using credible set inferences when

using high- and ultra-high-density genotypes. Most of the credible sets identified through BHHT

were small (between 1 to 4 SNPs); however, some credible sets involved more than 20 SNPs (Table

2.2).

To provide further insight on how BHHT works in Fig. 2.5(a) we showcase the plot of the SNP-

PIPs of two chromosome segments identified in the GWAS with the correlation heatmap added

below the horizontal axis. The solid red points show the discovery set by applying the BHHT to

the imputed SNP data. None of the five SNPs identified in the segment in chromosome 2 that is

displayed in the left panel of Fig. 2.5(a) and none of the four SNPs forming the CS identified in

18

(a)

(b)

Figure 2.5 (a) Plot of the PIPs of the SNPs of two chromosome segments with the LD structure
given below the horizontal axis. The solid red points indicate the discovery set of SNPs from
BHHT with DS-BFDR control at 𝛼 = 0.05. (b) Correlation matrix between the discovery set SNPs
of the segments depicted in panel (a).

chromosome 6 had high PIP; however, in both cases, jointly, they achieve high set-PIP. Fig. 2.5(b)

shows the LD between the SNPs forming each of the sets. We observe that the procedure has

managed to discover a small set of SNPs that are in very high LD with each other and jointly have

high set-PIP.

19

2.5 Discussion

In this study, we propose a Bayesian hierarchical hypothesis testing (BHHT) approach for

identifying credible sets in the context of GWAS with large datasets. In the not-too-distant future,

the UK-Biobank (and many other large-scale projects) will release full genome sequences for

almost half a million individuals. The availability of whole-genome sequence genotypes will bring

unprecedented opportunities for fine mapping of disease risk alleles; at least in principle, fully

sequenced genomes will enable the identification of causal variants without relying on LD between

causal variants and those included in a genotyping array. However, whole-genome sequence-

derived SNPs will be ultra-high-dimensional (tens of millions of SNPs), and variants within a short

genome distance will be highly correlated, making fine mapping increasingly challenging. Analysis

based on marginal association testing (e.g., single-SNP-phenotype association testing) will lead to

many discoveries, but the vast majority of such discoveries will not be risk loci; instead, these will

be SNPs in LD with risk variants.

To refine the results of marginal association tests, a common approach is to apply Bayesian

variable selection methods to the SNP segments discovered using marginal association tests. When

this approach is used, risk loci are often identified using PIPs. However, as we show in this study,

when LD is extremely high, the use of individual SNP PIPs can lead to a loss of power. Furthermore,

the issue gets aggravated when the SNPs are highly dense. This problem can be addressed using

BHHT approach described in this study. BHHT can be used to identify individual SNPs with

high PIP and credible sets, consisting of SNPs that are jointly confidently associated with the trait

being studied. We have shown through simulations and real data analysis that these procedures can

provide (1) high power with low FDR, (2) accurate error control, and (3) fine-mapping resolution.

We further show that these approaches gain significant power over the tests for individual effects,

which is the commonplace practice in GWAS.

In linear models, collinearity poses various challenges. Most popular variable selection proce-

dures, e.g., lasso [30], elastic net [35], tend to produce a discovery set with many false discoveries

when the features with non-zero effects are highly correlated with other features. Therefore, studies

20

in this area have changed the objective from selecting singletons to credible sets, which captures the

uncertainty in selecting from a group of highly correlated features. In Bayesian variable selection,

one approach (used in the SuSiE [32] method) is to specify a prior distribution that will lead to

the identification of credible sets. Previous studies have shown that SuSiE outperforms the other

procedures in the literature in terms of power, FDR, and size of credible sets (e.g., [32], [27]). The

problem of variable selection under collinearity has been discussed in the frequentist literature.

Recent methods such as knockoffzoom [27] applied hierarchical testing to find the credible sets.

Although our approach is similar, the knockoffzoom procedure controls FDR at each resolution

separately. Therefore, theoretically, it does not guarantee control over the discovery set FDR. In

addition, it depends on the distribution of the covariates which for many applications is difficult

to estimate. According to [27], knockoffzoom performs like SuSiE (Figure 4 in [27]) in terms of

power, however, SuSiE achieved higher resolution discoveries.

A potential advantage of the BHHT presented in this study is that, unlike the case of SuSiE,

BHHT can be used with any variable selection prior; thus, as shown in our simulations, BHHT

can outperform SuSiE if the assumptions made by the SuSiE prior do not hold. On the other hand,

BHHT can sometimes produce very large credible sets–this is particularly a problem when one

uses very low FDR threshold (e.g., 0.01) which forces the method to include many SNPs in the set

to meet such threshold. SuSiE does not have this limitation and hence in some cases it can discover

smaller credible sets. For this reason, in Fig. 2.3 when S=5, and r=0.90 we see that SuSiE shows

slight power gain over the multi-resolution tests.

We evaluated the BHHT with three FDR-control methods (local-BFDR, DS-BFDR and subfamily-

wise BFDR). Overall, the simulation and real-data analysis showed that all criteria performed

similarly. This may seem counterintuitive because a global FDR controlling procedure (Note 2)

is likely to have more power than a procedure that controls local FDR (Note 1 and Note 3). One

reason for such a behavior is the distribution of the set-PIP values. We have observed that across

features these probabilities exhibit a U-shaped distribution, with a high frequency of set-PIPs very

close to zero for clusters not harboring causal variants and set-PIPs very close to one for those

21

clusters harboring one or more causal variants. With such a distribution of the set-PIPs, the global

and local control of the FDR leads to similar decision rules.

There are many interesting avenues that can lead to further research in BHHT and credible

set inference. To address the issue of discovering large credible sets, these approaches can be

extended by considering a penalty based on cluster size in the FDR calculation. One key challenge

in multi-resolution inference is how to balance power-FDR and mapping resolution (i.e., credible

set size). Here we followed [20] who used a simple approach consisting of comparing power-FDR

performance constraining the credible set size to be fixed according to the hierarchy. However,

there seems to be an interesting avenue of research focused on considering a constrained search

problem with the multi-objective function based on maximizing power, minimizing type-I error,

and maximizing mapping resolution simultaneously.

On the application side, the methods discussed in this study can be applicable to a wide range of

fields beyond GWAS. An interesting application area is in the field of neuroscience, which focuses

on identifying regions of the human brain related to neural disorders. Modern imaging techniques

such as tractography provide data on the individual neuronal tracts in the human brain, which has a

highly complex correlation structure. The BHHT can be applied to these complex datasets to better

understand the relationship between tracts and the disorder. In general, BHHT are suited to the

area of image analysis. Since the images have structures where nearby pixels tend to be strongly

correlated, one can apply the multi-resolution tests to identify a group of pixels exhibiting some

properties of interest. Likewise, we believe there may be interesting applications of BHHT for

analysis of high-dimensional phenotype/risk factors, where in this method could help identifying

sets of correlated risk factors that are jointly associated with an outcome. Finally, in a GWAS

setting, there are some potentially interesting extensions of the method proposed in this study for

mapping sets of loci that are simultaneously associated with more than one trait or disease (i.e., for

the study of pleiotropy).

In summary, the BHHT approach proposed in this paper are shown to provide accurate and

powerful inference in a wide range of scenarios. With the use of modern software and with access

22

to computing clusters such as those available at many universities and research institutions, this

method can be applied to ultra-high dimensional data sets.

2.6 Software availability

The code for producing the results can be found in the Supplementary Materials file and at

Github https://github.com/AnirbanSamaddar/Bayes_HHT.

23

BIBLIOGRAPHY

[1] Barber, R. F. and Ramdas, A. (2016). The p-filter: multi-layer fdr control for grouped hypothe-

ses.

[2] Choi, H. K., De Vera, M. A., and Krishnan, E. (2008). Gout and the risk of type 2 diabetes
among men with a high cardiovascular risk profile. Rheumatology (Oxford), 47(10):1567–1570.

[3] de los Campos, G., Gianola, D., and Allison, D. B. (2010). Predicting genetic predisposition

in humans: the promise of whole-genome markers. Nat Rev Genet, 11(12):880–886.

[4] de los Campos, G., Naya, H., Gianola, D., Crossa, J., Legarra, A., Manfredi, E., Weigel, K., and
Cotes, J. M. (2009). Predicting quantitative traits with regression models for dense molecular
markers and pedigree. Genetics, 182(1):375–385.

[5] de los Campos, G., Vazquez, A. I., Hsu, S., and Lello, L. (2018). Complex-trait prediction in

the era of big data. Trends in Genetics, 34(10):746–754.

[6] Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing,
and Prediction. Institute of Mathematical Statistics Monographs. Cambridge University Press.

[7] Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature

space. Statistica Sinica, 20(1):101–148.

[8] Funkhouser, S. A., Vazquez, A. I., Steibel, J. P., Ernst, C. W., and Los Campos, G. d.
(2020). Deciphering sex-specific genetic architectures using local bayesian regressions. Genetics,
215(1):231–241. 32198180[pmid].

[9] Genovese, C. and Wasserman, L. (2004). Bayesian Frequentist Multiple Testing.

[10] George, E. I. and McCulloch, R. E. (1997). Approaches for bayesian variable selection.

Statistica Sinica, 7(2):339–373.

[11] Ghosh, J. and Ghattas, A. E. (2015). Bayesian variable selection under collinearity. The

American Statistician, 69(3):165–173.

[12] Grueneberg, A. and de los Campos, G. (2019). Bgdata - a suite of r packages for genomic

analysis with big data. G3: Genes, Genomes, Genetics, 9(5):1377–1383.

[13] Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide
association studies and other large-scale problems. The Annals of Applied Statistics, 5(3):1780–
1815.

[14] Habier, D., Fernando, R. L., Kizilkaya, K., and Garrick, D. J. (2011). Extension of the

bayesian alphabet for genomic selection. BMC Bioinformatics, 12(1):186.

24

[15] Hartigan, J. A. (1975). Clustering Algorithms. John Wiley & Sons, Inc., USA, 99th edition.

[16] Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and bayesian

strategies. Ann. Statist., 33(2):730–773.

[17] Kim, H., Grueneberg, A., Vazquez, A. I., Hsu, S., and de los Campos, G. (2017). Will big

data close the missing heritability gap? Genetics, 207(3):1135–1145.

[18] Lee, Y., Luca, F., Pique-Regi, R., and Wen, X. (2018). Bayesian multi-snp genetic association

analysis: Control of fdr and use of summary statistics. bioRxiv.

[19] Makowsky, R., Pajewski, N. M., Klimentidis, Y. C., Vazquez, A. I., Duarte, C. W., Allison,
D. B., and de los Campos, G. (2011). Beyond missing heritability: prediction of complex traits.
PLoS Genet, 7(4):e1002051.

[20] Meinshausen, N. (2008). Hierarchical testing of variable importance. Biometrika, 95(2):265–

278.

[21] Meuwissen, T. H. E., Hayes, B. J., and Goddard, M. E. (2001). Prediction of Total Genetic

Value Using Genome-Wide Dense Marker Maps. Genetics, 157(4):1819–1829.

[22] Müller, P., Parmigiani, G., and Rice, K. (2006). Fdr and bayesian multiple comparisons rules.

Johns Hopkins University, Dept. of Biostatistics Working Papers. Working Paper 115.

[23] O’Hara, R. B. and Sillanpää, M. J. (2009). A review of Bayesian variable selection methods:

what, how and which. Bayesian Analysis, 4(1):85 – 117.

[24] Pérez, P. and de los Campos, G. (2014). Genome-Wide Regression and Prediction with the

BGLR Statistical Package. Genetics, 198(2):483–495.

[25] Renaux, C., Buzdugan, L., Kalisch, M., and Bühlmann, P. (2020). Hierarchical inference
for genome-wide association studies: a view on methodology with software. Computational
Statistics, 35(1):1–40.

[26] Reynolds, R. J., Irvin, M. R., Bridges, S. L., Kim, H., Merriman, T. R., Arnett, D. K., Singh,
J. A., Sumpter, N. A., Lupi, A. S., and Vazquez, A. I. (2021). Genetic correlations between traits
associated with hyperuricemia, gout, and comorbidities. European Journal of Human Genetics,
29(9):1438–1445.

[27] Sesia, M., Katsevich, E., Bates, S., Candès, E., and Sabatti, C. (2020). Multi-resolution

localization of causal variants across the genome. Nature Communications, 11(1):1093.

[28] Siegmund, D. O., Zhang, N. R., and Yakir, B. (2011). False discovery rate for scanning

statistics. Biometrika, 98(4):979–985.

25

[29] Sun, M., Vazquez, A. I., Reynolds, R. J., Singh, J. A., Reeves, M., Merriman, T. R., Gaffo,
A. L., and Los Campos, G. d. (2018). Untangling the complex relationships between incident
gout risk, serum urate, and its comorbidities. Arthritis Res Ther, 20(1):90.

[30] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society: Series B (Methodological), 58(1):267–288.

[31] Valdar, W., Solberg, L. C., Gauguier, D., Cookson, W. O., Rawlins, J. N. P., Mott, R., and
Flint, J. (2006). Genetic and Environmental Effects on Complex Traits in Mice. Genetics,
174(2):959–984.

[32] Wang, G., Sarkar, A., Carbonetto, P., and Stephens, M. (2020). A simple new approach to
variable selection in regression, with application to genetic fine mapping. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 82(5):1273–1300.

[33] Ward Jr., J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of

the American Statistical Association, 58(301):236–244.

[34] Yekutieli, D. (2008). Hierarchical false discovery rate–controlling methodology. Journal of

the American Statistical Association, 103(481):309–316.

[35] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.

26

APPENDIX A2

SUPPLEMENTARY METHODS

A2.1 Set up for Lemma 1:

Consider the multiple regression problem in (1). We are interested in testing a set of 𝑘 null

hypotheses about the regression coefficients denoted by 𝐻01, ..., 𝐻0𝐾. These hypotheses can be

simple for example 𝛽0 𝑗 : 𝛽 𝑗 = 0 vs 𝐻1 𝑗 : 𝛽 𝑗 ≠ 0, ∀ 𝑗 = 1, ..., 𝑝 or can be composite for example

𝐻0 𝑗 : 𝛽 𝑗 = 𝛽 𝑗 ′ = 0 vs 𝐻1 𝑗 : at least one of 𝛽 𝑗 or 𝛽′

𝑗 is not zero ∀ 𝑗 ≠ 𝑗 ′ = 1, ..., 𝑝.

Let, 𝑑 𝑗 denote the decision rule for testing 𝐻0 𝑗 where 𝑑 𝑗 = 1 implies a rejection of 𝐻0 𝑗 and

𝑑 𝑗 = 0 implies otherwise. The FDR is defined by,

𝐹 𝐷 𝑅 = E𝑝(𝑦)

(cid:18) # of true null hypotheses rejected by 𝑑 𝑗∀ 𝑗 = 1, ..., 𝐾
# of null hypotheses rejected by 𝑑 𝑗∀ 𝑗 = 1, ..., 𝐾

(cid:19)

Note that the expectation is with respect to the marginal density p(y) which is not tractable.

Therefore, following [9], [22], we use the Bayesian FDR or BFDR to estimate the FDR. BFDR is

defined as,

𝐵𝐹 𝐷 𝑅 =

(cid:205)𝐾

𝑗=1 𝑝(𝐻0 𝑗 |𝑑𝑎𝑡𝑎)𝑑 𝑗
𝑗=1 𝑑 𝑗

(cid:205)𝐾

In the above definition, data represents the training data available that is {𝑦𝑖, 𝑥𝑖 𝑗 : 𝑖 = 1, ..., 𝑛; 𝑗 =

1, ..., 𝑝}. By the below lemma, we show that the BFDR is an unbiased estimate of the FDR (over

conceptual repeated sampling from 𝑝(𝑦)). Therefore, by controlling the BFDR we control the FDR

at a desired level. The proof is a straightforward application of the law of the iterated expectation.

Lemma A2.1.1. Under the above set up, 𝐹 𝐷 𝑅 = E𝑝(𝑦) (𝐵𝐹 𝐷 𝑅)

27

Proof. From the definition,

𝐹 𝐷 𝑅 = E𝑝(𝑦)

= E𝑝(𝑦)

(cid:18) # of true null hypotheses rejected by 𝑑 𝑗∀ 𝑗 = 1, ..., 𝐾
# of null hypotheses rejected by 𝑑 𝑗∀ 𝑗 = 1, ..., 𝐾
𝑗=1 1(𝐻0 𝑗 is true)𝑑 𝑗
𝑗=1 𝑑 𝑗

(cid:18) (cid:205)𝐾

(cid:205)𝐾

(cid:19)

(cid:19)

(cid:18)

= E𝑝(𝑦)

𝐸 𝑝(𝛽1,...,𝛽 𝑝 |𝑑𝑎𝑡𝑎)

(cid:18) (cid:205)𝐾

𝑗=1 1(𝐻0 𝑗 is true)𝑑 𝑗
𝑗=1 𝑑 𝑗

(cid:205)𝐾

(cid:19)(cid:19)

= E𝑝(𝑦)

(cid:18) (cid:205)𝐾

𝑗=1 𝑝(𝐻0 𝑗 |𝑑𝑎𝑡𝑎))𝑑 𝑗
𝑗=1 𝑑 𝑗

(cid:205)𝐾

(cid:19)

= E𝑝(𝑦) (𝐵𝐹 𝐷 𝑅)

The third equality is due to the law of iterated expectation. The fourth equality is valid since

𝑑 𝑗∀ 𝑗 = 1, ..., 𝐾 are functions of the training data and therefore we can apply the conditional

expectation only to the indicators 1(𝐻0 𝑗 is true)∀ 𝑗 = 1, ..., 𝐾 in the numerator. Hence the proof. □

A2.2 Set up for the algorithms:

Let, m hierarchical hypotheses in a tree be denoted by (𝐻01, 𝐻11), ..., (𝐻0𝑚, 𝐻1𝑚), where

(𝐻0 𝑗 , 𝐻1 𝑗 ) denotes the null (i.e., none of the predictors in the cluster has an effect) and the

alternative (at least one of the predictors belonging to the node has an effect) in the j-th cluster.

Following Sec 2.1, the posterior samples can be used to estimate the set-PIPs at each cluster denoted

by, 𝑣 𝑗 = 𝑝(𝐻1 𝑗 |𝑑𝑎𝑡𝑎)∀ 𝑗 = 1, ..., 𝑚.

Note 1 describes the local BFDR control algorithm.

Algorithm A2.1 Local BFDR control
1: Arrange 𝑣 𝑗 ’s in descending order. Let the ordered set-PIPs be, 𝑣 (𝑚) ≥ 𝑣 (𝑚−1) ≥ · · · ≥ 𝑣 (1);
2: Given threshold 𝛼 ∈ (0, 1), start from 𝑗 = 𝑚;
3: while (1 − 𝑣 ( 𝑗)) ≤ 𝛼 do
4:
5: end while
6: return Discovery set including rejected hypotheses where no further refinement is possible.

Reject null hypothesis corresponding to 𝑣 𝑗 and set 𝑗 ← 𝑗 − 1;

Note 2 describes the DS-BFDR control algorithm.

28

Algorithm A2.2 DS-BFDR control
1: Arrange 𝑣 𝑗 ’s in descending order. Let the ordered set-PIPs be, 𝑣 (𝑚) ≥ 𝑣 (𝑚−1) ≥ · · · ≥ 𝑣 (1);
2: Given threshold 𝛼 ∈ (0, 1), start from 𝑗 = 𝑚;
3: Using Algorithm A2.4 with threshold 𝑣 ( 𝑗) obtain discovery set 𝐷𝑆(𝑣 ( 𝑗)) and 𝐷𝑆−𝐵𝐹 𝐷 𝑅(𝑣 ( 𝑗));
4: while 𝐷𝑆 − 𝐵𝐹 𝐷 𝑅(𝑣 ( 𝑗)) ≤ 𝛼 do
5:

Using Algorithm A2.4 with threshold 𝑣 ( 𝑗) obtain discovery set 𝐷𝑆(𝑣 ( 𝑗)) and 𝐷𝑆 −

𝐵𝐹 𝐷 𝑅(𝑣 ( 𝑗));

Set 𝑗 ← 𝑗 − 1;

6:
7: end while
8: return Discovery set 𝐷𝑆(𝑣 ( 𝑗−1)).

Figure A2.1 Demonstrating subfamily

Note 3 describes the subfamily-wise FDR control algorithm.

In a hierarchy, a subfamily is

defined by the clusters which share the same parent. For example, in Fig. A2.1, the parent node

hypothesis is 𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = 𝛽4 = 0 Vs 𝐻0: at least one coeff. ≠ 0 which is marked

by solid red squares. The subfamily of the parent node is marked by solid blue rectangles. The

subfamily-wise FDR is defined as the FDR in the discovery set within a subfamily.

Algorithm A2.3 Subfamily-wise BFDR control
1: Given threshold 𝛼 ∈ (0, 1), start from the top node of the hierarchy;
2: Apply Algorithm 4 with threshold 𝛼 to each subfamily to control subfamily-wise BFDR;
3: return Discovery set including rejected hypotheses where there is no rejection in their respec-

tive subfamilies.

Note 4 Consider the multiple regression problem in (1). We are interested in testing a set of p

marginal null hypotheses denoted by 𝐻01, ..., 𝐻0𝑝 where 𝐻0 𝑗 : 𝛽 𝑗 = 0 vs 𝐻1 𝑗 : 𝛽 𝑗 ≠ 0, ∀ 𝑗 = 1, ..., 𝑝.

29

Given the PIPs denoted by 𝑣 𝑗∀ 𝑗 = 1, ..., 𝑝 and a threshold 𝛼 ∈ (0, 1), we outline the below steps

for marginal hypothesis testing with BFDR controlled at a desired level:

Algorithm A2.4 Marginal hypothesis testing
1: Arrange 𝑣 𝑗 ’s in a descending order. Let the ordered set-PIPs be, 𝑣 ( 𝑝) ≥ 𝑣 ( 𝑝−1) ≥ · · · ≥ 𝑣 (1);
2: Given threshold 𝛼 ∈ (0, 1), start from 𝑗 = 𝑝;
3: Obtain the discovery set 𝐷𝑆(𝑣 ( 𝑗)) consisting hypotheses corresponding to 𝑗 largest 𝑣 𝑗 values

Obtain the discovery set 𝐷𝑆(𝑣 ( 𝑗)) consisting hypotheses corresponding to 𝑗 largest 𝑣 𝑗

and calculate 𝐵𝐹 𝐷 𝑅(𝑣 ( 𝑗));
4: while 𝐵𝐹 𝐷 𝑅(𝑣 ( 𝑗)) ≤ 𝛼 do
5:

values and calculate 𝐵𝐹 𝐷 𝑅(𝑣 ( 𝑗));

Set 𝑗 ← 𝑗 − 1;

6:
7: end while
8: return Discovery set 𝐷𝑆(𝑣 ( 𝑗−1)).

30

APPENDIX B2

SUPPLEMENTARY DATA

B2.1 Generation of the simulation study covariates

Here, we provide details on the generation of the covariates {𝑥𝑖 𝑗 : 𝑖 = 1, ..., 𝑛; 𝑗 = 1, ..., 𝑝} for

the simulation study. As mentioned in the Simulation study section, we have generated covariates

where 𝐶𝑜𝑟 (𝑥𝑖( 𝑗), 𝑥𝑖( 𝑗+𝑘)) = 𝜌𝑘 decays with distance 𝑘. Here, 𝑖 is an index for the subject, 𝑗 is an

index for the feature in the sequence, and 𝑘 is the lag-between predictors. To generate the columns

of the covariate matrix in this way we followed the below two steps,

1. Generate the first column of the covariate matrix with the n elements from the i.i.d.

𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (2, 𝑎). We set 𝑎 = 0.2 in all simulations.

2. Generate the j-th column by randomly permuting a random fraction ( 𝑓 ) of elements of (j-1)-th

column. The random fraction 𝑓 ∼ 𝑏𝑒𝑡𝑎(𝑎, 𝑏).

By changing the values of (𝑎, 𝑏), we simulate different degrees of correlation between adjacent

columns. For example, setting (𝑎, 𝑏) = (2, 3) implies correlation ∼ 0.45 and (𝑎, 𝑏) = (0.5, 3)

implies correlation ∼ 0.9. Note that, as the distance between two columns grows, the correlation

decreases as more elements change positions randomly due to permutation.

B2.2 Model hyperparameters

In this section, we provide details of the hyperparameters used in both the simulation study and

the real data application. In this study, we have fitted two Bayesian methods to the data, which are

independent spike-and-slab regression and SuSiE. We have used the respective R-packages BGLR

[24] and susieR [32] to fit the models. Below are the models and key hyperparameter values used.

Independent spike-and-slab model: The Bayesian spike-and-slab model (aka BayesC) puts a

31

Hyperparameters

𝑎
𝑎+𝑏
𝑎 + 𝑏
# chains
iterations per chain
burn-in

Simulation study Real data
S = 5
10
𝑝
110
4
15000
2500

application
1.01
1000
1000
4
55000
5000

S = 20
30
𝑝
110
4
15000
2500

Table B2.1 Table showing hyperparameter values for simulation study and real data application for
the spike-and-slab model.

point mass and a gaussian slab on the regression coefficients. The model hierarchy is as follows,

𝑦𝑖 = 𝜇 + 𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + ... + 𝑥𝑖 𝑝 𝛽𝑝 + 𝜖𝑖

(𝑖 = 1, ..., 𝑛)

𝜖𝑖 ∼ 𝑁 (0, 𝜎2
𝜖 )

𝑝(𝛽 𝑗 ) = 1(𝛽 𝑗 = 0)(1 − 𝜋) + 𝜋 𝑝(𝛽 𝑗 |𝜃)

( 𝑗 = 1, ..., 𝑝)

𝜋 ∼ 𝑏𝑒𝑡𝑎(𝑎, 𝑏)

𝑏 ∼ 𝜒−2(𝑑𝑓𝑏, 𝑆𝑏)
𝜎2

𝜖 ∼ 𝜒−2(𝑑𝑓𝜖 , 𝑆𝜖 )
𝜎2

Here, 𝜒−2 denotes the scaled-inverse chi-square distribution. Also, since we used the Gibbs

sampler to draw samples from the posterior, a few other important hyperparameters are the number

of chains, number of iterations, and number of burn-in steps. We set the hyperparameters in

Table B4.1.

Note that the only hyperparameter changed between the two simulation settings, 𝑆 = 5and

𝑆 = 10, is the prior probability of inclusion 𝑎/(𝑎 + 𝑏). In our simulations, if we hold this to be fixed

at 10/𝑝 for 𝑆 = 20 the power of the BHHT method drops marginally. However, the key reason to

change this hyperparameter with 𝑆 is to be consistent with SuSiE (described below). For the other

parameters, we set them to the default values of the software implementation.

The sum of Single Effects model (SuSiE): The SuSiE model apriori assumes that out of the p

32

coefficients L of them are non-zero. The model hierarchy is as follows,

𝑦𝑖 = 𝜇 + 𝑥𝑖1𝛽1 + 𝑥𝑖2𝛽2 + ... + 𝑥𝑖 𝑝 𝛽𝑝 + 𝜖𝑖

(𝑖 = 1, ..., 𝑛)

𝜖𝑖 ∼ 𝑁 (0, 𝜎2
𝜖 )
𝐿
∑︁

𝜷 =

𝜷ℓ

ℓ=1

𝜷ℓ = 𝜸ℓ 𝛽ℓ

𝜸ℓ ∼ 𝑀𝑢𝑙𝑡 (1, 𝜋)

𝛽ℓ ∼ 𝑁 (0, 𝜎2
0ℓ)

In the above set of equations, 𝜷 represents the 𝑝×1 dimensional vector of regression coefficients.

The hyperparameter 𝐿 here has a similar interpretation as the prior probability of inclusion in the

spike-and-slab model. However, the misspecification of 𝐿 has a significant impact on the power of

SuSiE. Especially when𝐿 << 𝑆, the SuSiE method tends to lose significant power. The general

advice in [32] is to choose 𝐿 to be higher than 𝑆. Therefore, for the simulation settings where 𝑆 = 5

and 𝑆 = 10 we fix 𝐿 = 10 and 𝐿 = 30, respectively, to be consistent with the spike-and-slab prior.

The other hyperparameters in the model are set to the default values of the software implementation.

B2.3 Real data modeling strategy

In the main paper we described the application of BHHT in a GWAS of the trait serum urate.

In this section, we discuss additional details used to fit a Bayesian linear regression model using

the calls (∼ 785,000 SNPs) and imputed (∼ 15 million SNPs) data sets.

It is computationally infeasible to fit a multiple linear regression jointly with all the SNPs.

Therefore, following [8] we conducted the association study by fitting local Bayesian regression

models to overlapping segments throughout the genome. For calls, we created segments of 2900

SNPs by taking disjoint cores of 1500 SNPs and adding overlapping flanking regions of 700

SNPs on both sides. Similarly for imputed data, we considered disjoint cores of 7000 SNPs and

overlapping flanking regions of 1000 SNPs on both sides. From the local regression fit, we only

kept samples of the effects of the SNPs from the core region. We collected samples from the

33

posterior distribution of the model using the BLRXy() function of the BGLR R-package [24]. This

function implements the same set of models implemented in BGLR but performs the computation

using sufficient statistics (𝑋′𝑋, 𝑋′𝑦, 𝑦′𝑦); where 𝑋 is the design matrix and y is the response vector.

Following this strategy offers great computational advantages when 𝑛 >> 𝑝.

B2.4 Supplementary figures

Below are the supplementary figures that are referred to in the main paper.

Figure B2.1 Power vs FDR curve for different simulation settings restricting maximum cluster size
to 3. 𝑛 denotes the sample size and 𝑟 denotes the correlation between adjacent SNPs. The Left and
right panel represents scenarios where there are 5 and 20 causal variants respectively. We hold ℎ2
fixed at 1.25% for all settings. Putting more restriction on the cluster size criteria have reduced the
power advantage of CS approaches relative to marginal SNP-PIP inference.

34

Figure B2.2 Power vs FDR curve for different simulation settings restricting maximum cluster size
to 10. 𝑛 denotes the sample size and 𝑟 denotes the correlation between adjacent SNPs. The Left
and right panel represents scenarios where there are 5 and 20 causal variants respectively. We hold
ℎ2 fixed at 1.25% for all settings. Putting less restriction on the cluster size criteria has increased
the power advantage of CS approaches relative to marginal SNP-PIP inference. Furthermore, the
DS-BFDR approach gains more power due to lower-resolution discoveries relative to SuSiE.

35

Figure B2.3 Power vs FDR curve of the three error controlling approaches for different simulation
settings. 𝑛 denotes the sample size and 𝑟 denotes the correlation between adjacent SNPs. The Left
and right panel represents scenarios where there are 5 and 20 causal variants respectively. We hold
ℎ2 fixed at 1.25% for all settings. All the error-controlling approaches show similar performance.

36

Figure B2.4 Ideogram displaying each of the 22 autosomal chromosomes for the imputed data set
and singleton and segment discovered in the GWAS of the trait serum urate. The individual SNPs
that cleared the 0.05 BFDR threshold are represented using blue lines along with the segments
identified which are represented by bars colored as per their joint probability of inclusion in a
yellow-to-red scale.

37

CHAPTER 3

SPARSITY-INDUCING CATEGORICAL PRIOR IMPROVES
ROBUSTNESS OF THE INFORMATION BOTTLENECK

3.1

Introduction

Information bottleneck (IB) ([28]) is a deep latent variable model that poses representation

compression as a constrained optimization problem to find representations 𝑍 that are maximally

informative about the outputs 𝑌 while being maximally compressive about the inputs 𝑋, using

a loss function expressed using a mutual information (MI) metric and a Lagrangian formulation

of the constrained optimization: L𝐼 𝐵 = MI(𝑋; 𝑍) − 𝛽 MI(𝑍; 𝑌 ). Here, MI(𝑋; 𝑍) is the MI that

reflects how much the representation compresses 𝑋, and MI(𝑍; 𝑌 ) reflects how much information

the representation has kept from 𝑌 .

It is standard practice to use parametric priors, such as a mean-field Gaussian prior for the

latent variable 𝑍, as seen with most latent-variable models in the literature ([30]).

In general,

however, a major limitation of these priors is the requirement to preselect a latent space complexity

for all data, which can be very restrictive and lead to models that are less robust. Sparsity, when

used as a mechanism to choose the complexity of the model in a flexible and data-driven fashion,

has the potential to improve the robustness of machine learning systems without loss of accuracy,

especially when dealing with high-dimensional data ([1]).

Sparsity has been considered in the context of latent variable models in a handful of works.

In linear latent variable modeling, [17] proposes a sparse factor model in the context of learning

analytics, [4] proposes a sparse partial least squares (sPLS) method to resolve the inconsistency

issue that arises in standard PLS in a high-dimensional setting, and [34] reviews advances in sparse

canonical correlation analysis. Most of the work in sparse linear latent variable modeling fixes

the latent dimensionality or treats it as a hyperparameter. In nonlinear latent variable modeling,

sparsity was proposed primarily in the context of unsupervised learning. Notable works include

sparse Dirichlet variational autoencoder (sDVAE) ([3]), epitome VAE (eVAE) ([35]), variational

sparse coding (VSC) ([31]), and InteL-VAE ([21]). Most of these approaches do not take into

38

Figure 3.1 Plot of latent dimension distribution learned by SparC-IB aggregated for each of the
MNIST data classes. In the vertical axis, we have 10 digits classes and in the horizontal axis we
have their distribution of posterior modes of latent dimension aggregated across the testset data
points. It shows our SparC-IB prior provides flexibility to learn the data-specific latent dimension
distribution, in contrast to fixing to a single value.

account the uncertainty involved in introducing sparsity, and treat this as a deterministic selection

problem. Approaches such as the Indian buffet process VAE ([26]) relax this and allow learning

a distribution over the selection parameters that induce sparsity, but only allow global sparsity.

Ignoring uncertainty in selection and the flexibility to learn a local data-driven dimensionality of

the latent space for each data point can lead to a loss of robustness in inference and prediction.

We also note that the aforementioned approaches have been proposed for unsupervised learning,

and we are interested in the supervised learning scenario introduced with the IB approach, which

poses a different set of challenges because the sparsity has to accommodate accurate prediction of 𝑌 .

Only one recent work ([13]) that we know of has considered sparsity in the latent variables of the IB

model, but here the latent variable is assumed to be deterministic, and the sparsity applied indirectly

by weighting each dimension using a Bernoulli distribution, where zero weight is equivalent to

sparsification.

39

To that end, we make the following contributions:

1. We introduce a novel sparsity-inducing Bayesian spike-slab prior that is based on a beta-

binomial formulation, where the sparsity in the latent variables of the IB model is modeled

stochastically through a categorical distribution, and thus the joint distribution of latent

variable and sparsity is learned with this categorical prior IB (SparC-IB) model through

Bayesian inference. Unlike traditional spike-and-slab priors that are based on the beta-

Bernoulli distribution ([9],[11]) and can select dimensions randomly, SparC-IB imposes an

order in which the dimensions are selected/activated for each data point. This helps to infer

the distribution of dimensionality more effectively.

2. We derive variational lower bounds for efficient inference of the proposed SparC-IB model.

3. Using in-distribution and out-of-distribution experiments with MNIST, CIFAR-10, and Im-

ageNet data, we show an improvement in accuracy and robustness with SparC-IB compared

to vanilla VIB models and other sparsity-inducing strategies.

4. Through extensive analysis of the latent space, we show that learning the joint distribu-

tion provides the flexibility to systematically learn parsimonious data-specific (local) latent

dimension complexity (as shown in Fig. 3.1), enabling them to learn robust representations.

3.2 Related Works

Previous works in the literature on latent-variable models has looked at different sparsity-

inducing mechanisms in the latent space in supervised and unsupervised settings. [3] propose a

sparse Dirichlet variational autoencoder (sDVAE) that assumes a degenerate Dirichlet distribution

on the latent variable. They introduce sparsity in the concentration parameter of the Dirichlet

distribution through a deterministic neural network. Epitome VAE (eVAE) by [35] imposes sparsity

through epitomes that select adjacent coordinates of the mean-field Gaussian latent vector and mask

the others before feeding to the decoder. The authors propose training a deterministic epitome

selector to select from all possible epitomes. Similar to eVAE, the variational sparse coding (VSC)

proposed in [31] introduces sparsity through a deterministic classifier.

In this case, the classifier outputs an index from a set of pseudo-inputs that define the prior for

40

Approach

sDVAE
eVAE
VSC
InteL-VAE
Drop-B
IBP
SparC-IB

Latent
Variable
S
S
S
S
D
S
S

Sparsity
(global/local)
D (L)
D (L)
D (L)
D (L)
S (G)
S (G)
S (L)

Table 3.1 Latent-variable models with different
sparsity induction strategies, where
D=Deterministic and S=Stochastic. The type of induced sparsity is in parentheses, where G is
global sparsity and L the local sparsity.

the latent variable. In a more recent work, InteL-VAE ([21]) introduces sparsity via a dimension

selector (DS) network on top of the Gaussian encoder layer in standard VAEs. The output of DS

is multiplied by the Gaussian encoder to induce sparsity and then fed to the decoder. InteL-VAE

has empirically shown an improvement over VSC in unsupervised learning tasks, such as image

generation. The Indian buffet process (IBP) ([26]) learns a distribution on infinite-dimensional

sparse matrices where each element of the matrix is an independent Bernoulli random variable,

where the probabilities are global and come from a Beta distribution. Therefore, the sparsity

induced by IBP is global and can make the coordinates of the latent variable zero for all data points.

In the IB literature, we find the aspect of sparsity in the latent space rarely explored. To the best

of our knowledge, Drop-Bottleneck (Drop-B) by [13] is the only work that attempts this problem.

In Drop-B, a neural network extracts features from the data; then, a Bernoulli random variable

stochastically drops certain features before passing them to the decoder. The probabilities of the

Bernoulli distribution, similar to IBP, are assumed as global parameters and trained with other

parameters of the model.

In this paper, with SparC-IB, we model stochasticity in both latent variables and sparsity, and

we relax the global sparsity assumption by learning the distribution of (local) sparsity for each

data point. In this regard, we differ from other latent-variable models. In Table 3.1 we summarize

these different approaches by the types, stochastic (S) or deterministic (D), of the latent variable

41

and the sparsity. Furthermore, we characterize the sparsity induced by each method by whether

they impose global (G) or local (L) sparsity. The table shows that very few works incorporate

stochasticity in both latent variable and sparsity-inducing mechanism.

3.3

Information Bottleneck with Sparsity-Inducing Categorical Prior

3.3.1

Information Bottleneck: Preliminaries

Taking into account a joint distribution 𝑃(𝑋, 𝑌 ) of the input variable 𝑋 and the corresponding

target variable 𝑌 , the information bottleneck principle aims to find a (low-dimensional) latent en-

coding 𝑍 by maximizing prediction accuracy, formulated in terms of mutual information MI(𝑍; 𝑌 ),

given a constraint on compressing the latent encoding, formulated in terms of mutual information

MI(𝑋; 𝑍). This can be cast as a constrained optimization problem:

max𝑍 MI(𝑍; 𝑌 )

s.t. MI(𝑋; 𝑍) ≤ 𝐶,

(3.1)

where 𝐶 can be interpreted as the compression rate or the minimum number of bits needed to

describe the input data. Mutual information is obtained through a multidimensional integral

that depends on the joint distribution and the marginal distribution of random variables given

∫
X

𝑃𝜃 (𝑥, 𝑧) log (cid:16) 𝑃 𝜃 (𝑥,𝑧)
𝑃(𝑥) 𝑃(𝑧)

by ∫
𝑑𝑥 𝑑𝑧, where 𝑃𝜃 (𝑥, 𝑧) := 𝑃𝜃 (𝑧|𝑥)𝑃(𝑥); a similar expression for
Z
MI(𝑍; 𝑌 ) needs 𝑃𝜃 (𝑦, 𝑧) := ∫ 𝑃(𝑦|𝑥)𝑃𝜃 (𝑧|𝑥)𝑃(𝑥)d𝑥 . The integral presented to calculate MI is

(cid:17)

generally computationally intractable for large data.

Thus, in practice, the Lagrangian relaxation of the constrained optimization problem is adopted [29]:

L𝐼 𝐵 (𝑍) = MI(𝑍; 𝑌 ) − 𝛽 MI(𝑋; 𝑍)

(3.2)

where 𝛽 is a Lagrange multiplier that enforces the constraint MI(𝑋; 𝑍) ≤ 𝐶 such that a latent

encoding 𝑍 is desired that is maximally expressive about 𝑌 while being maximally compressive

about 𝑋. In other words, MI(𝑋; 𝑍) is the mutual information that reflects how much the repre-

sentation (𝑍) compresses 𝑋, and MI(𝑍; 𝑌 ) reflects how much information the representation has

been kept from Y. Several approaches have been proposed in the literature to approximate mutual

information MI(𝑋; 𝑍), ranging from parametric bounds defined by variational lower bounds ([2])

42

to non-parametric bounds (based on kernel density estimate) ([16]) and adversarial f-divergences

[37]. In this research, we focus primarily on the variational lower bounds-based approximation.

Furthermore, we take a square transformation of the term MI(𝑋; 𝑍) following [16]. Taking a con-

vex transformation of the compression term makes the solution of the IB Lagrangian identifiable

w.r.t. 𝛽 ([25]). However, for the sake of clarity, we drop this transformation from the loss function

derivation. From the convexity property, all derivations with standard IB loss carry over to the loss

function with this transformation.

Role of Prior Distribution and Stochasticity of Latent Variable: In the IB formulation presented

above, the latent variable is assumed to be stochastic, and hence the posterior distribution of it is

learned using Bayesian inference. In [2], the variational lower bound of L𝐼 𝐵 (𝑍) is given as

L𝑉 𝐼 𝐵 (𝑍) = E𝑋,𝑌

(cid:20)

E𝑍 | 𝑋 log 𝑞(𝑌 |𝑍) − 𝛽 DKL(𝑞(𝑍 |𝑋)||𝑞(𝑍))

(cid:21)

(3.3)

In equation 3.3, the prior 𝑞(𝑧) serves as a penalty to the variational encoder 𝑞(𝑧|𝑥) and 𝑞(𝑦|𝑧) is the

decoder that is the variational approximation to 𝑝(𝑦|𝑧). [2] choose the encoder family and the prior

to be a fixed 𝐾-dimensional isotropic multivariate Gaussian distribution (𝑁 (0𝐾, 𝐼𝐾×𝐾)). This choice

is motivated by the simplicity and efficiency of inference using differentiable reparameterization of

the Gaussian random variable 𝑍 in terms of its mean and sigma. For complex datasets, however,

this can be restrictive since the same dimensionality is imposed on all the data and hence can

prohibit the latent space from learning robust features ([21]).

We describe a new family of sparsity-inducing priors that allow stochastic exploration of differ-

ent dimensional latent spaces. For the proposed variational family, we derive the variational lower

bound that consists of discrete and continuous variables, and show that simple reparameterization

steps can be taken to draw samples efficiently from the encoder for inference and prediction.

3.3.2 Sparsity-Inducing Categorical Prior

A key aspect of the IB model and latent variable models in general is the dimension of 𝑍.

Fixing a very low dimension of 𝑍 can impose a limitation on the capacity of the model, while a

very large 𝑍 can lead to learning a lot more nuisance information that can be detrimental to model

43

Figure 3.2 Schematic of the categorical prior information bottleneck (SparC-IB) model. The
inputs are passed through the encoder layer to estimate the feature allocation vector 𝐴, while
simultaneously using a categorical prior to sample the number of active dimensions 𝑑 of 𝐴 for
given data that selects the complexity of latent variable 𝑍. The obtained 𝑍 is then fed to the decoder
to predict the supervised learning responses.

robustness and generalizability. We formulate a data-driven approach to learn the distribution of

dimensionality 𝑍 through the design of the sparsity-inducing prior.

Let {𝑋𝑛, 𝑌𝑛}𝑁

𝑛=1 be 𝑁 data points with 𝑋𝑛 ∈ R1×𝑝 and let the latent variable 𝑍𝑛 = (𝑍𝑛,1, 𝑍𝑛,2, ..., 𝑍𝑛,𝐾),

where 𝐾 is the latent dimensionality. The idea is that we will fix 𝐾 to be large a priori and make

the prior assumption that the 𝑘 < 𝐾 columns of 𝑍𝑛 are zero and therefore do not contribute to the

prediction of 𝑌𝑛. Therefore, the prior distribution of 𝑍𝑛 can be specified as follows.

𝑍𝑛,𝑘 |𝛾𝑛,𝑘 ∼ (1 − 𝛾𝑛,𝑘 ) 1(𝑍𝑛,𝑘 = 0) + 𝛾𝑛,𝑘 N(𝜇𝑛,𝑘 , 𝜎2

𝑛,𝑘 )

OR

𝑍𝑛 = 𝐴𝑛 ◦ 𝛾𝑛; 𝐴𝑛 ∼ N (𝜇𝑛, Σ𝑛)

𝛾𝑛,𝑘 |𝑑𝑛 = 1(𝑘 ⩽ 𝑑𝑛); 𝑑𝑛 ∼ Categorical(𝜋𝑛)

(3.4)

(3.5)

This prior is a variation of the spike-slab family ([9]) where the sparsity-inducing parameters follow

a categorical distribution. The latent variable 𝑍𝑛 is an element-wise product between the feature

allocation vector 𝐴𝑛 and a sparsity-inducing vector 𝛾𝑛. The 𝐾 dimensional latent 𝐴𝑛 is assumed to

follow a 𝐾 dimensional Gaussian distribution with mean vector 𝜇𝑛 and diagonal covariance matrix

44

𝑋𝑌𝐴𝛾𝜋𝑑𝜎SLABSPIKE𝑍𝜇∘Σ𝑛 with the variance vector 𝜎2

𝑛 . Note that 𝛾𝑛 is a vector whose first 𝑑𝑛 coordinates are 1s and the

rest are 0s, and hence 𝐴𝑛 ◦ 𝛾𝑛, with the result that the first 𝑑𝑛 coordinates of 𝑍𝑛 are nonzero and the

rest are 0. Therefore, as we sample 𝑑𝑛, we consider the different dimensionality of the latent space;

𝑑𝑛 follows a categorical distribution over the 𝐾 categories with 𝜋𝑛 as the vector of categorical

probabilities whose elements 𝜋𝑛,𝑘 denote the prior probability of the 𝑘th dimension for the data

point 𝑛. In Fig. 3.2 we present the flow of the model from the input 𝑋 to the output 𝑌 through the

latent encoding layer.

3.3.3 Variational Bound Derivation

The variable 𝑑𝑛 augments 𝑍𝑛 for latent space characterization. With a Markov chain assumption,

𝑌 ↔ 𝑋 ↔ 𝑍, the latent variable distribution factorization as 𝑝(𝑍, 𝑑) = (cid:206)𝑁

𝑛=1 𝑝(𝑍𝑛|𝑑𝑛) 𝑝(𝑑𝑛), and
the same family with the same factorization of the variational posterior 𝑞(𝑍𝑛, 𝑑𝑛|𝑋) as the prior, we

derive the variational lower bound of the IB Lagrangian as follows. In the following derivation, for

simplicity, we use 𝑋 interchangeably for both the random variable ∈ R1×𝑝 and the sample covariate

matrix ∈ R𝑁×𝑝.

L𝐼 𝐵 (𝑍) = MI(𝑍; 𝑌 ) − 𝛽 MI(𝑋; 𝑍)

⩾ L𝑉 𝐼 𝐵 (𝑍)

(cid:20)

(cid:20)

= E𝑋,𝑌

= E𝑋,𝑌

E𝑍 |𝑋 log 𝑞(𝑌 |𝑍) − 𝛽 DKL(𝑞(𝑍 |𝑋)||𝑞(𝑍))

(cid:21)

(from 3.3)

E𝑍 |𝑋 log 𝑞(𝑌 |𝑍) − 𝛽 DKL(𝑞(𝑍, 𝑑|𝑋)||𝑞(𝑍, 𝑑))

(cid:21)

= LSparC-IB(𝑍, 𝑑)

Note that we have introduced the random variable 𝑑 along with 𝑍 in the density function of the

second KL term. This is because the KL divergence between 𝑞(𝑍 |𝑋) and 𝑞(𝑍) is intractable. The

equality of the loss function is valid since 𝑑𝑛 = (cid:205)𝐾

𝑘=1 1(𝑍𝑛,𝑘 ≠ 0) is a deterministic function of 𝑍𝑛,
and we can write the density 𝑞(𝑧, 𝑑) = 𝑞(𝑧)𝑞(𝑑|𝑧) = 𝑞(𝑧)𝛿𝑑𝑧 (𝑑), where 𝛿𝑐 (𝑑) is the Dirac delta

45

function at 𝑐. Therefore, we have the following.

DKL(𝑞(𝑍, 𝑑|𝑋)||𝑞(𝑍, 𝑑)) = E𝑍,𝑑|𝑋 log

𝑞(𝑍 |𝑋)𝛿𝑑𝑍 (𝑑)
𝑞(𝑍)𝛿𝑑𝑍 (𝑑)
𝑞(𝑍 |𝑋)
𝑞(𝑍)

= E𝑍 |𝑋 log

= E𝑍 |𝑋E𝑑|𝑍 log

𝑞(𝑍 |𝑋)𝛿𝑑𝑍 (𝑑)
𝑞(𝑍)𝛿𝑑𝑍 (𝑑)

= DKL(𝑞(𝑍 |𝑋)||𝑞(𝑍))

Note that 𝛿𝑑𝑍 (𝑑) = 1, given 𝑍, almost everywhere since 𝛿𝑑𝑍 (𝑑) = 0 ⇐⇒ 𝑑 ≠ 𝑑𝑧 has measure 0

under 𝑞(𝑑|𝑧) = 𝛿𝑑𝑧 (𝑑). We now replace E𝑋,𝑌 with the empirical version.

LSparC-IB(𝑍, 𝑑) ˆ=

1
𝑁

𝑁
∑︁

(cid:20)

𝑛=1

E𝑍𝑛 |𝑋 [log 𝑞(𝑌𝑛|𝑍𝑛)]

− 𝛽 DKL(𝑞(𝑍𝑛, 𝑑𝑛|𝑋)||𝑞(𝑍𝑛, 𝑑𝑛))

(cid:21)

1
𝑁

=

𝑁
∑︁

(cid:20)

E𝑑𝑛 |𝑋E𝑍𝑛 |𝑋,𝑑𝑛 [log 𝑞(𝑌𝑛|𝑍𝑛)]

𝑛=1
− 𝛽 (cid:2)E𝑑𝑛 |𝑋 [DKL(𝑞(𝑍𝑛|𝑋, 𝑑𝑛)||𝑞(𝑍𝑛|𝑑𝑛))]

+ DKL(𝑞(𝑑𝑛|𝑋)||𝑞(𝑑𝑛))(cid:3)

(cid:21)

We analyze the three terms in the above decomposition as follows.

(𝑖) E𝑑𝑛 |𝑋E𝑍𝑛 |𝑋,𝑑𝑛 [log 𝑞(𝑌𝑛|𝑍𝑛)]

=

𝐾
∑︁

𝑘=1

E𝑍𝑛 |𝑋,𝑑𝑛=𝑘 [log 𝑞(𝑌𝑛|𝑍𝑛, 𝑑𝑛 = 𝑘)]𝜋𝑛,𝑘 (𝑋)

This term is a weighted average of the negative cross-entropy losses from models with increasing

dimension of latent space, where the weights are the posterior probabilities of the dimension

encoder. Therefore, maximizing this term implies putting large weights on the dimensions of the

latent space where log-likelihood is high. During training, this term can be computed using the

Monte Carlo approximation, that is, (𝑖) ˆ= 1
𝐽

(cid:205)𝐽

𝑗=1 log 𝑞(𝑌𝑛|𝑍𝑛 = 𝑍 ( 𝑗)

𝑛 , 𝑑𝑛 = 𝑑 ( 𝑗)

𝑛 ), where we draw

𝐽 randomly drawn samples from 𝑞(𝑍𝑛, 𝑑𝑛|𝑋). In our experiments, we fixed 𝐽 = 10 everywhere

46

during training.

(𝑖𝑖) E𝑑𝑛 |𝑋 [DKL(𝑞(𝑍𝑛|𝑋, 𝑑𝑛)||𝑞(𝑍𝑛|𝑑𝑛))]

=

𝐾
∑︁

𝑘=1

DKL(𝑞(𝑍𝑛|𝑋, 𝑑𝑛 = 𝑘)||𝑞(𝑍𝑛|𝑑𝑛 = 𝑘))𝜋𝑛,𝑘 (𝑋)

Note that 𝑞(𝑧𝑛|𝑑𝑛) = 𝑞(𝑧𝑛|𝛾𝑛) = N (𝑧𝑛; ˜𝜇𝑛, ˜Σ𝑛), where ˜𝜇𝑛 = 𝜇𝑛 ◦ 𝛾𝑛 and ˜Σ𝑛 are diagonal with the

𝑛 = 𝜎2

entries ˜𝜎2

𝑛 ◦ 𝛾𝑛. When 𝑘 < 𝐾, this density does not exist w.r.t. the Lebesgue measure in R𝐾.
However, we can still define a density w.r.t. the Lebesgue measure restricted to R𝑘 (see Chapter 8 in

[24]), and it is the 𝑘-dimensional multivariate normal density with mean 𝜇𝑛,−𝐾−𝑘 = (𝜇𝑛,1, ..., 𝜇𝑛,𝑘 )′
and diagonal covariance matrix Σ𝑛,−𝐾−𝑘 with diagonal entries 𝜎2
𝑛,𝑘 )′. Denoting
𝜇𝑛,−0 = 𝜇𝑛, we have the following.

1,𝑛, ..., 𝜎2

= (𝜎2

𝑛,−𝐾−𝑘

(𝑖𝑖)𝑡𝑒𝑟𝑚 =

(cid:18)

DKL

𝐾
∑︁

𝑘=1

N (𝜇𝑛,−𝐾−𝑘 (𝑋), Σ𝑛,−𝐾−𝑘 (𝑋))

||N (𝜇𝑛,−𝐾−𝑘 , Σ𝑛,−𝐾−𝑘 )

(cid:19)

𝜋𝑛,𝑘 (𝑋)

𝐾
∑︁

𝑘
∑︁

(cid:18)

DKL

=

N (𝜇𝑛,ℓ (𝑋), 𝜎2

𝑛,ℓ (𝑋))

𝑘=1

ℓ=1

||N (𝜇𝑛,ℓ, 𝜎2

𝑛,ℓ)

(cid:19)

𝜋𝑛,𝑘 (𝑋)

1
2

=

𝐾
∑︁

𝑘
∑︁

𝑘=1

ℓ=1

[𝜎2

𝑛,ℓ (𝑋) − 1 − log(𝜎2

𝑛,ℓ (𝑋))

+ 𝜇2

𝑛,ℓ (𝑋)]𝜋𝑛,𝑘 (𝑋)

The second-last equality is due to the fact that the KL divergence of multivariate Gaussian densities

whose covariances are diagonal can be written as a sum of coordinate-wise KL divergences. Since

KL-divergence is always non-negative, minimizing the above expression implies putting more

probability to the smaller-dimensional latent space models since the second summation term is

expected to grow with dimension 𝑘.

(𝑖𝑖𝑖) DKL(𝑞(𝑑𝑛|𝑋)||𝑞(𝑑𝑛)) =

𝐾
∑︁

𝑘=1

log

𝜋𝑛,𝑘 (𝑋)
𝜋𝑛,𝑘

𝜋𝑛,𝑘 (𝑋)

47

Minimizing this term forces the learned probabilities to be close to the prior. Note that we are

learning these probabilities for each data point (since 𝜋𝑛,𝑘 (𝑋) is indexed by 𝑛). In this respect,

we differ from most of the stochastic sparsity-inducing approaches, such as Drop-B ([13]) and IBP

([26]).

In these approaches, the sparsity is induced from a probability distribution with global

parameterization and is not learned for each data point.

Modeling Choices for the SparC-IB Components: Since 𝑍𝑛 = 𝐴𝑛 ◦ 𝛾𝑛, we are required to

fix the priors for ( 𝐴𝑛, 𝛾𝑛) or ( 𝐴𝑛, 𝑑𝑛). We chose 𝐾-dimensional spherical Gaussian N (0, 𝐼𝐾) as

the prior for the latent variable 𝐴𝑛. For 𝑑𝑛, we assume that the 𝑘th categorical probability comes

from the compound distribution of a beta-binomial model, also known as the Polya urn model [20].

Therefore,

𝜋𝑛 = P(𝑑𝑛 = 𝑘) =

(cid:18)𝐾 − 1
𝑘 − 1

(cid:19) B(𝑎𝑛 + 𝑘 − 1, 𝑏𝑛 + 𝐾 − 𝑘)
B(𝑎𝑛, 𝑏𝑛)

.

(3.6)

For simplicity, we set the prior value to be constant across the data points, that is, (𝑎𝑛, 𝑏𝑛) = (𝑎, 𝑏).

The key advantage of this choice is that we can write the probability as a differentiable function

of the two shape parameters (𝑎𝑛, 𝑏𝑛). Therefore, we can assume the same categorical distribution

for the encoder; and instead of learning 𝐾 probabilities 𝜋𝑛,𝑘 (𝑋) we can learn (𝑎𝑛 (𝑋), 𝑏𝑛 (𝑋)),

which significantly reduces the dimensionality of the parameter space. However, learning 𝜋𝑛,𝑘 (𝑋)

provides more flexibility because the shape of the distribution is not constrained, while learning

(𝑎𝑛 (𝑋), 𝑏𝑛 (𝑋)) constrains 𝜋𝑛,𝑘 (𝑋) to follow according to the shape of the compound distribution,

which depends on 𝑎𝑛 (𝑋) and 𝑏𝑛 (𝑋). In our experiments, we tested both approaches and found that

learning (𝑎𝑛 (𝑋), 𝑏𝑛 (𝑋)) produces better results.

3.4 Experimental Results

(a) Test error vs 𝛽

(b) Log likelihood vs noise (c) Log likelihood vs rotation (d) Log likelihood vs attack

Figure 3.3 In- and out-of-distribution performance on MNIST.

48

(a) Test error vs 𝛽

(b) Log likelihood vs noise
Figure 3.4 In- and out-of- distribution performance on CIFAR-10.

(c) Log likelihood vs attack

In this section, we present experimental results that compare the performance of SparC-IB with

the most recent sparsity-inducing strategies proposed in the literature: the Drop-B model ([13])

and InteL-VAE ([21]). We could not find an open source implementation for either of the two

approaches and therefore we have coded our own implementation where we adapt them in the

context of information bottleneck (link to the code base is provided in the Appendix A3.5). In

this section, these two approaches are called Drop-VIB and Intel-VIB. In addition, we compare our

model with the baseline mean-field Gaussian VIB approach, where the latent dimension is fixed to

a single value across all data. Note that we apply the square transformation ([16]) to the estimator

of MI(𝑋; 𝑍) for all the methods.

We evaluated the SparC-IB model for in-distribution prediction accuracy in a supervised clas-

sification scenario using the MNIST and CIFAR-10 data, which have a small number of classes,

and also the ImageNet data, where the number of classes is large. Furthermore, we evaluated the

robustness of SparC-IB trained on these three datasets in out-of-distribution scenarios, specifically

with rotation, white-box attacks with noise corruptions, and black-box adversarial attacks.

The selection of the Lagrange multiplier 𝛽 controls the amount of information learned from the

input (that is, MI(𝑋; 𝑍)) by the latent space. We have chosen a common 𝛽 where (MI(𝑋; 𝑍), MI(𝑍; 𝑌 ))

is close to the minimum necessary information or MNI (suggested in [7]) for all models. MNI is a

point in the information plane where MI(𝑋; 𝑍) = MI(𝑍; 𝑌 ) = H(𝑌 ), where the entropy is indicated

by H(𝑌 ). We evaluated the robustness of each model using a single value of 𝛽. The value of 𝛽 we

chose to compare the models for MNIST is ∼ 0.08, for CIFAR-10 it is ∼ 0.04, and for ImageNet it

is ∼ 0.02. The choice of 𝛽 is discussed in more detail in the Appendix A3.1.

We use the encoder-decoder architecture from [25] for MNIST, from [36] for CIFAR-10, and

49

from [2] for ImageNet. Note that we learn the parameters of the dimension distribution in both

compound and categorical strategies by splitting the encoder network head into two parts, as

depicted in Fig. 3.2. Full details of the architectures have been discussed in the Appendix A3.1.

Furthermore, following [8], we pass the mean of the encoder 𝑍 to the decoder at the test time to

make a prediction.

Prior probabilities act as regularizers in learning the dimension probabilities in both categorical

and compound strategies (the third term of LSparC-IB(𝑍, 𝑑)). They also model the prior knowledge

or inductive bias that one may have. The prior probability distribution, in this case, 3.6 can be

set by the choice of hyperparameters (𝑎, 𝑏). we evaluated two different cases, (𝑎, 𝑏) = (1, 3) and

(2,2) for both the categorical and compound distribution strategies. The choice (𝑎, 𝑏) = (1, 3) puts

more probability mass on the lower dimensions, and gradually decays with dimension, whereas

(𝑎, 𝑏) = (2, 2) penalizes models of too high or too low dimensions. The appendix A3.7 shows the

prior probabilities of the dimensions for both choices.

3.4.1

In-distribution Data

In this section, we compare the performance of SparC-IB with Intel-VIB, Drop-VIB, and the

vanilla fixed-dimensional VIB approach on the MNIST, CIFAR-10 test set, and ImageNet validation

set. Beyond these choices, we have also compared SparC-IB with a discrete latent space IB model

following VQ-VAE ([33]) on MNIST data. We train each model for 5 values of the Lagrange

multiplier 𝛽 in the set (0.02, 0.04, 0.06, 0.08, 0.1). We calculate the validation set error for each

model for these 𝛽 values. Since increasing 𝛽 penalizes the amount of information retained by the

latent space about the inputs, we expect the error to increase as 𝛽 increases.

For MNIST, we find that, across 𝛽, the compound distribution prior with (𝑎, 𝑏) = (2, 2) performs

best in terms of in-distribution prediction accuracy across 𝛽 as compared to other SparC-IB choices,

fixed-dimensional VIB approaches and Drop-VIB and Intel-VIB, as shown in Fig. 3.3(a). For

CIFAR-10 data, we observe that compound strategy with prior (𝑎, 𝑏) = (1, 3) has the best accuracy

compared to other SparC-IB choices at MNI, fixed-dimensional VIB models and Intel-VIB but is

slightly lower than Drop-VIB (numbers are shown in Table 3.2). However, we find from Table 3.2

50

Methods

SparC-IB
Drop-VIB
Intel-VIB
Fixed K: 6
Fixed K: 32
Fixed K: 128

MNIST

CIFAR-10

Acc %
98.65 (0.001)
98.25 (0.000)
98.51 (0.001)
98.27 (0.001)
98.54 (0.001)
85.28 (0.165)

LL
3.24 (0.004)
3.12 (0.003)
3.23 (0.007)
3.22 (0.004)
3.23 (0.003)
2.85 (0.358)

Acc %
84.44 (0.015)
85.43 (0.004)
82.77 (0.010)
82.29 (0.050)
83.76 (0.012)
81.88 (0.001)

LL
2.53 (0.010)
2.37 (0.015)
2.49 (0.063)
2.49 (0.107)
2.44 (0.013)
2.48 (0.036)

Table 3.2 In-distribution performance of all methods in terms of accuracy and log-likelihood (SD
in the parenthesis) at MNI for MNIST and CIFAR-10 (maximum for each column highlighted).
SparC-IB performs as good as the best performing model in terms of both metrics.

Methods

SparC-IB
Drop-VIB
Intel-VIB
Fixed K: 1024

ImageNet

Acc %
79.71 (0.001)
79.86 (0.000)
79.88 (0.000)
79.88 (0.000)

LL
8.48 (0.007)
8.19 (0.002)
8.53 (0.003)
8.52 (0.005)

Table 3.3 In-distribution performance of all methods in terms of accuracy and log-likelihood (SD
in the parenthesis) at MNI for ImageNet (maximum for each column highlighted). All models are
close in terms of both metrics (especially log-likelihood).

that SparC-IB compound (1,3) has the highest log-likelihood at MNI among the other approaches.

Furthermore, we have found that the best test accuracy for the discrete latent space IB is 97.79%

(avg. over 3 seeds) in MNIST, which is lower than SparC-IB.

For Imagenet data, we observe that the compound strategy with prior (𝑎, 𝑏) = (2, 2) has the

best validation accuracy compared to other SparC-IB choices. Furthermore, the in-distribution

performance is at least as good as the fixed-dimensional VIB model (Table 3.3), but it has a slightly

lower accuracy compared to the Drop-VIB and Intel-VIB. The test error for Intel-VIB is very high

when 𝛽 > 0.04. This behavior is due to the fact that the dimension selector in Intel-VIB (Section

6.3 in [21]) is pruning almost all values of the latent allocation vector 𝐴 for values of 𝛽 > 0.04.

3.4.2 Out-of-distribution Data

We consider three out-of-distribution scenarios to measure the robustness of our approach and

compare it with vanilla VIB with fixed latent dimension capacity, as well as with other sparsity-

inducing strategies. The first is a white-box attack, in which we systematically introduce shot noise

corruptions into the test data [22, 10]. The second is a rotation transform (only for MNIST data).

51

The third is the black-box adversarial attack simulated using the projected gradient descent (PGD)

strategy [19]. We use the log-likelihood metric to compare the out-of-distribution performance of

the methods ([23]). Comparison in terms of other metrics is included in the Appendix A3.8.

3.4.2.1 Noise Corruption

White-box attack or noise corruption is generated by adding shot noise. Following [22], Poisson

noise is generated pixel-by-pixel and added to the test images for both MNIST and CIFAR-10. The

levels of noise along the horizontal axis of panel (b) of Fig. 3.3 and Fig. 3.4 represent an increasing

degree of noise added to the images of the validation set. For our approach and each of the five

models that are being compared, we plot the log-likelihood as a function of the level of noise to

assess the robustness of these approaches. We find that with both MNIST (panels (b) in Fig. 3.3)

and CIFAR-10 (panels (b) in Fig. 3.4), for each of these three metrics, SparC-IB outperforms all

other approaches compared. For ImageNet (panel (b) in Fig. A3.5), we find that Drop-VIB has a

higher likelihood than our approach, possibly due to a large amount of input information (estimated

MI(𝑋; 𝑍) = 57.56 for Drop-VIB and = 10.35 for SparC-IB) learned in the latent space.

3.4.2.2 Rotation

In this scenario, we evaluate the models trained on MNIST data with increasingly rotated digits

following [23], which simulates the scenario of data that are moderately out of distribution to the

data used for training the model. We show the results of the experiments in panel (c) in Fig. 3.3. We

can see that SparC-IB with the compound distribution prior outperforms the rest of the five models

in terms of log-likelihood, we also note that the performance of the Drop-VIB has the lowest in this

scenario.

3.4.2.3 Adversarial Robustness

Multiple approaches have been proposed in the literature to assess the adversarial robustness of

a model [32]. Among them, perhaps the most commonly adopted approach is to test the accuracy in

adversarial examples that are generated by adversarially perturbing the test data. This perturbation

is in the same spirit as the noise corruption presented in the preceding section, but is designed

52

(a) Test error vs 𝛽 value

(b) Log likelihood vs noise

(c) Log likelihood vs attack radius

Figure 3.5 In- and out-of- distribution results on ImageNet.

to be more catastrophic by adopting a black-box attack approach that chooses a gradient direction

of the image pixels at some loss and then takes a single step in that direction. The projected

gradient descent is an example of such an adversarial attack. Following [2], we evaluated the

robustness of the model to the PGD attack with 10 iterations. We use the 𝐿∞ norm to measure the

size of the perturbation, which in this case is a measure of the largest single change in any pixel.

Log-likelihood as a function of the perturbed 𝐿∞ distance for MNIST (panel (d) in Fig. 3.3), for

CIFAR-10 (panel (c) in Fig. 3.4), and for ImageNet (panel (c) in Fig. A3.5) show that the SparC-IB

approach provides the highest log-likelihood across the attack radius in all three datasets.

3.4.3 Analysis of the Latent Space

A key property of SparC-IB is the ability to jointly learn the latent allocation and the dimension

of the latent space for each data point. In this section, our aim is to disentangle the information

learned in the latent space of the SparC-IB approach by analyzing the posterior distribution of the

dimension variable and the information learned in the latent allocation vector for MNIST data.

Similar analyses for CIFAR-10 and ImageNet are in the Appendix A3.9.

3.4.3.1 Flexible Latent Space Dimension

A distinct feature of the proposed approach is the flexibility enabled by the sparsity-inducing

prior for learning a data-dependent dimension distribution. To demonstrate this, for a compound

model with (𝑎, 𝑏) = (2, 2) in Fig. 3.1 we show the distribution of posterior modes of dimension

distribution across data points, aggregated per MNIST digit. We see that, in fact, each digit, on

average, preferred to have a different latent dimension. We further note that the mode values

53

Figure 3.6 Information content plot for different latent dimensions (averaged across seeds). We
observe that SparC-IB learns the maximum information in a small dimensional latent space.

depicted by the plot for digits 5 and 8 are farther away from the rest of the digits. Note that the

pattern in Fig. 3.1 is for a single seed. Although the pattern in dimension distribution modes

changes across the seeds, the separation between the classes remains (see A3.9 in the Appendix

for details). We also observe a similar separation of the latent dimensionality across classes with

CIFAR-10 and ImageNet data (Appendix Fig. A3.7).

3.4.3.2 Analysis of Information Content

The SparC-IB prior in 3.5 induces an ordered selection of the dimensions of the latent allocation

vector 𝐴𝑛 based on the dimension 𝑑𝑛. In Fig. 3.6, we plotted the estimated mutual information or

(cid:156)MI(𝑍; 𝑌 ) against the increasing dimension of the latent space (in increments of 5) for all models in
(cid:205)𝑁
MNIST. For a given dimension 𝑑, (cid:156)MI(𝑍; 𝑌 ) = 1
𝑛=1 log 𝑞(𝑌𝑛|𝑍𝑛 = 𝜇𝑛 (𝑋) ◦ 𝛾(𝑑)), where the first
𝑑 coordinates of 𝛾(𝑑) are 1s and the rest are 0s. We observe that the SparC-IB model has been able

𝑁

to code the learned information in the first few (∼ 15) dimensions of the latent space. In contrast,

we notice that the fixed-dimensional VIB models, Drop-VIB, and Intel-VIB require close to the

full latent space to encode similar information levels. This characteristic perhaps hinders these

models in achieving good robustness performance consistently on all data. However, we believe

54

Figure 3.7 Plot of pixel importance in encoding the latent allocation mean (averaged across seeds)
for a sample from the MNIST test data. We observe that SparC-IB learns important features of the
latent space in the first few dimension of the latent space.

that further investigation is needed to establish this claim. For CIFAR-10, we observed the same

characteristics of the information content plot in Fig. 3.6 whereas for ImageNet the information

plateaus at a larger dimension than Fig. 3.6 (see Fig. A3.8 in the Appendix).

3.4.3.3 Visualizing the Latent Dimensions

In deep neural networks, the conductance of a hidden unit, proposed in [5], is the flow of

information from the input through that unit to the predictions.

In this spirit, we measure the

importance scores of individual pixels in the dimensions of the latent mean allocation vector 𝜇 for

the MNIST data. The computation details have been added in the Appendix A3.9.4. In Fig. 3.7, we

plotted these measures for different methods (averaged across seeds) for the first 15 dimensions of

𝜇 for a sample from the MNIST test data. We observe that SparC-IB encodes important features in

the first few dimensions of the latent space. For Intel-VIB and VIB with K = 32, we see that the pixel

information is spread over a large set of dimensions of the latent space. For Drop-VIB, we notice

that it learns a lot of information in all the 15 dimensions. Fig. 3.7 also helps explain the information

jumps in Fig. 3.6. We notice that the jump in the information content occurs when we include

the dimensions that have learned important pixel information, e.g. for SparC-IB dimensions 3 and

8. Note that the Intel-VIB and fixed-dimensional VIB models with high dimensions have many

dimensions with very little information about the input. Although Fig. 3.7 exhibits some features

of the digit learned by the latent space (e.g., the middle part of the digit has high importance for

SparC-IB), in our experiments, we have not been able to extract meaningful features in the latent

dimensions that are common across all the digits. In our view, this demands further investigation.

55

3.5 Conclusions

In summary, we propose the SparC-IB method in this paper, which models the latent variable

and its dimension through a Bayesian spike-and-slab categorical prior and derived a variational

lower bound for efficient Bayesian inference. This approach accounts for the full uncertainty in the

latent space by learning a joint distribution of the latent variable and the sparsity. We compare our

approach with commonly used fixed-dimensional priors, as well as using those sparsity-inducing

strategies previously proposed in the literature through experiments on MNIST, CIFAR-10, and

ImageNet in both the in-distribution and out-of-distribution scenarios (such as noise corruption,

adversarial attacks, and rotation). We find that our approach obtains as good accuracy and robustness

as the best-performing model in all the cases, and in some cases it outperforms the other models.

This is important because we found that other VIB algorithms considered performed well on a few

datasets but have significantly poor performance on the others. In addition, we show that enabling

each data to learn their own dimension distribution leads to separation of dimension distribution

between output classes, thus substantiating that latent dimension varies class-wise. Furthermore,

we find that the SparC-IB approach provides a compact latent space in which it learns important

data features in the first few dimensions of the latent space, which is known to lead to superior

robustness properties.

There are several avenues for future research based on the proposed model SparC-IB. Since

the latent dimensionality of the data is modeled through a Bayesian spike-and-slab prior with a

categorical spike distribution over the dimension, an interesting avenue for future research could be

to find a rich class of hierarchical priors or a non-parametric stick-breaking prior. Another direction

might be to explore other approaches to data-driven mutual information estimation to tighten the

lower bound of MI(𝑋; 𝑍). A key aspect of this work is its broader appeal in the field of adaptive

selection of model complexity in machine learning. Beyond IB, the proposed sparsity-inducing

prior is a promising candidate for the probabilistic node and depth selection of DNN. Our approach

for data-specific complexity learning can potentially be applied to have a NN with an optimal

number of layers with ordered and sparse information captured in each layer.

56

BIBLIOGRAPHY

[1] Ahuja, K., Caballero, E., Zhang, D., Gagnon-Audet, J.-C., Bengio, Y., Mitliagkas, I., and Rish,
I. (2021). Invariance principle meets information bottleneck for out-of-distribution generaliza-
tion. Advances in Neural Information Processing Systems, 34:3438–3450.

[2] Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2016). Deep variational information

bottleneck. arXiv preprint arXiv:1612.00410.

[3] Burkhardt, S. and Kramer, S. (2019). Decoupling sparsity and smoothness in the Dirichlet

variational autoencoder topic model. J. Mach. Learn. Res., 20(131):1–27.

[4] Chun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous
dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 72(1):3–25.

[5] Dhamdhere, K., Sundararajan, M., and Yan, Q. (2018). How important is a neuron?

[6] Fan, T.-H., Chi, T.-C., Rudnicky, A. I., and Ramadge, P. J. (2022). Training discrete deep

generative models via gapped straight-through estimator.

[7] Fischer, I. (2020). The conditional entropy bottleneck. Entropy, 22(9).

[8] Fischer, I. and Alemi, A. A. (2020). Ceb improves model robustness. Entropy, 22(10):1081.

[9] George, E. I. and McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of

the American Statistical Association, 88(423):881–889.

[10] Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common
corruptions and perturbations. In 7th International Conference on Learning Representations
(ICLR-2019).

[11] Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and Bayesian

strategies. The Annals of Statistics, 33(2):730 – 773.

[12] Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-softmax.

arXiv preprint arXiv:1611.01144.

[13] Kim, J., Kim, M., Woo, D., and Kim, G. (2021). Drop-bottleneck: Learning discrete
compressed representation for noise-robust exploration. arXiv preprint arXiv:2103.12300.

[14] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint

arXiv:1312.6114.

[15] Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Alsallakh, B., Reynolds, J., Melnikov,

57

A., Kliushkina, N., Araya, C., Yan, S., et al. (2020). Captum: A unified and generic model
interpretability library for pytorch. arXiv preprint arXiv:2009.07896.

[16] Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. (2019). Nonlinear information bottleneck.

Entropy, 21(12).

[17] Lan, A. S., Waters, A. E., Studer, C., and Baraniuk, R. G. (2014). Sparse factor analysis for
learning and content analytics. Journal of Machine Learning Research (JMLR), 15(57):1959–
2008.

[18] Maddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous

relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.

[19] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017). Towards deep

learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.

[20] Mahmoud, H. (2008). Pólya urn models. Chapman and Hall/CRC.

[21] Miao, N., Mathieu, E., Siddharth, N., Teh, Y. W., and Rainforth, T. (2021). On incorporating

inductive biases into vaes. arXiv preprint arXiv:2106.13746.

[22] Mu, N. and Gilmer, J. (2019). Mnist-c: A robustness benchmark for computer vision. arXiv

preprint arXiv:1906.02337.

[23] Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J., Lakshmi-
narayanan, B., and Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating
predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems,
32.

[24] Rao, C. R. (1973). Linear statistical inference and its applications, volume 2. Wiley New

York.

[25] Rodríguez Gálvez, B., Thobaben, R., and Skoglund, M. (2020). The convex information

bottleneck Lagrangian. Entropy, 22(1):98.

[26] Singh, R., Ling, J., and Doshi-Velez, F. (2017). Structured variational autoencoders for
the Beta–Bernoulli process. In NIPS 2017 Workshop on Advances in Approximate Bayesian
Inference.

[27] Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. (2016). Inception-v4, inception-resnet

and the impact of residual connections on learning.

[28] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv

preprint physics/0004057.

58

[29] Tishby, N. and Zaslavsky, N. (2015). Deep learning and the information bottleneck principle.

In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE.

[30] Tomczak, J. M. (2022). Deep Generative Modeling. Springer Nature.

[31] Tonolini, F., Jensen, B. S., and Murray-Smith, R. (2020). Variational sparse coding.

In

Uncertainty in Artificial Intelligence, pages 690–700. PMLR.

[32] Tramer, F., Carlini, N., Brendel, W., and Madry, A. (2020). On adaptive attacks to adversarial

example defenses. Advances in Neural Information Processing Systems, 33:1633–1645.

[33] Van Den Oord, A., Vinyals, O., et al. (2017). Neural discrete representation learning. Advances

in neural information processing systems, 30.

[34] Yang, X., Liu, W., Liu, W., and Tao, D. (2019). A survey on canonical correlation analysis.

IEEE Transactions on Knowledge and Data Engineering, 33(6):2349–2368.

[35] Yeung, S., Kannan, A., Dauphin, Y., and Fei-Fei, L. (2017). Tackling over-pruning in

variational autoencoders. arXiv preprint arXiv:1706.03643.

[36] Yu, X., Yu, S., and Príncipe, J. C. (2021). Deep deterministic information bottleneck with
In ICASSP 2021-2021 IEEE International Conference on

matrix-based entropy functional.
Acoustics, Speech and Signal Processing (ICASSP), pages 3160–3164. IEEE.

[37] Zhai, P. and Zhang, S. (2021). Adversarial

information bottleneck.

arXiv preprint

arXiv:2103.00381.

59

APPENDIX A3

SUPPLEMENTARY DATA

A3.1 Model Architectures and Hyperparameter settings

For the proposed SparC-IB model, we need to select the neural architecture to be used for the

latent space mean and variance, as well as the dimension encoder that learns categorical probabilities

or compound distribution parameters. To learn the parameters of the dimension distribution in both

compound and categorical strategies, we split the head of the encoder network into two parts, as

depicted in Fig. 3.2. The first part estimates the parameters of the latent allocation variable 𝐴

and the second part estimates the parameters of the dimension variable 𝑑. For MNIST, we follow

[25] and use an MLP encoder with three fully connected layers, the first two of dimensions 800

and ReLu activation, and the last layer with dimension 2𝐾 + 2 for the compound strategy and 3𝐾

for the categorical strategy. The decoder consists of two fully connected layers, the first of which

has dimension 800 and the second predicts the softmax class probabilities in the last layer. For

CIFAR-10, following [36], we adopt a VGG16 encoder and a single-layered neural network as

decoder. We choose the final sequential layer of VGG16 as the bottleneck layer that outputs the

parameters of the latent space.

For ImageNet, we crop the images at its center to make them 299 × 299 pixels and normalize

them to have mean = (0.5, 0.5, 0.5) and standard deviation = (0.5, 0.5, 0.5). We have followed the

implementation of [2] where we transform the ImageNet data with a pre-trained Inception Resnet V2

([27]) network without the output layer. Under this transformation, the original ImageNet images

reduce to a 1534-dimensional representation, which we used for all our results. Following [2], we

use an encoder with two fully connected layers, each with 1024 hidden units, and a single-layer

decoder architecture.

For comparison with other sparsity-inducing approaches, we chose the two most recent works:

the Drop-B model ([13]) and the InteL-VAE ([21]). The Drop-B implementation requires a feature

extractor. For all three data sets, we choose the same architecture for the feature extractor as the

encoder of SparC-IB until the final layer, which has 𝐾 dimensions. We assume the same decoder

60

Hyperparameters
Train set size
Validation set size
# epochs
Training batch size
Optimizer
Initial learning rate
Learning rate drop
Learning rate drop steps
Weight decay

MNIST
60,000
10,000
100
128
Adam
1e-4
0.6
10 epochs
Not used

CIFAR-10
50,000
10,000
400
100
SGD
0.1
0.1
100 epochs
5e-4

ImageNet
128,1167
50,000
200
2000
Adam
1e-4
0.97
2 epochs
Not used

Table A3.1 Hyperparameter settings used to model the three data sets.

architecture for Drop-B as for SparC-IB. Additionally, for Drop-B, the 𝐾 Bernoulli probabilities

are trained with the other parameters of the model. For InteL-VAE, the encoder and decoder

architectures are chosen to be the same as in SparC-IB for all data sets. In addition, this model

requires a dimension selector (DS) network. Following the experiments in [21], we select three

fully connected layers for the DS network with ReLu activations between, where the first two layers

have dimension 10 and the last layer has dimension 𝐾. We fix 𝐾 (that is, the prior assumption of

dimensionality) to be 100 for MNIST and CIFAR-10 and 1024 for the ImageNet data.

The workflow of SparC-IB overlaps with the standard VIB when encoding the mean and sigma

of the full dimension of the latent variable. Furthermore, SparC-IB encodes categorical probabil-

ities and then draws samples from a categorical distribution. Unlike the reparameterization trick

([14]) for Gaussian variables, there does not exist a differentiable transformation from categorical

probabilities to the samples. Therefore, we use the Gumbel-Softmax approximation ([18], [12]) to

draw categorical samples. We apply the transformation in 3.5 to the samples and take the element-

wise product with the Gaussian samples before passing it to the decoder. Note that there exists

other differentiable reparameterization of the discrete samples, e.g., the Gapped Straight-Through

(GST) estimator [6]. However, in our experiments, the use of the Gumbel-Softmax approximation

has led to a lower loss value than that of the GST.

Fitting deep learning models involves several key hyperparameters. In Table B4.2, we provide

the necessary hyperparameters for training and evaluation of all fitted models in three data sets.

61

A3.2 Data Augmentation

For the CIFAR-10 data, we augmented the training data using random transformations. We pad

each training set image by 4 pixels on all sides and crop at a random location to return an original-

sized image. We flip each training set image horizontally with probability 0.5. Furthermore,

we normalize each training and validation set image with mean = (0.4914, 0.4822, 0.4465) and

standard deviation = (0.2023, 0.1994, 0.2010).

A3.3 Convergence Checks

For CIFAR-10, we observed overfitting after 100 epochs for the Drop-VIB model, where

the validation loss started to increase. Therefore, we saved the model at epoch 100 for our

robustness analysis. For other models in our experiments on MNIST and CIFAR-10, we observed

the convergence of train and validation loss, and we have considered models at the final epoch as

the final models. For ImageNet, we saved the model with the lowest validation loss as our final

model for all the methods.

A3.4 Evaluation Metrics

We calculated three evaluation metrics for each method in each scenario:

test error, log-

likelihood, and Brier score. For prediction in all in- and out-of-distribution scenarios, follow-

ing [8], we have used the mean latent space E(𝑍 |𝑋) as input to the decoder for all meth-

ods. Note that for SparC-IB we calculate the marginal expectation by following, E(𝑍 |𝑋) =
E𝑞(𝑑|𝑋)E𝑞(𝑍 |𝑑,𝑋) (𝑍) ˆ= 1
𝐽

𝑗=1 E𝑞(𝑍 |𝑑=𝑑 𝑗 ,𝑋) (𝑍), where 𝑑 𝑗 is a sample from 𝑞(𝑑|𝑋).

In our experi-

(cid:205)𝐽

ments, we have fixed 𝐽 = 10.

A3.5 Software AND Hardware

We have forked the code base https://github.com/burklight/convex-IB-Lagrangian-PyTorch.git

that implements the Convex-IB method ([25]) using PyTorch. The code to run the models used

in the experiments can be found in the following repository https://github.com/AnirbanSamaddar/

SparC-IB. For modeling MNIST, we used NVIDIA V100 GPUs and for CIFAR-10 and ImageNet

experiments, we used NVIDIA A100 GPUs.

62

Figure A3.1 Information curve on MNIST, CIFAR-10, and ImageNet.

A3.6

Information Curve: Selection of 𝛽

In the IB Lagrangian, the Lagrange multiplier 𝛽 controls the trade-off between two MI terms.

By optimizing the IB objective for different values of 𝛽, we can explore the information curve,

which is the plot of (MI(𝑋; 𝑍), MI(𝑍; 𝑌 )) in the 2-d plane. Fig. A3.1 shows the information

curve on the validation set for the models selected for MNIST, CIFAR-10 and ImageNet for the

robustness studies in the main article. For the fixed 𝐾 VIB models, the information curves are

similar to Fig. A3.1. Minimum necessary information ([7]) is a point in the information plane

where MI(𝑋; 𝑍) = MI(𝑍; 𝑌 ) = H(𝑌 ), where the entropy is indicated by H(𝑌 ). For classification

tasks, where labels are deterministic given the images, the entropy H(𝑌 ) = log2 𝑛𝑐, where 𝑛𝑐 is the

number of classes. Therefore, for MNIST and CIFAR-10 H(𝑌 ) = log2 10 ∼ 3.32 and for ImageNet

H(𝑌 ) = log2 10 ∼ 9.97. Therefore, we choose 𝛽 ∼ 0.08 for MNIST, 𝛽 ∼ 0.04 for CIFAR-10, and

𝛽 ∼ 0.02 for ImageNet, which gives us the closest proximity to MNI. The points are circled in

Fig. A3.1.

A3.7 Selection of Prior Parameters SparC-IB

Fig. A3.2 shows the probabilities of the prior dimension considered for modeling the three data

sets. These prior probabilities are from the compound distribution (Eq. 3.6) with 𝐾 = 100 (left

figure) and 𝐾 = 1024 (right figure).

A3.8 Performance based on the Evaluation Metrics on out-of-distribution data

In the main paper, we have presented the robustness results for the three data sets in terms of

the log-likelihood. Here we present the robustness results in terms of the test set error and the Brier

score for all the methods across the three data sets.

63

Figure A3.2 Plot of prior dimension probabilities for choices of (𝑎, 𝑏).

Fig. A3.3 (a)-(e) shows the results for the out-of-distribution MNIST test data. From the figures,

we observe that in all the scenarios SparC-IB performs as well as the best performing model. We

further observe that the separation between the models in terms of the test error and the Brier score

is less than the log-likelihood which is our chosen metric in the main paper. Similar results for

CIFAR10 and ImageNet have been presented in Fig. A3.4 and Fig. A3.5 respectively. For CIFAR10

(Fig. A3.4), we observe similar behavior in terms of the test error and the Brier score as for MNIST.

For ImageNet (Fig. A3.5), we observe that SparC-IB is doing better in terms of test error in the 𝐿∞

attacks. However, Drop-VIB seems to be performing better than other approaches for the white-box

attacks. This behavior has been discussed in the main paper (Sec. 3.4.2.1).

(a) Test error vs noise

(b) Brier score vs noise

(c) Test error vs rotation

(d) Brier score vs rotation (e) Test error vs 𝐿∞ radius
Figure A3.3 Out-of-distribution performance in terms of the test error, and the Brier score on
MNIST. We observe that SparC-IB approach with compound strategy and (𝑎, 𝑏) = (2, 2) (red line)
performs as well as the best-performing model in all the cases.

64

(a) Test error vs noise

(b) Brier core vs noise

(c) Test error vs 𝐿∞ radius

Figure A3.4 Out-of-distribution performance in terms of the test error, and the Brier score on
CIFAR-10. We observe that SparC-IB approach with compound strategy and (𝑎, 𝑏) = (1, 3) (green
line) performs as good as the best performing model in all the cases.

(a) Test error vs noise

(b) Brier score vs noise

(c) Test error vs 𝐿∞ radius

Figure A3.5 Out-of-distribution performance in terms of the test error, and the Brier score on
ImageNet. In the white-noise scenario, the drop-VIB performs better than our approach possibly due
to learning more information about 𝑋. However, we observe that SparC-IB approach outperforms
other models in black-box attacks.

A3.9 Analysis of Latent Space

A3.9.1 Dimension Distribution Mode Plot across Seeds

Fig. A3.6 shows the mode plot for SparC-IB compound (2,2) across 3 seeds on MNIST data.

The overall range of dimensions remains unchanged across the 3 seeds; however, we observe that

each digit prefers a different dimension of the latent space.

A3.9.2 The Dimension Distribution Mode Plot for CIFAR-10 AND ImageNet

Fig. A3.7 shows the dimension distribution mode plot across classes of CIFAR-10 and ImageNet.

We show these plots for the SparC-IB compound (1,3) in CIFAR-10 and the SparC-IB compound

(2,2) in ImageNet. In both data sets, we observe that each class prefers a different latent dimension

(especially on ImageNet).

65

(a)

(b)

(c)

Figure A3.6 Mode plot for SparC-IB compound (𝑎, 𝑏) = (2, 2) across the 3 seeds on MNIST. We
observe separation of posterior modes between the MNIST digits for all 3 seeds.

(a) Dimension mode plot for
CIFAR-10

(b) Dimension mode plot for Im-
ageNet.

Figure A3.7 Plot of the posterior modes of the dimension variable for (a) CIFAR-10 and (b) 50
randomly chosen classes of ImageNet. Both plot show separation between the latent dimension of
the classes chosen by SparC-IB.

A3.9.3

Information Content Plot for CIFAR-10 and ImageNet

Fig. A3.8 shows the estimated MI(𝑍; 𝑌 ) (expression provided in Sec. 3.4.3.2) against the

increasing dimension of the latent space for CIFAR-10 and ImageNet.

In CIFAR-10, SparC-

IB provides the most compact representation among the other models in which the information

plateaus within the first dimensions (∼ 5) of the latent space. For ImageNet, we observe that the

information plateaus around dimension 500 which is smaller than the fixed-dimensional VIB and

Intel-VIB models but higher than the Drop-VIB model. In addition, we note that the behavior of

the estimated MI(𝑍; 𝑌 ) as a function of dimension is much smoother than those of the other two

data sets. The reason for such behavior is perhaps the complexity in the ImageNet data, where

it requires a high-dimensional latent space to encode the necessary information of 𝑋 where each

dimension’s contribution is small.

66

(a) Information content vs di-
mension plot for CIFAR-10

(b) Information content vs di-
mension plot for ImageNet.

Figure A3.8 Plot of the information content vs dimension of the mean of the encoder for (a) CIFAR-
10 and (b) ImageNet. Both plot show SparC-IB encodes the maximum information in a smaller
dimensional latent space than other models.

A3.9.4 Calculating Pixel-wise Importance Scores for Dimension Visualization on MNIST

We have used the Captum package [15] to calculate the pixel importance scores to visualize

the latent space (Sec. 3.4.3.3) for MNIST. Given an input image 𝑥 and a baseline image 𝑥′, the

importance score for the 𝑖 th pixel on the 𝑑 th dimension of the mean of the latent space is calculated

using the following expression.

Importance Score𝑑

𝑖 (𝑥) = (𝑥𝑖 − 𝑥′
𝑖)

∫ 1

𝛼=0

𝜕𝜇𝑑 (𝛼𝑥 + (1 − 𝛼)𝑥′)
𝜕𝑥𝑖

𝑑𝛼

In the above expression, 𝜇(.) is the mean vector of the latent space and 𝜇𝑑 (.) represents its 𝑑-th

coordinate. We used a blank image where every pixel value is 0 as a baseline 𝑥′. We can interpret

the score as the sensitivity of the dimensions of 𝜇(𝑥) to a small change in each pixel integrated on

the images that fall on the line given by 𝛼𝑥 + (1 − 𝛼)𝑥′.

67

CHAPTER 4

PREDICTION OF THE POLYGENIC RISK SCORES USING DEEP LEARNING IN
ANCESTRY-DIVERSE GENETIC DATA SETS

4.1

Introduction

Recent years have observed unprecedented increases in the sample size of biomedical data sets.

Government initiatives (e.g. UK-Biobank, All of Us) and private organizations (e.g. 23andMe®)

have generated genotype information linked to phenotypic and disease data on millions of individ-

uals. These large-scale data sets have led to important advances in genomics research, including

mapping of risk loci (aka Genome-Wide Association (GWA) analysis) and in developing models

to predict phenotypes and disease risk using genomic information [11, 18].

In genetic studies,

the prediction of a phenotype or disease risk is commonly made by polygenic scores (PGS) which

are weighted sums of risk alleles at Single Nucleotide Polymorphism (SNP) that in a GWA study

were found to be associated with a phenotype of interest [18]. Despite important advances in

PGS prediction, many important challenges persist. One such challenge is the prediction of PGS

in ancestry-diverse data sets which may involve heterogeneity of SNP effects between ancestry

groups.

Most human genomic studies build PGS using linear models with SNPs as the covariates and the

phenotype of interest as the response. A PGS is typically derived in two stages: (i) A GWA study

is conducted to identify SNPs significantly associated with the outcome of interest using univariate

regression, either a penalized or a Bayesian procedure (a form of sure independent screening [7]),

then, (ii) The effects of the selected SNPs are estimated using a linear regularized regression [3].

With the increasing sample size in genetic data sets, studies have expanded the set of tools to include

machine learning (ML) models for prediction to account for possible interactions between SNPs

([1],[2],[6],[13],[17], [16]). In some studies, ML models have shown gains in prediction accuracy

compared to the linear model for some traits ([1],[17],[16]); however, other studies have not shown

a consistent superiority of ML over traditional linear models ([2], [13]).

The majority of the Genome-Wide Association (GWA) and PGS studies used data from Eu-

68

ropeans [14]. Within homogeneous populations, linear models can capture a very large fraction

of the genetic variance, even if the underlying genetic architecture involves complex gene-gene

interactions. The reasons why linear models can provide very good approximations to genome-

to-phenotype maps that are highly nonlinear have been studied extensively in quantitative genetics

[9]. However, the approximation provided by a linear function has validity within the context (ge-

netic background) where it was developed. Therefore, linear models often do not generalize well

across populations because differences in allele frequencies, linkage disequilibrium patterns, and

genetic-by-environment interactions between ancestry groups make the linear effects of SNPs vary

between populations. Therefore, recent studies have emphasized utilizing ML models to capture

such non-linearities in ancestry-diverse data sets. For instance, Elgart et al. [6] showed that an

ensemble method using gradient-boosted regression trees performs better than traditional linear

models in modeling PGS for a wide range of traits in a multi-ancestry data set.

In genetics, gene-by-gene interactions are referred to as epistatic effects. Although epistatic

interactions can occur among distant loci, the literature on epistasis suggests that most gene-gene

interactions are likely to happen between loci that are physically proximal in their genome position

positions (e.g., between SNPs within the same gene). There are biological and statistical reasons

that support the hypothesis that most epistatic interactions happen between SNPs that are physically

proximal. For example, a mutation in a regulatory region may modulate the effect of mutations in

a coding region. Likewise, two mutations within a gene may have larger effects on the resulting

protein structure than a single mutation. Additionally, limitations of the genomic data available can

lead to apparent epistasis (aka phantom epistasis [5]) even if the underlying causal model is linear.

The notion that most SNP-by-SNP interactions may happen between SNPs that are physically

proximal may be used to develop ML that are both statistically and computationally efficient. To

the best of our knowledge, there are no ML models in the literature which aim to capture these local

interaction effects.

Therefore, the objective of this study is to develop and investigate ML methods for PGS in the

context of ancestry-diverse populations that can capture local epistatic interactions. To attain this

69

goal, we develop ML models with local neural network learners throughout the genome. Using the

multi-ancestry Trans-Omics in Precision Medicine (TOPMed) [10], we empirically show that our

approach significantly improves the prediction accuracy on small genomic segments throughout

the genome.

Applying ML methods on a whole-genome scale is computationally challenging. Therefore, in-

spired by the encouraging results of the application of ML methods to short chromosome segments,

and also on previous research [4] we propose a novel two-step procedure where local regressions are

used to learn DNA-phenotype patterns for short chromosome segments (this step is fully paralleliz-

able) and then combine the genomic scores of each chromosome segment into a single PGS. The

proposed architecture (1) captures local interaction effects, and (2) is tuned to learn highly sparse

and heterogenous signals throughout the genome. Using simulations, we benchmark the prediction

performance of four specifications obtained by either using a Bayesian Linear Regression (BLR) or

Machine Learning (ML) for the first and the second step of the procedure. Our results suggest that

combining variable selection in the first step through a BLR followed by ML in the second step can

improve prediction accuracy in cross-ancestry prediction relative to BLR or DL alone. This is a

baseline study that serves as a step towards formalizing the ML application in genomic prediction.

4.2 Materials and Methods

To carry out our simulations we used genotype data from the TOPMed data set which contains

whole-genome sequenced genomes that led to the calling of about 40 million SNPs across 23,078

samples from various ancestry groups. We choose this data set because it was derived from whole-

genome sequencing (as opposed to genotyping based on SNP arrays), it has a relatively large sample

size, and it is ancestry-diverse.

Training and Testing sets: We perform a principal component analysis on the SNPs and then

apply K-means clustering to categorize the samples into 6 distinct population groups. Appendix

Fig. A4.1 shows differences in the sub-populations in the plot of the first 2 principal components.

Our focus is on developing methods that can leverage multi-ancestry data to improve the perfor-

mance of polygenic prediction in non-Europeans who are severely underrepresented in genomics

70

research. Therefore, to test the cross-ancestry prediction accuracy of PGS, we select a held-out

set of samples from sub-population 6 in Fig. A4.1 (which corresponds to individuals primarily of

African ancestry) for testing the prediction accuracy and use the remaining samples for training.

Simulation set up: Our primary objective is to assess how ML techniques can capture non-

linearity in the genotype-phenotype relationship. Most genomic data sets use SNP arrays with ∼

1 million SNPs. This represents a small fraction of all the SNPs in the human genome. In this

context, most of the variants with causal effects are not genotyped. Therefore, in GWA studies and

in PGS prediction, the SNPs available act as surrogates for the SNPs with causal effects. Mapping

of causal loci and PGS prediction is possible because of linkage-disequilibrium (i.e., correlation)

between the genotyped SNPs and those with causal effects. We leveraged the fact that TOPMed

offers whole-genome sequence data to simulate phenotype data using randomly selected causal

variants and then developed prediction models which excluded those causal variants. This strategy

allowed us to represent a realistic scenario in which causal variants are not genotyped.

In the simulation study, there are many ways to introduce non-linearity in the relationship

between the SNPs and the phenotype. The model used to simulate data (i.e., the causal model)

may involve group-specific effects which may emerge due to epistasis or genetic-by-environment

interactions. However, it is also known that even in cases where the model is linear at the causal

level, the use of SNPs that are imperfect surrogates for the genotypes at causal loci may lead to

non-linearities in the SNP-phenotype map [5].

Our objective was to benchmark ML methods in a challenging scenario that does not offer a

clear advantage to ML models relative to linear PGS. Therefore, in all our simulations, following

[5] we assume that the causal model is linear (with effects that are the same across ancestry groups)

and then analyzed the data excluding the causal variants. In this scenario, non-linearities in the

SNP-phenotype map may emerge because of heterogeneity in the correlation between the SNPs

used for analysis across ancestry-diverse groups.

Let the causal model be defined by,

𝒚 = 𝒁𝒃 + 𝒆

71

(4.1)

Here, 𝒃 = [𝑏1, ..., 𝑏𝑆]′ is the effect size vector , 𝒁 = [𝒁1, ..., 𝒁𝑆] ∈ R𝑛×𝑆 represents 𝑛 samples of 𝑆

causal loci (randomly selected among the available SNPs) and the errors 𝒆 ∈ R𝑛 are chosen from iid

Gaussian distribution with mean 0 and variance 𝜎2. In our simulations, we tuned the error variance

to achieve different scenarios regarding the proportion of variance of the phenotype explained by

genetic effects (aka heritability).

To develop prediction models we regressed the simulated phenotype (𝒚 from the causal model

Eq. 4.1) using SNPs 𝑿 ∈ R𝑛×𝑝 that were not involved in the causal model (𝑿 does not include any

of the variants included in the causal model Eq. 4.1). The instrumental model takes the form,

𝒚 = 𝑓 ( 𝑿) + 𝝐

(4.2)

Here, 𝑓 (.) is any regression function, and 𝝐 ∈ R𝑛 are model residuals. In ML, 𝑓 (.) is non-linear

on its inputs; on the other hand, in the linear models that we used to benchmark ML 𝑓 (.) = 𝑿 𝜷.

Our simulations were based on SNPs in chromosome 1; this chromosome has a length of about

250 mega-base-pairs (mbp) and represents approximately 8.3% of the human genome. We first

conducted a simulation based on short chromosome segments (up to 1 mega-base-pairs, mbp) and

then extended this to a simulation that involve the entire chromosome.

Short-segment simulation: In this simulation, we applied Eq. 4.1 and Eq. 4.2 to short chromo-

some segments. In each Monte Carlo replicate, we sampled at random a position in chromosome 1

and then using the SNPs in a segment around that position, either a short one, 100 kilo-base-pairs

(kbp) long, or a longer one (1,000 kbp=1 mbp) to simulate and analyze the data using Eq. 4.1 and

Eq. 4.2 respectively. We tuned the error variance to achieve a proportion of variance explained by

the causal loci (aka heritability) of either 0.2 or 0.4.

In our data, there are approximately 40 SNPs per kbp and 400 SNPs per mbp. In the simulation

involving 100kbp segments, we assumed that 4 SNPs in the segment had effects and that the

remaining 36 SNPs had no effects (i.e., at the causal level there was 90% of sparsity).

In the

simulation involving 1 mbp segments, we assumed that 16 of the 400 SNPs had effects (i.e., 96%

sparsity).

72

Figure 4.1 Map of chromosome divided into segments each containing 400 SNPs ( 1 mega-base-
pairs). The segments highlighted (in green) each contains 1 causal SNP. There were 20 segments
with one causal variant. Data were simulated using the 20 causal variants, according to Eq. 4.1
The causal SNPs were removed to implement the regression of Eq. 4.2.

Chromosome-wide simulation: In a second simulation we expanded our research to simulation

and analysis involving the entire chromosome 1. Within the human genome, the correlation between

SNPs typically decays with physical distance. In most populations, the correlation drops to values

close to zero within roughly 1 mbp. Therefore, in the chromosome-wide simulation, we divide

chromosome 1 into 250 segments each of width ∼ 400 SNPs (1 mbp) and randomly select 20 causal

SNPs from distant segments. Similar to before, we use Eq. 4.1 and an assigned heritability of 0.2

to generate the phenotype. For illustration, Fig. 4.1 shows a graphical representation of 100,000

SNPs of chromosome 1 divided into segments, each with 400 SNPs. The highlighted segments (in

green) each contain 1 causal SNP.

Analysis Methods:

In genetics studies, a variety of machine learning models have been

previously explored (see [13] for a review).

In this study, we chose a feed-forward multi-layer

perceptron neural network to model the data since it has shown promise in modeling local interaction

patterns [17].

Single chromosome-segments simulation: For the segment-wise analysis (first simulation), we

benchmark the performance of the ML model against (1) a Bayesian Linear Regression (BLR)

with priors from the spike-slab family and (2) a BLR model with random SNP-ancestry groups

interaction effect [19] [19] (i.e., a model that allows for the SNP effects to vary between ancestry-

groups). Further details about the analysis methods are provided in the Appendix B4 section in the

appendix.

Whole-chromsome simulation:

It is computationally infeasible to fit a single model to the

∼ 100, 000 SNPs of chromosome 1. Therefore, as suggested by [8], we apply all models in

overlapping segments of 3 mbp (approximately 1200 SNPs). We then shift this window by 1 mbp,

73

Figure 4.2 Schematic representation of the two-stage modeling of SNPs from chromosome 1 and a
simulated phenotype (𝒚) in the whole-chromosome simulation.

First stage (local)
Linear
Linear
Neural network
Neural network

Second stage
Linear

Label
BLR + BLR
Neural network BLR + NN
NN + BLR
NN + NN

Linear
Neural network

Table 4.1 Labels for the models used in the chromosome-wide analysis.

creating local regressions consisting of a 1 mbp core and two 1 mbp flanking regions on either side.

Further details are provided in the Appendix B4 section.

We fit the chromosome-wide models using a two-stage procedure, (1) fit local regression

models and get the predicted values separately on each segment, and (2) combine the scores to

make the final prediction. Fig. 4.2 shows a schematic representation of the two-stage modeling

where in the initial stage, SNPs from each segment passed through a function 𝑓 (.) to predict the

scores 𝑠𝑘 , 𝑘 = 1, ..., 250. This first step was applied using the sliding window approach previously

described, in parallel for each of the overlapping segments. This step produces a single score 𝑓 (.)

for each segment. In the second stage, we combine the scores using another function 𝑔(.) to make

74

(a) Prediction correlation on 100 kbp segments

(b) Prediction correlation on 1 mbp segments

Figure 4.3 The figure shows (a) ML clearly outperforms the linear models in terms of test set
prediction correlation (error bars represent standard errors) in the scenario using 100 kilo-base-
pairs (kbp) segments and simulated phenotypes.
(b) ML performs similarly or slightly better
compared to the baselines when applied to larger (1 mega-base-pairs) segments throughout the
genome. BLR: Bayesian linear regression model, BLR-interaction: BLR model with random
interaction effects, and NN: Neural network.

the final prediction ˆ𝑦. With the approach in Fig. 4.2, we choose from four models for building the

PGS whereas at each stage, the candidate models (i.e. the choice of 𝑓 (.) and 𝑔(.)) considered are

either a BLR model or a neural network. Table 4.1 summarizes the strategies we used for data

analysis.

4.3 Results

Performance on Short-segments simulation: Fig. 4.3 shows the Pearson’s correlation coeffi-

cient between the predicted and true phenotype on the test data set for the three models. For the

simulation involving short (100kbp-long, Fig. 4.3 panel (a)) we observe that the ML model clearly

outperforms the two linear models. This even though the model used to simulate data was linear.

However, for a larger 1 mbp segment (Fig. 4.3 panel (b)), the difference in prediction correlation

between linear models and the neural network was much smaller. Fig. 4.3(b) shows that the ML

model is performing similarly to or marginally better than the baseline models for ℎ2 = 0.2 and

75

Figure 4.4 Chromosome-wide prediction accuracy (error bars represent standard errors) shows that
local neural network methods do not perform as well as the linear model. The method based
on linear models only (BLR+BLR) performs marginally better than the one based solely on NN
(NN+NN). Using a linear (variable selection) model in the first stage and a NN in the second stage
(BLR+NN) was the best-performing method. BLR + BLR: BLR in both the stages, BLR + NN:
BLR in the first stage and neural network in the second stage, NN + BLR: Neural network in the
first stage and BLR in the second stage, and NN + NN: Neural network in both the stages.

0.4 respectively. Furthermore, we observe that the linear models in Fig. 4.3(b) show significantly

better results than panel (a) where the ML model maintains a similar prediction correlation. For

large segments, the prediction accuracy of linear models improves because longer segments can

leverage the long-range LD between SNPs. As one would expect, we observe that higher ℎ2 (i.e.,

higher signal-to-noise ratio) led to better prediction performance for all models.

Performance on chromosome-wide simulation: Fig. 4.4 shows the prediction correlation

(averaged over 50 bootstrap samples in the test data set) of the models in the chromosome-wide

simulation. We observe that all models are significantly less accurate compared to Fig. 4.3. This

was expected because the signal is distributed over many more SNPs (∼ 99% sparsity), therefore,

posing a greater challenge for the models to detect them. In addition, we observe that the prediction

76

accuracy decreases when the neural network is applied in the first stage (NN + BLR and NN + NN)

compared to the linear model (BLR + BLR and BLR + NN). This behavior is primarily because the

simulated signal is sparse (many segments did not have effects, see Fig. 4.1) and neural networks

do not perform as well as the BLR (which uses a sparsity-inducing prior) model at separating the

segments with the signal from the rest. The best prediction accuracy is achieved by using a linear

model in the first stage and a neural network in the second stage (BLR + NN model).

4.4 Discussion

In this study, we proposed strategies for the use of ML approaches for constructing polygenic

scores on an ancestry-diverse data set. Over the past decade, machine learning research has

seen tremendous growth in all fields of science. However, in Genetics, despite the age of big

data ushering in important advancements, the application of ML has seen limited success. With

the future increase in data collected on populations from diverse ancestry (e.g. All of Us), we

highlight the importance of ML compared to linear models in constructing cross-population PGS

that achieves better generalization across populations.

Modern machine-learning applications in PGS prediction have reported mixed results. However,

recent studies ([6],[16]) have shown promise for ML in genomic prediction in the domain of cross-

population PGS construction. Although our objective is similar, we show that ML models gain

significant improvement in prediction accuracy over baseline models when applied to local genomic

segments (Fig. 4.3(a)). This demonstrates the potential impact of local epistatic interactions [5],

which lead to decreased accuracy of the linear model.

The chromosome-wide application shows that the prediction accuracy decreases when using the

local neural network approach in the first stage of the two-stage modeling. This corroborates with

the findings of earlier studies [2] which show the performance of the neural network deteriorates

when there are excessive covariates that have no effect on the output. However, using a BLR (for

the first stage) followed by NN to combine effects yielded better results than simply using NNs or

linear models.

This study provides numerous opportunities for future research in ML applications in genomic

77

prediction. The proposed two-stage approach allows for the application of this method on a whole-

genome scale. This is because the first step can be fully parallelized, and the second step operates

in a reduced dimension using local polygenic scores inferred in the first stage. As noted earlier,

for the two-stage method in the chromosome-wide simulation, the most effective approach was

the one using a variable selection model in the initial stage, and a NN in the second stage. This

outcome presents a chance to create techniques that can produce reliable predictions while being

computationally feasible. The initial step, which is computationally expensive, can be done in

parallel and using linear models that are less expensive and simpler to implement than an ML

model.

Therefore, our next objective will be to expand this approach to whole-genome simulations,

evaluating it with real data, and, finally developing software that could implement these methods

and a user-friendly, computationally efficient framework. One possible future research is to explore

machine learning methods that incorporate linear dimension reduction (e.g. random projections

[12]) of the input data into their architecture, or neural networks that induce sparsity on the input-

output map.

In conclusion, this study provides insights into the application of ML in capturing epistatic

interactions in modern ancestry-diverse data sets. Our research indicates that ML models have the

potential to capture the local non-linearity that occurs as a result of imperfect linkage disequilibrium

(LD) between SNPs. As a result, we have developed a two-stage framework, based on local

polygenic scores, that is effective in constructing PGS on large chromosome-wide data sets. With the

help of standard software and computational resources available to researchers, this framework can

be applied genome-wide to construct PGS on underrepresented population groups using ancestry-

diverse ultra-high dimensional genetic data sets.

78

BIBLIOGRAPHY

[1] Alzoubi, H., Alzubi, R., and Ramzan, N. (2023). Deep learning framework for complex disease

risk prediction using genomic variations. Sensors, 23(9):4439.

[2] Bellot, P., de Los Campos, G., and Pérez-Enciso, M. (2018). Can deep learning improve

genomic prediction of complex human traits? Genetics, 210(3):809–819.

[3] Choi, S. W., Mak, T. S.-H., and O’Reilly, P. F. (2020). Tutorial: a guide to performing polygenic

risk score analyses. Nature protocols, 15(9):2759–2772.

[4] de Los Campos, G., Grueneberg, A., Funkhouser, S., Pérez-Rodríguez, P., and Samaddar, A.
(2023). Fine mapping and accurate prediction of complex traits using bayesian variable selection
models applied to biobank-size data. European Journal of Human Genetics, 31(3):313–320.

[5] de Los Campos, G., Sorensen, D. A., and Toro, M. A. (2019). Imperfect linkage disequilibrium
generates phantom epistasis (& perils of big data). G3: Genes, Genomes, Genetics, 9(5):1429–
1436.

[6] Elgart, M., Lyons, G., Romero-Brufau, S., Kurniansyah, N., Brody, J. A., Guo, X., Lin, H. J.,
Raffield, L., Gao, Y., Chen, H., et al. (2022). Non-linear machine learning models incorporating
snps and prs improve polygenic prediction in diverse human populations. Communications
Biology, 5(1):856.

[7] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.

[8] Funkhouser, S. A., Vazquez, A. I., Steibel, J. P., Ernst, C. W., and Los Campos, G. d.
(2020). Deciphering sex-specific genetic architectures using local bayesian regressions. Genetics,
215(1):231–241. 32198180[pmid].

[9] Hill, W. G., Goddard, M. E., and Visscher, P. M. (2008). Data and theory point to mainly

additive genetic variance for complex traits. PLoS genetics, 4(2):e1000008.

[10] Kowalski, M. H., Qian, H., Hou, Z., Rosen, J. D., Tapia, A. L., Shan, Y., Jain, D., Argos, M.,
Arnett, D. K., Avery, C., et al. (2019). Use of > 100,000 nhlbi trans-omics for precision medicine
(topmed) consortium whole genome sequences improves imputation quality and detection of
rare variant associations in admixed african and hispanic/latino populations. PLoS genetics,
15(12):e1008500.

[11] Lewis, C. M. and Vassos, E. (2020). Polygenic risk scores: from research tools to clinical

instruments. Genome medicine, 12(1):1–11.

[12] Liu, Z., Bhattacharya, S., and Maiti, T. (2022). Variational bayes ensemble learning neural
networks with compressed feature space. IEEE Transactions on Neural Networks and Learning

79

Systems.

[13] Lourenço, V. M., Ogutu, J. O., Rodrigues, R. A., and Piepho, H.-P. (2022). Genomic prediction
using machine learning: A comparison of the performance of regularized regression, ensemble,
instance-based and deep learning methods on synthetic and empirical data. bioRxiv, pages
2022–06.

[14] Mills, M. C. and Rahal, C. (2019). A scientometric review of genome-wide association

studies. Communications biology, 2(1):9.

[15] Pérez, P. and de Los Campos, G. (2014). Genome-wide regression and prediction with the

bglr statistical package. Genetics, 198(2):483–495.

[16] Sigurdsson, A. I., Ravn, K., Winther, O., Lund, O., Brunak, S., Vilhjalmsson, B. J., and
Rasmussen, S. (2022). Improved prediction of blood biomarkers using deep learning. medRxiv,
pages 2022–10.

[17] Sigurdsson, A. I., Westergaard, D., Winther, O., Lund, O., Brunak, S., Vilhjálmsson, B. J.,
and Rasmussen, S. (2021). Deep integrative models for large-scale human genomics. bioRxiv,
pages 2021–06.

[18] Torkamani, A., Wineinger, N. E., and Topol, E. J. (2018). The personal and clinical utility of

polygenic risk scores. Nature Reviews Genetics, 19(9):581–590.

[19] Veturi, Y., de Los Campos, G., Yi, N., Huang, W., Vazquez, A. I., and Kühnel, B. (2019).
Modeling heterogeneity in the genetic architecture of ethnically diverse groups using random
effect interaction models. Genetics, 211(4):1395–1407.

80

APPENDIX A4

TOPMED DATA SET

Figure A4.1 Plot of first two principle components of the SNPs shows clear population stratifications.

81

APPENDIX B4

SUPPLEMENTARY METHODS

B4.1 Local regression

In this section, we demonstrate the local regression approach used to model the whole chromo-

some data using a sequence of 1600 SNPs. We divide the whole set into 4 segments each of size

400 SNPs (∼ 1 mbp). Fig. B4.1 shows the 400 core SNPs and flanking regions for the four local

regression models. In the chromosome-wide application, the local regressions are used to calculate

scores for each segment and then combined in the second stage. To calculate the scores using the

BLR model, following [8], we only use the effects from the core region.

B4.2 Bayesian linear model

Let us assume, a linear regression model with 𝒚 as the response and 𝑿 as the covariate matrix

is of the form,

𝒚 = 𝑿 𝜷 + 𝝐,

𝜖𝑖

iid∼ N(0, 𝜎2
𝜖 )

(B4.1)

where 𝒚 = (𝑦1, 𝑦2, ..., 𝑦𝑛)′ is a vector of phenotypes, 𝑿 = (𝒙1, ..., 𝒙 𝑝) is an 𝑛 × 𝑝 SNPs matrix,

𝜷 is the 𝑝-dimensional vector of regression coefficients, 𝝐 is a vector of iid Gaussian error terms,

𝜎2
𝜖 > 0 is the common error variance.

Prior We chose a Bayesian Variable Selection model with IID priors from the spike-and-slab

family on the regression coefficient of Eq. B4.1 that includes a point mass at 0 and a Gaussian slab.

The formulation of the prior is as follows,

𝑝(𝛽 𝑗 |𝜋, 𝜎2

𝑏 ) iid= 𝜋N(0, 𝜎2

𝑏 ) + (1 − 𝜋)1(𝛽 𝑗 = 0)

𝜋 ∼ 𝑏𝑒𝑡𝑎(𝑎, 𝑏)

𝜎2
𝑏 ∼ 𝜒−2(𝑑𝑓𝑏, 𝑆𝑏)

𝜖 ∼ 𝜒−2(𝑑𝑓𝜖 , 𝑆𝜖 )
𝜎2

Here, 𝜒−2 denotes the scaled-inverse chi-square distribution.

82

Figure B4.1 The local regression models fitted on a sequence of 1600 SNPs.

Hyperparameters Values

𝑎
𝑎+𝑏
𝑎 + 𝑏
iterations
burn-in

1
10
110
10000
2000

Table B4.1 Table showing hyperparameter values for the BLR model.

Hyperparameters We used the R package BGLR [15] to draw samples from the posterior. The

Gibbs sampler in BGLR requires important hyperparameters such as the number of iterations and

the number of burn-in steps. We set the hyperparameters according to Table B4.1.

B4.3 Bayesian linear model with group-specific interaction

Following [19], we define the Bayesian linear model with group-specific interaction effects. In

addition to the main SNPs effects, we introduce SNPs × population dummy effects in the model.

For the simplicity of notation, we describe the model assuming 2 populations,

=

𝒚1

𝑿1

















1𝜇1















𝑗=1 are the main effects and 𝜷𝑖 = {𝛽𝑖 𝑗 }

𝜷0 +

1𝜇2

































𝑿2

𝑿1

𝒚2

+

0

𝑝

𝑝









0

𝑿2









𝜷1 +

𝜷2 +

𝝐 1

𝝐 2

















Here, 𝜷0 = {𝛽0 𝑗 }

𝑗=1, 𝑖 = 1, 2 are the interactions effects for
populations 1 and 2. We assume the spike-and-slab prior described in the previous section on the

(B4.2)

regression coefficients of Eq. B4.2. The effects are estimated using BGLR software [15] with the

values of the hyperparameters from Table B4.1.

83

Hyperparameters
Train set size
Test set size
# epochs
Training batch size
Optimizer
Initial learning rate
Learning rate drop
Learning rate drop steps

Segments Chromosome-wide

22,778
300
100
128
Adam
1e-4
0.6
10 epochs

22,578
500
100
128
Adam
1e-4
0.6
10 epochs

Table B4.2 Hyperparameter settings used for modeling the local segments and the entire chromo-
some.

B4.4 Neural network model

For the simplicity of notation, we first define a shallow neural network model. Let, the output

of a shallow neural network is ˆ𝒚 then the model is defined as

ˆ𝒚 = 𝑏0 +

(cid:16)

𝑏 𝑗 𝜓

𝛾 𝑗0 +

𝐾
∑︁

𝑗=1

𝑝
∑︁

ℎ=1

(cid:17)

𝛾 𝑗 ℎ𝒙𝒉

(B4.3)

Here, the input 𝑿 is passed through one layer with 𝐾 hidden nodes before making the prediction

ˆ𝒚. The parameter vectors {𝛾 𝑗0, 𝛾 𝑗 ℎ}

ℎ=1 denotes the biases and the weights for the 𝑗-th node of
𝑗=1 denotes the biases and weights used to combine the output of the
hidden layer for the final prediction. The non-linearity is introduced in the network by the activation

the hidden layer and {𝑏0, 𝑏 𝑗 }𝐾

𝑝

function {𝜓}. Deep neural networks expand shallow neural networks to have more than one hidden

layer. We omit the full equation of deep neural networks for brevity.

Model architecture In this study, we use feed-forward multi-layer perceptron (MLP) neural

networks. In Sec. 4.3, we have discussed two purposes of applying the neural networks, (1) local

regression on segments, and (2) combining the scores of the local models. For local models, we

considered an encoder-decoder architecture where both the encoder and decoder are shallow neural

networks Eq. B4.3 with ReLU activations. In all our simulations, we set the number of hidden

nodes 𝐾 = 128. For combining the local regression scores, we used a single and much smaller

shallow neural network with 𝐾 = 50. All models were fitted using early stopping on 10-fold

cross-validation data splits of the training set and the ensemble of the 10 models was taken as the

final model.

84

Fitting deep learning models involves several key hyperparameters. In Table B4.2, we provide

the necessary hyperparameters for the training and evaluation of all fitted models.

85

CHAPTER 5

CONCLUSION

This dissertation made significant strides in the field of machine learning, particularly in the

realm of supervised learning encompassing both classification and regression tasks. We developed

scalable machine learning methods and explored their practical application for feature selection

and prediction in large-scale genetics data, resulting in several noteworthy discoveries:

Chapter 1 introduced the Bayesian Hierarchical Hypothesis Testing (BHHT) algorithm, which

significantly enhanced fine-mapping tasks in Genome-wide association studies. By applying BHHT

to identify risk variants for serum-urate traits using data from UKBiobank, the method demonstrated

an impressive increase in discoveries by capturing highly correlated SNPs that conventional variant-

centered procedures failed to detect.

Chapter 2 addressed the improvement of a deep latent variable model, the information bot-

tleneck, by introducing the novel sparsity-inducing spike-slab categorical prior (SparC-IB). Ex-

periments on MNIST, CIFAR-10, and ImageNet data showcased the superior prediction accuracy

and robustness of the proposed approach compared to traditional fixed-dimensional priors and

other sparsity induction mechanisms for latent variable models found in the literature. SparC-IB’s

ability to enable data-point-specific dimensionality learning further demonstrated the successful

separation of latent dimensions between image classes.

In Chapter 3, we focused on developing machine learning methods for constructing polygenic

scores (PGS) for ancestry-diverse datasets. Using the TOPMed genotype data on short chromosome

segments and simulated phenotypes, we show that the local neural network learners improve cross-

ancestry polygenic risk scores (PGS) prediction over state-of-the-art Bayesian variable selection

models. Inspired by this, we propose a two-step procedure that enables expanding the use of deep

learning for whole-genome PGS prediction. Our simulation results show that a two-step procedure

that involves inference of linear local scores for individual chromosome segments, later combined

into a single PGS using a neural network provides better prediction performance than linear models

or models based on neural networks only.

86

In summary, the thesis contributes valuable insights into the realm of machine learning method-

ologies for feature selection and prediction in the context of large-scale genetics data. The methods

and algorithms that we developed hold the potential to advance genetic research and pave the way

for improved precision in understanding complex genetic traits and their associations.

87