FLEXIBLE HIERARCHICAL BAYESIAN MODELING EXTENSIONS TO IMPROVE
WHOLE GENOME PREDICTION AND GENOME WIDE ASSOCIATION ANALYSES
By
Chunyu Chen

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Animal Science â Doctor of Philosophy
2017

ABSTRACT
FLEXIBLE HIERARCHICAL BAYESIAN MODELING EXTENSIONS TO IMPROVE
WHOLE GENOME PREDICTION AND GENOME WIDE ASSOCIATION ANALYSES
By
Chunyu Chen
Whole genome prediction (WGP) has been widely implemented in animal and plant
breeding for genomic selection of economically important traits, having already accelerated
genetic progress for economically important traits in some species especially dairy cattle.
Genome wide association (GWA) analysis is used for screening genomic regions that may
include important candidate genes segregating for the trait of interest and is being increasingly
integrated with WGP analysis. Both WGP and GWA typically represent mâŤn problems as
defined by a large number of single nucleotide polymorphism (SNP) markers (m) and comparably
much smaller number of individuals (n). Two broad types of parametric models are typically
considered for these analyses: traditional best linear unbiased prediction approaches based on
SNP marker effects being normally distributed and Bayesian WGP models that allow more
flexible specifications for SNP marker effects based on either heavy-tailed or variable selection
specifications. Bayesian WGP models can achieve higher prediction accuracies than traditional
approaches in many applications if properly tuned; however, their implementation can be
computationally challenging. My dissertation was aimed to address some of these emerging
issues in Bayesian WGP models as well as providing software tools for real data applications.
In Chapter 2, I developed an expectation maximization (EM) algorithm as a fast alternative to
traditional Markov Chain Monte Carlo (MCMC) for Bayesian WGP models. I proposed EM
implementations for two models, heavy-tailed BayesA and stochastic search and variable
selection (SSVS) adapting the EM algorithm for maximum a posterior (MAP) inference of SNP

effects and adapting REML like strategies to estimate key hyperparameters. Using a
comprehensive simulation study and real data analysis, I found that these empirical Bayes
approaches can be quite sensitive to starting values for SNP effects. However, using a
deterministic annealing variant of EM, I obtained hyperparameter estimates and prediction
accuracies comparable to their MCMC counterparts. In Chapter 3, I further assessed the
possibility using two Bayesian WGP models BayesA and SSVS for GWA studies. I also
included a popular GWA analysis (EMMAX) based on the utilization of the linear mixed model.
In addition to basing inferences on traditional single SNP tests and fixed genomic window tests, I
assessed the merit of tests involving adaptively determined windows based on clustering genome
into blocks based on linkage disequilibrium. I found that SSVS and BayesA under MCMC and
adaptive window tests led to best receiver operating curve (ROC) properties. In Chapter 4, I
extended SSVS to single step SSVS to incorporate phenotypes of non-genotyped individuals and
compared its performance with corresponding models ignoring these genotypes for both WGP
and GWA. I found single step SSVS to be a promising for WGP and GWA, particularly for
genetic architectures characterized by a few genes with large effects. In Chapter 5, I combined
much of the developments in Chapter 2 to Chapter 4 and beyond in a unified framework as an
open source R package BATools to implement several different Bayesian models for WGP and
GWA.

To my wife Gefan Li

iv

ACKNOWLEDGMENTS

It is a great experience spending five years to explore and finally know just a tip of animal
breeding and genetics. This dissertation research could not have been done without the help from
my advisor, committee members, lab mates and the support from families and friends. I would
like to sincerely thank everyone who helped and supported me through this incredible journey.
Firstly, I would like to thank my Ph.D. advisor, Professor Robert J. Tempelman, for the guidance
and patience during the past five years. Rob is someone you will love once you talk to him. He is
always energetic, passionate and thoughtful. He is a good mentor, professor, and friend. Without
his advice and suggestion, I would not have completed this work.
I also want to thank my committee members Dr. Juan Steibel, Dr. Yuehua Cui, Dr. Nora Bello
and Dr. Qing Lu for their insightful suggestions and inspiring thoughts. I appreciate your help
and suggestions. Special thanks to Dr. Juan Steibel, who inspired me a lot on writing
reproducible code through his class and talks and made the swine data available for my research,
and Dr. Yuehua Cui, for his introduction to statistical genetics.
Also, thanks to my former and current colleagues in the quantitative genetics lab including
Jose-Luis Gualdron Duarte, Yeni Bernal Rubio, Wenzhao Yang, Heng Wang, Youngfang Lu, Lei

Zhou, Pablo Reeb, Kaitlyn Daza, Yasir Nawaz, Deborah Velez, Ryan Corbett and Scott
Funkhouser for the meaningful discussions and friendship. Special thanks to Jose-Luis Gualdron
Duarte and Yeni Bernal Rubio for their assistance in preparing the swine data.
Finally, I am indebted to my wife and parents and Iâd like to thank them for their love and
support.

v

TABLE OF CONTENTS

LIST OF TABLES ......................................................................................................................... ix
LIST OF FIGURES ...................................................................................................................... xii
Chapter1 Introduction ..................................................................................................................... 1
Chapter2 An Integrated Approach to Empirical Bayesian Whole Genome Prediction Modeling . 7
2.1 Abstract ............................................................................................................................ 7
2.2 Introduction ...................................................................................................................... 8
2.3 Materials and methods ................................................................................................... 10
2.3.1 The first stage linear WGP model ........................................................................... 10
2.3.2 BayesA EM ............................................................................................................. 11
2.3.3 SSVS EM ................................................................................................................ 14
2.3.4 Hyperparameter estimation ..................................................................................... 17
2.3.4.1 Variance component estimation ....................................................................... 17
2.3.4.2 Estimation of remaining hyperparameters ....................................................... 20
2.4 Data ................................................................................................................................ 21
2.4.1 Simulation Study ..................................................................................................... 21
2.4.2 Loblolly Pine Data .................................................................................................. 22
2.5 Data Analysis ................................................................................................................. 23
2.6 Expectation maximization variable selection (EMVS) .................................................. 24
2.7 Results ............................................................................................................................ 26
2.7.1 Simulation Study ..................................................................................................... 26
2.7.2 Application to Loblolly Pine Data .......................................................................... 32
2.8 Discussion ...................................................................................................................... 34
Chapter3 Genome Wide Association Analyses Based on Broadly Different Specifications for
Prior Distributions, Genomic Windows, and Estimation Methods............................................... 39
3.1 Abstract .......................................................................................................................... 39
3.2 Introduction .................................................................................................................... 40
3.3 Methods and materials ................................................................................................... 44
3.3.1 The hierarchical linear model .................................................................................. 44
3.3.2 Models ..................................................................................................................... 45
3.3.3 Joint posterior density ............................................................................................. 46
3.3.4 Algorithms............................................................................................................... 46
3.3.4.1 Markov Chain Monte Carlo ............................................................................. 46
3.3.4.2 Maximum a posterior estimation ..................................................................... 47
3.3.5 Conducting Genome Wide Association Analyses .................................................. 48
3.3.5.1 Single SNP marker associations ...................................................................... 48
3.3.5.2 Windows based associations ............................................................................ 52
3.4 Data ................................................................................................................................ 55
3.4.1 Simulation Study ..................................................................................................... 55
3.4.2 MSUPRP data ......................................................................................................... 60

vi

3.5 Results ............................................................................................................................ 61
3.5.1 Simulation Study ..................................................................................................... 61
3.5.2 MSUPRP Data ........................................................................................................ 65
3.6 Discussion ...................................................................................................................... 69
Chapter4 Hierarchical Whole-Genome Prediction and Genome-Wide Association Modeling
When Some Genotypes Are Missing ............................................................................................ 79
4.1 Abstract .......................................................................................................................... 79
4.2 Introduction .................................................................................................................... 80
4.3 Methods and materials ................................................................................................... 82
4.3.1 The hierarchical linear model .................................................................................. 82
4.3.2 The ssGBLUP model .............................................................................................. 84
4.3.3 The ssSSVS model .................................................................................................. 87
4.3.4 Conducting Genome-Wide Association Analyses .................................................. 88
4.3.4.1 Single SNP marker associations ...................................................................... 88
4.3.4.2 Windows based associations ............................................................................ 89
4.4 Data and Applications Strategies ................................................................................... 90
4.4.1 Genotypes ................................................................................................................ 90
4.4.2 Simulation study...................................................................................................... 91
4.4.3 Dairy consortium data ............................................................................................. 93
4.4.4 Benchmarking analysis for dairy consortium data .................................................. 95
4.4.5 Cross-validation study for dairy data ...................................................................... 95
4.4.6 Software .................................................................................................................. 97
4.5 Results ............................................................................................................................ 97
4.5.1 Simulation Study ..................................................................................................... 97
4.5.2 Dairy Data ............................................................................................................. 100
4.6 Discussion .................................................................................................................... 110
4.7 Summary and Conclusions........................................................................................... 120
Chapter5 BATools: A Hierarchical Modeling R Package for Genome Prediction and Genomewide Association Analysis .......................................................................................................... 121
5.1 Abstract ........................................................................................................................ 121
5.2 Introduction .................................................................................................................. 121
5.3 Statistical Models and Algorithms ............................................................................... 126
5.3.1 Priors for marker effects ........................................................................................ 127
5.3.2 Single-step for BayesA/B and SSVS .................................................................... 129
5.3.3 Antedependence implementation .......................................................................... 130
5.3.4 Algorithms............................................................................................................. 131
5.3.5 GWA implementation ........................................................................................... 132
5.4 Data .............................................................................................................................. 133
5.5 Interface and application examples .............................................................................. 134
5.5.1 Example 1: Cross-validation using BRR, BayesA and SSVS .............................. 138
5.5.2 Example 2: GWA using EMMAX and SSVS...................................................... 141
5.5.3 Example 3: Fitting antedependence model for GWA ........................................... 143
5.5.4 Example 4: Fitting single-step model using ssGBLUP, ssBayesA, ssBayesB and
ssSSVS ........................................................................................................................... 147
5.6 Performance and computing time ................................................................................ 149
vii

5.7 Concluding remarks and future developments............................................................. 150
Chapter6 Conclusions, Discussions and Future Work ................................................................ 153
APPENDICES ............................................................................................................................ 159
Appendix A: Chapter 3 ...................................................................................................... 160
Appendix B: Chapter 4 ...................................................................................................... 184
Appendix C: Chapter 5 ...................................................................................................... 217
REFERENCES ........................................................................................................................... 222

viii

LIST OF TABLES

Table 2.1 Average MCMC and MMAP estimates of hyperparameters as a function of marker
density, starting values and expectation-step (E-step) strategies under a BayesA model. ........... 27
Table 2.2 Average MCMC and MMAP estimates of hyperparameters as a function of marker
density and starting values under a SSVS model.......................................................................... 29
Table 3.1 Overall mean relative (random classifier = 1) partial areas under a receiving operating
characteristic curve up until a false positive rate of 5% (pAUC05) for different methods on single
SNP associations ........................................................................................................................... 61
Table 3.2 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for associations based on genomic windows of length 1Mb. Comparisons are made
within different specifications of shape parameter (ď§) for Gamma distribution of quantitative trait
loci (QTL) and number of QTL (nqtl) ........................................................................................... 62
Table 3.3 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for associations based on genomic windows adaptively chosen by the BALD software
package. Comparisons are made within different specifications of number of quantitative trait
loci (nqtl) ........................................................................................................................................ 63
Table 3.4 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) between different
window sizes within each of EMMAX, MCMC-BayesA, and MCMC-SSVS. ........................... 64
Table 4.1 Number of cows by research station in dairy consortium study ................................... 93
Table 4.2 Cross-validation (25-fold) prediction accuracies for comparing GBLUP and SSVS (all
animals genotyped) in benchmark analysis ................................................................................ 100
Table 4.3 Cross-validation (25-fold) prediction accuracies for GBLUP and SSVS and their
respective single step extensions (ssGBLUP and ssSSVS) on genotyped cows ........................ 101
Table 4.4 Cross-validation (25-fold) prediction accuracies for GBLUP and SSVS and their
respective single step extensions (ssGBLUP and ssSSVS) on non-genotyped cows ................. 101
Table 4.5 Average ((n=5 fold for within herds and n=6 fold for across herds) measures of
strength of association (-log10P-value using GBLUP or posterior probability using SSVS) for
most significant SNP/genomic region using single-step compared to conventional specifications
on milk fat ................................................................................................................................... 105
Table 4.6 Average (n=5 fold for within herds and n=6 fold for across herds) measures of strength
of association (-log10P-value using GBLUP or posterior probability using SSVS) for most

ix

significant SNP/genomic region using single-step compared to conventional specifications on
body weight ................................................................................................................................. 106
Table 4.7 Likelihood ratio test on H 0 : ďł ďĄ2 ď˝ ďł u2 for within station study in milk fat ................. 107
Table 4.8 Likelihood ratio test on H 0 : ďł ďĄ2 ď˝ ďł u2 for milk fat across station splits where respective
analysis masked genotypes for research station as indicated below ........................................... 108
Table 4.9 Likelihood ratio test on H 0 : ďł ďĄ2 ď˝ ďł u2 for milk fat across station splits where respective
analysis masked genotypes on all other research stations except for research station as indicated
below ........................................................................................................................................... 109
Table 4.10 Full results of within station analyses for PPA in SSVS for most significant
SNP/genomic region using single-step compared to conventional specifications on milkfat and
body weight ................................................................................................................................. 114
Table 4.11 Full results of across station analyses for PPA in SSVS for most significant
SNP/genomic region using single-step compared to conventional specifications on milkfat and
body weight ................................................................................................................................. 117
Table 5.1 List of models in BATools and their priors and hyperparameters .............................. 129
Table 5.2 GWA output for different models for single SNP and window based approaches .... 132
Table 5.3 Cross-validation prediction accuracy for BRR, BayesA and SSVS ........................... 141
Table 5.4 Cross-validation prediction accuracy for ssGBLUP, ssBayesA, ssBayesB and ssSSVS
..................................................................................................................................................... 147
Table A.1 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations based on non-overlapping genomic windows of length 0.5Mb.
Comparisons are made within different specifications of shape parameter (ď§) for Gamma
distribution of quantitative trait loci (QTL) and number of QTL (nqtl) ...................................... 174
Table A.2 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations based on non-overlapping genomic windows of length 2Mb.
Comparisons are made within different specifications of shape parameter (ď§) for Gamma
distribution of quantitative trait loci (QTL) and number of QTL (nqtl) ...................................... 175
Table A.3 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations based on non-overlapping genomic windows of length 3Mb.
Comparisons are made within different specifications of shape parameter (ď§) for Gamma
distribution of quantitative trait loci (QTL) and number of QTL (nqtl) ...................................... 176

x

Table A.4 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
specifications of degrees of freedom hyperparameter ( vg = 2.5 versus vg = 5.0) using MCMCBayesA. Comparisons are made within different specifications of shape parameter (ď§) for
Gamma distribution of quantitative trait loci (QTL) and number of QTL (nqtl) ......................... 177
Table A.5 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different sets of
starting values for SNP effects (MCMC-SSVS vs RRBLUP) for MAP-SSVS. Comparisons are
made within different specifications of number of quantitative trait loci (nqtl) .......................... 178
Table A.6 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different sets of
starting values for SNP effects (MCMC-BayesA vs RRBLUP) for MAP-BayesA. Comparisons
are made within different specifications of number of quantitative trait loci (nqtl) .................... 179
Table A.7 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations averaging across all window size determinations (single SNP,
0.5 Mb, 1.0Mb, 2.0Mb, 3.0Mb and adaptive windows) based on two different methods for
inferring posterior probabilities of association: 1) That proposed by Fernando et al., 2014 and 2)
that proposed by Moser et al., 2015. Comparisons are made within different specifications of
shape parameter (ď§) for Gamma distribution of quantitative trait loci (QTL) and number of QTL
(nqtl) ............................................................................................................................................. 180

xi

LIST OF FIGURES

Figure 2.1 MMAP versus MCMC estimates of ďł g2 in simulation study (20 replicated datasets
each with 5,000 markers) under BayesA model with MMAP estimates based on different starting
values for g and ďłďş Set 1) rrBLUP (g) and REML(ďł), Set 2) MCMC (g and ďł), or Set 3) g = 0
and MCMC (ďłďŠ, Top panel of plots (P) pertain to use of E-step based on relative precisions
whereas bottom panel of plots (V) pertain to use of E-step based on relative variances.
Reference line of slope 1 and intercept 0 superimposed............................................................... 28
Figure 2.2 MMAP versus MCMC posterior means of ďł g (top row) and of ď° (bottom row) in
2

simulation study (20 replicated datasets each with 5,000 markers) under SSVS model with
MMAP estimates based on different starting values for g and ďąďş Set 1) rrBLUP (g) and
REML(ďł), Set 2) MCMC (g and ďą), and Set 3) g = 0 and MCMC(ďąďŠ . Reference line of slope 1
and intercept 0 superimposed........................................................................................................ 30
Figure 2.3 Mean accuracies of breeding value prediction for EB inference as a function of
different marker densities (625, 1250, 2500, or 5000 markers) for BayesA (Panel A) and SSVS
(Panel B) in simulation study (20 replicated datasets). e-GBLUP based on REML(ďłďŠ, is same
for both Panels A) and B) whereas MCMC refers to using fully Bayesian inference under
MCMC for the corresponding model. Other lines pertain to EB inference based on different sets
of starting value sets or E-step strategies Set 1) e-rrBLUP (g) and REML (ďł), Set 2) MCMC (g
and ďłďŠ, Set 3) g = 0 and MCMC (ďłďŠ with letter suffixes indicating whether the corresponding Estep was based on relative precisions (P) or relative variances (V). Letter codes used to separate
estimates (P<0.05) having different accuracies based on the 5,000 marker analyses .................. 31
Figure 2.4 Average cross-validation accuracies for analysis of loblolly pine data based on
empirical Bayes (EB) inference under BayesA (left cluster) or SSVS (right cluster) models. eGBLUP, based on REML(ďłďŠ, is same for both clusters whereas MCMC refers to using fully
Bayesian inference for the corresponding model. Other bars pertain to EB inference based on
different sets of starting value sets or E-step strategies: Set 1) e-rrBLUP (g) and REML (ďł), Set
2) MCMC (g and ďłďŠ, or Set 3) g = 0 and MCMC with letter suffixes indicating whether the
corresponding E-step was based on relative precisions (P) or relative variances (V). Letter codes
used to separate estimates (P<0.05) having different cross-validation prediction accuracies within
each cluster.................................................................................................................................... 33
Figure 2.5 DAEMVS regularization plot for e-SSVS analysis of one training dataset analysis
based on loblolly pine data. The x-axes pertain to precision on spike component variance (c at
bottom) and inverse temperatures (t at top) whereas the y-axis denote SNP effect estimates gĚ . . 34
Figure 3.1 Distribution of quantitative trait loci effects under a Gamma distribution for different
specifications of shape (magenta curve ď§ = 0.18, blue curve ď§ = 1.48 and red curve ď§ = 3.00) .... 56
Figure 3.2 Manhattan plots for single SNP analysis on 13th week 10th rib backfat in Duroc
Pietrain F2 cross (n = 922 pigs) based on different methods (Panel A: EMMAX, Panel B:
xii

MCMC-SSVS, Panel C: MCMC-BayesA, Panel D: RRBLUP, Panel E: MAP-BayesA and Panel
F: MAP-SSVS) ............................................................................................................................. 66
Figure 3.3 Manhattan plots for genomic window based associations on 13th week 10th rib backfat
in Duroc Pietrain F2 cross (n = 922 pigs) based on different methods (Panel A: EMMAX, Panel
B: MCMC-SSVS, Panel C: MCMC-BayesA, Panel D: RRBLUP, Panel E: MAP-BayesA and
Panel F: MAP-SSVS) under adaptive window inference. ............................................................ 68
Figure 3.4 Linkage disequilibrium (r2 metric) heatmap for genomic region containing Windows
905 - 909 on Chromosome 6 as adaptively determined by BALD software. Blue dots are starting
and ending points for window 905 whereas purple dots are starting and ending points for window
909. Black dots are the 3 markers at 133.9292Mb, 136.0786Mb and 136.0844Mb that are top 3
SNPs by MCMC-SSVS. The blue oval is used to highlight a pocket of higher r2 measures SNP
markers in window 905 and 909. .................................................................................................. 69
Figure 4.1 Illustration of within station partitions P1-P5 for one particular station. 20% of the
cows are marked as non-genotyped with the remaining 80% cows treated as genotyped in each
partition ......................................................................................................................................... 94
Figure 4.2 Example of training vs. validation partition for P1 (from Figure 4.1) ........................ 95
Figure 4.3 Boxplots of prediction accuracies of breeding values of genotyped and non-genotyped
cows based on the simulation study of different nqtl of 30, 300 and 3000. Panel A) nqtl=30 for
genotyped cows; Panel B) nqtl=300 for genotyped cows; Panel C) nqtl=3000 for genotyped cows;
Panel D) nqtl=30 for non-genotyped cows; Panel E) nqtl=300 for non-genotyped cows; Panel F)
nqtl=3000 for non-genotyped cows; Methods not sharing the same letter code within each panel
have different mean prediction accuracies (P<0.05)..................................................................... 98
Figure 4.4 Boxplot of relative pAUC05 for each method on the simulation study of different nqtl
of 30, 300 and 3000. The first row is the relative pAUC05 using single SNP approach and the
second row is the relative pAUC05 using adaptive window approach. Panel A) nqtl=30 for single
SNP; Panel B) nqtl=300 for single SNP; Panel C) nqtl=3000 for single SNP; Panel D) nqtl=30 for
adaptive window; Panel E) nqtl=300 for adaptive window; Panel F) nqtl=3000 for adaptive
window. Methods not sharing the same letter code are significantly different from each other
within each plot (P<0.05).............................................................................................................. 99
Figure 4.5 Boxplot of relative pAUC05 for SSVS and ssSSVS on the simulation study with
nqtl=3000 for adaptive window approach based on different specifications for ď° ďŚ : Panel A)

ď° ďŚ ď˝ 0.001 , Panel B) ď° ďŚ ď˝ 0.01 and Panel C) joint MCMC sampling of ď° ďŚ . Methods not sharing
the same letter code are significantly different from each other within each plot (P<0.05). ...... 100
Figure 4.6 Manhattan plot for milkfat treating all cows as genotyped in benchmarking study.
Panel A: single SNP inferences for EMMAX; Panel B: adaptive window inferences for
EMMAX; Panel C: single SNP inferences for SSVS; Panel D: adaptive window inferences for
SSVS. .......................................................................................................................................... 103

xiii

Figure 4.7 Manhattan plot for body weight treating all cows as genotyped in benchmarking
study. Panel A: single SNP inferences for EMMAX; Panel B: adaptive window inferences
approach for EMMAX; Panel C: single SNP inferences for SSVS; Panel D: adaptive window
inferences for SSVS. ................................................................................................................... 104
Figure 4.8 Manhattan plot for milkfat masking genotypes ISU cows using ssEMMAX. Panel A:
single SNP inferences for HETVAR variance; Panel B: adaptive window inferences for
HETVAR variance; Panel C: single SNP inferences for HOMVAR variance; Panel D: adaptive
window inferences for HOMVAR variance. .............................................................................. 110
Figure 4.9 LD (r2 metric) heatmap for chromosome 14 from 1189.341kb to 3059.698kb that
contains all the SNP and windows selected by EMMAX based on the benchmark. Purple star
mean are all the SNPs that deem to be significant by EMMAX; blue circle is the starting and
ending SNPs for the windows deem to be significant by EMMAX; green circle is the starting and
ending SNPs for other windows in the map................................................................................ 115
Figure 5.1 Visualization of prior distributions for SNP marker effects in BATools. ................. 128
Figure 5.2 Loading Pig data included in BATools ..................................................................... 134
Figure 5.3 Basic model setting and fitting for a GBLUP model ................................................ 135
Figure 5.4 Full model setting and fitting for GBLUP. Verbose counterpart to Figure 5.3 ........ 136
Figure 5.5 Summary of BATools results .................................................................................... 138
Figure 5.6 Visualization of cross-validation results for BRR, BayesA and SSVS via built-in
BATool function baplot. Black dots are from training and red dots are from validation set. . 139
Figure 5.7 5-fold Cross-validation using BRR, BayesA and SSVS ........................................... 140
Figure 5.8 GWA using EMMAX and SSVS .............................................................................. 142
Figure 5.9 Manhattan plot from GWA using the example MSUPRP dataset. Panel A) EMMAX
single SNP approach; Panel B) EMMAX adaptive window approach; Panel C) SSVS single SNP
approach; Panel D) SSVS adaptive window approach. .............................................................. 143
Figure 5.10 GWA using anteBayesA and anteBayesB ............................................................... 145
Figure 5.11 Manhattan plot from GWA using the example MSUPRP dataset. Panel A)
anteBayesA single SNP approach; Panel B) anteBayesA adaptive window approach; Panel C)
anteBayesB single SNP approach; Panel D) anteBayesB adaptive window approach; Panel E)
absolute value of association parameter for anteBayesA; Panel F) absolute value of association
parameter for anteBayesA........................................................................................................... 146
Figure 5.12 5-fold Cross-validation using ssGBLUP, ssBayesA, ssBayesB and ssSSVS ......... 148

xiv

Figure 5.13 Computing time in seconds per 1000 iterations for BayesB for sampling all the
marker effects by sample size and the number of marker. The benchmark was performed on a
2.4Ghz Intel Xeon E5-2680v4 CPU using a single core ............................................................ 149
Figure A.1 Boxplot of window lengths for windows adaptively chosen based on the BALD
software in terms of mega bases (Panel A) and number of SNP markers (Panel B) .................. 181
Figure A.2 Average ROC curve (10 replicates) for 1Mb versus 10 SNP windows using EMMAX
(Panel A), MAP-SSVS (Panel B), MAP-BayesA (Panel C) and RRBLUP (Panel D) for 30
quantitative trait loci generated from a Gamma distribution with shape parameter 1.48 ........... 182
Figure A.3 Scatterplots of posterior probabilities of association (PPA) for MCMC-SSVS (x-axis)
versus MAP-SSVS (y-axis) for analysis on 13th rib backfat on 922 pigs from the MSUPRP
population based on with different starting values for MAP-SSVS: A) RRBLUP and B) MCMCSSVS. .......................................................................................................................................... 183
Figure A.4 Scatterplot of posterior probabilities of association (PPA) based on local false
discovery rates (lFDR) conversions of p-values from EMMAX procedure (y-axis: PPA=1-lFDR)
and MCMC-SSVS (x-axis) on 13th rib backfat on 922 pigs from the MSUPRP population. .... 183
Figure B.1 Partition P1: Manhattan plot for milkfat in within station splits of genotyped and nongenotyped animals. ..................................................................................................................... 192
Figure B.2 Partition P2: Manhattan plot for milkfat in within station splits of genotyped and nongenotyped animals. ..................................................................................................................... 193
Figure B.3 Partition P3: Manhattan plot for milkfat in within station splits of genotyped and nongenotyped animals. ..................................................................................................................... 194
Figure B.4 Partition P4: Manhattan plot for milkfat in within station splits of genotyped and nongenotyped animals. ..................................................................................................................... 195
Figure B.5 Partition P5: Manhattan plot for milkfat in within station splits of genotyped and nongenotyped animals. ..................................................................................................................... 196
Figure B.6 Without the genotype of ISU: Manhattan plot for milkfat in across station study. .. 198
Figure B.7 Without the genotype of MSU: Manhattan plot for milkfat in across station study. 199
Figure B.8 Without the genotype of USDFRC: Manhattan plot for milkfat in across station study.
..................................................................................................................................................... 200
Figure B.9 Without the genotype of UW: Manhattan plot for milkfat in across station study. .. 201
Figure B.10 Without the genotype of FL: Manhattan plot for milkfat in across station study. .. 202
Figure B.11 Without the genotype of AGIL: Manhattan plot for milkfat in across station study.
..................................................................................................................................................... 203

xv

Figure B.12 Partition 1: Manhattan plot for body weight in within station study. ..................... 205
Figure B.13 Partition 2: Manhattan plot for body weight in within station study. ..................... 206
Figure B.14 Partition 3: Manhattan plot for body weight in within station study. ..................... 207
Figure B.15 Partition 4: Manhattan plot for body weight in within station study. ..................... 208
Figure B.16 Partition 5: Manhattan plot for body weight in within station study. ..................... 209
Figure B.17 Without the genotype of ISU: Manhattan plot for body weight in across station
study. ........................................................................................................................................... 211
Figure B.18. Without the genotype of MSU: Manhattan plot for body weight in across station
study. ........................................................................................................................................... 212
Figure B.19 Without the genotype of USDFRC: Manhattan plot for body weight in across station
study. ........................................................................................................................................... 213
Figure B.20. Without the genotype of UW: Manhattan plot for body weight in across station
study. ........................................................................................................................................... 214
Figure B.21 Without the genotype of FL: Manhattan plot for body weight in across station study.
..................................................................................................................................................... 215
Figure B.22 Without the genotype of AGIL: Manhattan plot for body weight in across station
study. ........................................................................................................................................... 216

xvi

Chapter1 Introduction
Whole genome prediction (WGP) using dense single nucleotide polymorphism (SNP) marker
panels has been increasingly implemented in animal and plant breeding for genetic improvement
of economically important traits. Currently, SNP marker panels with ~50,000 SNP markers are
widely used for most livestock species, and high-density panels with ~770,000 SNP markers are
also available (Meuwissen et al. 2016). In the 1000 bull genome project
(http://www.1000bullgenomes.com/), 28.3 million variants have been obtained on 238 cattle
using whole-genome sequence technology (Daetwyler et al. 2014) with the most recent updated
number of sequenced cattle now at 1147 (http://www.canadacow.ca/). Therefore, WGP is a âbig
dataâ research and application area characterized albeit by a relatively small number of
observations (n) compared to the number of predictors or SNP markers (m).
The seminal idea of using WGP for genomic selection (GS) of livestock was developed 16
years ago by Meuwissen et al. (2001) who exposited the use of best linear unbiased prediction
(BLUP) and various Bayesian extensions to include information on SNP marker panels, even
before such panels were developed for livestock! Genetic gain using genomic selection has
doubled in dairy cattle traits compared to the period before the adoption of SNP marker
information for genetic evaluations in 2009 (GarcĂ­a-Ruiz et al. 2016) such that genomic selection
is now considered to be a mature technology (Misztal 2016b). Nevertheless, improved
methodologies and software tools for WGP still require further development.
Two broad categories of models for WGP are genomic BLUP (GBLUP) models and
Bayesian models (Meuwissen et al. 2001; de Los Campos et al. 2013; Gianola 2013), both
considered to be critical components of hierarchical linear or multilevel modeling. While
GBLUP or Bayesian ridge regression (BRR) (Hoerl and Kennard 1970) is equivalent to
1

specifying a Gaussian prior on the SNP marker effects, other Bayesian models are often
specified with more flexible priors on the SNP marker effect to provide differential shrinkage
effects that may be appropriate for various types of genetic architecture; i.e., whether or not a
trait is characterized by few (oligogenic) or many (polygenic) loci. These priors include heavytailed specifications such as a scaled Student-t (Meuwissen et al., 2001), variable selection
specifications such as SSVS or stochastic search and variable selection (George and McCulloch
1993) or hybrids thereof such as a mixture of point mass at zero and scaled-t (BayesB) as
originally proposed by Meuwissen et al. (2001). With greater flexibility in priors, hierarchical
Bayesian models have been shown to provide higher WGP prediction accuracies than
GBLUP/BRR in many applications (de Los Campos et al. 2013).
Unfortunately, implementation of hierarchical Bayesian models with flexible priors typically
requires intensive computing demands as the posterior inference is typically based on simulationbased Markov chain Monte Carlo (MCMC) techniques. Conversely, GBLUP is relatively fast,
particularly with small n (i.e. n x n matrices are easily inverted) in part because of a recent
equivalent model realization that parameterizes genetic effects in terms of additive genetic
effects rather than SNP marker effects (Stranden and Garrick 2009). Over the past few years, to
reduce the computing time, several ExpectationâMaximization (EM) algorithms have been
developed to partly address computational limitations in hierarchical Bayesian models with flexible

priors (Meuwissen et al. 2009; Shepherd et al. 2010; Karkkainen and Sillanpaa 2012; Sun et al.
2012). These EM implementations sometimes achieve comparable or slightly lower WGP accuracies
to their MCMC counterparts. Typically, the hyperparameters in these EM algorithms are often
arbitrarily specified (Karkkainen and Sillanpaa 2012; Sun et al. 2012) or sometimes determined by
heritability-based rules (de Los Campos et al. 2013) that have been shown to be suboptimal

compared to formally allowing for their uncertainty (Lehermeier et al. 2013; Yang et al. 2015b).
2

Furthermore, there does not seem to be an appreciation of the potential influence of starting
values for the SNP marker effects in those EM based approaches with most implementations
choosing zero as starting values. Hence, hyperparameter tuning/estimation and starting values for
EM approaches require further investigation as to their effect on these analytical approximations.
Genome-wide association (GWA) analysis is also another important tool to help pinpoint the
regions contain causal variants or quantitative trait loci (QTL) for complex traits. With the
availability of high density SNP marker panels, the very first GWA result was reported in 2005
(Klein et al. 2005) followed by the first large scale GWA study by Wellcome Trust Case Control
Consortium (2007). The early days of GWA studies were based on serial simple linear regression
models on SNP markers, one at a time, without accounting for population structure and
relatedness. Ignoring these features has been demonstrated to result in spurious GWA inferences
(Martin and Eskin 2016). To account for population structure, linear mixed model (LMM)
specifying all other marker effects as random except for the one of inferential interest as fixed
have been proposed (Kang et al. 2008). Since then various computationally efficient
enhancements to this LMM approach have been developed (Kang et al. 2010; Lippert et al.
2011; Zhou and Stephens 2012; Gualdron Duarte et al. 2014). The variance components of these
LMM are typically estimated using restricted maximum likelihood (REML).
Hierarchical Bayesian models in WGP fit all SNP markers as random effects simultaneously
(Meuwissen et al. 2001) and automatically accounts for the population structure just as LMM
with random effects. However, random effects estimation using Gaussian prior specifications
(equivalent to GBLUP/BRR) tends to overly shrink all marker effects to zero, particularly for
higher marker densities (Hayes 2013) and has been deemed to be too conservative for GWA
(Gualdron Duarte et al. 2014). Therefore, priors with less shrinkage to larger marker effects,

3

such as BayesA and SSVS, should be considered for such applications as they tend to provide far
less shrinkage to larger marker effects than a Gaussian in WGP applications. GWA studies rely
on SNPs to be correlated or in linkage disequilibrium (LD) with QTLs. In fact, many SNPs are
likely to be in LD with a single QTL (Goddard et al. 2016), and single marker tests suffer from
multicollinearity problems or low statistical power or both (Fernando et al. 2017). For this
reason, GWA studies based on joint tests on all marker within a genomic window/region should
be considered. Currently, however, window lengths tend to be arbitrarily specified (Schmid and
Yang 2008; Moser et al. 2015), and in livestock species and crops, LD may extend for a long
distance (Goddard et al. 2016). Hence, those window selection procedures may separate SNP
markers that should conceptually be grouped in the same window because of high LD between
them. Therefore, additional efforts are required to partition SNPs into windows with less
arbitrary boundaries.
In large GS programs for animal and plant breeding, an increasingly important problem is that
many if not most individuals to be genetically evaluated do not have genotype information.
Traditionally, genomic evaluation uses deregressed breeding values from pedigree based BLUP
to remove the contribution of relatives that are not related to the study; and then fit WGP models
for genotyped animals in a âtwo-stepâ model (VanRaden 2008; Hayes et al. 2009). A single-step
GBLUP (ssGBLUP) approach that combines phenotypes on genotyped and non-genotyped
animals with pedigree information in one regression model (Aguilar et al. 2010) has become
popular for many livestock GS programs (Legarra et al. 2014). Because of extra phenotypic
information that ssGBLUP has combined, many studies have found this procedure to have higher
prediction accuracy than Bayesian models without such information (Lourenco et al. 2013;
Legarra et al. 2014; Vallejo et al. 2016). The ssGBLUP models have also been implemented for

4

GWA. However, these GWA assessments were not based on formal measures of statistical
significance (Wang et al. 2012; Zhang et al. 2016); therefore, such measures, e.g. P-value, needs
to be developed. Recently, Fernando et al. (2014) proposed a framework to implement the
single-step approach in hierarchical Bayesian models that combine information on both
genotyped and non-genotyped individuals. Although studies have shown such models have
higher WGP accuracies than ssGBLUP in beef cattle for traits controlled by large SNP marker
effects (Lee et al. 2017), the GWA performance of Bayesian models that combine nongenotyped animals have not been comprehensively evaluated, as well as WGP accuracies in
other species.
Currently, several open source software packages are available for WGP or GWA or both,
and most of them are designed for specific models (Endelman 2011; Zhou and Stephens 2012).
The popular BGLR R package (Perez and de los Campos 2014) includes a collection of models
designed for WGP, but it does not focus on GWA features nor does it yet support Bayesian
approaches that incorporate information on non-genotyped animals. It is crucial to have an open
source R package to include both LMM and Bayesian models for both WGP and GWA that
performs window based GWA and combines genotype, phenotype and pedigree information of
both genotype and non-genotyped individuals.
With this in mind, there are four overall objectives in this dissertation to improve the
computational efficiency and accuracy in both WGP and GWA. The first objective is to help
improve the computational efficiency for Bayesian WGP models using EM algorithm and assess
their ability to estimate hyperparameters and the influence of starting values on WGP accuracies
(Chapter 2). The second objective is to develop a window based approach for traditional LMM
and two Bayesian models: BayesA and SSVS; to examine the potential benefits of using

5

Bayesian models relative to classical LMM for GWA under a wide range of simulated
architectures; to assess whether the choice of different fixed genomic window sizes versus
window sizes inferred based on LD clustering, could impact GWA performance and to evaluate
the relative merit of EM approaches to MCMC approaches for BayesA and SSVS (Chapter 3).
The third objective is to extend hierarchical Bayesian model and traditional LMM to incorporate
information on non-genotyped animals for both single SNP and window based GWA to provide
formal statistical significance assessment and to compare these approaches for both WGP and
GWA (Chapter 4). A capstone objective is to provide all the models/algorithm in previous
Chapters as an efficient and accessible R package (Chapter 5). Finally, I summarize all the
findings throughout the dissertation as well as provide ideas for future research (Chapter 6).

6

Chapter2 An Integrated Approach to Empirical Bayesian Whole Genome Prediction
Modeling
2.1 Abstract
Computational efficiency is an increasing concern for whole genome prediction (WGP) based
on denser genetic marker panels such that algorithms other than Markov Chain Monte Carlo
(MCMC) warrant greater consideration, particularly for hierarchical models that flexibly confer
either heavy-tailed (e.g., BayesA) or stochastic search and variable selection (SSVS) instead of
Gaussian specifications on marker effect distributions. The expectation maximization (EM)
algorithm is one attractive alternative; however, recently proposed hierarchical model
implementations of EM have not addressed formal estimation of underlying hyperparameters
even though their specifications are known to impact WGP accuracy. Furthermore, EM can be
sensitive to starting values. I develop and explore the properties of an empirical Bayes strategy
by conditioning EM implementations of BayesA or SSVS WGP models on marginal modal
estimation of variance components and other key hyperparameters. These empirical Bayes
implementations are compared against their MCMC counterparts for estimation of
hyperparameters and WGP accuracy, both within the context of a simulation study and
application to a loblolly pine dataset. In all cases, starting values were deemed to be important
for EM-based estimates. Starting values based on MCMC posterior means were preferable
whereas those based on setting all marker effects equal to zero generally led to inferior
performance. Nevertheless, a recently proposed regularization procedure was useful in
alleviating the impact of starting values in the EM implementation of the SSVS model, as was
modifying the expectation step in the BayesA model to be based on relative variances rather than
on relative precisions.
7

2.2 Introduction
Recent developments in genotyping technology have made dense single nucleotide
polymorphism (SNP) genotype marker panels available for many livestock species. Typically,
some of these SNP chips have up to hundreds of thousands or more markers. As these numbers
continue to increase with emerging sequencing technologies (Daetwyler et al. 2014), it is
important to develop statistically and computationally efficient methods that best use these
genotypes to predict genomic estimated breeding values (GEBV) on economically important
traits in whole genome prediction (WGP) models. These models are based on the premise that
for a sufficient number of SNPs, at least some of them should be in close linkage disequilibrium
(LD) with quantitative trait loci (QTL) for economically important traits (Meuwissen et al.
2001).
WGP typically represents a mâŤn problem, whereby the number (m) of markers used for
prediction is much larger than the number (n) of animals having phenotypes. Two broad
parametric categories for WGP models are genomic or ridge regression best linear unbiased
prediction (RRBLUP) and various hierarchical model extensions of RRBLUP, often satirically
referred to as âBayesian alphabetâ models (Gianola 2013). RRBLUP is based on classical linear
mixed model analyses whereby a common variance component is specified for each random
SNP effect. Variance components can be readily estimated using Restricted Maximum
Likelihood (REML) followed by BLUP of SNP effects conditional on these REML estimates.
This two-stage approach has been characterized as empirical Bayes (EB) (Casella 1985) and so I
might refer to corresponding genomic predictions as empirical RRBLUP (e-RRBLUP) . On the
other hand, fully Bayesian inference in hierarchical WGP models is typically conducted using
Markov chain Monte Carlo (MCMC) techniques. In a seminal paper, Meuwissen et al. (2001)

8

proposed BayesA which hierarchically extends RRBLUP by specifying every SNP effect as
scaled Student t distributed. Since then, other hierarchical variations on heavy-tailed or variable
selection specifications on the SNP effects have been proposed. Such extensions often lead to
higher WGP accuracies compared to RRBLUP analyses, particularly when n is large relative to
the number of QTL and n/m is not too small (Wimmer et al. 2013).
MCMC analyses are computationally expensive, especially for WGP models with large m.
ExpectationâMaximization (EM) algorithms have been developed to partly address
computational limitations in hierarchical Bayes WGP (e.g., Meuwissen et al. 2009; Hayashi and
Iwata 2010; Shepherd et al. 2010; Sun et al. 2012). These EM implementations have often been
shown to lead to WGP accuracies comparable to their MCMC based counterparts but at a
fraction of the computational cost.
A typically neglected issue in hierarchical WGP modeling is proper tuning/inference of
hyperparameters as subsequent estimates on SNP effects have been observed to be sensitive to
such specifications (Gianola et al. 2009). These hyperparameters are often arbitrarily specified
or sometimes based on heritability-based rules (de Los Campos et al. 2013) that have been
shown to be suboptimal (Lehermeier et al. 2013). Although formal Bayesian inference on
hyperparameters using MCMC is possible (Yi and Xu 2008; de Los Campos et al. 2013; Yang et
al. 2015b), there has been far less discussion on how to estimate hyperparameters in conjunction
with EM-based implementations of hierarchical WGP models; in fact, it has been deemed to be
nearly impossible (Karkkainen and Sillanpaa 2012). Furthermore, it has been implied that
starting values are not important in these implementations with zero being a particularly popular
starting value choice for SNP effects (e.g., Meuwissen et al. 2009; Shepherd et al. 2010;
Karkkainen and Sillanpaa 2012). However, Gianola (2013) has warned that joint posterior

9

modal estimates may iteratively get trapped at local modes such that the specification of different
starting values for SNP effects, never mind hyperparameter specifications, could have different
implications for WGP accuracy. To partially address that concern in a variable selection model
known as stochastic search and variable selection (SSVS) first introduced by George and
McCulloch (1993), I investigate strategies proposed by Rockova and George (2014) to alleviate
the influence of starting values for determining high posterior probability modes in an EM-based
implementation of SSVS.
Our objectives were then to demonstrate an empirical Bayes (EB) strategy to estimate key
hyperparameters as well as to investigate the effects of different starting values on SNP effect
estimates and genomic merit in two widely used hierarchical WGP models, BayesA and SSVS.
In Section 2, I outline this EB strategy for both BayesA and SSVS and describe a simulation
study used to assess the effect of starting values on accuracy of genomic merit prediction for
both models, using MCMC as a positive control method. I also demonstrate application of these
same procedures to a loblolly pine dataset (Resende et al. 2012) illustrating the use of the
regularization procedures proposed by Rockova and George (2014) for SSVS. Results are
described in Section 3 with a concluding discussion provided in Section 4.
2.3 Materials and methods
2.3.1 The first stage linear WGP model
The first stage of a WGP linear mixed model is typically specified as follows:

yi ď˝ xi' Î˛ ďŤ z i' g + ei , i ď˝ 1, 2,..., nď 
where

[2.1]

yi denotes the phenotype on individual i connected to fixed effects

Î˛ via a known

m
incidence row vector x i and connected to SNP effects g ď˝ {g j } j ď˝1 via known genotypes

'

10

z i' ď˝ ď zi1 zi 2 zi 3 ďź zim ď on m SNP markers. Furthermore, I specify ei ~ N ď¨ 0, ďł e2 ďŠ . Now
iid

*
although the SNP covariates are typically provided with values of zij = 0, 1, or 2 (i.e., number of

copies of a reference allele at each SNP), they are often recoded in a number of different ways
using, for example, just centering, i.e., zij ď˝ zij* ď­ 2 pË j or centering and scaling, i.e.,

zij ď˝

zij* ď­ 2 pË j

2 pË j ď¨1 ď­ pË j ďŠ

(de Los Campos et al. 2013), where pË j ď˝

1 n *
ďĽ zij denotes the estimated
2n i ď˝1

frequency of the reference allele for SNP j=1,2,âŚ,m. Recoding genotypes in this manner has
been demonstrated to improve algorithmic stability (Stranden and Christensen 2011).
2.3.2 BayesA EM
In BayesA, the following hierarchical priors are typically specified:
g j ~ N ď¨ 0, ďł g2ď´ j ďŠ

[2.2]

having density p ď¨ g j | ďł g2 ďŹď´ j ďŠ and

ď´ j ~ ďŁ ď­2 (ďŽ g ,ďŽ g )

ď¨

having density p ď´ j | ďŽ g

[2.3]

ďŠ

and used to denote a scaled inverted chi-square distribution with

degrees of freedom and scale parameters both being ďŽ g ď¨ďŽ g ďž 0 ďŠ such that ď´ j =1 then defines a

ďŚ ďŽg

ďś
;ďŽ g ďž 2 ďˇ and prior mode
ďˇ
ď¨ďŽ g ď­ 2
ď¸

typical value falling between the prior mean ď§ď§

ďŚ ďŽg ďś
ď§ď§
ďˇďˇ . This
ďŽ
ďŤ
2
g
ď¨
ď¸

parameterization slightly differs from, but is marginally equivalent to, that provided in the
seminal paper by Meuwissen et al. (2001) who directly specify prior distributions on

11

ď¨

ďŠ

2
2
ď­2
2
ďł g2 ~ ďł g2ď´ j . That is, marginally, g j |ďŽ g ,ďł g ~ ď˛ N 0,ďł gď´ j ďŁ (ďŽ g ,ďŽ g )dď´ j ď˝ tďŽ g (0,ďł g ) , i.e. a
j

ď´j

2
scaled Student t distribution with degrees of freedom ďŽ g and scale parameter ďł g .

Given these specifications, the joint posterior density for BayesA can be derived as follows:

p ď¨ Î˛, g, Ď, ďł e2 , ďł g2 ,ďŽ g | y ďŠ
m
ďś
ďŚ n
2 ďśďŚ
ďľ ď§ ď p ď¨ yi |Î˛, g,ďł e ďŠ ďˇ ď§ ď p ď¨ g j |ďł g2 ,ď´ j ďŠ p ď¨ď´ j |ďŽ g ďŠ ďˇ p ď¨ Î˛ ďŠ p ď¨ďł g2 ďŠ p ď¨ďł e2 ďŠ p ď¨ďŽ g ďŠ ,
ď¨ i ď˝1
ď¸ ď¨ j ď˝1
ď¸

ďť ď˝

m

where y ď˝ ďť yi ď˝i ď˝1 and Ď ď˝ ď´ j
n

[2.4]

j ď˝1

. Also Equation [2.4] is based on the product of arbitrarily

2
and independently specified priors p ď¨ Î˛ ďŠ , p ď¨ďł g2 ďŠ , p ď¨ďł e2 ďŠ , and p ď¨ďŽ g ďŠ on Î˛ , ďł g , ďł e , and ďŽ g ,

2

respectively. I will assume p ď¨ Î˛ ďŠ ďľ 1 in this paper as p ď¨ Î˛ ďŠ is typically diffuse, although
extensions to more informative specifications should be obvious.
The EM algorithm is based on computing the expectation of a log likelihood and/or a log joint
posterior density with respect to augmented variables (E-step) followed by maximization of this
expectation with respect to the remaining unknown parameters (M-step). Letâs momentarily
2
assume that each of p ď¨ďł g2 ďŠ , p ď¨ďł e2 ďŠ , and p ď¨ďŽ g ďŠ are point masses on known values for ďł g , ďł e ,

2

ď¨

ďŠ

and ďŽ g , respectively, such that Equation [2.4] can be re-expressed as p Î˛, g, Ď | ďł e2 , ďł g2 ,ďŽ g ,y to
reflect this conditioning. Taking its logarithm, the corresponding âaugmentedâ conditional log
density (LA) for ď˘, g and the augmented variables Ď , recognizing that log p ď¨ Î˛ ďŠ ď˝ constant , is as
follows:

ď¨

LA ď˝ log p ď¨ Î˛, g, Ď | ďł e2 , ďł g2 ,ďŽ g , y ďŠ

ďŠ

ď¨

ďŠ

ď˝ ďĽ log p ď¨ yi |Î˛, g,ďł e2 ďŠ ďŤ ďĽ log p ď¨ g j |ďł g2 ,ď´ j ďŠ ďŤ log p ď¨ď´ j | ďŽ g ďŠ ďŤ constant.
n

m

i ď˝1

j ď˝1

12

[2.5]

Following Sun et al. (2012), the E-step for Ď involves evaluating the expectation of Equation
[2.5] with respect to Ď . Taking terms in Equation [2.5] that only involve Ď , by drawing on the
corresponding components that derive from the logarithms of Equations [2.2] and [2.3], require

ďŚ1ďś
evaluations of E ď§ ďˇ and E log ď´ j
ď´ j |. ď§ ď´ ďˇ
ď´ j |.
ď¨ jď¸

ď¨ ď¨ ďŠďŠ . Here ď´

j

|. denotes the expectation is conditional on

all other terms and y, i.e.,
ďŚ

m

ď¨

j ď˝1

ď¨ ď¨

ďŠďŠ

ďś

E ď§ ďĽ log p ď¨ g j |ďł g2 ,ď´ j ďŠ ďŤ log p ď¨ď´ j |ďŽ g ďŠ ďˇ
ď´ |.
j

ďŚ 1g
ďŚ 1 ďś ďŚďŽ
ďś
ď˝ ďĽď§ ď­
E ď§ď§ ďˇďˇ ď­ ď§ g ďŤ 1ďˇ E log ď¨ď´ j
ď§
2ďł
j ď˝1 ď¨
ď¸ ď´ j |.
ď¨ď´ j ď¸ ď¨ 2

ď¨

2
j
2 ď´ |.
j
g

m

ď¸

ďŠďŠ

ďŚ 1 ďśďś
ď­ E ď§ ďˇ ďˇ ďŤ constant .
2 ď´ j |. ď§ď¨ ď´ j ďˇď¸ ďˇď¸

ďŽg

[2.6]

Now, as previously provided by Sun et al. (2012) ,

ď¨ď´Ë ďŠ
j

ď­1

ď˝

ďŚ 1 ďś ďŽg ďŤ1
ďˇď˝ 2
ď´ j |. ď´
ď¨ j ď¸ gË j ďŤ ďŽ
g
2

Eď§

[2.7]

ďłg

where gË j is g j evaluated at the M-step (see later) whereas

ďŚ
ď§
ď¨

ďśďś
ďŚďŽ ďŤ 1ďś
ďŤ
ďŽ
ď­ ďď§ g ďˇ
ďˇ
gďˇ
2
ďˇďˇ
ď¨ 2 ď¸
ď¨ďłg
ď¸ď¸
ďŚ gË 2j

E ď¨ log ď¨ď´ j ďŠ ďŠ ď˝ log ď§ 0.5 ď§ď§

ď´ j |.

[2.8]

where ď ď¨.ďŠ denotes the digamma function, i.e., ď ď¨ x ďŠ ď˝ ďśď ď¨ x ďŠ dx for ď ď¨.ďŠ being the gamma
ďśx

function. Note that evaluating Equation [2.8] is only required if estimation of ďŽ g is desired.
Subsequently, the joint posterior modes, Î˛Ě and gĚ , respectively of Î˛ and g are evaluated in
the maximization or M-step. This involves solving Hendersonâs mixed model equations (Sun et
al. 2012), i.e.,

13

ďŠ X 'X
ďš
X 'Z
ďŞ
ďş ďŠÎ˛Ë ďš ďŠ X' y ďš
2
ďŞ Z'X Z'Z + ďł e D
ď­
1
Ë ďş ďŞgË ďş ď˝ ďŞďŞ Z' y ďşďş
2
ďť
ďŞ
ďş ďŤďŞ ďťďş ďŤ
ďłg
ďŤ
ďť

where X ď˝ ď x1

x2

x n ď ' , Z ď˝ ď z1

[2.9]

z2

ďť

z n ď ' and DË ď­1 ď˝ diag ď¨ď´Ë

j

ďŠ

ď­1

ď˝.

Upon convergence of the E-steps and M-steps, one attains the joint posterior mode of ď˘ and g
2
conditional on ďł g , ďł e , and ďŽ g .

2

As a deviation from this particular EM implementation, Karkkainen and Sillanpaa (2012)

ď¨ ďŠ

ď¨

ďŠ

proposed that the E-step in a BayesA model be based on evaluating E ďł g2 ď˝ E ďł g2ď´ j or,
2
j
ďł g j |.

ď´ j |.

gË 2j

ďŤďŽ g
ďł g2
ďŚ ďś
equivalently, of ď´Ë ď˝ E ď¨ď´ j ďŠ ď˝
(i.e. relative variances) instead of E ď§ 1 ďˇ (i.e., relative
ď´ |.
ď´ j |. ď§ ď´ ďˇ
ďŽ g ď­1
ď¨ jď¸
*
j

j

precisions) as in Equation [2.7] even though the latter is implicitly required in a formal E-step for
the EM algorithm as per Equation [2.6]. Note that ď´Ë j and ď´Ë*j are subtly different, having the
same numerators but different denominators. Typically, expectations of augmented variables in
EM implementations are taken with respect to their functional forms in augmented log
likelihoods or log joint posterior densities (i.e., using ď´Ë j ) whereas Karkkainen and Sillanpaa
(2012) substitute ď´Ë*j for ď´Ë j for the E-step in D of Equation [2.9]. This warranted further
investigation on our part.
2.3.3 SSVS EM
I adapt the developments in the previous section by modifying the prior on

ď´ j to facilitate

variable selection using the SSVS specification first introduced by George and McCulloch
(1993) and recently revisited by Rockova and George (2014) i.e.

14

ďŚ
1 ď­ ď´ j ďŠ ďł g2 ďś
ď¨
2
g j ~ N ď§ 0, ďł gď´ j ďŤ
ďˇ
ď§
ďˇ
c
ď¨
ď¸

[2.10]

with density p ď¨ g j |ďł g2 ,ď´ j ďŠ and whereby c >> 1 such that Equation [9a] represents a mixture
2
distribution on a âslabâ component for effectively non-zero g j ,characterized by variance ďł g , and

a âspikeâ component with variance

ďł g2
c

, the latter thereby absorbing negligible or near-zero g j .

Furthermore,

ď´ j ~ Bernoulli ď¨ď° ďŠ ;ď´ j ď˝ 0,1

[2.11]

is the Bernoulli distribution with density p ď¨ď´ j | ď° ďŠ . The joint posterior density for SSVS can
then be written as follows:
p ď¨ Î˛, g, Ď, ďł e2 , ďł g2 , ď° | y ďŠ
ďś
ďŚ n
ďśďŚ m
ďľ ď§ ď p ď¨ yi |Î˛, g,ďł e2 ďŠ ďˇ ď§ ď p ď¨ g j |ďł g2 ,ď´ j ďŠ p ď¨ď´ j | ď° ďŠ ďˇ p ď¨ Î˛ ďŠ p ď¨ďł g2 ďŠ p ď¨ďł e2 ďŠ p ď¨ď° ďŠ .
ď¨ i ď˝1
ď¸ ď¨ j ď˝1
ď¸

[2.12]

ď p ď¨ y |Î˛, g,ďł ďŠ , p ď¨Î˛ďŠ , p ď¨ďł ďŠ , and p ď¨ďł ďŠ are defined similarly
n

Note that the components

i ď˝1

2
e

i

2
g

2
e

as before with BayesA except with p ď¨ g j |ďł g2 ,ď´ j ďŠ and p ď¨ď´ j | ď° ďŠ defined as in Equations [2.10] and
[2.11], respectively, and p ď¨ď° ďŠ , as the prior for ď° , substituted for p ď¨ďŽ g ďŠ . Letâs momentarily
2
assume that the key hyperparameters ďł g , ďł e , and ď° are specified to be known, similar to what I

2

did earlier with BayesA. Then again, with p ď¨ Î˛ ďŠ ďľ 1 , Equation [2.12] can be further condensed to
reflect this conditioning:

ď¨

p ď¨ Î˛, g, Ď,| ďł e2 , ďł g2 , ď° , y ďŠ ďľ ď p ď¨ yi |Î˛, g,ďł e2 ďŠď p ď¨ g j |ďł g2 ,ď´ j ďŠ p ď¨ď´ j | ď° ďŠ
n

m

i ď˝1

j ď˝1

15

ďŠ

[2.13]

Taking its logarithm, I denote the corresponding log augmented SSVS joint posterior density
as LS used for inferring ď˘ and g with the augmented variables again being

ď¨ ď¨

LS ď˝ log p Î˛, g, Ď | ďł e2 , ďł g2 , ď° , y

ď¨

ďŠ

ď¨

ďŠďŠ
ď¨

ďŠ

ď˝ ďĽ log p yi |Î˛, g,ďł e2 ďŤ ďĽ log p g j |ďł g2 ,ď´ j ďŤ log p ď¨ď´ j | ď° ďŠ
n

i ď˝1

m

j ď˝1

Ď , i.e.,

Taking only terms that involve

[2.14]

ďŠ

Ď in Equation [2.14], based on the components contributed by

ď¨ ďŠ

Equations [2.10] and [2.11], the E-step requires an evaluation of E ď´ j , i.e.,
ď´ j |.

ďŚ

m

ď¨

j ď˝1

ď¨

ďŠ

ďś

E ď§ ďĽ log p ď¨ g j |ďł g2 ,ď´ j ďŠ ďŤ log p ď¨ď´ j | ď° ďŠ ďˇ
ď´ |.
j

ď¸

ďŚ ďŚ
ďś
ďś
ď§ ď§
ďˇ
ďˇ
2
m ď§
ďˇ
ď§ 1
ďˇ
gj
[2.15]
ď˝ ďĽď§ E ď§ ď­
ďŤ E ď¨ď´ j ďŠ log ď¨ď° ďŠ ďŤ E ď¨1 ď­ ď´ j ďŠ log ď¨1 ď­ ď° ďŠ ďˇ ďŤ constant
ďˇ
ď´ |.
ď´ j |.
ď´ j |.
2 2ďŚ
ďś
1
ď­
ď´
j ď˝1 ď§ j ď§
ď¨ jďŠďˇďˇ
ďˇ
ďł g ď§ď´ j ďŤ
ď§ď§ ď§
ďˇďˇ
ďˇ
ď§
ďˇ
c
ď¨
ď¸ď¸
ď¨ ď¨
ď¸
m
1
ďŚ
ďś
ď˝ ďĽ ď­ ďł gď­2 g j 2 ď§ E ď¨ď´ j ďŠ ďŤ ďŚď§ 1 ď­ E ď¨ď´ j ďŠ ďśďˇ c ďˇ ďŤ E ď¨ď´ j ďŠ log ď¨ď° ďŠ ďŤ ďŚď§ 1 ď­ E ď¨ď´ j ďŠ ďśďˇ log ď¨1 ď­ ď° ďŠ ďŤ constant .
ď´
|.
ď´
|.
j
2
ď¨
ď¸ ď¸ ď´ j |.
ď¨ ď´ j |.
ď¸
ď¨ j
j ď˝1

as also provided by Rockova and George (2014). They further demonstrate that

E ď¨ď´ j ďŠ ď˝ ď´Ë ď˝
**
j

ď´ j |.

ďŚgË ď¨ 0, ďł g2 ďŠ ď°
j

ďŚgË ď¨ 0, ďł
j

2
g

ďŠ

ďŚ ďł g2 ďś
ď° ďŤ ďŚgË j ď§ď§ 0, ďˇďˇ ď¨1 ď­ ď° ďŠ
ď¨ c ď¸

.

[2.16]

Here, ďŚx ď¨ ď­ , ďł 2 ďŠ denotes, for example, the ordinate of a Gaussian pdf with mean ď­ and
variance ďł2 evaluated at x, noting that gË j in Equation [2.16] denotes the M-step estimate of gj for
the current iterate. The M-step for providing the joint posterior mode of ď˘ and g is based on the
use of the same set of mixed model equations provided in Equation [2.9], except that

ď¨

ď¨

ďŠďŠ

**
Ë ď­1 ď˝ diag ď´Ë**
.
D
j ďŤ c 1 ď­ ď´Ë j

16

For all data analyses in this paper, our default value for c was 1000.
2.3.4 Hyperparameter estimation
2.3.4.1 Variance component estimation
The EM strategy for inferring upon ď˘ and g as outlined for BayesA and SSVS above is
computationally similar to classical mixed model inference, holding constant hyperparameters
such as variance components. In fact, BayesA defaults to RRBLUP of g as ďŽ g ďŽ ďĽ whereas
SSVS defaults to RRBLUP of g with ď° ď˝ 1 or c = 1, noting that genomic predictions or GBLUP
of individual genetic merit, u = Zg, is simply Z(RRBLUP(g)). It should be quickly noted that
the term GBLUP is typically reserved for the equivalent mixed effects model whereby one
directly solves for u rather than for g in order to facilitate computational tractability since
generally n << m (Stranden and Garrick 2009); nevertheless, I take the liberty of referring to
GBLUP(u) as being a linear function of RRBLUP(g).
I represent the vector of hyperparameters as Î¸ ď˝ ď¨ďł e2 , ďł g2 ,ďŽ g ďŠ for BayesA and Î¸ ď˝ ď¨ďł e2 , ďł g2 , ď° ďŠ
for SSVS. I partition ďą into the variance components Ď ď˝ ď¨ďł e2 , ďł g2 ďŠ and remaining
hyperparameters as

Î¸ď­Ď such that, for example, Î¸ď­ Ď ď˝ ďŽ g in BayesA whereas Î¸ ď­ Ď ď˝ ď° in

SSVS. Prior to the advent of MCMC, a convincing justification for EB in quantitative genetics
was provided by Gianola et al. (1986) who recommended conditioning RRBLUP of g on REML
estimates of ďł recognizing that REML(ďłďŠ is equivalent to maximizing the marginal density of

p ď¨ Ď | y ďŠ based on a flat prior for ďł, i.e., p ď¨ Ď ďŠ ďľ 1 (Harville 1974). The resulting e-GBLUP
estimates (i.e., Z(e-RRBLUP(g)) conditional on REML(ďł)) are typically not practically different

17

from posterior means on Zg allowing for uncertainty on Ď , provided that p ď¨ Ď | y ďŠ is reasonably
symmetric.
Now prior uncertainty on the augmented variables

Ď for both BayesA and SSVS is driven

entirely by Î¸ ď­Ď given that its prior distribution is written simply as p ď¨ Ď | Î¸ ď­Ď ďŠ , i.e.,

ď p ď¨ď´
m

for BayesA and

j ď˝1

j

ď p ď¨ď´
m

j ď˝1

j

|ďŽ g ďŠ

| ď° ďŠ for SSVS. In the previous section, I maximized p ď¨ Î˛, g | Î¸, y ďŠ with

ď¨

ďŠ

respect to ď˘ and g by first evaluating E log p ď¨ Î˛, g,Ď | Î¸, y ďŠ in an E-step followed by maximizing
ď´ j |.

this subsequent expectation with respect to ď˘ and g in a M-step for both of these models. With ďł
unknown, I start with:

p ď¨ Î˛, g,Ď, Ď | Î¸ ď­ Ď , y ďŠ ďľ p ď¨ Î˛, g,Ď | Ď, Î¸ ď­ Ď , y ďŠ p ď¨ Ď ďŠ

[2.17]

where p ď¨ Ď ďŠ ď˝ p ď¨ďł e2 ďŠ p ď¨ďł g2 ďŠ such that the marginal posterior density of augmented variables
and variance components can be expressed as:
p ď¨ Ď, Ď | Î¸ ď­ Ď , y ďŠ ďľ ď˛ ď˛ p ď¨ Î˛, g,Ď | Ď, Î¸ ď­ Ď , y ďŠ p ď¨ Ď ďŠdgdÎ˛

[2.18]

Î˛ g

A âREML-likeâ strategy for inferring ďł in BayesA or SSVS within an EM framework would
then be to respectively evaluate the conditional expectation of LA (Equation [2.5]) or LS
(Equation [2.14]) with respect to

Ď as noted earlier, use its antilog to substitute for

p ď¨ Î˛, g,Ď | Ď, Î¸ ď­Ď , y ďŠ in Equation [2.18] and maximize the resulting expression with respect

toď ďłď ď in order to evaluate the joint posterior mode of p ď¨ Ď | Î¸ ď­Ď , y ďŠ .

18

Recall again that with p ď¨ Ď ďŠ ďľ 1 and the special conditions ďŽ g ďŽ ďĽ for BayesA and ď° ď˝ 1 or c
= 1 for SSVS, this strategy defaults to classical REML. The classical log REML function
(Searle et al. 1992) can be written as follows:
l ď¨ Ď | y ďŠ ď˝ ď­0.5log V ď­ 0.5log X'V ď­1X ď­ 0.5y ' Py

ď¨

[2.19]

ďŠ

ď­

2
2
with V ď˝ ZDZ ' ďł g ďŤ Iďł e and P ď˝ V ď­1 ď­ V ď­1 X X' V ď­1 X X' V ď­1 . In typical classical REML

specifications involving uncorrelated random effects, D = I. I modify this expression for our
BayesA and SSVS adaptations accordingly as:
l ď¨ Ď, Ď | Î¸ď­ Ď , y ďŠ ď˝ log ď¨ p ď¨ Ď, Ď | Î¸ ď­ Ď , y ďŠ ďŠ

[2.20]

ď˝ constant ď­ 0.5log V ď­ 0.5log X'V ď­1X ď­ 0.5y ' Py + log p ď¨ Ď | Î¸ ď­ Ď ďŠ + log p ď¨ Ď ďŠ .

Recall for either hierarchical model, D is a function of

ďť

ď˝

Ď

for which conditional expectations

ď¨

ďŠ

Ë ď­1 ď˝ diag ď¨ď´Ë ďŠď­1 in BayesA or D
Ë ď­1 ď˝ diag ď´Ë** ďŤ c ď¨1 ď­ ď´Ë** ďŠ in SSVS as
are used to derive D
j
j
j
Ë ď­1 constitutes the only E-step for either model
noted earlier. Evaluating Equation [2.20] at D

whereas the M-step is based on maximizing this resulting expression with respect to ďł. I denote
the corresponding estimates as marginal modal a posteriori (MMAP) estimates in order to
distinguish them from classical REML estimates.
Average Information REML (AIREML) is a particularly attractive hybrid Fisherâs scoring/
Newton Raphson algorithm used to obtain REML estimates under classical Gaussian
specifications for g based on the log likelihood of Equation [2.19] (Gilmour et al. 1995; Johnson
and Thompson 1995). I adapt this algorithm for our proposed MMAP approach in Equation
[2.20] by simply replacing Ď by ĎĚ from a previous E-step followed by maximizing Equation
[2.20] with respect to ďł in a M-step evaluated at ĎĚ . To account for prior information in

19

log p ď¨ Ď ďŠ , I augment the AIREML first and second derivatives as provided by Johnson and
Thompson (1995) with ďś log p ď¨ Ď ďŠ and
ďśĎ

ďś2
log p ď¨ Ď ďŠ , respectively. For all subsequent
ďśĎďśĎ '

analyses in this paper, I use the non-informative prior for components of Ď as advocated by
Gelman (2006), i.e., p ď¨ďł e2 ďŠ ďľ ď¨ďł e2 ďŠ

ď­

1
2

1

and p ď¨ďł g2 ďŠ ďľ ď¨ďł g2 ďŠ 2 .
ď­

2.3.4.2 Estimation of remaining hyperparameters
Suppose that one wishes to estimate Î¸ď­Ď (i.e., ďŽ g in BayesA or ď° in SSVS) as well so that a
prior p ď¨ Î¸ ď­Ď ďŠ is specified. Then Equation [2.20] could be further extended to additionally infer

Î¸ď­Ď using the following expression:
l ď¨ Ď, Ď, Î¸ď­Ď | y ďŠ ď˝ ď­0.5log V ď­ 0.5log X'V ď­1X ď­ 0.5y ' Py + log p ď¨ Ď | Î¸ď­ Ď ďŠ ďŤ log p ď¨ Ď ďŠ ďŤ log p ď¨ Î¸ď­ Ď ďŠ .

[2.21]
Note that the âseparabilityâ (Rockova and George 2014) of Equation [2.21] into contributions
involving Ď versus

Î¸ď­Ď , i.e.,

ďś
l ď¨ Ď, Ď, Î¸ ď­ Ď |, y ďŠ ď˝ 0 , allows independent M-steps for each
ďśĎďśÎ¸ ď­ Ď '

of these two components of ďą. In fact, I suggest a hybrid algorithmic strategy whereby
AIREML-based optimization is used for estimating Ď whereas only first derivative information
is used for maximizing p ď¨ Î¸ | y ďŠ with respect to
[2.21] with respect to

Î¸ď­Ď by simply setting ďś

Î¸ď­Ď . That is, I propose maximizing Equation

ďśÎ¸ ď­ Ď

ď¨ log p ď¨ Ď | Î¸ ďŠ ďŤ log p ď¨ Î¸ ďŠ ďŠ dÎ¸
ď­Ď

ď­Ď

at the E-step, equal to 0. For all subsequent analyses in this paper, I considered

20

ď­Ď

, evaluated

p ď¨ďŽ g ďŠ ďľ

1

ď¨1 ďŤ ďŽ g ďŠ

2

and p ď¨ď° ďŠ to be a Beta(1,10) density, similar to what Iâve advocated in

previous work (Yang and Tempelman 2012; Yang et al. 2015b).
Upon convergence, marginal modal estimates of Ď and/or

Î¸ď­Ď could be âplugged inâ to

provide EB estimates of ď˘ and g for BayesA as in Section 2.2 or for SSVS as in Section 2.3. I
refer to these estimates as e-BayesA and e-SSVS respectively.
2.4 Data
2.4.1 Simulation Study
In order to address the feasibility of our proposed EB approaches defined by MMAP
estimates of hyperparameters and subsequent EM-based estimates of SNP effects under BayesA
and SSVS models, I developed a simulation study to compare those estimates to MCMC based
posterior means as the gold standard. I simulated 20 replicated datasets using the R (R Core
Team 2017) package hypred (Technow 2013). The simulated genome was composed of 5
chromosomes, each of length 1 Morgan and consisting of 10,000 equally spaced loci. Individuals
were randomly mated to generate 400 animals within each of 2000 consecutive generations. The
mutation rate was specified to be 2.5Ă10-5 per locus per generation. After Generation 2000,
random matings were used to expand the population size to 2000 individuals in each of
Generations 2001 and 2002. In Generation 2001, I deleted SNP genotypes with a minor allele
frequency (MAF) less than 0.05 and deleted randomly one of any two adjacent SNP loci in
complete LD with each other. I randomly chose 5000 of the remaining SNP to be our markers
(i.e. Z) from which I further randomly selected nqtl = 30 to be quantitative trait loci (QTL). I
simulated allelic substitution effects, gqtl, for these QTL from a reflected gamma distribution with
shape parameter 0.4 and scale parameter 1.66 (Hayes and Goddard 2001; Meuwissen et al.

21

2001); the corresponding genotypes Zqtl for these animals were considered to be the a nqtl column
subset of the SNP genotype matrix Z such that the cumulative genetic merit or true breeding
values ( uTRUE ) was Zqtlgqtl. Phenotypes for animals in Generation 2001 and 2002 were generated
based on a heritability of 0.5, i.e., such that ďł e = ďł u = var( uTRUE ). Average pairwise LD among
2

2

the 5000 SNP markers across all 20 replicates averaged 0.34.
In order to assess the effect of different marker densities on prediction accuracy, I selected
every single, 2nd, 4th and 8th SNP markers resulting in 4 different marker densities, i.e., m= 5000,
2500, 1250, and 625 SNP markers across the 20 replicates.
2.4.2 Loblolly Pine Data
Resende et al. (2012) provided a data set involving 4854 SNP genotypes on 926 loblolly pine
individuals. Upon excluding SNP with MAF<0.05 and those showing departure from Hardy
Weinberg equilibrium (P<10-4), 2684 SNPs remained. I further standardized elements of Z using

( zij* ď­ 2 pË j )

2 pË j (1 ď­ pË j ) as described previously. Although original phenotypes were not

publicly available, Resende et al. (2012) provided deregressed EBV (DEBV) for 17 traits;
DEBV are often used as proxies for phenotypes when raw data are not readily available.
Deregression refers to the process by which the effect of unequal shrinkage on individual EBV
based on a pedigreed analysis (i.e., without using genotypic information) is reversed to remove
heterogeneity of variation due to unequal information in order to recreate âphenotypesâ (see
Garrick et al. 2009). I selected one disease resistance trait, absence or presence of rust,
previously estimated to have a heritability of 0.21. After discarding data on individuals missing
DEBV, 807 individuals remained.
To compare our proposed hierarchical EB estimation strategy with MCMC and with
RRBLUP, I randomly split the data into 10 portions with each portion serving once as a test
22

dataset (81 individuals/dataset) and the remaining 9 portions serving as the training dataset (726
individuals/dataset) in a 10-fold cross-validation study.
2.5 Data Analysis
All parameters excluding ďŽ g in the BayesA model were estimated using MCMC based on
100,000 cycles of burn-in followed by an additional 100,000 cycles, saving every 10th sample for
a total of 10,000 saved MCMC cycles. For our proposed EB strategy, MMAP estimation of ďą in
RRBLUP, BayesA, and SSVS was based on a convergence criterion of
[Î¸Ë ( k ) ď­ Î¸Ë ( k ď­1) ]'[Î¸Ë ( k ) ď­ Î¸Ë ( k ď­1) ]
ďź 10ď­4 ; after AIREML convergence of variance components, convergence
(k ) ' Ë (k )
Ë
[Î¸ ] [Î¸ ]

on EM-based solutions to g were based on the same criteria. Because of the difficulties and slow
convergence that I encountered in estimating ďŽ g based on our proposed MMAP strategy, ďŽ g was
held constant to 5 for the BayesA model using both MMAP and MCMC. In our simulation
study, the correlation between GEBV and uTRUE in Generation 2002 was defined as the accuracy
of (genomic) prediction whereby GEBV =Z gĚ for gĚ being the posterior mean of g using
MCMC or gĚ being EB estimates based on BayesA and SSVS and generated from analysis of
Generation 2001 data. The RRBLUP model was also considered. Similarly, for the loblolly pine
data analysis, I defined the accuracy of (cross-validation) prediction as the correlation between
the test data DEBV and its predictions based on estimates derived from the training data.
Accuracies of prediction for both analyses of the simulated and the loblolly pine data were
based on comparing the effects of three different sets of starting values of g in our EB methods:
Set 1) e-RRBLUP of g conditional on REML(ďłďŠ, Set 2) MCMC posterior means of all
parameters based on same model as the corresponding EB approach and Set 3) g = 0 and MCMC
posterior means for all other parameters based on same model as the corresponding EB
23

approach. To help derive properly scaled starting values ( ďł g2ď¨0ďŠ ) based on REML estimates
2
( ďłË g2ď¨ REML ďŠ ) of ďł g for Set 1, I used ďł g2ď¨ 0ďŠ ď˝

ďł

2
g ď¨ 0ďŠ

ď˝

ďłË g2ď¨ REML ďŠ
ď°Ë

ďŽg ď­2 2
ďłË g ď¨ REML ďŠ for MMAP under e-BayesA and
ďŽg

for MMAP under e-SSVS with ď°Ë being the posterior mean of

ď°

from the

corresponding MCMC analyses. For e-BayesA, I additionally considered investigating the effect
of basing the E-step on relative precisions (P), i.e., using ď´Ë j , as per Equation [2.7] or on relative
variances (V), i.e., using ď´Ë*j , as advocated by Karkkainen and Sillanpaa (2012) for each of the
three sets of starting values described above. So, for example, Set 2(P) refers to using MCMC
posterior means as starting values with the E-step based on ď´Ë j whereas Set 2(V) refers to using
MCMC posterior means as starting values with the E-step based on ď´Ë*j for e-BayesA. Since
2
hyperparameters like ďł g or

ď°

are not truly well defined in simulation studies that attempt to

mimic the LD between markers characteristic of real data (Yang et al. 2015b), our assessment of
the relative performance of the different starting value/E-step strategies were based on their
agreement with the corresponding MCMC posterior means given that MCMC inferences do not
require asymptotic assumptions. A more ideal assessment might be based on cross-validation
prediction accuracy, but I deemed that to be prohibitively expensive for a replicated simulation
study.
2.6 Expectation maximization variable selection (EMVS)
By nature of the SSVS prior, estimates of SNP effects based on a SSVS model will be
multimodal, such that EM estimates of g may precariously converge to local maxima. This was
recognized by Rockova and George (2014) who recommended two strategies in tandem to

24

alleviate this problem: 1) a regularization procedure that they label as Expectation Maximization
Variable Selection (EMVS) and 2) a deterministic annealing variant of EM (DAEM). EMVS
involves gradually changing the values of c from relatively high values (highly peaked spikes on
0 for near-null elements of g) to relatively low values while executing EM. Rockova and George
(2014) demonstrated that starting SSVS with a highly peaked spike (high c) and gradually
increasing the variance of the spike (i.e., decreasing c) helps absorb negligible estimates of g in
their examples. For the SSVS model characterized in Equation [9a], Rockova and George
(2014) actually specified ďł g2 as known, whereas I adaptively estimate it with our proposed
MMAP approach as indicated earlier. Although Rockova and George (2014) provided no
specific guidelines on how to choose the decreasing gradient set of values of c, I arbitrarily chose
decreasing values of c within the set Sc ď˝ ďť100000,60000,10000,5000, 2000,1000ď˝ , ending with
1000 based on a trial and error assessment on the lowest value of c that maximized crossvalidation accuracy on the loblolly pine dataset. Note that e-SSVS estimates of Î¸ , Î˛ , and g
based on one value of c are used as the starting values for e-SSVS estimation of Î¸ , Î˛ , and g for
the next value of c in Sc .
On the other hand, DAEM involves modifying the E-step in Equation [2.16] as follows:

ď´Ë**j ď˝

ď¨ďŚ ď¨ 0,ďł ďŠď° ďŠ
gË j

ď¨ďŚ ď¨
gË j

2
g

t

ďŚ ďŚ ďł g2 ďś
ďś
2
0, ďł g ď° ďŤ ď§ ďŚgË j ď§ 0, ďˇ ď¨1 ď­ ď° ďŠ ďˇ
ď§ ď§ c ďˇ
ďˇ
ď¸
ď¨ ď¨
ď¸

ďŠ ďŠ

t

[2.22]

t

Here 1/t (0 < t < 1) corresponds to a temperature specification on the degree of separation
between multiple modes of the estimates; note that Equation [2.22] defaults to [2.16] when t = 1.
Rockova and George (2014), following developments provided by Ueda and Nakano (1998),
claimed that gradually increasing temperature settings starting at 1/t = 0 increase the chances of
finding the true global maximum. Note that at 1/t=0, the corresponding estimates from this
25

procedure are effectively equivalent to e-RRBLUP(g) and REML(ďł) (i.e. starting values defined
in Set 2). Hence, at least as it pertains to g, using 1/t=0 then corresponds to a situation whereby
any local modes in its joint posterior density are completely smoothed away. As 1/t gradually
decreases, multiple modes begin to appear. Hence the influence of starting values is weakened by
keeping 1/t high during the early stages and gradually decreasing it to 1/t = 1 which directly
corresponds to the joint posterior density of interest.
I jointly adapted EMVS and DAEM together, labeled as DAEMVS by Rockova and
George (2014), by conducting our e-SSVS strategy over decreasing values of c in Sc within each
of the decreasing temperature values (i.e. increasing t) in St ď˝ ďť0,0.25,0.5,0.75,1ď˝ on each of the
10 loblolly pine training datasets. Within each subsequent specification of t in St I also conducted
our e-SSVS strategy over increasing spike variance (i.e. decreasing c) in the set Sc such that eSSVS estimates from each pair of t and c values cycling within t served as the starting values for
the next pair of values of t and c.

2.7 Results
2.7.1 Simulation Study
Average MMAP estimates, based on the three different sets of starting values and the two
different E-step strategies (relative precisions ď´Ë j vs. relative variances ď´Ë*j ) as well as average
MCMC estimates of ďł across the 20 replicates under the BayesA model are provided in Table
2.1.

26

Table 2.1 Average MCMC and MMAP estimates of hyperparameters as a function of marker
density, starting values and expectation-step (E-step) strategies under a BayesA model.
Average hyperparameter estimates across 20 simulated replicates
Variance
Component

ďł g2

ďł e2

Marker

E-step based on relative precisions(P)

Density

MCMC1

625

E-step based on relative variances (V)

Set 12

Set 23

Set 34

Set 12

Set 23

Set 34

2.26E-3

3.77E-3

3.77E-3

3.77E-3

2.22E-3

2.22E-3

2.22E-3

1250

1.43E-3

2.55E-3

2.55E-3

2.55E-3

1.50E-3

1.50E-3

1.50E-3

2500

6.98E-4

1.34E-3

1.32E-3

1.32E-3

7.75E-4

7.65E-4

7.68E-4

5000

3.03E-4

6.35E-4

3.62E-4

6.84E-4

3.38E-4

3.29E-4

3.50E-4

625

2.68

2.62

2.63

2.63

2.61

2.61

2.61

1250

2.30

2.23

2.23

2.23

2.21

2.21

2.21

2500

2.05

2.00

2.00

2.00

1.98

1.98

1.98

5000

1.89

1.88

1.98

1.84

1.89

1.88

1.88

1

MCMC: average posterior means, Other columns pertain to MMAP estimates based on relative
precisions versus relative variances E-steps and three different sets of starting values: 2Set 1) eBLUP (g) and REML(ďł),3Set 2) MCMC (g and ďłďŠ, 4Set 3) g = 0 and MCMC (ďłďŠ

ď¨ ďŠ

By basing the E-step on relative precisions ( ď´Ë j

ď­1

ď˝ E ď¨ď´ ď­j 1 ďŠ ) , the three different sets of

starting values lead to virtually identical estimates of ďł at m = 625, 1250 and 2500 although the
2
corresponding estimates of ďł g in all three cases were consistently higher compared to MCMC

estimates such that the opposite was observed for ďł e . However, at m = 5000, there was
2

considerable disagreement between the three sets. When the E-step was based on relative

ď¨ ďŠ

variances ( ď´Ë*j ď˝ E ď´ j ), MMAP led to average estimates that were much closer to average
MCMC estimates for all specifications of m and starting value sets. The correlation between all
5 sets of estimates in Table 2.1 exceeded 0.995 for m = 625, 1250, and 2500, whereas these
correlations dropped to 0.95-0.99 at m = 5000 with E-step based on relative precisions (results
not reported). A closer look at the behavior of the three different sets of starting values and/or
two different E-step strategies versus the corresponding MCMC posterior means for m = 5000 is
27

provided in Figure 2.1. It appears that starting with MCMC-based estimates (Set 2) for MMAP
estimation generally led to closer agreement than the other two sets between MMAP estimates
2
with the corresponding MCMC posterior means on ďł g when the E-step was based on relative

precisions. However, when the E-step was based on relative variances, the correspondence
between MMAP and MCMC estimates were very close for all three sets of starting values.

Figure 2.1 MMAP versus MCMC estimates of ďł g2 in simulation study (20 replicated datasets
each with 5,000 markers) under BayesA model with MMAP estimates based on different starting
values for g and ďłďş Set 1) rrBLUP (g) and REML(ďł), Set 2) MCMC (g and ďł), or Set 3) g = 0
and MCMC (ďłďŠ, Top panel of plots (P) pertain to use of E-step based on relative precisions
whereas bottom panel of plots (V) pertain to use of E-step based on relative variances.
Reference line of slope 1 and intercept 0 superimposed.
For the SSVS model, average MMAP estimates, based on the different starting values as well
as average MCMC estimates of Î¸ ď˝ ď¨ďł e2 , ďł g2 , ď° ďŠ across the 20 replicates are provided in Table 2.2.
28

Table 2.2 Average MCMC and MMAP estimates of hyperparameters as a function of marker
density and starting values under a SSVS model.
Averages hyperparameter estimates across 20 simulated replicates
MCMC1

Set 12

Set 23

Set 34

625

1.04

0.67

1.08

3.76

1250

0.70

0.68

0.71

2.68

2500

0.32

0.47

0.34

1.59

5000

0.20

0.48

0.21

0.96

625

2.74

2.56

2.57

2.70

1250

2.39

2.11

2.24

2.34

2500

2.16

2.03

2.08

2.11

5000

1.96

2.04

1.94

1.96

625

0.10

0.26

0.11

1.07E-3

1250

0.05

0.16

0.05

4.57E-4

2500

2.35E-2

0.04

1.85E-2

1.65E-4

5000

5.74E-3

1.18E-3

3.91E-3

3.10E-5

Hyper-

Marker

Parameter

Density

ďł g2

ďł e2 ď 

ď°

1

MCMC: average posterior means, other columns pertain to MMAP inference based on different
sets of starting value sets or E-step strategies 2Set 1) e-BLUP (g) and REML (ďł), 3Set 2) MCMC
(g and ďłďŠ, and 4Set 3) g = 0 and MCMC (ďłďŠ.
It seemed intuitively apparent that the best MMAP performance (i.e., highest correlation with
MCMC estimates) was observed when starting values for ďą were based on MCMC estimates (Set
2
2); in fact, the correlation between Set 2 and MCMC estimates of ďł g exceeded 0.99 for all

marker densities. On the other hand, the worst performance involved the Set 3 starting values
(i.e. g = 0) where the corresponding correlation never exceeded 0.72 and was sometimes less
than 0, although it did increase with marker density. A closer assessment of the relative
2
agreement between MCMC estimates with MMAP estimates of ďł g and

ď° based on the three

different sets of starting values and m = 5,000 markers is provided in Figure 2.2. MMAP
2
estimates of ďł g starting from MCMC estimates (Set 2) most closely aligned with MCMC

29

2
estimates of posterior means of ďł g . Likewise MMAP estimates of

ď°

starting at Set 2 agreed

best with the corresponding MCMC estimates although MMAP estimates were typically slightly
2
lower. However, MMAP estimates of ďł g based on the other starting value sets (Sets 1 and 3)

appeared to be badly biased upwards relative to the MCMC estimates. To seemingly compensate
for that bias, estimates of ď° were badly biased downwards such that with Set 3) starting at g = 0,
estimates of ď° rarely deviated from 0.

Figure 2.2 MMAP versus MCMC posterior means of ďł g (top row) and of ď° (bottom row) in
2

simulation study (20 replicated datasets each with 5,000 markers) under SSVS model with
MMAP estimates based on different starting values for g and ďąďş Set 1) rrBLUP (g) and

30

REML(ďł), Set 2) MCMC (g and ďą), and Set 3) g = 0 and MCMC(ďąďŠ . Reference line of slope 1
and intercept 0 superimposed.
I compared GEBV in Generation 2002 based on estimates of g derived from the analysis of
data on Generation 2001 using e-GBLUP, MCMC, and the various sets of starting values and/or
E-step strategies as it pertains to our EB versions of e-BayesA and e-SSVS in Figure 2.3.

Figure 2.3 Mean accuracies of breeding value prediction for EB inference as a function of
different marker densities (625, 1250, 2500, or 5000 markers) for BayesA (Panel A) and SSVS
(Panel B) in simulation study (20 replicated datasets). e-GBLUP based on REML(ďłďŠ, is same
for both Panels A) and B) whereas MCMC refers to using fully Bayesian inference under
MCMC for the corresponding model. Other lines pertain to EB inference based on different sets
of starting value sets or E-step strategies Set 1) e-rrBLUP (g) and REML (ďł), Set 2) MCMC (g
and ďłďŠ, Set 3) g = 0 and MCMC (ďłďŠ with letter suffixes indicating whether the corresponding Estep was based on relative precisions (P) or relative variances (V). Letter codes used to separate
estimates (P<0.05) having different accuracies based on the 5,000 marker analyses
e-GBLUP was always inferior to all other strategies with the gap in average accuracy
generally increasing with increasing marker density. As further anticipated from our previous
comparisons of MMAP estimates of ďł g in Table 1 and Figure 2.2, BayesA MCMC posterior
2

31

means or e-BayesA estimates of GEBV based on starting values derived from these same
MCMC estimates (Set 2) were generally superior to e-BayesA estimates based on other starting
values (Sets 1 and 3) when the E-step was based on relative precisions (Figure 2.3A). However,
no significant differences in accuracies were apparent between MCMC and e-BayesA estimates
based on any of the three different sets of starting values when the E-step was based on relative
variances. For SSVS (Figure 2.3B), MCMC posterior means and e-SSVS based on Set 2)
starting values led to GEBV that were more accurate than GEBV starting at e-RRBLUP (Set 1)
that were, in turn, more accurate than those starting with g = 0 (Set 3).
2.7.2 Application to Loblolly Pine Data
The average cross-validation accuracies for the different models, sets of starting values, and
E-step strategies (e-BayesA) over the 10 different replicates for the loblolly pine data analysis is
summarized in Figure 2.4; additionally, results based on the DAEMVS strategies for e-SSVS are
also provided. As consistent with BayesA in the simulation study, MCMC posterior means or eBayesA estimates starting at MCMC estimates (Set 2) were generally superior to e-BayesA
estimates based on either Set 1 or Set 3 when the E-step was based on relative precisions and to
e-GBLUP. Similar conclusions could be drawn when the E-step was based on relative variances
although there was significant improvement in the e-BayesA accuracies based on Set 1 or 3
starting values. For SSVS, MCMC dominated e-SSVS based on starting value Set 2 although
Set 2 in turn outperformed Sets 1 and 3 or e-GBLUP as well.

32

Figure 2.4 Average cross-validation accuracies for analysis of loblolly pine data based on
empirical Bayes (EB) inference under BayesA (left cluster) or SSVS (right cluster) models. eGBLUP, based on REML(ďłďŠ, is same for both clusters whereas MCMC refers to using fully
Bayesian inference for the corresponding model. Other bars pertain to EB inference based on
different sets of starting value sets or E-step strategies: Set 1) e-rrBLUP (g) and REML (ďł), Set
2) MCMC (g and ďłďŠ, or Set 3) g = 0 and MCMC with letter suffixes indicating whether the
corresponding E-step was based on relative precisions (P) or relative variances (V). Letter codes
used to separate estimates (P<0.05) having different cross-validation prediction accuracies
within each cluster.
For SSVS, I also evaluated whether DAEMVS could mitigate the impact of starting values for
e-SSVS. Figure 2.5 provides a regularization plot (Rockova and George 2014) for one randomly
chosen training dataset. In this plot, elements of gĚ are plotted as a function of pairs of
33

sequentially chosen values of c in St within t in Sc. Recall that at t = 0, gĚ are e-RRBLUP of g for

Figure 2.5 DAEMVS regularization plot for e-SSVS analysis of one training dataset analysis
based on loblolly pine data. The x-axes pertain to precision on spike component variance (c at
bottom) and inverse temperatures (t at top) whereas the y-axis denote SNP effect estimates gĚ .

all values of c with Figure 2.5 indicating very little spread in elements of gĚ at t = 0. Beyond t =
0.75, it appeared that DAEM (i.e. impact of t) had little influence on spread of elements in gĚ
whereas EMVS (i.e. impact of c) was far more influential. I did not consider values lower than c
= 1000 since I noted that they compromised cross validation prediction accuracies (results not
reported). Referring back to Figure 2.4, I determined that the DAEMVS based e-SSVS estimates
of GEBV had a predictive accuracy comparable to MCMC estimates with greater (P<0.05)
predictive accuracy than e-SSVS based on all other starting value sets.
2.8 Discussion
I have demonstrated that it is possible to develop computationally efficient empirical Bayes
approaches to hierarchical Bayesian WGP models that additionally allow one to infer key
34

hyperparameters, whether those WGP are based on heavy-tailed (BayesA) or variable selection
(SSVS) specifications on marker effects g. Our approach is based on a marginal modal inference
procedure for estimating hyperparameters that closely emulates REML. Nevertheless, I have
also demonstrated that the reliability of this EB strategy can critically depend on starting values.
Starting at MCMC posterior means generally lead to accuracies in EB estimates that closely
mirrored MCMC posterior means whereas a currently popular strategy based on starting all SNP
effects at 0 appeared to be badly suboptimal.
At any rate, there appeared to be some promising possibilities for partially mitigating the
effects of starting values. Our simulation study demonstrated no evidence of a difference
between the various sets of starting values when the E-step was based on relative variances, as
advocated by Karkkainen and Sillanpaa (2012) instead of relative precisions for e-BayesA
implementations. However, the E-step based on relative variances was not quite as effective at
mitigating the effects of starting values in our application to the loblolly pine data although it did
improve cross-validation prediction accuracy relative to the conventional E-step for starting
value sets not starting at MCMC posterior means for all parameters. I have no theoretical
conjecture for the difference in performance between the two different E-step strategies. The
relative variance strategy was based on a generalized EM (GEM) framework in which
Karkkainen and Sillanpaa (2012) proposed that one merely iteratively computes conditional
means (E-steps) or modes (M-steps) to substitute for random draws from the corresponding full
conditional densities using MCMC. However, their GEM strategy does not suggest what full
conditional parameterizations (e.g. relative variances versus relative precisions) that one should
work with whereas EM is typically involves basing the E-step on the functional forms of the
corresponding augmented variables in the joint posterior density. For e-SSVS, starting values

35

based on MCMC posterior means were also important for maximizing accuracy of converged
GEBV; however, the application of the DAEMVS approach appeared to be rather effective for
minimizing the influence of starting values in the loblolly pine data application such that it was
even superior to e-SSVS starting at MCMC posterior means.
At any rate, it is reasonable to conclude from our results that implementations of e-BayesA or
e-SSVS can lead to more accurate GEBV than the more common e-GBLUP strategy, mirroring
what has been often concluded based on fully Bayesian (i.e. MCMC) implementations (Hayes et
al. 2009; de Los Campos et al. 2013). Given that MCMC estimates lead to better starting values
than, say, g = 0, a practical recommendation might be that all initial analyses be first based on
MCMC followed by computationally efficient regular EB updates at periodic intervals for WGP
programs involving regular updates of phenotypes and genotypes (Wiggans et al. 2011).
However, even this strategy warrants more rigorous study since it has been demonstrated that the
relative importance of certain genomic regions contributing to GEBV in chickens may change
over several generations (Fragomeni et al. 2014), likely because of the gradual breakdown of LD
between SNP markers and QTL.
As indicated previously, there has not been much work addressing inferences on
hyperparameters in EM-based implementations of the Bayesian alphabet models. The strategy
proposed by Karkkainen and Sillanpaa (2012) was based on maximizing the joint posterior
density of all parameters, including hyperparameters; however, their success in estimating these
hyperparameters was limited, particularly for the BayesA model. It has been demonstrated
previously that maximizing the joint posterior density of all parameters in a linear mixed model
can be wrought with difficulties whereby âsevere dependenciesâ can exist between the
components of g and of Î¸ that hamper efficient estimation of Î¸ (Harville 1977). Conversely

36

marginal posterior estimates (i.e. MMAP) of Î¸ followed by joint posterior modal inference of g
(and Î˛ ) conditional on MMAP( Î¸ ) is typical of a more stable EB-based approach to inference
with hierarchical models, similar to using REML followed by BLUP (Robinson 1991). Other
researchers have taken yet a completely different approach by treating elements of Î¸ as if they
were augmented variables whose uncertainty is accounted for by integrating them out of the joint
posterior density whereas SNP-specific variances (i.e., Ď ) are considered as parameters to be
estimated (Xu 2007; Cai et al. 2011; Huang et al. 2015). Given that each element of Ď defines
the relative variance of a single element of g, I are not sure that this is particularly advisable;
nevertheless, more rigorous comparisons of their approach with our proposed strategy may be
warranted.
I have also alleged, as others previously have (Karkkainen and Sillanpaa 2012; Sun et al.
2012), that computing time is substantially less for EM versus MCMC based implementations of
BayesA; similar arguments naturally hold for SSVS. If hyperparameters ( Î¸ ) are not to be
estimated, this should be rather intuitive since EM is based on updating the full conditional
means of Î˛ , g, and Ď whereas MCMC is based on drawing random samples from their
corresponding full conditional Gaussian densities that require, in addition to specifications of
these same conditional means, the specification of conditional variances followed by draws from
a random Gaussian generator. Hence the computing time per MCMC cycle should be slightly
greater per cycle than per EM iterate if Î¸ is not considered; furthermore, the number of EM
iterates to reach convergence to a joint posterior mode for Î˛ and g is presumably less than the
number of cycles to facilitate sufficiently reliable MCMC inference. This should be especially
true of analyses updates based on the regular collection of more phenotypes and genotypes since
EM would be programmed to start at the solutions from the most previous update. Even so, the
37

MCMC versus EM computing comparison is further complicated by the need to estimate Î¸ , and
the dimensionality of g amongst other things. Weâve advocated the use of a MMAP technique
for Î¸ that closely emulates REML, advocating the use of the AIREML algorithm. To mitigate
the dimensionality of g in determining some of the intermediate calculations in AIREML, most
notably the inverse of the coefficient matrix in Equation [2.9], it is possible to reparameterize the
model in terms of the breeding values u=Zg, to mirror current GBLUP implementations
(Stranden and Garrick 2009).
At any rate, our results imply that researchers thinking about using current EM-based
implementations of Bayesian alphabet models should be cognizant of the potential effect of
starting values and potential remedies, as convergence problems will only intensify with
increasing marker densities.

38

Chapter3 Genome Wide Association Analyses Based on Broadly Different Specifications
for Prior Distributions, Genomic Windows, and Estimation Methods
3.1 Abstract
A currently popular strategy (EMMAX) for genome wide association (GWA) analysis infers
association for the specific marker of interest by treating its effect as fixed while treating all
other marker effects as classical Gaussian random effects. It may be more statistically coherent
to specify all markers as sharing the same prior distribution, whether that distribution is
Gaussian, heavy-tailed (BayesA), or has variable selection specifications based on a mixture of,
say, two Gaussian distributions (SSVS). Furthermore, all such GWA inference should be
formally based on posterior probabilities or test statistics as I present here, rather than merely
being based on point estimates. I compared these three broad categories of priors within a
simulation study to investigate the effects of different degrees of skewness for quantitative trait
loci (QTL) effects and numbers of QTL using 43,266 SNP marker genotypes from 922 DurocPietrain F2 cross pigs. Genomic regions were based either on single SNP associations, on nonoverlapping windows of various fixed sizes (0.5 to 3 Mb) or on adaptively determined windows
that cluster the genome into blocks based on linkage disequilibrium (LD). I found that SSVS
and BayesA lead to the best receiver operating curve properties in almost all cases. I also
evaluated approximate marginal a posteriori (MAP) approaches to BayesA and SSVS as
potential computationally feasible alternatives; however, MAP inferences were not promising,
particularly due to their sensitivity to starting values. I determined that it is advantageous to use
variable selection specifications based on adaptively constructed genomic window lengths for
GWA studies.

39

3.2 Introduction
Recent developments in genotyping technology have made single nucleotide polymorphism
(SNP) genotype marker panels, based on thousands, and now millions, of markers, available for
many livestock species (Wiggans et al. 2013; Kemper et al. 2015). Genome wide association
(GWA) analyses have been increasingly used to help pinpoint regions containing potential
causal variants or quantitative trait loci (QTL) for economically important phenotypes based on
fitting SNP markers as covariates. An increasingly popular inferential approach for GWA is
based on fitting phenotypes as a joint linear function of all markers using mixed-model
procedures such as those invoked in the popular EMMAX procedure (Kang et al. 2010) and
other similar procedures (Lippert et al. 2011; Zhou and Stephens 2012). Jointly accounting for
all SNP effects when inferring upon a specific SNP marker of interest generally improves
precision and power while also accounting for potential population structure (Kang et al. 2008).
Now GWA inferences in EMMAX and related procedures are based on treating the effect of
the SNP marker of interest as fixed with all other marker effects as normally distributed random
effects, noting that this process is repeated in turn for every single marker. These âfixed effectsâ
hypothesis tests are based on generalized least squares (GLS) inference, with P-values being
subsequently adjusted for the total number of markers or tests. Goddard et al. (2016) have
recently pointed out the paradox with treating markers as fixed for inference but then otherwise
as random to account for population structure for inference on association with other markers.
Random effects modeling with all SNP effects treated as random, including the one of inferential
interest, is synonymous with shrinkage based inference. Shrinkage or posterior inference has
been demonstrated to facilitate reliable inference without any formal requirements for multiple
comparison adjustments (Stephens and Balding 2009; Gelman et al. 2012). However, with SNP

40

markers treated as identically and independently distributed variables from a Gaussian
distribution, the resulting shrinkage from random effects modeling can be too âhardâ,
particularly with greater marker densities (Hayes 2013). Subsequently, this random effects test
has been deemed to be far too conservative in various applications, as further demonstrated by
Gualdron Duarte et al. (2014).
Prior specifications that are sparser than Gaussian may be more important for GWA since
they more likely better characterize the true genetic architecture of most traits relative to
Gaussian priors (de Los Campos et al. 2013). Sparser specifications have already been
popularized in whole genome prediction (WGP), such as the Student t distribution used in
BayesA (Meuwissen et al. 2001) and stochastic search and variable selection or SSVS (George
and McCulloch 1993; Verbyla et al. 2009). Both specifications generally lead to far less
shrinkage of large effects yet greater shrinkage of small effects compared to a Gaussian prior. In
particular, the use of variable selection procedures facilitate the determination of posterior
probabilities of association (PPA), whose control may be far more effective in maximizing both
sensitivity and specificity of GWA (Fernando et al. 2017) compared to frequentist based
inferences which require adjustments for multiple testing such as with EMMAX. Another
common inferential strategy in GWA is to simply report the percent of variance explained by a
marker or marker region (Fan et al. 2011; Tizioto et al. 2015; Wolc et al. 2016). However, point
estimates of marker effects or percentage of variation explained, by themselves, do not provide
formal evidence of association.
Most sparse prior WGP models have been implemented using Markov chain Monte Carlo
(MCMC), which can be computationally expensive. Approximate analytical approaches based
on the expectationâmaximization (EM) algorithm to provide approximate maximum a posteriori

41

(MAP) estimates of SNP effects have been developed to address computational limitations in
these sparse prior WGP models (Meuwissen et al. 2009; Hayashi and Iwata 2010; Sun et al.
2012; Chen and Tempelman 2015). Strategies for estimating/tuning hyperparameters for MAP
inference have been proposed, including those proposed by Karkkainen and Sillanpaa (2012),
KnĂźrr et al. (2013) and Chen and Tempelman (2015), the latter adapting the average information
restricted maximum likelihood (AIREML) algorithm for estimating hyperparameters in BayesA
and SSVS specifications. These MAP implementations should also be assessed for their efficacy
in GWA studies.
A pragmatic first objective in GWA is to pinpoint narrow genomic regions containing QTL
rather than to specifically identify the QTL themselves, even though the latter is the ultimate
goal. That is, a large number of SNP markers in a region surrounding a typically untyped QTL
might be in high linkage disequilibrium (LD) with the QTL and with each other, thereby
thwarting precise inference on the causal QTL. Different GWA methods may differ in the
number of SNP markers inferred to have an association within a genomic region with, for
example, EMMAX tending to draw associations with more SNP markers in LD with a QTL
compared to use of SSVS (Guan and Stephens 2011; Goddard et al. 2016).
Increasingly, more GWA studies are based on inferences involving joint tests on all of the
SNP markers within a narrow genomic region, recognizing that single SNP marker associations
may be fraught by low statistical power or problems with multicollinearity or both(Fernando et
al. 2017). Some GWA studies have been based on using several arbitrary window sizes based on
either non-overlapping (Wolc et al. 2012; Moser et al. 2015; Wolc et al. 2016) or sliding
windows (Schmid and Yang 2008). Because of the arbitrariness of fixed window sizes, whether
defined by number of SNP markers or by physical length in base pairs, it is possible to split a

42

large LD block into 2 or more separate windows, thereby making such a division seemingly
suboptimal for GWA. Substantially different window lengths have been used in different
studies. For example, a 5 SNP window was used for GWA based on 51,385 SNP markers in
pigs (Fan et al. 2011), whereas a 250 kilo base (Kb) window was used for 287,854 SNPs from
the Welcome Trust Case Control Consortium (WTCCC) human data (Wellcome Trust Case
Control Consortium 2007; Moser et al. 2015), and a 1 mega base (Mb) window was used for a
24,425 SNP marker panel in chickens (Wolc et al. 2012). Dehman et al. (2015) recently
proposed an approach to adaptively cluster windows of SNP markers of varying sizes based on
LD relationships. That is, they performed spatially constrained hierarchical clustering of SNPs
by minimizing a distance measure derived from Wardâs criterion based on LD r2 between SNP
markers. They surmised that this procedure would estimate a suitable specification of genomic
windows within each chromosome using a modified version of the gap statistic. This method has
been implemented in the R package BALD (Dehman and Neuvial 2015).
I had three primary objectives in this study. One was to examine the potential benefits of
using sparser priors (i.e., BayesA and SSVS) relative to classical (i.e., based on normality)
random effects specifications and strategies for GWA under a wide range of simulated
architectures. A second objective was to assess whether the choice of different fixed genomic
window sizes (specifically 0.5, 1, 2, and 3 Mb), versus adaptively inferred window sizes based
on LD clustering, could impact GWA performance. A final objective was to assess the relative
merit of approximate MAP approaches to theoretically exact yet computationally intensive
MCMC approaches based on sparse prior specifications. Our assessments are based upon SNP
marker genotypes and actual and simulated phenotypes on F2 pigs deriving from a DurocPietrain cross.

43

3.3 Methods and materials
3.3.1 The hierarchical linear model
All analyses in this paper are based on a hierarchical linear model which be characterized by
the classical mixed model specification:

y ď˝ XÎ˛ ďŤ Zg ďŤ e

[3.1]

Here y is a n x 1 vector of phenotypes, X is a known n x p incidence matrix connecting y to
the p x 1 vector of unknown fixed effects Î˛ , Z is a known n x m matrix of genotypes connecting
y to the m x 1 vector of unknown random SNP marker effects g, and e is the random error vector.
2
I also assume throughout that e ~ N (0, Iďł e ) whereas g | ďł g2 , D ~ N (0, Dďł g2 ) for D being a

diagonal matrix of augmented data or variables (Chen and Tempelman 2015; Tempelman 2015).
The prior specification on these diagonal elements is used to distinguish each of the competing
models as described later.
For pedagogical reasons, I assume one record per individual although extensions to repeated
records per individual are possible. An equivalent genomic animal (i.e., subject) effects model
(VanRaden 2008) to Equation [3.1] can then be written as:

y ď˝ XÎ˛ ďŤ a ďŤ e

[3.2]

with a ď˝ Zg and all other terms defined previously as in [3.1] such that, conditionally on D,

var(a) ď˝ var( Zg) ď˝ Z var(g) Z' ď˝ ZDZ'ďł g2

[3.3]

If đ âŤ đ, it is generally computationally more tractable to work with the linear mixed model
in Equation [3.2], along with the random effects specification in Equation [3.3], then back solve
for the estimate of g that would be identical to those using a linear mixed model directly based
on Equation [3.1] (Stranden and Garrick 2009).

44

3.3.2 Models
In the simplest model, which I denote as ridge regression (RR), there is no such data
augmentation (i.e. D = I), such that the elements of g are marginally distributed as independent
normal (de Los Campos et al. 2013). Sparser distributional specifications on g can be
constructed as mixtures of normal densities (Andrews and Mallows 1974) by simply specifying

ďť ď˝

prior distributions on functions of the diagonal elements of D . Suppose that D ď˝ diag ď´ j

m
j ď˝1

with ď´ j ~ ďŁ ď­2 (ďŽ ď´ ,ďŽ ď´ ) ; then it can be demonstrated that, marginally, elements of g are identically
and independently distributed as a scaled Student t with scale parameter ďł g2 and degrees of
freedom ďŽ ď´ (Chen and Tempelman 2015). This model is typically referred to as BayesA

ďť

ď˝

(Meuwissen et al. 2001). Alternatively, if D ď˝ diag ď´ j ďŤ ď¨1 ď­ ď´ j ďŠ / c

m
j ď˝1

where

ď´ j ~ Bernoulli ď¨ď° ď´ ďŠ ;ď´ j ď˝ 0,1 and c>>1, then the resulting model is Bayes SSVS in the spirit of
George and McCulloch (1993). As a side-note, I use c = 1000 for all SSVS analyses in this
paper.
As a final stage in each of the competing hierarchical models (RR, BayesA, and SSVS), I
specify convenient conjugate priors wherever possible. For example, scaled inverted chi-square
priors for variance components; i.e.

p ď¨ďł e2 | ve , se2 ďŠ ďľ ď¨ďł

ďŚďŽ e ďś
2 ď­ď§ď¨ 2 ďŤ1ďˇď¸
e

p ď¨ďł g2 | vg , sg2 ďŠ ďľ ď¨ďł

ďŚďŽ g ďś
2 ď­ď§ď§ 2 ďŤ1ďˇďˇ
ď¨
ď¸
g

ďŠ

ďŽ e se2
2ďł e2

ď­

e

[3.4]

and

ďŠ

ďŽ g sg2

ď­

e

2ďł g2

[3.5]

whereas I specify a Beta prior on ď° ď´ in SSVS; i.e.,

p ď¨ď° ď´ | ďĄ 0 , ď˘ 0 ďŠ ďľ ď° ď´ ďĄ0 ď¨1 ď­ ď° ď´ ďŠ 0 .
ď˘

[3.6]

As I explain later, I arbitrarily specify ďŽ ď´ as known (ďŽ ď´ =2.5), although conceptually it could
also be estimated (Yang et al. 2015b). I assume throughout that p ď¨ Î˛ ďŠ ďľ 1 as p ď¨ Î˛ ďŠ is typically
45

diffuse, although extensions to more informative specifications should be obvious. Furthermore,
2
for all analyses in this paper, I specify Gelmanâs non-informative prior (Gelman 2006) for ďł e in
2
Equation [3.4] based on ve ď˝ ď­1 and se ď˝ 0 and for ďł g2 in Equation [3.5] based on vg ď˝ ď­1 and

sg2 ď˝ 0 . Furthermore, as per Yang and Tempelman (2012), I specify ďĄ 0 ď˝ 1 and ď˘ 0 ď˝ 9 .
3.3.3 Joint posterior density
Given the specifications above, the joint posterior density can be written as:

p ď¨ Î˛, g, Ď, ďł e2 , ďł g2 , ďąď´ | y ďŠ
[3.7]
ďś
ďŚ n
ďśďŚ m
ďľ ď§ ď p ď¨ yi |Î˛, g,ďł e2 ďŠ ďˇ ď§ ď p ď¨ g j |ďł g2 ,ď´ j ďŠ p ď¨ď´ j | ďąď´ ďŠ ďˇ p ď¨ Î˛ ďŠ p ď¨ďł g2 | vg , sg2 ďŠ p ď¨ďł e2 | ve , se2 ďŠ p ď¨ďąď´ ďŠ
ď¨ i ď˝1
ď¸ ď¨ j ď˝1
ď¸

ď¨

ďŠ

ď­2
Note that p ď´ j | ďąď´ specifies the ďŁ (ďŽ ď´ ,ďŽ ď´ ) density under BayesA (i.e., ďąď´ ďş ďŽ ď´ ) whereas

p ď¨ď´ j | ďąď´ ďŠ specifies the Bernoulli ď¨ď°ď´ ďŠ density under SSVS (i.e., ďąď´ ďş ď° ď´ ). Furthermore,
p ď¨ g j |ďł g2 ,ď´ j ďŠ is Gaussian with null means under all three competing models but with variance

ď¨

ďŠ

ďł g2ď´ j under BayesA and variance ďł g2 ď´ j ďŤ ď¨1 ď­ ď´ j ďŠ / c under SSVS. For RR, ď´ j ď˝ 1ď˘j such

ď¨

ďŠ

that p g j |ďł g ,ď´ j ď˝ 1 is Gaussian with common variance ďł g2ď˘j .
2

3.3.4 Algorithms
3.3.4.1 Markov Chain Monte Carlo
The MCMC sampling strategies that I use here for BayesA are similar to those provided in
Yang and Tempelman (2012) and Yang et al. (2015b). However, since our parameterization is
slightly different, I present the full conditional densities of interest for implementing BayesA in
Appendix A. For similar reasons, I also provide the full conditional densities for SSVS in
Appendix A as even our model differs from the model also labeled as SSVS in the genomic
46

prediction work of Verbyla et al. (2009) whereas it is virtually identical to the model presented
in seminal SSVS paper by George and McCulloch (1993).
3.3.4.2 Maximum a posterior estimation
Complete details on our MAP procedure for both BayesA and SSVS are found in Chen and
Tempelman (2015). Given that our application involved m >> n, I conducted MAP based
inference on an equivalent subject-centric model using Equation [3.2] rather than based on a
SNP effects model as in Equation [3.1]. Details on backsolving from a subject-centric model to
provide estimates of SNP effects are provided in Appendix A. For pedagogical reasons,
however, I work directly from the SNP-effect Model [3.1] in our subsequent developments.
Conditional on D, the posterior variance-covariance matrix of g, or equivalently its prediction
error variance-covariance (PEV) matrix from a frequentist viewpoint, can be written as:

var ď¨ g | y, D, ďł g2 , ďł e2 ,ďąď´ ďŠ ď˝ C gg|D . This expression can be derived from the inverse of the mixed
model coefficient matrix as:

ďŠCÎ˛Î˛|D
ďŞ gÎ˛|D
ďŤC

ď­2
CÎ˛g|D ďš ďŠ X ' Xďł e
ďşď˝ďŞ
ď­2
Cgg|D ďť ďŤďŞ Z ' Xďł e

ďš
X ' Zďł eď­2
ďş
Z ' Zďł eď­2 ďŤ Dď­1ďł gď­2 ďťďş

ď­1

[3.8]

That is, C gg |D is the random by random portion of the inverse coefficient matrix in
2
Hendersonâs mixed model equations, conditional on D, ďł e , ďł g2 and ďąď´ ( ďąď´ ďş ďŽ ď´ for BayesA or

ďąď´ ďş ď°ď´ for SSVS). As noted earlier, values for hyperparameters such as ďł e2 , ďł g2 and ďąď´
required for Equation [3.8] can be determined using the REML or marginal maximum likelihood
(MML) estimation strategies as described by Chen and Tempelman (2015) noting that I choose
to fix ďŽ ď´ in BayesA as indicated earlier.

47

It can be readily demonstrated (Sorensen and Gianola 2002), that asymptotically

MAP ď¨ g ďŠ ďť E ď¨ g | y ďŠ whereby MAP(g) can be iteratively determined using EM based on NewtonRaphson for maximization (M-) steps interwoven with expectation (E-) steps on elements of D
(Chen and Tempelman, 2015). Under RR, D = I such that Cgg ď˝ C gg|D ďť var ď¨ g | y ďŠ represents the
2
posterior variance-covariance matrix of g conditional on ďł e , ďł g2 and ďąď´ . In fact, MAP(g) is

synonymous with BLUP(g) under RR. Furthermore, Cgg is synonymous with the g-component
of the observed information matrix of the joint conditional posterior density ofď ď˘ and g. This
posterior density is formally defined in Equation [3.9].

ďŚ m
ďś
ďŚ n
2 ďśď§
2
p ď¨ Î˛, g,| ďł , ďł ,ďąď´ | y ďŠ ďľ ď§ ď p ď¨ yi |Î˛, g,ďł e ďŠ ďˇ ď ď˛ p ď¨ g j |ďł g ,ď´ j ďŠ p ď¨ď´ j | ďąď´ ďŠ dď´ j ďˇ
ďˇ
ď¨ i ď˝1
ď¸ ď§ď¨ j ď˝1 Rď´ j
ď¸
2
e

2
g

[3.9]

With D = I, there is no uncertainty on ď´ j such the integration in Equation [3.9] is not
necessary with Cgg being directly obtainable for RR using Equation [3.8]. However, for BayesA
and SSVS, uncertainty in D needs to be integrated out as per Equation [3.9]. An indirect strategy
for asymptotically providing Cgg for BayesA and SSVS is based on the strategy proposed by
Louis (1982) with details provided in Appendix A. I subsequently use elements of Cgg to
asymptotically determine key components of var ď¨ g | y ďŠ for both single SNP and window based
GWA testing using MAP under all three models, noting again that MAP and BLUP are
synonymous under RR.
3.3.5 Conducting Genome Wide Association Analyses
3.3.5.1 Single SNP marker associations
I subsequently describe how I conducted GWA inference for single SNP associations based
on the algorithms (MCMC vs. MAP) and models (RR, BayesA, and SSVS). With respect to
48

inference on association on SNP j, EMMAX is conceptually based on subsetting out Equation
[3.1] as follows:

y ď˝ XÎ˛ ďŤ z j g j ďŤ Zď­ j g ď­ j ďŤ e

[3.10]

That is, Z is partitioned into column j, zj, being the genotypes for SNP j and all other
remaining columns in Z-j. In EMMAX, gj is actually treated as fixed whereas g-j is treated as
classically random; i.e., characterized by a Gaussian prior distribution. Writing W j ď˝ ďŠďŤ X z j ďšďť
and Vď­ j ď˝ Z ď­ j Z'ď­ jďł g2 ďŤ Iďł e2 , the generalized least squares (GLS) estimator gË j of g j , using all
other markers to account for population structure, is the last element of the product

ď¨W 'V
j

ď­1
ď­j

ď¨ ďŠ

Wj ďŠ Wj ' Vď­ď­1j y . Furthermore, the corresponding standard error se gË j is
ď­1

ď¨

determined by the square root of the last diagonal element of Wj ' Vď­ď­1j W j

ďŠ

ď­1

. The test-statistic

or âfixed effectsâ z-score for the EMMAX test can then be simply written as:

zf ď˝

gË j

se ď¨ gË j ďŠ

[3.11]

which is assumed to be N(0,1) under Ho: g j ď˝ 0 . The âexpeditedâ approach (Kang et al.
2010) in EMMAX, that I consider in this paper, is based on approximating Vď­ j with

V ď˝ ZZ ' ďł g2 ďŤ Iďł e2 for inference of association on all SNP j= 1,2,âŚ,m; furthermore, ďł g2 and ďł e2
estimated only once using REML in an initial analysis that treats all SNP marker effects as
random. A GWA test for a particular SNP marker j using EMMAX then essentially involves
treating its effect jointly as both fixed and random by replacing Z ď­ j g ď­ j with Zg on the right
side of Equation [3.10], implying that this double counting of g j as both fixed and random
should be trivial with large m.
A classical shrinkage or random effects test for Ho: g j ď˝ 0 is based on treating all SNP
effects, including a marker j of particular interest, as having a Gaussian prior such that the point
49

estimate of the SNP substitution effect is based on fitting Equation [3.1] or, equivalently,
backsolving from fitting Equation [3.3] as demonstrated by Stranden and Garrick (2009) and also
in Appendix A. A corresponding test statistic ( zr ) can be based on dividing g j , the BLUP of gj,

ď¨ ďŠ

ď¨

ďŠ

by the square root of its prediction error variance (PEV) where PEV g j ď˝ var g j ď­ g j from a
frequentist perspective. From a Bayesian perspective, the corresponding test statistic can be
interpreted as a posterior z-score (Gelman et al. 2012) since g j is analogous to a posterior mean

ď¨

ďŠ

ď¨

ďŠ

(i.e., g j ď˝ E g j | y , ďłË e , ďłË g ďť E g j | y ) whereas the PEV is analogous to a posterior variance
2

2

ď¨

ď¨ ďŠ

ďŠ

ď¨

ďŠ

with PEV g j ď˝ var g j | y , ďłË e , ďłË g ďť var g j | y . I refer to this inference strategy as RR2

2

BLUP. It is important to indicate, nevertheless, that these RR-BLUP inferences are empirical
Bayesian (Robinson 1991) since these posterior means and variances are typically conditioned
2
2
upon REML estimates of ďł e and ďł u . The posterior z-score (Gelman et al. 2012) can then

equivalently derived from both frequentist and Bayesian perspectives as indicated in Equation
[3.12].

zr ď˝

gj
PEV ď¨ g j ďŠ

ď˝

Eď¨ g j | yďŠ

[3.12]

var ď¨ g j | y ďŠ

Now Gualdron Duarte et al. (2014), with a proof provided later by Bernal Rubio et al. (2016),
determined that the âfixed effectsâ or EMMAX z-score, z f in Equation [3.11], could be
equivalently derived by treating all markers as classically random, but by dividing the
corresponding BLUP g j for marker j by the square root of its frequentist definition of variance

var ď¨ g j ďŠ as characterized by classical mixed model theory (Searle et al. 1992) in Equation
[3.13].
50

var ď¨ g j ďŠ ď˝ var ď¨ g j ďŠ ď­ PEV ď¨ g j ďŠ ď˝ ďł g2 ď­ PEV ď¨ g j ďŠ

[3.13]

In other words, one can rewrite the fixed effects test provided in both its frequentist

ď¨

ďŠ

(numerator = g j ) and Bayesian (numerator = E g j | y ) representations as in Equation [3.14].

zf ď˝

gj

ďł g2 ď­ PEV ď¨ g j ďŠ

ď˝

Eď¨ g j | yďŠ

ďł g2 ď­ var ď¨ g j | y ďŠ

.

[3.14]

Note that the computation using Equation [3.14] is far more tractable than that implied with
Equation [3.11]. That is, Equation [3.14] only requires computing BLUP of g and its
corresponding
PEV in one single step determination for all m tests whereas Equation [3.11] imply m
different mixed model analyses, each one in turn explicitly treating a different SNP marker effect
as fixed.
I perceive no computationally tractable âfixed effectsâ test analogous to EMMAX that I could
adapt for MAP based on sparser priors (e.g., BayesA and SSVS). For BayesA, for example, this
would entail treating the marker of interest j as fixed with all other markers treated as scaled
Student t- distributed. However, a posterior or random effects z-score test can be constructed
using the MAP estimate of gj as the numerator and its asymptotic posterior standard error as the
denominator, noting that MAP and the posterior mean of g should approach each other
asymptotically. Details on deriving those asymptotic standard errors (i.e., based on deriving Cgg)
for use in Equation [3.11] for these sparse prior specifications are provided in Appendix A such
that I refer to these two corresponding GWA inference strategies as MAP-BayesA and MAPSSVS.
For SSVS based single SNP inferences using MCMC, I based inferences on the PPA for SNP
marker j (i.e. PPAj) as in Equation [3.15].

51

N

PPAj ď˝

ďĽď´ ď¨ ďŠ
j l

l ď˝1

[3.15]

N

Here N denotes the number of MCMC cycles saved for posterior inference and ď´ j ď¨ l ďŠ is a
binary draw from the full conditional distribution of ď´ j at MCMC cycle l. I denote this GWA
method as MCMC-SSVS.
Since there is no variable selection inherent with BayesA under MCMC, I based single SNP
inferences on a Bayesian analog to a P-value using

ď¨

ďŠ

ď¨

N
ďŚ N
I
g
ďž
0
I g j ď¨l ďŠ ďž 0
ďĽ
ďĽ
ď§
jď¨l ďŠ
l ď˝1
l ď˝1
pË j ď˝ 2 min ď§
,1 ď­
N
N
ď§
ď§
ď¨

ďŠ ďśďˇ
ďˇ
ďˇ
ďˇ
ď¸

[3.16]

(Bello et al. 2010) where the indicator variable I ď¨.ďŠ = 1 if the condition within the argument
is true and 0 otherwise. I denote this particular GWA method as MCMC-BayesA,

3.3.5.2 Windows based associations
Window-based extensions to all of the above tests were also developed, some based on work
previously presented above. Suppose that window k, k = 1,2, 3, âŚ , K contains nk markers such
that Z can be partitioned accordingly into Z ď˝ ď Z1

Z2

Z K ď with Zk having nk columns,

implying then that window k contains nk SNP markers. Similarly, the vector g is partitioned
accordingly; i.e. g ď˝ ďŠďŤg1 g 2
'

'

g 'K ďšďť ' such that gk is of dimension nk x 1. Recall that I

gg
denoted C gg ď˝ PEV ď¨ g ďŠ . For our proposed windows-based test, the key components of C can

gg
gg
gg
be partitioned into K different blocks along the block diagonal; i.e., C1 , C2 , âŚ, CK where

52

Ckgg is of dimension nk x nk . The extension to a joint âfixed effectsâ or EMMAX like test on nk
markers in window k involves the following extension of Equation [3.14].

ďŁ 2f ď˝ g k (Î n ďł g2 ď­ Ckgg )ď­1 g k

[3.17]

k

That is, it can be readily demonstrated, extending results from Bernal Rubio et al. (2016), that

ďŁ 2f is chi-square distributed with nk degrees of freedom under Ho: gk = 0. The corresponding
extension to a joint classical ârandom effectsâ or RRBLUP test on window k is provided in
Equation [3.18]

ďŁ r2 ď˝ g k (Ckgg )ď­1 g k

[3.18]

which would also be considered to be chi-square distributed with nk degrees of freedom under
Ho: gk = 0. Similarly, one could use Equation [3.18] to construct the same tests for MAPBayesA and MAP-SSVS but basing the Cgg on the corresponding asymptotic posterior variancecovariance matrices as derived in Appendix A.
For windows based inference using MCMC-SSVS, I simply compute the PPA for window k
(i.e. PPAk) in Equation [3.19], following that presented in Fernando et al. (2017).

ďŚ nk
ďś
ď§ I (ďĽď´ kj ď¨l ďŠ ) ďž 0 ďˇ
ďĽ
l ď˝1
j ď˝1
ď¸
PPAk ď˝ ď¨
N
N

[3.19]

Here, ď´ kj ď¨ l ďŠ defines a binary draw from the full conditional distribution of ď´ j for SNP marker j
nk

located within window k drawn during MCMC cycle l. Note then that I (ďĽď´ kj ď¨l ďŠ ) ďž 0 is equal to
j ď˝1

1 when any of the draws of ď´ kj ď¨ l ďŠ within window k are equal to 1.
For windows based GWA inference under MCMC-BayesA, I propose inferring upon the
posterior probability of the proportion ( qw ) of the genetic variance explained by the markers in a

53

genomic window relative to the total genetic variance as proposed by Fernando and Garrick
(2013) and determined in the following manner. First note that the genotypic value that is
attributed to a genomic window k is defined as in Equation [2.20].

ak ď˝ Zk g k

[3.20]

Then the variance explained by the window is defined as

ďł

2
ak

'
a'k a k 1nk a k 2
ď˝
ď­(
)
nk
nk

[3.21]

Similarly, the total genetic variance is computed as

ďł a2 ď˝

1' a
a' a
ď­ ( m )2
m
m

[3.22]

Hence, the proportion of genetic variance that is explained by marker in window k is defined
as

qk ď˝

ďł a2
ďł a2
k

[3.23]

Suppose that I deem genomic windows that explain more than 1% of the total genetic
variance as being of potential interest. Hence, a variable selection modification of MCMCBayesA can be simply be based on the proportion of MCMC samples for which the genetic
variance ( qk ) for window k exceeds 0.01 (Fernando and Garrick 2013). One advantage of this
approach is that it can be applied to any MCMC analyses based on a model where variable
selection is not explicitly specified.

54

3.4 Data
3.4.1 Simulation Study
In order to compare the various models (RR, BayesA, and SSVS) and algorithms (MAP vs.
MCMC), I simulated data based on the Michigan State University Pig Resource Population
(MSUPRP) raised at the Michigan State University Swine Teaching and Research Farm, East
Lansing, MI (Edwards et al. 2008) . I specifically started with the SNP markers chosen for
analysis by Gualdron Duarte et al. (2014) which included 928 Duroc-Pietrain F2 crosses.
Roughly 1/3 of these pigs were directly genotyped using the Illumina Porcine SNP60 beadchip
(60K) whereas the remaining F2 animals with genotyped using a lower density 9K set but whose
genotypes were subsequently imputed to the 60K set (Gualdron Duarte et al. 2013). Edits
excluded animals with more than 10% of their SNP markers missing, excluding SNP markers
with more than 10% of animals missing genotypes for those markers, and excluding SNPs with
minor allele frequency (MAF) below 0.01 (Gualdron Duarte et al. 2014). Some adjacent
markers were in complete LD with each other. To circumvent multicollinearity issues,
particularly its role in generating multimodality in the MCMC generated posterior densities for
some SNP markers (Calus et al. 2015), I randomly deleted one SNP within an adjacent pair in
complete LD with each other before further analyses. After invoking this edit, 43,266 SNPs
remained. The original data source can be downloaded from
https://msu.edu/~steibelj/JP_files/GBLUP.html.
To simulate different but representative genetic architectures, I generated QTL effects from
three different Gamma densities with demonstrably different values of shape (ď§) ranging from an
effectively oligogenic density (ď§ = 0.18) which effectively specifies relatively much fewer QTL
with large effects to an effectively polygenic Gaussian density (ď§ = 3.00) where most QTL have

55

intermediate effects with symmetrically small and large effects on either side. A third
intermediate value (ď§ = 1.48) was also chosen. A good illustration of the gamma density of QTL
effects based on these three different specifications for ď§ is provided in Figure 3.1. Note that this
range in ď§ values for QTL effects has been reported for various traits in livestock based on
previous empirical work (Hayes and Goddard 2001).

Figure 3.1 Distribution of quantitative trait loci effects under a Gamma distribution for
different specifications of shape (magenta curve ď§ = 0.18, blue curve ď§ = 1.48 and red curve ď§ =
3.00)
In addition to the distribution of QTL effects, I conjectured that the number of QTLs (nqtl)
may also influence GWA performance such that I considered nqtl = 30, 90, or 300. Hence, I

56

simulated 10 replicated populations under each of the 3 x 3 = 9 different scenarios pertaining to
the 3 different values for each of ď§ and of nqtl. Each of the 90 simulated datasets were based on
utilization of the 43,266 SNP marker genotypes on the n = 922 MSUPRP F2 pigs as previously
described. Within each dataset, allelic substitution effects, gqtl, were simulated for each of the
nqtl randomly chosen SNP markers from the corresponding gamma distribution having shape ď§,
with a randomly chosen half of those effects multiplied by -1 as per Meuwissen et al. (2001).
The corresponding genotypes Zqtl for QTL on these animals were then a n x nqtl subset of the
SNP genotype matrix Z such that the cumulative genetic merit or true breeding values was
determined as uTRUE =Zqtlgqtl. Phenotypes for animals were generated based on a heritability of
0.45 as estimated for 13th-week tenth rib backfat from this same dataset. Only the remaining (i.e.,
non-QTL) marker genotypes Z-qtl were used for all simulation study analyses.
In the simulation study, all parameters excluding vď´ in BayesA were estimated using both
MCMC and MAP. For MCMC, I ran 200,000 iterations, discarding the first 100,000 iterations as
burn-in and basing inference on saving every 10 of the remaining 100,000 cycles for a total of
10,000 samples from the posterior density. Using MAP, estimation of variance components (ďąďŠ
for BayesA and SSVS was based on a convergence criterion of

[Î¸Ë ( k ) ď­ Î¸Ë ( k ď­1) ]' [Î¸Ë ( k ) ď­ Î¸Ë ( k ď­1) ]
ďź 10ď­6 .
[Î¸Ë ( k ) ]' [Î¸Ë ( k ) ]

Based on our previous experience (Chen and Tempelman, 2015), I recognized that the
specification of starting values in MAP-SSVS and MAP-BayesA was important for genomic
prediction accuracy and, hence, likely important for GWA inferences as well. Strategies for
2
2
specifying starting values for ďł g , ďł e , g and ď´ may pragmatically involve using REML and

RRBLUP inferences as in Chen and Tempelman (2015) since RRBLUP is not computationally
intensive. For MAP-BayesA, starting values were based on REML estimates ďłË g2ď¨ REML ďŠ and
57

ďłË e2ď¨ REML ďŠ using ďł g2ď¨ 0ďŠ ď˝

m
ďŽg ď­2 2
ďłË g ď¨ REML ďŠ for ďł g2 , g 0 ď˝ BLUP ď¨ g ďŠ ď˝ ďť g0 j ď˝ j ď˝1 for g and
ďŽg

g 02 j

ď´ ď¨ 0ďŠ j

ďŤďŽ
ďł g2(0) g
ď˝
for ď´ j , j = 1, 2, âŚ, m, based on the posterior expectation derived from its full
ďŽ g ď­1

conditional density. For MAP-SSVS, the corresponding starting values were ďł

2
g ď¨ 0ďŠ

ď˝

ďłË g2ď¨ REML ďŠ
ď°0

for ďł g2 with the starting value ď° ď´ ď¨ 0 ďŠ for ď° ď´ based, in turn, on starting values for ď´ j (i.e., SNPspecific PPA) which were determined in the following manner. First of all, EMMAX-based Pvalues for each SNP were converted to local false discovery rate (lFDR) estimates using the R
package ashr (Stephens 2017). It has been demonstrated that these lFDR estimates, in turn,
can be used to approximate PPA using PPA â 1- lFDR (Stephens 2017). These approximate PPA
values were then chosen as the starting values for ď´j in MAP-SSVS. In turn, these starting values
for ď´j were used to derive the starting value for ď° 0 in MAP-SSVS using the posterior expectation
m

from its full conditional density, i.e., ď° 0 ď˝

ďĄ 0 ďŤ ďĽď´ j
j ď˝1

ďĄ 0 ďŤ ď˘0 ďŤ m

. Upon convergence of variance

components using the AIREML procedure outlined in Chen and Tempelman (2015),
convergence of MAP-based solutions to g were based on the same criteria.
Single SNP marker inferences were based on the procedures outlined previously; i.e. for MAP
by comparing zr in Equation [3.12] for the random effect tests for RRBLUP, MAP-BayesA, and
MAP-SSVS and zf for the EMMAX test in Equation [3.11] to a standard normal distribution.
Furthermore, the estimates of PPA and Bayesian P-values provided in Equations [3.15] and
[3.16] were respectively used for GWA under MCMC-SSVS and MCMC-BayesA. Since the
58

remaining genotypes Z-qtl did not include the simulated QTL, a SNP marker was declared a true
positive if a QTL was located between that marker and its closest SNP neighbor on either side.
Window based inference was based on the procedures outlined previously; i.e. for MAP by
2
computing ďŁ r in Equation [18] for the random effect tests using RRBLUP, MAP-BayesA, and

MAP-SSVS and ďŁ 2f for fixed effects test in Equation [3.17] under EMMAX. These test
statistics were compared to a chi-square distribution with degrees of freedom nk. Furthermore,
GWA was based on the PPA that qk ďž 0.01 as provided in Equation [3.23] for MCMC-BayesA
and on the PPA for MCMC-SSVS as provided in Equation [3.19].
For windows-based inference, four alternative fixed window sizes were chosen: 0.5, 1, 2, or 3
Mb. The genome map used was the Sus Scrofa build 10.2
(http://www.ensembl.org/Sus_scrofa/Info/Index). Also, as per Moser et al. (2015), two different
within-chromosome starting positions (starting at location 0 or 0.25 Mb for window size 0.5;
starting at 0 or 0.5 Mb location for window sizes 1 Mb; starting at 0 or 1 Mb location for window
sizes 2Mb; and starting at 0 or 1.5 Mb location for window sizes 3Mb) for each chromosome
were chosen to partly counteract the chance effect of different LD patterns being associated with
non-overlapping windows. Finally, adaptive window sizes based on clustering SNP by LD r2
were also determined using the BALD R package (Dehman and Neuvial 2015) using the
procedure described by Dehman et al. (2015).
The relative performance of all methods and models were based on receiver operating
characteristic (ROC) curves. In a ROC curve, the true positive rate (TPR) is plotted against the
false positive rate (FPR) for each competing method (Metz 1978). I were more specifically
interested in the partial area under the curve up until a FPR= of 5% (pAUC05) so as to not
include somewhat irrelevant ROC regions with low levels of specificity (Ma et al. 2013). A
59

perfect classifier would have a pAUC05 of 0.05Ă1 = 0.05 whereas a random classifier would
have a pAUC05 of 0.052/2= 0.00125. I subsequently rescaled all pAUC05 measures by 0.00125-1
such that a random classifier is rescaled to a relative pAUC05 = 1. I used the R package ROCR
(Sing et al. 2005) to obtain replicate-specific ROC curves and pAUC05 for each of the 10
replicated datasets for each method and window specification within each nqtl and ď§
combination. For each window specification, specific comparisons between methods were based
on using the logarithm of pAUC05 as the response variable in a mixed model ANOVA with
methods, nqtl and ď§ and all of their interactions included as fixed effects and population replicate
(nested within nqtl and ď§) as a random effect blocking factor. For windows-based inferences
based on fixed window sizes, replicate-specific pAUC05 values were averaged over the two
different starting positions as previously noted.

Mean log(pAUC05) estimates were

backtranformed (i.e. anti-logged) to the original scale for reporting. Overall marginal means
were separated using Tukeyâs test whereas comparisons between methods were sliced out using
ANOVA t-tests for each value of nqtl or ď§ if the corresponding interaction between these factors
with methods were significant (P<0.05). I are also conjecture that window size might actually
influence of the power of detecting QTL using the same method; therefore, I conducted separate
tests comparing pAUC05 for each of the different window sizes, including adaptively chosen
windows based on BALD, separately within each method.
3.4.2 MSUPRP data
I also compared all models and algorithms on 13-week tenth rib backfat (mm) within the
MSUPRP data as per Gualdron Duarte et al. (2014). Sex, contemporary group, and age of
slaughter were treated as fixed effects (i.e., ď˘). I compared each of the six competing methods,
computing either PPA or P-values in the same manner as in the simulation study. For MCMC60

BayesA and MCMC-SSVS, I ran a total of 1 million MCMC iterations based on 500,000 burn in
iterations and 500,000 iterations post burn-in saving every 10 iterations such that posterior
inference was based on 50,000 random draws from the posterior distribution. Since I did not
know the true positions of the causal QTL for this trait, GWA inferences were compared
between the various methods, based on PPA for MCMC-BayesA and MCMC-SSVS, P-values
for RRBLUP, MAP-BayesA, and MAP-SSVS, and Bonferroni adjusted P-values for EMMAX.
Note that no adjustment for multiple testing were invoked for P-values determined using the
shrinkage based procedures (RRBLUP, MAP-BayesA, and MAP-SSVS) as per Gelman et al.
(2012) whereas a Bonferroni adjustment based on the number of markers, or number of genomic
windows for windows-based analyses, was invoked for EMMAX.
3.5 Results
3.5.1 Simulation Study
Overall mean comparisons between methods for pAUC05 based on single SNP inferences are
provided in Table 3.1, noting that two-way interactions were not detected (P > 0.05) between
methods with ď§ď or with nqtl. There was no evidence of a sizeable difference between any of the
methods given that pAUC05 ranged from 2.52 to 2.77 times that for a random classifier,
although MCMC-BayesA did rank lowest.
Table 3.1 Overall mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods on single SNP associations
Methods
MCMC- MCMCMean pAUC05

SSVS

BayesA

2.61a, b

2.52b

EMMAX
2.77a

61

MAP-

MAP-

SSVS

BayesA

2.69a, b

2.76a

RRBLUP
2.73a

Values not sharing the same letter have different (P <0.05) relative pAUC05. Mean estimates
based on 10 replicates per each of 9 populations of 3 x 3 factorial on number (30, 100, or 300) of
quantitative trait loci (QTL), and shape parameter (0.18,1.48, or 3.00) for Gamma distribution of
QTL effects.
For fixed 1Mb window sizes (Table 3.2), the two-way interactions between method and ď§ď and
between method and nqtl were both significant (P < 0.0001). Therefore, methods were compared
separately for each different value of ď§ď and of nqtl. Nevertheless, MCMC-SSVS and MCMCBayesA had the largest pAUC05 (P < 0.05) for each different value of ď§ď and of nqtl as well as
overall. EMMAX generally followed MCMC-SSVS and MCMC-BayesA with MAP-SSVS,
MAP-BayesA and RRBLUP being the worst performing methods. Most notably, these latter
three methods generally did worse than a random classifier (i.e. pAUC05 < 1) except for MAPSSVS at nqtl = 30.
Table 3.2 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for associations based on genomic windows of length 1Mb. Comparisons are made
within different specifications of shape parameter (ď§) for Gamma distribution of quantitative trait
loci (QTL) and number of QTL (nqtl)
Methods
Factors

MCMCSSVS

MCMC-

EMMAX

BayesA

MAPSSVS

MAP-

RRBLUP

BayesA

Shape ď§
ď°ďŽďąď¸

2.82a

2.75a

1.78b

0.74c, *

0.63c, *

0.48d, *

ďąďŽď´ď¸

4.22a

4.16a

2.54b

0.69c, *

0.38d, *

0.28e, *

ďł

4.63a

5.01a

2.47b

0.67c, *

0.40d, *

0.24e, *

30

6.81a

7.28a

3.63b

1.69c

0.67d, *

0.31e, *

90

3.89a

3.78a

2.14b

0.61c, *

0.47c, d,

0.37d, *

300

2.08a

2.08a

1.42b

0.33c, *

* c, *
0.31

0.29c, *

nqtlď 

62

Overall

3.81a

3.86a

2.23b

0.70c, *

0.46d, *

0.32e, *

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. * indicates the corresponding method is not better than a random classifier. Mean
estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on number (30, 100,
or 300) of quantitative trait loci (QTL), and shape parameter (0.18,1.48, or 3.00) for Gamma
distribution of QTL effects.
Table 3.3 highlights the comparisons between the various methods using the adaptive window
sizes inferred by BALD. Here, the two-way interaction between method and nqtl was important (P
< 0.05) whereas the two-way interaction between method and ď§ď was not; hence, I just compared
different methods within each different value of nqtl . As with the 1Mb window inferences,
MCMC-SSVS and MCMC-BayesA had the highest pAUC05, followed by EMMAX within each
different value of nqtl such that these same rankings were found overall as well. Again, I found
that MAP-SSVS, MAP-BayesA, and RRBLUP had lower pAUC05 compared to a random
classifier except for MAP-SSVS when nqtl = 30.
Table 3.3 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for associations based on genomic windows adaptively chosen by the BALD software
package. Comparisons are made within different specifications of number of quantitative trait
loci (nqtl)
Methods
nqtl

MCMC-

MCMC-

SSVS

BayesA

a

9.03

a

EMMAX

MAP-

MAP-BayesA

RRBLUP

SSVS
3.57

b

1.76c

0.87d, *

0.29e, *

30

8.83

90

5.34a

4.98a

2.13b

0.8c, *

0.50d, *

0.38d, *

300

3.89a

3.17a

1.30b

0.66c, *

0.62c, *

0.58c, *

Overall

5.68a

5.22a

2.15b

0.97c, *

0.65d, *

0.40e, *

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. * indicates the corresponding method is not better than a random classifier. Mean
estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on number (30, 100,

63

or 300) of quantitative trait loci (QTL), and shape parameter (0.18,1.48, or 3.00) for Gamma
distribution of QTL effects.
I was also interested in pAUC05 comparisons between different window length
specifications. Recognizing that the interaction between method and window length was
important in our joint analysis involving all simulated datasets, I choose to focus on window
length comparisons separately within each of MCMC-SSVS, MCMC-BayesA, and EMMAX
(Table 3.4), given that all other methods performed worse than random classifier with windows
based inference. For EMMAX, single SNP inferences has significantly larger pAUC05
Table 3.4 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) between different
window sizes within each of EMMAX, MCMC-BayesA, and MCMC-SSVS.
EMMAX

MCMC-BayesA

MCMC-SSVS

Window

pAUC05

Window

pAUC05

Window

pAUC05

Single

2.76a

Single

2.52c

Single

2.61c

SNP
0.5Mb

2.43a, b

SNP
0.5Mb

3.77b

SNP
0.5Mb

3.65b

1Mb

2.23b, c

1Mb

3.86b

1Mb

3.81b

2Mb

1.95c

2Mb

3.93b

2Mb

3.94b

3Mb

1.85c

3Mb

3.93b

3Mb

4.04b

Adaptive

2.15b, c

Adaptive

5.22a

Adaptive

5.67a

Values not sharing the same letter within a column have different (P <0.05) relative pAUC05
within the column. Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 x 5
factorials on number (30, 100, or 300) of quantitative trait loci (QTL), shape parameter
(0.18,1.48, or 3.00) for Gamma distribution of QTL effects, and genomic region size (single
SNP, 0.5Mb, 1Mb, 2Mb, 3Mb or adaptively determined) for genome wide association.
compared to inferences based on the longer genomic windows (2 and 3 Mb) with inference based
on adaptively determined windows using BALD and shorter genomic windows (0.5Mb and 1Mb)
being intermediate in their performance. Conversely, for both MCMC-BayesA and MCMCSSVS, single SNP inference had the lowest pAUC05 whereas adaptively determined window
selection based on BALD yielded the highest pAUC05 with fixed window inferences being
64

intermediate in their performance. In fact, the best overall performance was based on using the
two MCMC based methods with adaptively determined windows with a pAUC05 being over 5
times greater than that of a random classifier.
3.5.2 MSUPRP Data
Manhattan plots based on single SNP associations for 13-week tenth rib backfat (mm) in
MSUPRP are provided in Figure 3.2. The statistically most significant marker identified by
EMMAX was SNP label ALGA0104402 (P = 2.36e-10) at location 136.0844Mb in
Chromosome 6, marking the same location identified as being most significantly associated with
this trait by Gualdron Duarte et al. (2014). Another 11 nearby statistically significant markers
ranged in location from 132.60Mb to 138.24Mb with 1 marker (SNP label MARC0035827) at
122.36Mb on Chromosome 6 being also statistically significant using EMMAX. For MCMCSSVS, the marker (SNP label ALGA0122657) located at 136.0786Mb on Chromosome 6 had the
highest PPA of 0.487 and was adjacent to the most significant marker ALGA0104402 as
identified by EMMAX. MCMC-SSVS also inferred its second largest PPA=0.227 with SNP
marker ALGA0104402. Hence, the top 2 SNP markers identified by MCMC-SSVS and
EMMAX were the same, albeit their order of importance was reversed. Although the most
significant single SNP associations were also determined within this same region for each of the
four other methods, their levels of significance were clearly not important except perhaps for
MAP-SSVS which started to approach statistical significance with SNP label MARC0035827
(P=0.08).

65

Figure 3.2 Manhattan plots for single SNP analysis on 13th week 10th rib backfat in Duroc
Pietrain F2 cross (n = 922 pigs) based on different methods (Panel A: EMMAX, Panel B:
MCMC-SSVS, Panel C: MCMC-BayesA, Panel D: RRBLUP, Panel E: MAP-BayesA and Panel
F: MAP-SSVS)

66

For windows-based inference, I focused on the adaptively chosen window strategy based on
LD using BALD (Figure 3.3). For EMMAX, the most significant window (P = 9.36e-08) ranged
from 134.17Mb to 134.75Mb on Chromosome 6. Although this region did not contain any
markers that were statistically significant based on single SNP based inferences, it was very
close to a marker (SNP label ASGA0029653) at 134.14Mb that was deemed to be statistically
significant in Figure 3.2. Four other windows on Chromosome 6 were also significant, covering
regions 129.70-131.35Mb, 132.87-134.14Mb, 135.19-136.84Mb and 136.97-137.32Mb. These
windows included some statistically significant or nearly significant markers based on single
SNP inferences in Figure 3.2. Using MCMC-SSVS, the most significant window (Window 909)
covered 135.19-136.84Mb with a PPA = 0.722; this window also contained the most
significant markers based on single SNP inferences using EMMAX and MCMC-SSVS in Figure
3.2. Window 905 had the second highest PPA = 0.477 and ranged in location from 132.87134.14Mb with all other windows having smaller PPA (< 0.2). A LD heatmap of the genomic
region containing both windows are provided in Figure 3.4, indicating that some SNP markers in
Window 905 are in relatively high LD with markers in Window 909. These two windows also
had the highest PPA under MCMC-BayesA being 0.459 and 0.553 respectively. For RRBLUP,
MAP-BayesA and MAP-SSVS, no window was deemed to be statistically significant (P>0.05).

67

Figure 3.3 Manhattan plots for genomic window based associations on 13th week 10th rib
backfat in Duroc Pietrain F2 cross (n = 922 pigs) based on different methods (Panel A:
EMMAX, Panel B: MCMC-SSVS, Panel C: MCMC-BayesA, Panel D: RRBLUP, Panel E:
MAP-BayesA and Panel F: MAP-SSVS) under adaptive window inference.

68

Figure 3.4 Linkage disequilibrium (r2 metric) heatmap for genomic region containing
Windows 905 - 909 on Chromosome 6 as adaptively determined by BALD software. Blue dots
are starting and ending points for window 905 whereas purple dots are starting and ending points
for window 909. Black dots are the 3 markers at 133.9292Mb, 136.0786Mb and 136.0844Mb
that are top 3 SNPs by MCMC-SSVS. The blue oval is used to highlight a pocket of higher r2
measures SNP markers in window 905 and 909.
3.6 Discussion
The objectives of our study were multifaceted in that I wished to very broadly address the
impact of a) prior specifications on marker effects, b) single marker associations versus
associations based on different specifications for genomic windows and c) of computationally
tractable but analytical approximations for GWA inference based on sparse priors. Although our
simulation study was based on genotypes derived from a specific population (MSUPRP), a wide

69

variety of potential genetic architectures were constructed on top of that framework based on
different degrees of skewness of a Gamma distribution via alternative specifications of the shape
parameter (ď§) for QTL effects as well as different numbers (nqtl) of QTL.
Most GWA studies have been conducted using single SNP inferences (Goddard and Hayes
2009; Visscher et al. 2012; Goddard et al. 2016). In this specific context, I determined that the
difference in pAUC05 between all methods were relatively small and unimportant even though
MCMC-BayesA had significantly lower pAUC05 and hence worse GWA performance.
However, for all windows based analyses, MCMC-BayesA and MCMC-SSVS had significantly
greater pAUC05 than all other methods across all combinations of ď§ and nqtl, regardless of
window size and whether these window sizes were fixed or adaptively inferred based on LD
using the BALD software package. Conceptually, MCMC-BayesA might have even
outperformed MCMC-SSVS for windows-based GWA as our comparisons may have been
influenced by the arbitrariness of using 1% as a threshold for percentage of total genetic variance
explained by a window when determining the PPA under MCMC-BayesA. That is, proper
specification of such a threshold is likely to be density dependent. Admittedly, a BayesB like
implementation (Meuwissen et al. 2001) could have captured the best features (i.e. variable
selection and heavy-tailed priors) of both BayesA and SSVS. EMMAX typically ranked third
whereas MAP implementations of BayesA and SSVS as well as RRBLUP did much more poorly
for windows based association. The latter was not too surprising since previously Gualdron
Duarte et al. (2014) also determined that RRBLUP was extremely conservative for GWA in this
same dataset. Furthermore, this liability of RRBLUP has been noted by others including Hayes
(2013). I noted that the median and mean lengths for windows adaptively chosen by BALD
software were 0.59Mb and 0.91Mb (Panel A in Figure A1 in Appendix A), respectively, such

70

that it was reasonable to expect adaptively chosen windows to lead to an GWA performance
closest to inferences based on either based on the 0.5Mb or 1Mb fixed window sizes as I did
observe for the two MCMC based procedures.
What was initially surprising to us was that the pAUC05 for the analytical âshrinkageâ-based
procedures, namely RRBLUP, MAP-SSVS and MAP-BayesA, under windows based inference
was often worse than that of a random classifier (i.e. pAUC05<1). This, at first, seemed
counterintuitive to us. Hence, I briefly investigated a scenario where the number of SNP
markers per window was fixed to be 10 rather than basing window sizes on a fixed physical
distance. Basing genomic windows on a fixed number of SNPs has been a strategy also
considered elsewhere (Zhang et al. 2016). In our particular case, the average length of a 10 SNP
window was 0.51 Mb such that one might anticipate that inference based on 10 SNP marker
windows might be comparable to using inference based on fixed 0.5 Mb length windows.
Nevertheless, I determined that 10 SNP windows based inference lead to a ROC performance
that was at least as good as a random classifier for each of RRBLUP, MAP-SSVS and MAPBayesA (Figure A2 in Appendix A), conversely to what I observed previously to windows based
on any fixed physical distance. This contrast in pAUC05 performance between fixed physical
distance and fixed number of markers could be explained as follows. For the vast majority of
windows based on either scenario (fixed number of markers or fixed physical distance), the Pvalues for the chi-square tests of these shrinkage based procedures were very large (i.e., P >
0.85). With inference based on a fixed number of SNP markers per window and random
assignment of QTL to these markers, it was reasonable to expect that the pAUC05 of any of
these procedures should be at least as large as a random classifier. However, with inference
based on fixed physical distance in Mb or even adaptively determined based on LD relationships,

71

the number of SNP markers and hence the degrees of freedom for each window-specific chisquare test was highly variable, ranging from 1 to 35 with 0.5Mb windows, for example. Hence
regions with few markers are more likely to have smaller P-values than regions with many
markers by nature of a greater penalty incurred with a larger degrees of freedom chi-square test
statistic. Furthermore, lower P-value regions with fewer markers are also less likely to contain a
QTL because of random assignment of QTL to markers throughout the genome such that regions
with the smallest P-values would more likely include a greater than expected number of false
positive results relative to a random classifier.
One possible strategy to mitigate this problem is through use of a likelihood ratio test for the
variance component characterizing the variance attributable to markers within a window can be
considered for EMMAX or the MAP based approaches as then the degrees of freedom for that
test does not depend on the number of markers in that window (Wu et al. 2010; Wang et al.
2013). Gualdron Duarte et al. (2014) present details for such a likelihood ratio test; nevertheless,
this approach requires one to refit the entire model each time that a particular window is being
tested and hence can be computationally challenging.
I specifically determined that adaptive window specifications based on BALD worked best for
both MCMC-BayesA and MCMC-SSVS with significantly higher mean pAUC05 than
inferences based on fixed window lengths or single SNP markers. In fact, there was no evidence
of differences in pAUC05 between GWA associations based on windows of constant sizes
ranging from 0.5 to 3Mb when using either MCMC-BayesA or MCMC-SSVS. Hence adaptive
window clustering based on LD measures seems to be an important factor to consider when
partitioning genomic windows, at least for Bayesian sparse prior specifications.

72

I have previously established that starting values are important for MAP-SSVS and MAPBayesA (Chen and Tempelman 2015); in fact, I then demonstrated that starting marker effects at
null values was very suboptimal, even though that is a common strategy for genomic prediction
methods based on the use of the EM algorithm (Meuwissen et al. 2009; Karkkainen and
Sillanpaa 2012). As I adapted in this study, a practical strategy is to base starting values on
RRBLUP and genomic REML as I conducted in this study although I worried as to how
suboptimal that might be, recognizing MAP estimates are asymptotic i.e., MAP ď¨ g ďŠ ďŽ E ď¨ g | y ďŠ
only as n ďŽ ďĽ and such that n >> m. To further assess whether starting values based on
RRBLUP and genomic REML estimates might lead to suboptimal GWA inferences, I also based
starting values for MAP-SSVS and MAP-BayesA on posterior mean estimates derived from their
MCMC counterparts, focusing only, however, on single SNP and adaptive window inference. I
recognize that this would not be a practical MAP strategy as once MCMC based inferences are
obtained, then asymptotic MAP based inferences would not have any extra value. As anticipated
from our previous genomic prediction work (Chen and Tempelman 2015), using MCMC based
starting values for MAP-SSVS lead to a larger pAUC05 compared to the use of RRBLUP or
genomic REML starting values except for no evidence of a difference at nqtl = 300 (Table A5 in
Appendix A). However, for adaptively determined windows, even MAP-SSVS inferences based
on MCMC based starting values were no better than a random classifier except for when nqtl =30.
Similar results for comparing different sets of starting values (MCMC-BayesA vs BLUP) for
MAP-BayesA are provided in Table A6 in Appendix A. These supplementary results further
illustrate how precarious is the use of MAP based procedures for Bayesian regression GWA
analyses; again, I would believe the sensitivity of MAP to starting values would only be greater
with the use of high density marker panels.

73

As our GWA inference for MCMC-SSVS was based on PPA (i.e. Prob( ď´ j = 1|y)), it might
seem reasonable to specify GWA inference for MAP-SSVS in a similar manner; i.e,, using the Estep values of ď´ j at convergence as estimates of PPA. However, I noted that these E-step values
uniformly drifted either towards 0 or 1 such that there were never any intermediate estimates of
PPA. A comparison of PPA based on ď´ j for Prob( ď´ j =1|y) for MCMC-SSVS versus the E-step
values of ď´ j at convergence on the MSUPRP data is provided is given in Panel A of Figure A3
in Appendix A. Also, recall that the MAP-procedure is sensitive to starting values and that
starting values for MAP-SSVS were based on RRBLUP as this might be a pragmatic and
reasonable strategy in most cases. If I had based starting values on, say, their MCMC-SSVS
posterior means, one would notice a different assortment of converged E-step values of ď´ j
compared to what I observed with RRBLUP starting values as I demonstrate with the MSUPRP
data in Panel B of Figure A3 (Appendix A).
Recall that for MAP-SSVS, I based starting values for the SNP specific PPA on estimated
local false discovery rates (lFDR) using the R package ashr since there is presumably a close
relationship between them; i.e., PPA â 1- lFDR (Stephens 2017). This procedure converts Pvalues to lFDR such that I based lFDR determinations from the P-values computed under
EMMAX. This begged the question as to whether PPA could be simply based on lFDR
processing of EMMAX P-values. However, upon comparing 1-lFDR estimated from the
EMMAX P-values to PPA estimated using MCMC-SSVS of the MSUPRP data, it appeared that
there was not generally very good agreement between the two sets of PPA estimates except for
the some near-zero PPA and the largest PPA estimated using both procedures (Figure A4 in
Appendix A).

74

I also wondered if the strategy for computing window-based PPA could be simplified further
from that presented in Fernando et al. (2017) and used in this paper (i.e., Equation [3.19]) to that
suggested by Moser et al. (2015) who simply summed SNP specific PPA (i.e., based on Equation
[3.15]) within a window to determine the window-based PPA. One should anticipate that the
approach of Moser et al. (2015) should lead to higher estimated PPA. I compared the two PPA
determination approaches for pAUC05 in the simulation study and noted that there was
significant interaction between PPA determination approach with ď§ and nqtl but no significant
interaction involving window size; hence I compared the two strategies within each value of ď§
and nqtl averaged across window length (Table A7 in Appendix A). The only significant
difference in pAUC05 occurred with ď§ď˝ďłď and nqtl =300 for which the approach of Fernando et al.
(2017) led to a higher pAUC05. Nevertheless, since point estimates of pAUC05 were always
larger using the approach from Fernando et al. (2017) I would recommend their approach from
Equation [3.19] for the determination of windows based PPA. Excellent analytical discussion on
control of false positives in GWA using PPA is further provided in Fernando et al. (2017).
I did not estimate vď´ using either the procedures outlined in Yang et al. (2015b) for MCMCBayesA or provided in Chen and Tempelman (2015) for MAP-BayesA primarly because of the
extremely poor MCMC mixing for sampling this hyperparameter and its poor convergence in
MAP-BayesA. A typical specification for vď´ in BayesA is 4 or 5 (Colombani et al. 2013; Perez
and de los Campos 2014). The specification of vď´ ď˝ 2.5 that I chose for this paper was based in
part on results from Yang et al. (2015b) and Nadaf et al. (2012) who determined that lower
specifications of vg (i.e., heavier tails) could lead to higher genomic prediction accuracies when
using Bayes A. To assess this further, I compared MCMC-BayesA using vď´ ď˝ 2.5 versus vď´ ď˝ 5

75

for pAUC05 based on the BALD derived adaptive window inference. In general, the use of

vď´ ď˝ 2.5 yielded a higher mean pAUC05 than vď´ ď˝ 5 except for a non-significant difference at
nqtl=300 (Table A4). For large scale empirical analyses whereby hyperparameter inference
seems daunting, researchers should consider conducting analyses based on a finite number of
hyperparameter specifications, choosing those specifications that lead to the best cross-validation
prediction accuracy. Similar arguments could be made for choosing the key hyperparameters in
other Bayesian regression models. It is worth noting that even I ran our MCMC algorithm for 1
million iterations, the mixing of the MCMC chain was still rather poor as it pertained to
inference on other hyperparameters. For example, for MCMC-BayesA, the effective sample size
(ESS) for ďł g2 was estimated to be 66.33 whereas for SSVS, the ESS was 61.03 for ďł g2 and
53.48 for ď° ď´ .
It should be apparent that given that MCMC-SSVS is a natural variable selection model, it
might be favored over MCMC-BayesA which is not a natural variable selection model. Our
strategy for computing the proportion of genetic variance explained by each window and
determining the posterior probability that that percentage exceeds an arbitrary threshold (1% in
our analyses) is based on the strategy presented by Fernando and Garrick (2013). The flexibility
of MCMC modeling allows posterior probabilities (i.e., PPA) of this nature to be computed.
However, one should be wary of the impact of the threshold since it obviously should depend
upon marker density. That is, if the threshold is set too high, then sensitivity is lost. Based on
the results from both simulation study and real data analysis, I demonstrated that random effects
modeling can also be powerful tool for GWA as long as the suitable priors, i.e., in our case
sparser priors, are used. Other variable selection implementations popularized in WGP including

76

BayesB (Meuwissen et al. 2001) or BayesR (Erbe et al. 2012; Moser et al. 2015) could be
considered as well.
Our MSUPRP application was interesting in that I discovered that SNP markers in two
different blocks can be in high LD even when theyâre not adjacent to each other. However, I
would quickly note that these strange LD patterns may be due to genome assembly errors in the
pig genome (Groenen 2016) with particular issues having been identified in the Chromosome 6
region (Warr et al. 2015) which contained the strongest associations in our study. This may
somewhat complicate strategies for single SNP specific or even window-based inference. I also
recognize that there is a movement towards the use of multi-SNP haplotype modeling which may
improve GWA performance (Cuyabano et al. 2014). Our adaptive window based strategy seems
to improve the performance of GWA relative to single SNP or fixed window length inference
although, conceivably, there may be other better ways to group SNPs. With marker densities
well beyond 50K, the adaptive window strategy might not be viable since it requires the
computation and storage of matrix of LD r2 values between every SNP marker within a
chromosome before clustering analyses can be used to partition the genome into windows.
Fernando et al. (2017) also suggested that PPA based on Bayesian GWA analyses similar to our
MCMC-SSVS be based on whether non-zero associations were found not only in that markerâs
resident window but also in either of the two flanking windows. Their strategy was based on
fixed window sizes such that it may be worthwhile to consider their flanking strategy in the
context of adaptively chosen window sizes. I conjecture that if LD structure is appropriately
used to partition the genome, the use of such flanking windows might not be necessary; however,
this should be a topic for future research. It is also important to note that the comparisons in this
paper are context specific in terms of the genomic LD relationships germane to a F2 cross in

77

pigs. This cross naturally leads to a higher pairwise LD between adjacent SNP markers than
what might be found in outbreeding populations, and most notably humans. Different LD
patterns would naturally change the relative comparisons between single SNP versus windows
based inferences as well as the relative number and sizes of adaptively chosen windows based on
LD relationships. Hence future investigation of our approaches in other populations is strongly
warranted.
In summary, I found Bayesian variable selection to be a promising strategy for GWA when
combined with window based inference. Nevertheless, it seems prudent that window selection
be carefully chosen using rules based on LD information rather than predetermined constant
physical window lengths (in Mb) for genomic regions. Also, recently proposed analytical
approaches for Bayesian regression models should be discouraged for GWA studies.

78

Chapter4 Hierarchical Whole-Genome Prediction and Genome-Wide Association
Modeling When Some Genotypes Are Missing
4.1 Abstract
Single step Genomic Best Linear Unbiased Prediction (ssGBLUP) has become increasingly
popular for whole genome prediction (WGP) modeling as it utilizes any available pedigree and
phenotypes on both genotyped and non-genotyped individuals. The WGP accuracy of ssGBLUP
has been previously demonstrated to be higher than or equivalent to conventional Bayesian
regression models. However, most of these assessments have typically not included phenotypes
on non-genotyped individuals in Bayesian regression analyses, making the interpretation of these
comparisons difficult. Recently, ssGBLUP has been increasingly used for genome-wide
association (GWA) studies although there has been no clear guidance on how to determine
formal statistical evidence of association in these analyses. I address this problem as well as
propose a GWA based on a single step adaptation of Bayesian stochastic search and variable
selection (ssSSVS) model that also incorporates phenotypes on non-genotyped animals. Our
study was based on a dataset including 3186 Holstein cows from 6 US research stations using the
USDA-ARS Bovine 60671 SNP marker panel as genotypes. In a simulation study based on the
use of these same genotypes, a different number of causal variants (nc = 30, 300, or 3000) were
randomly assigned to the markers, masking 50% of cows as non-genotyped, for a trait having a
heritability of 0.25. I determined that ssSSVS had a greater (P<0.05) WGP accuracy than
ssGBLUP with simpler genetic architectures (nc = 30 or nc = 300). Moreover, ssSSVS always
had better (P<0.05) GWA performance than ssGBLUP as based on the partial area under a
receiver operating characteristic curve up until a false positive rate of 5%. In a 25-fold within
station cross-validation study using phenotypes from the dairy consortium, I determined that
79

ssSSVS had, albeit slightly, greater (P<0.05) WGP accuracies in milkfat compared to ssGBLUP
for genotyped individuals, whereas no such differences were detected for body weight. I also
found no significant differences between ssSSVS and ssGBLUP for WGP accuracies for nongenotyped individuals for both traits. Overall, I find ssSSVS to be a promising method for both
WGP and GWA, particularly for genetic architectures characterized by a few genes with large
effects.
4.2 Introduction
Whole genome prediction (WGP) or genomic selection using dense marker panels has been
increasingly implemented in animal and plant breeding and in human genetics studies. Two
broad categories of models have been particularly popular for WGP analyses, namely Genomic
Best Linear Unbiased Prediction (GBLUP) analysis and Bayesian regression or âBayesian
alphabetâ (Gianola 2013) analysis. GBLUP is based on traditional linear mixed model inference
whereby a genomic relationship matrix based on single nucleotide marker (SNP) marker
genotypes is used to specify the correlation between random animal effects (VanRaden 2008).
On the other hand, Bayesian regression models are typically more flexible with distributional
specifications for SNP effects based on heavy tailed prior distributions like a scaled Student t
(i.e., BayesA) (Meuwissen et al. 2001) or variable selection specifications such as stochastic
search and variable selection (SSVS) (George and McCulloch 1993; Chen and Tempelman
2015). These Bayesian regression approaches been demonstrated to achieve higher WGP
accuracies in many different applications (de Los Campos et al. 2013).
A single-step GBLUP (ssGBLUP) approach has been used to describe a procedure that
combines phenotypes on genotyped animals and on non-genotyped animals with pedigree
information (Aguilar et al. 2010) has been successfully applied to several livestock species

80

(Chen et al. 2011b; Gray et al. 2012; Lourenco et al. 2015). Because of the additional utilization
of the phenotypic and pedigree information on non-genotyped individuals, ssGBLUP has been
found to have higher WGP accuracies than one popular Bayesian alphabet model, BayesC
(Lourenco et al. 2013; Legarra et al. 2014; Vallejo et al. 2016). However, this comparison might
be unfair as most of these Bayesian analyses have not been extended to use phenotypes on nongenotyped individuals; furthermore, these comparisons may be sensitive to arbitrary
specifications of some key hyperparameters, most notably the proportion of SNP effects deemed
to be non-zero. Recently, Fernando et al. (2014) proposed a single-step approach for Bayesian
alphabet models that combine information on both genotyped and non-genotyped individuals,
later following up on that work with computational strategies for implementation with large
livestock datasets (Fernando et al. 2016).
Genome-wide association (GWA) analysis is a useful tool to identify genomic regions
containing putative causal variants or quantitative trait loci (QTL). Currently, popular tools such
as EMMAX simultaneously fit all markers using the linear mixed model to account for
population structure (Kang, 2010). Although EMMAX and somewhat related analyses have been
demonstrated to increase the power of QTL detection compared to single-marker based
regression (Kang et al. 2008), these analyses have not typically utilized phenotypic information
on non-genotyped individuals. Wang et al. (2012) and Zhang et al. (2016) have demonstrated
how to adapt ssGBLUP for GWA; however, their GWA assessments were not based on formal
measures of statistical significance but merely point estimates of SNP estimates or percentage of
genetic variance explained by sliding windows of SNP markers. This latter development is
particularly important if one is interested in a fair assessment of whether flexible Bayesian

81

specifications might have merit over ssGBLUP for both WGP and GWA applications when
some phenotyped animals are not genotyped.
GWA studies are increasingly based on joint tests on SNP markers within pre-defined
genomic windows rather than just tests on single SNP marker as single SNP marker tests may
have low statistical power or be adversely affected by multicollinearity or both (Chen et al. 2017;
Fernando et al. 2017). Most studies have based their window using arbitrarily selected window
sizes or number of markers (Wolc et al. 2012; Moser et al. 2015) whereas Dehman et al. (2015)
proposed to adaptively cluster SNP markers into windows of varying size based on LD structure.
Chen et al. (2017) demonstrated that GWA inferences based on an adaptive window approach
enhance the GWA performance of Bayesian models, such as SSVS and BayesA compared to
GWA associations derived from window sizes of arbitrary length or single SNP inferences.
I had several objectives in this study. The first objective was to present a single-step SSVS
(ssSSVS) Bayesian strategy for WGP in conjunction with GWA. A second objective was to
demonstrate formal P-value inference for a single-step extension of EMMAX or GBLUP
(ssGBLUP) that allows use of phenotypes on non-genotyped individuals for GWA based single
SNP based test as well as adaptive window approach. The last objective was to compare the
performance of ssSSVS and ssGBLUP for WGP and GWA in both simulation study and real
data analysis.
4.3 Methods and materials
4.3.1 The hierarchical linear model
Following Fernando et al. (2014) the linear model including genotyped and non-genotyped
individual can be presented as in Equation [4.1]:

ďŠ y n ďš ďŠ Xn ďš
ďŠZn
ď˝
Î˛
ďŤ
ďŞy ďş ďŞ X ďş
ďŞ0
ďŤ gďť ďŤ gďť
ďŤ

0 ďš ďŠun ďš
ďŤe
Z g ďşďť ďŞďŤu g ďşďť

[4.1]
82

Here y n and y g are nn x 1 and ng x 1 vectors of phenotypes of non-genotyped and genotyped
individuals, respectively. Also Î˛ is a vector of fixed environmental effects, with Xn and X g being
the corresponding incidence matrices on the non-genotyped and genotyped individuals,
respectively. Furthermore, u n and u g are, respectively, qn x 1 and qg x 1 vectors of breeding values
of non-genotyped and genotyped individuals, with Z n and Z g being the corresponding respective
incidence matrices.
Now, breeding values can, in turn, be written as linear functions of SNP effects per Fernando
et al. (2014) in Equation [4.2].

ďŠ u n ďš ďŠTË nÎą ďŤ Îľ ďš
ďş
ďŞu ďş ď˝ ďŞ
ďŤ g ďť ďŞďŤ Tg Îą ďşďť

[4.2]

Here Îą is a m x 1 vector of random SNP marker effects that Îą ~ N ď¨ 0, Dďł ďĄ2 ďŠ with D being a
weighting matrix as described later. Furthermore, Tg is a standardized genotype matrix for
genotyped individuals such that
Tg ď˝

ď¨M

g

ď­ 1k 'ďŠ

[4.3]

ďĽ 2 p ď¨1 ď­ p ďŠ
m

j ď˝1

j

j

where M g is the original ng x m genotype matrix with elements coded as â0, 1, 2â or the number
of copies of the reference allele of the SNP marker within the corresponding column of Mg.
Furthermore, element j of the m x 1 vector k is the mean value ( 2 p j ) for the corresponding column
of M g , such that p j is the allele frequency of the reference allele of SNP marker j =1,2,âŚ,m

Ë in Equation [4.2] is an âimputedâ genotype matrix for the non(VanRaden 2008). Conversely, T
n
Ë can be obtained by solving
genotyped individuals. As demonstrated by Fernando et al., (2014), T
n
83

Ë ď˝ ď­ A ng Î¤ , where Ann and Ang are the partitions of A-1 corresponding to non-genotyped by
A nn Î¤
n
g

non-genotyped and non-genotyped by genotyped animals, respectively. Finally, the imputation
residuals Îľ ~ N ď¨ 0, ( A nn ) ď­1ďł u2 ďŠ in Equation [4.2] are the contributions of pedigree information to
breeding values for non-genotyped animals as demonstrated by Fernando et al. (2014).
Combining Equations [4.1] and [4.2], a SNP effects model can be written as

y ď˝ XÎ˛ ďŤ WÎą ďŤ UÎľ ďŤ e

[4.4]

ďŠyn ďš
ďŠ Xn ďš
ďŠ Wn ďš ďŠ Z TË ďš
ďŠZ ďš
where y ď˝ ďŞ ďş , X ď˝ ďŞ ďş , W ď˝ ďŞ ďş ď˝ ďŞ n n ďş , and U ď˝ ďŞ n ďş . Then the corresponding
ďŤ0ďť
ďŤ Wg ďť ďŞďŤ Z g Tg ďşďť
ďŤXg ďť
ďŤy g ďť
mixed model equations used to compute the BLUE, Î˛Ě , of ď˘, BLUP ÎąĚ of ďĄ and BLUP ÎľĚ of ďĽ, as
also illustrated by Fernando et al. (2014), is given in Equation [4.5]

ďŠ X' X
ďŞ
ďŞ W' X
ďŞ
ďŞ
ďŞ '
ďŞZn Xn
ďŤ

X' W
W' W ďŤ Dď­1

ďš
ďşďŠËďš
Î˛ ďŠ X' y ďš
'
ďş
Wn Z n
ďŞ ďş
ďş ďŞÎąË ďş ď˝ ďŞ W' y ďş
ďş ďş ďŞ ' ďş
2 ďŞÎľ
ďł ďş Ë ďŞ Z n y n ďşďť
Z'n Z n ďŤ A nn e2 ďş ďŤ ďť ďŤ
ďłu ďť
X'n Z n

ďł e2
ďł ďĄ2

Z 'n Wn

[4.5]

4.3.2 The ssGBLUP model
When D=I, then the elements of ďĄ are normally, identically and independently distributed,
such that the corresponding analysis is single-step (ss) adaptation of GBLUP which I denote as
ss-GBLUP. When the total number of animals q = qn + qg is considerably smaller than the total
number of SNP markers m, I believe it is convenient to re-parameterize Equation [4.4] further to
Equation [4.6].

y ď˝ XÎ˛ ďŤ Zuď¨ ďŠ ďŤ UÎľ ďŤ e
ďĄ

[4.6]

84

ďŠZn
where Z ď˝ ďŞ
ďŤ0

ďŠuď¨nďĄ ďŠ ďš
0ďš
ďŠ Tn ďš
ď¨ďĄ ďŠ
, and u ď˝ TÎą ď˝ ďŞ ďş Îą ď˝ ďŞ ď¨ďĄ ďŠ ďş is the contribution of genotypes to
Z g ďşďť
ďŞďŤu g ďşďť
ďŤTg ďť

breeding values whether based on actual genotypes (Tg) for breeding values on genotyped animals

Ë ) for non-genotyped animals.
or based on imputed genotypes ( T
n

ď¨ďĄ ďŠ
With u ď˝ TÎą , then

var(uď¨ďĄ ďŠ ) ď˝ TT'Îą such that the corresponding mixed model equations can be written as follows to

solve for the BLUE of ď˘ and the BLUP of uď¨ ďŠ and Îľ in Equation [4.7]
ďĄ

ďŠ X' X
X' Z
ďŞ
ďŞ X'Z Z'Z ďŤ TT'
ďŞ
ďŞ
ďŞ '
U'Z
ďŞX U
ďŤ

ď¨

ďš
ďş ďŠ Î˛Ë ďš
ďŠ X' y ďš
'
ďşďŞ
ďş
ZU
ďş ďŞuË ď¨ďĄ ďŠ ďş ď˝ ďŞ Z'y ďş
ďŞ
ďş
ďş
ďş
' ďş
2 ďŞ Îľ
ďŞ
ďł ďş Ë ďşďť ďŤ U y ďť
U'U ďŤ A nn e2 ďş ďŞďŤ
ďłu ďť
X' U

ďŠ

ď­1

ďł e2
ďł ďĄ2

[4.7]

Note that ďł ďĄ2 , ďł u2 , and ďł e2 can be readily estimated using REML; in fact, I adopt the average
information restricted maximum likelihood (AIREML) algorithm (Gilmour et al. 1995; Johnson
and Thompson 1995) to estimate the variance components ďł ďĄ2 , ďł u2 and ďł e2 . Subsequently,

Î˛Ë , uË (ďĄ ) , andď ÎľË can be obtained by solving MME in Equation [4.7] conditional on these REML
estimates. I label this strategy that separately estimates a common marker variance component

ďł ďĄ2 from the polygenic variance component ďł u2 as the heterogeneous variance (HETVAR)
approach, recognizing that these two variance components could be different from each other in
real data applications if the markers do not capture all of the genetic variability or highly selected
animals or animals from certain herds or stations are preferentially genotyped such that ďł ďĄ2 â  ďł u2 .
Furthermore, estimating the two variance components separately might be a better solution if the
pedigree based relationship matrix and scaled genomic relationship matrix are not completely
compatible with each other (Chen et al. 2011a).
85

The HETVAR approach differs from the more traditional ssGBLUP approach described in
Aguilar et al. (2010) where it is implicitly assumed that ďł ďĄ2 ď˝ ďł u2 This specification, which I
label as HOMVAR, simplifies the MME in Equation [4.7] to that in Equation [4.8]

ďŠ X' X
X' Z
ďŞ
ďŞ X' Z Z'Z ďŤ TT'
ďŞ
ďŞ
ďŞ '
U' Z
ďŞX U
ďŤ

ď¨

ďš
ďşďŠ Ë ďš
ďŠ X' y ďš
Î˛
'
ďş
ZU
ďŞ ďş
ďş ďŞuË ďĄ ďş ď˝ ďŞ Z'y ďş
ďŞ
ďş
ďş
' ďş
ďş
2 ďŞ Îľ
ďŞ
Ë
U
y
ďł ďş
ďť
U'U ďŤ A nn e2 ďş ďŤ ďť ďŤ
ďłu ďť
X' U

ďŠ

ď­1

ďł
ďł

2
e
2
u

[4.8]

Similarly, ďł u2 and ďł e2 can be estimated using REML. Essentially, the MME in Equation
[4.8] is equivalent to the following MME based on H ď­1 (Aguilar et al. 2010):
ďŠ X' X
ďš ďŠÎ˛Ë ďš ďŠ X' y ďš
X' Z
ďł e2
ď˝ďŞ
ďŞ '
ďş;ďŹ ď˝ 2
ď­1 ďş ďŞ ďş
ďłu
ďŤ Z X Z ' Z ďŤ H ďŹ ďť ďŤuË ďť ďŤ Z ' y ďť

[4.9]

where

ďŠ0
H ď­1 ď˝ A ď­1 ďŤ ďŞ
ďŞďŤ0

ďš
ďş
ď­1
ď¨ Tg Tg ' ďŠ ď­ Aď­gg1 ďşďť
0

ďŠ Î nn
If I partition H ď­1 as H ď­1 ď˝ ďŞ gn
ďŤÎ

Î ng ďš
ďş into components due to non-genotyped (n) and
Î gg ďť

genotyped (g) animals, then the MME in Equation [4.9] can be further rewritten as

ďŠ X' X
ďŞ '
ďŞ XZ
ďŞ X' Z g
ďŤ

X' Z n
Z 'n Z n ďŤ Î nn ďŹ
Z'n Z g ďŤ Î gn ďŹ

ďš ďŠ Î˛ ďš ďŠ X' y ďš
X'Z g
ďşďŞ ďş ďŞ
ďş
Z 'n Z g ďŤ Î ng ďŹ ďş ďŞ uË n ďş ď˝ ďŞ Z 'n y n ďş
Z 'g Z g ďŤ Î gg ďŹ ďşďť ďŞďŤuË g ďşďť ďŞďŤ Z 'g y g ďşďť

[4.10]

Equations [4.8], [4.9] and [4.10] are based on equivalent linear models (Henderson 1975) and
hence lead to identical estimated breeding values uË g for the genotyped individuals whereas
REML estimates for ďł u2 and ďł e2 are also identical. For non-genotyped individuals, the mixed
86

model equations (MME) in Equation [4.9] estimate the breeding values using imputed genotypes

uË ď¨nďĄ ďŠ and pedigree information in ÎľĚ separately, i.e. uË n ď˝ uË ď¨nďĄ ďŠ ďŤ ÎľË .
To test H 0 : ďł ďĄ2 ď˝ ďł u2 , one can conduct a likelihood ratio test with l0 being the maximized
restricted log likelihood under the reduced HOMVAR model and l1 being the maximized
restricted log likelihood under the full HETVAR model. Then under H 0 : ďł ďĄ2 ď˝ ďł u2 , ď­2(l0 ď­ l1 )
follows a 50:50 mixture of ďŁ12 and ďŁ 02 (Stram and Lee 1994).
4.3.3 The ssSSVS model
I consider a variable selection specification by writing D ď˝ diag{

where c

1 and p ď¨ďŚ j | ď° ďŚ ďŠ ď˝ ď° ďŚ ď¨1 ď­ ď° ďŚ ďŠ

1ď­ďŚ j

ďŚj

ď¨1 ď­ ďŚ ďŠ ďŤ ďŚ }; j ď˝ 1, 2,....., m ,
j

c

j

,ď ď ďŚ j ď˝ 0,1 . The corresponding model is SSVS with

further details provided by Chen et al. (2017). Note that ď° ďŚ is the probability that marker has a
large variance, and subsequently large effect, with respect to the trait. This specification for D is
equivalent to assigning the following mixture prior for the SNP marker effects:

ďŚ
ďś
ď§
ďˇ
2
ďĄj
1
1
2
ď§
ďˇ ; i ď˝ 1, 2,....., m
p(ďĄ i | ďł ďĄ , ďŚi ) ď˝
exp ď­
ď§
2
1
ď­
ďŚ
ďŚ
ďś
ď¨ i ďŠ ďŤ ďŚ ďˇďˇ
ďŚ ď¨1 ď­ ďŚi ďŠ
ďś
ďł ďĄ2 ď§
ď§ď§
2ď°ďł ďĄ2 ď§
ďŤ ďŚi ďˇ
iďˇďˇ
ď¨ c
ď¸ď¸
ď¨
ď¨ c
ď¸

[4.11]

Additionally, the prior for ď° ďŚ is Beta distributed, i.e.,

p ď¨ď° ďŚ | ďĄ 0 , ď˘ 0 ďŠ ďľ ď° ďŚ ďĄ0 ď¨1 ď­ ď° ďŚ ďŠ

ď˘0

[4.12]

For the variance components, I specify scaled inverse chi-square distribution priors; i.e.,
p ď¨ďł | ve , s
2
e

2
e

ďŚďŽ e ďś
2 ď­ď§ď¨ 2 ďŤ1ďˇď¸
e

ďŠ ďľ ď¨ďł ďŠ

ďŽ e se2
2ďł e2

ď­

e

,

[4.13]

87

ďŚďŽďĄ ďś
2 ď­ď§ď¨ 2 ďŤ1ďˇď¸

ďŽ ďĄ sďĄ2
2ďł ďĄ2

ď­

p ď¨ďł ďĄ2 | vďĄ , sďĄ2 ďŠ ďľ ď¨ďł ďĄ ďŠ

e

,

[4.14]

and
p ď¨ďł | vu , s
2
u

2
u

ďŚďŽ u ďś
2 ď­ď§ď¨ 2 ďŤ1ďˇď¸
u

ďŠ ďľ ď¨ďł ďŠ

ďŽ u su2
2ďł u2

ď­

e

.

[4.15]

For all analyses in this paper, I use non-informative priors where the degree of freedom

ve ď˝ vďĄ ď˝ vu ď˝ ď­1 and scale se2 ď˝ sďĄ2 ď˝ su2 ď˝ 0 (Gelman 2006).
Given the prior specification above, the joint posterior density is given as follows:

ďŚ n
p Î˛, ďĄ1 , ďĄ 2 ,..., ďĄ m , Îľ, ďł u2 , ďł ďĄ2 , ďł e2 | y ďľ ď§ ď p yi | Î˛, ďĄ1 , ďĄ 2 ,..., ďĄ m , Îľ, ďł e2
ď¨ i ď˝1
ďŚ m
ďś
2
2
2
2
2
2
ď§ ď p ďĄ j | ďł ďĄ , ďŚ j p(ďŚ j | ď° ďŚ ) ďˇ p Îľ | ďł u p ďł ďĄ | sďĄ ,ďŽ ďĄ p ďł u | su ,ďŽ u p
ď¨ j ď˝1
ď¸

ď¨

ďŠ

ď¨

ďŠ

ď¨

ď¨

ďŠ ď¨

ďŠ ď¨

ďŠ ďśďˇ
ď¸

ďŠ ď¨ďł

2
e

ďŠ

| ve , se2 p ď¨ď° ďŚ | ďĄ 0 , ď˘ 0 ďŠ
[4.16]

Then unknown parameters for ssSSVS can be sampled from their joint posterior density using
Markov Chain Monte Carlo (MCMC). Details on the MCMC sampling scheme for ssSSVS is
provided in the Supplementary File S1 and for SSVS in Chen et al. (2017).
4.3.4 Conducting Genome-Wide Association Analyses
4.3.4.1 Single SNP marker associations
An efficient strategy for providing formal GWA inference under EMMAX is provided by
Gualdron Duarte et al. (2014) and further described in Chen et al. (2017) with a formal proof
provided in Bernal Rubio et al. (2016). This approach is equivalent to treating the SNP marker
effect of interest as fixed while treating all other SNP effects as random in a generalized least
squares (GLS) approach. The same strategy can be used to derive formal GWA under a singlestep modification of EMMAX which I denote as ssEMMAX, the test statistic for the ssEMMAX
test on SNP marker j can then be simply written as
88

zj ď˝

ďĄË j
se ď¨ďĄË j ďŠ

[4.17]

Note that SNP effect estimates ÎąË ď˝ ďťďĄË j ď˝

m
j ď˝1

can simply be backsolved from the solutions to

ď¨ ďŠ

Equation [4.8] using ÎąË ď˝ Tg G ď­g1uË (gďĄ ) where G g ď˝ Tg Tg' (Stranden and Garrick 2009). se ďĄË j is
2
ď­1
uu
the square root of diagonal of var ď¨ gË ďŠ , where var ď¨ gË ďŠ ď˝ Tg' G ď­g1 ď¨ G gďł ďĄ2 ď­ Cuu
ggďł e ďŠ G g Tg . Here C gg is

essentially the block diagonal of the inverse of the coefficient matrix in Equations [4.7] or [4.8]
corresponding to u (gďĄ ) . That is, one can obtain Cuu
gg by inverting the coefficient matrix in
2
Equation [4.7] for HETVAR and Equation [4.8] for HOMVAR. In other words, Cuu
ggďł e is the

prediction error covariance matrix of uË (gďĄ ) .
For MCMC-based single SNP inferences, I based inferences on the posterior probability of
association (PPA) for SNP marker j (i.e. PPAj):
N

PPAj ď˝

ďĽďŚ ď¨ ďŠ
j l

l ď˝1

[4.18]

N

For SSVS, N denotes the number of MCMC cycles saved for posterior inference and ďŚ j ď¨ l ďŠ is a
binary draw from the full conditional distribution of ďŚ j at MCMC cycle l.
4.3.4.2 Windows based associations
The window based approach follows what has been developed in Chen et al. (2017). Suppose
that window k = 1,2, 3, âŚ, K contains nk markers such that Tg can be partitioned accordingly
into Tg ď˝ ďŠďŤTg1 Tg 2

TgK ďšďť with Tgk having nk columns, containing the submatrix

representing the nk SNP markers for window k. Therefore, the submatrix of var ď¨ ÎąË ďŠ for window

89

2
ď­1
k is var ď¨ ÎąË k ďŠ ď˝ Tgk' G ď­g1 ď¨ G gďł ďĄ2 ď­ Cuu
ggďł e ďŠ G g Tgk . Similarly, the vector Îą is partitioned accordingly;

i.e. Îą ď˝ ďŠďŤÎą1' Îą '2

Îą 'K ďšďť ' such that Îą k is of dimension nk x 1. The extension to a joint

EMMAX-like test on nk markers in window k involves the following determination:

ďŁ k2 ď˝ ÎąË k (var ď¨ ÎąË k ďŠ)ď­1 ÎąË k

[4.19]

where ďŁ k2 is chi-square distributed with nk degrees of freedom under Ho: Îą k ď˝ 0 .
For windows based inference using SSVS under MCMC, I just compute the PPA for each
window k (i.e. PPAk) in Equation [4.20] as also presented in Chen et al. (2017) and similar to
Fernando et al. (2017)

PPAk ď˝

N

nk

l ď˝1

j ď˝1

ďĽ I ((ďĽďŚkj (l ) ) ďž 0)
[4.20]

N

Here, ďŚkj ď¨ l ďŠ defines a binary draw from the full conditional distribution of ď´ j for SNP marker j
nk

located within window k drawn during MCMC cycle l. Note then that I (ďĽ ďŚkj ď¨l ďŠ ) ďž 0 is equal to
j ď˝1

1 when any of the draws of ď´ kj ď¨ l ďŠ within window k are equal to 1. This simply entails determining
whether any of the SNP markers within region k have an association.
4.4 Data and Applications Strategies
4.4.1 Genotypes
The SNP marker genotypes of 3186 Holstein cows were provided from 6 US research stations
including Iowa State University (ISU), Michigan State University (MSU), the University of
Florida (UF), the University of Wisconsin-Madison (UW), the USDA Dairy Forage Research
Center (USDFRC) in Madison, Wisconsin, and the USDA Animal Genomics and Improvement
Laboratory (AGIL) in Beltsville, MD. Genotypes were obtained using IlluminaÂŽ BovineSNP50
90

Genotyping BeadChip and then imputed and edited as in Lu (2016), which excluded SNPs with
minor allele frequency (MAF) less than 0.05 and SNP markers in complete LD with each other,
leaving 57,347 SNP markers for analysis.
4.4.2 Simulation study
To compare the two broad categories of models of interest, I conducted a simulation study
based on the actual genotypes of 3186 Holstein cows was collected from 6 research stations as
described above. I generated QTL effects Îą qtl from a Gamma distribution with shape parameter
equal to 0.42 based on average estimates reported by Hayes and Goddard (2001). I conjectured
that the number of QTLs might also influence WGP and GWA performance such that I
considered nqtl = 30, 300, or 3000. Here, I simulated 10 replicates for each specification of nqtl,
resulting in 30 different simulated datasets in total. For each dataset, QTL effects, Îą qtl , were
randomly assigned to nqtl SNP markers across the genome with a random half of the effects
multiplied by -1 as per Meuwissen et al. (2001). The corresponding genotypes Mqtl for QTL on
these cows were then a n x nqtl subset of the SNP genotype matrix M such that the true breeding
values uTRUE ď˝ M qtl Îą qtl . Phenotypes for animals were generated based on a heritability of 0.25 as
estimated for milk fat from this same dataset. Only the remaining marker genotypes M-qtl were
used for the analyses; i.e., QTL genotypes were always masked.
For analysis, 50% of cows were masked as non-genotyped such that their SNP marker
genotypes were treated as missing. Analyses were conducted using GBLUP, SSVS, ssGBLUP,
and ssSSVS, where GBLUP and SSVS, as previously noted, do not include any phenotypes on
non-genotyped animals. For GBLUP and ssGBLUP, I also use AIREML to estimate the
variance components using both HETVAR and HOMVAR specifications with fixed and random
effect estimates obtained by solving the MME provided in Equations [4.7] and [4.8]. For SSVS
91

and ssSSVS, I ran MCMC for 200,000 iterations in total, discarding the first 100,000 iterations
as burn-in and basing inference on saving every 10 of the remaining 100,000 cycles for a total of
10,000 samples from the posterior density. It is known that estimating ď° ďŚ can be challenging
when nqtl is large (van den Berg et al. 2013), i.e., nqtl =3000 in our case. Therefore, for nqtl
=3000, I considered 3 different specifications for ď° ďŚ : 1) estimating ď° ďŚ , 2) setting ď° ďŚ ď˝ 0.01 , and
setting ď° ďŚ ď˝ 0.001 .
I defined the prediction accuracy of breeding values for WGP as the correlation between the
estimated breeding values ( uĚ ) and the true breeding values ( uTRUE ) in our simulation study. For
all individuals, I compared the WGP accuracies of breeding values using GBLUP, SSVS,
ssGBLUP and ssSSVS for each specification of nqtl.
As for GWA, single SNP marker inferences were implemented as described in the Methods
and Materials. P-values for EMMAX and ssEMMAX were based on the z test in Equation
[4.17] whereas SSVS and ssSSVS provided posterior probabilities (i.e., PPA) based on the
MCMC samples in Equation [4.18]. Since the remaining genotypes Mg,-qtl did not include the
simulated QTL, SNP markers were treated as true positives if the QTL were located between
themselves and an adjacent SNP marker.
I also conducted windows based inference based on the EMMAX and ssEMMAX procedures
using P-values based on the chi-square test in Equation [4.19] whereas PPA within a window
using SSVS and ssSSVS were based on randomly drawn MCMC samples as in Equation [4.20].
The length of each window was determined by the BALD R package (Dehman and Neuvial
2015), which adaptively determines window sizes based on LD using the procedure described by
Dehman et al. (2015).

92

The performance of all methods and models were compared using the receiver operating
characteristic (ROC) curves which plots the true positive rate (TPR) against the false positive
rate (FPR) for each method (Metz 1978). I specifically chose to compare the performance of
different methods using a partial area under the curve up until an FPR= of 5% (pAUC05) as also
per Chen et al. (2017). Given that a random classifier has a pAUC05 of 0.052/2= 0.00125, I
further rescaled all pAUC05 by a factor of 0.00125-1 such the relative pAUC05 for a random
classifier is 1. An ANOVA blocking on simulated data replicate was used to compare the
different methods (GBLUP, SSVS, ssGBLUP, and ssSSVS) for pAUC05 for each specification
of nqtl..
4.4.3 Dairy consortium data
The phenotypes that I choose for demonstration are the corresponding milk fat yields and body
weights for the 3186 genotyped Holstein cows described earlier. The data was edited and
described in Tempelman et al. (2015) and Lu et al. (2015). The complete breakdown of the number
of genotyped and phenotyped cows for each station is provided in Table 4.1. Four generation
pedigrees on all cows were provided by the USDA-AGIL.
Table 4.1 Number of cows by research station in dairy consortium study
Station1

Number
cows

ISU

930

UW

780

AGIL

488

UF

377

USDFRC

347

MSU

264

Total

3186
93

of

1

ISU = Iowa State University; USDFRC = USDA Dairy Forages Research Center; AGIL =
USDA Animal Genomics Improvement Laboratory; UF = University of Florida; UW=
University of Wisconsin- Madison; MSU = Michigan State University.
Since our goal was to evaluate the performance of single step models for WGP and GWA, I
randomly masked the genotypes (i.e., as non-genotyped) on a proportion of the cows.
Specifically, cows were randomly partitioned into five equally sized subsamples stratified by
station. Of the 5 subsamples within each station, the genotypes on one single subsample were
masked as non-genotyped whereas the genotypes on the remaining 4 subsamples were kept.
This masking arrangement was repeated for 5 times such that each of the 5 subsamples within a
station had masked genotypes for one of the partitions that I label as P1-P5. An illustration of
the partition for one station is given in Figure 4.1. Since P1-P5 were stratified by stations, these
partitions were within-station partitions such that within station GWA assessments of the various
methods were based on P1-P5.
P1

P2

P3

P4

Non-genotyped

P5

Genotyped

Figure 4.1 Illustration of within station partitions P1-P5 for one particular station. 20%
of the cows are marked as non-genotyped with the remaining 80% cows treated as
genotyped in each partition
The prediction accuracy of WGP was evaluated using 5-fold cross-validation (CV) for each
single partition of P1-P5 across herds; i.e., a total of 25 folds. That is, each of the P1- P5
partitions within each station was further subpartitioned into 5 orthogonal subsets such that for
each partition P1-P5, a 5-fold cross-validation of 4 training orthogonal subsets and 1 validation
orthogonal subset included both genotyped and non-genotyped animals. An illustration for one
94

such partition, say P1, is provided in Figure 4.2.
Validation
Validation
Validation
Validation
Validation

Non-genotyped
Training

Non-genotyped
Validation

Genotyped
Training

Genotyped
Validation

Figure 4.2 Example of training vs. validation partition for P1 (from Figure 4.1)
4.4.4 Benchmarking analysis for dairy consortium data
To provide benchmark gold standards for all assessments involving animals with missing
genotypes, I conducted baseline WGP and GWA analyses using genotypes and phenotypes on all
animals. For WGP, our benchmark assessments were based on conventional GBLUP and SSVS
analyses using the complete data, i.e. using genotypes on all 3186 cows. Then the same total 25
sub partitions described above were used to assess CV prediction accuracy against this ideal
situation. For GWA, our benchmark analyses were based on EMMAX and SSVS using the entire
dataset and all the cows are treated as genotyped. GWA with non-genotyped cows in the crossvalidation stud described below were then compared against the benchmark for locations having
strong measures of association (i.e., low P-values or high PPA).
4.4.5 Cross-validation study for dairy data
I specified parity class as fixed effects, a fourth-order polynomial regression on days in milk
ďĄ
(DIM), and the random effects of rations, test dates, and genetics (i.e. uď¨ ďŠ and Îľ ) in the WGP

model. To save computing time and to stabilize REML convergence of variance components,
variance components for ration effects and test date effects were estimated just once from the
entire dataset using genotype information on all cows. The values for these variance components

95

were then fixed to those estimates for all subsequent cross-validation comparisons. I separately
compared the WGP CV accuracies of GBLUP, SSVS, ssGBLUP and ssSSVS for genotyped
cows, and the WGP CV accuracies of ssGBLUP and ssSSVS for non-genotyped cows.
For GWA, I examined the performance of different models both within and across station
cross validation studies. I also fitted parity class, ration and test date as fixed effects, and up to a
fourth-order polynomial on DIM as covariates in both situations. I based our 5-fold within
station cross-validation study on partitions P1-P5 as described previously. For the across station
study, I constructed a 6-fold cross-validation study by masking the genotypes from all cows
within one station with genotype information available on cows from all other stations, one
station at a time for a total of 6 folds. I compared the location of peaks (i.e. strongest GWA
associations) of EMMAX, ssEMMAX, SSVS and ssSSVS with corresponding benchmarking
results from EMMAX and SSVS which treated all genotypes as known. In addition, I also
compared the peak strength of association in ssEMMAX versus EMMAX based on -log10(Pvalue), and ssSSVS versus SSVS based on PPA for the most significant single SNP or adaptive
window based associations. EMMAX and ssEMMAX were implemented in the same manner as
the simulation study. SSVS and ssSSVS were also implemented like in the simulation study
except that I fixed ď° ďŚ ď˝ 0.0001 for milkfat and ď° ďŚ ď˝ 0.02 for body weight in all analyses
because of poor mixing when estimating ď° ďŚ using the benchmark data. These values for ď° ďŚ
were chosen from the set of [0.0001, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1] having the highest
average prediction accuracy from 5-fold cross-validation on the benchmark data in a manner
similar to Lee et al. (2017). All reported WGP and GWA inferences were based on the
HOMVAR specification for variance component for ssGBLUP and ssEMMAX, i.e. ďł ďĄ2 ď˝ ďł u2 , due
to computational expedience and fast and stable convergence.
96

Additional standalone studies were conducted to assess whether the HOMVAR specification
was a better fit than the HETVAR specification. I adopted the likelihood ratio tests using GWA
partitions for milkfat. The analyses were conducted among 5 within station partitions (P1-P5)
and among 6 across station splits of genotyped and non-genotyped animals. I reassessed GWA
under HETVAR when H 0 : ďł ďĄ2 ď˝ ďł u2 was rejected. In addition, for 6 across station splits, I also
tested the hypothesis on the 6 âflippedâ partitions, in which only 1 station was treated as
genotyped while other 5 stations were treated as non-genotyped.
4.4.6 Software
In addition to BALD and ROCR, I have developed a tool to implement ssSSVS and ssEMMAX
for both WGP and GWA (single SNP and window based) which is included in BATools R
package (https://github.com/chenchunyu88/BATools).
4.5 Results
4.5.1 Simulation Study
Method specific boxplots of WGP accuracies of EBV across the 10 replicated data sets for
each nqtl are provided in Figure 4.3. The single-step approaches (both ssGBLUP and ssSSVS)
had higher WGP accuracies than their conventional counterparts (GBLUP and SSVS) that
ignored phenotypes on non-genotyped animals. In turn, ssSSVS had higher WGP accuracies
than ssGBLUP except when nqtl =3000, noting that the advantage for ssSSVS was largest for
simpler genetic architectures; i.e., nqtl = 30. For non-genotyped cows, ssSSVS led to a higher
WGP accuracy compared to ssGBLUP when nqtl = 30 and 300 with no evidence of a difference
when nqtl = 3000.

97

Figure 4.3 Boxplots of prediction accuracies of breeding values of genotyped and nongenotyped cows based on the simulation study of different nqtl of 30, 300 and 3000. Panel A)
nqtl=30 for genotyped cows; Panel B) nqtl=300 for genotyped cows; Panel C) nqtl=3000 for
genotyped cows; Panel D) nqtl=30 for non-genotyped cows; Panel E) nqtl=300 for non-genotyped
cows; Panel F) nqtl=3000 for non-genotyped cows; Methods not sharing the same letter code
within each panel have different mean prediction accuracies (P<0.05).
Comparisons of relative pAUC05 on GWA performance between the various methods are
provided in Figure 4.4. Using the adaptive window approach, SSVS and ssSSVS outperformed
EMMAX and ssEMMAX. Furthermore, the single-step procedures did not typically lead to a
higher pAUC05 relative to their conventional counterparts. When nqtl=30, there was no evidence
of an advantage for using a single-step approach whereas for nqtl=300 and 3000; only ssEMMAX
had a higher pAUC05 than EMMAX although these differences were very small. Using single
SNP association testing, differences were only detected when nqtl=30, where both EMMAX and
ssEMMAX had higher pAUC05 than SSVS. As for the value for ď° ďŚ , I found that sampling ď° ďŚ
or fixing ď° ďŚ ď˝ 0.01 lead to equivalent pAUC05 for ssSSVS and SSVS; but fixing ď° ďŚ ď˝ 0.001 led
to ssSSVS having lower pAUC05 than SSVS (P<0.05) as shown in Figure 4.5.
98

Figure 4.4 Boxplot of relative pAUC05 for each method on the simulation study of different
nqtl of 30, 300 and 3000. The first row is the relative pAUC05 using single SNP approach and the
second row is the relative pAUC05 using adaptive window approach. Panel A) nqtl=30 for single
SNP; Panel B) nqtl=300 for single SNP; Panel C) nqtl=3000 for single SNP; Panel D) nqtl=30 for
adaptive window; Panel E) nqtl=300 for adaptive window; Panel F) nqtl=3000 for adaptive
window. Methods not sharing the same letter code are significantly different from each other
within each plot (P<0.05)

99

Figure 4.5 Boxplot of relative pAUC05 for SSVS and ssSSVS on the simulation study with
nqtl=3000 for adaptive window approach based on different specifications for ď° ďŚ : Panel A)

ď° ďŚ ď˝ 0.001 , Panel B) ď° ďŚ ď˝ 0.01 and Panel C) joint MCMC sampling of ď° ďŚ . Methods not sharing
the same letter code are significantly different from each other within each plot (P<0.05).
4.5.2 Dairy Data
The mean cross-validation WGP accuracies based on treating all cows as genotyped is
provided in Table 4.2 for benchmarking purposes. Here, SSVS outperformed GBLUP for milk
fat whereas no difference in WGP accuracy between GBLUP and SSVS was determined for
body weight. At any rate, differences were very small in either case (i.e. less than 2 percentage
points).
Table 4.2 Cross-validation (25-fold) prediction accuracies for comparing GBLUP and SSVS
(all animals genotyped) in benchmark analysis
Trait

GBLUP

SSVS

Milk fat

0.7126a

0.7156b

Body weight

0.7645a

0.7643a

100

Values not sharing the same letter within a row have different (P <0.05) prediction accuracy.
With genotypes on 20% of the cows being masked, SSVS demonstrated higher (P<0.05)
WGP accuracies compared to GBLUP on genotyped cows for milkfat; similarly, ssSSVS
outperformed ssGBLUP for milkfat (Table 4.3). However, for body weight, no difference in
WGP accuracies were detected between ssSSVS and ssGBLUP whereas SSVS had higher WGP
accuracy than GBLUP (P<0.05). At any rate, single step approaches outperformed their
conventional counterparts for both traits within either model (GBLUP/SSVS) although,
admittedly, differences were small. Nevertheless, no difference in GEBV accuracies were found
between ssSSVS and ssGBLUP for either trait on non-genotyped cows (Table 4.4).
Table 4.3 Cross-validation (25-fold) prediction accuracies for GBLUP and SSVS and their
respective single step extensions (ssGBLUP and ssSSVS) on genotyped cows
Trait

GBLUP

ssGBLUP

SSVS

ssSSVS

Milk fat

0.7037a

0.7102b

0.7081b

0.7123c

Body weight

0.7461a

0.7601c

0.7564b

0.7597c

Values not sharing the same letter within a row have different (P <0.05) prediction accuracies
Table 4.4 Cross-validation (25-fold) prediction accuracies for GBLUP and SSVS and their
respective single step extensions (ssGBLUP and ssSSVS) on non-genotyped cows
Trait

ssGBLUP

ssSSVS

Milk fat

0.7101a

0.7085a

Body weight

0.7556a

0.7547a

Values not sharing the same letter within a row have different (P <0.05) prediction accuracies.
When genotypes on all cows were used for benchmarking GWA analyses on milkfat, the
highest peak determined by EMMAX (Figure 4.6A) based on single SNP associations were
located at 1801.116kb (SNP ARS-BFGL-NGS-4939) on chromosome 14 within the same region
101

as the DGAT1 gene located between 1795.425kb and 1804.838kb and known to be a major gene
influencing milk fat yield (Grisart et al. 2002). The peak window identified by EMMAX using
the adaptive window approach ranged from 1189.341kb to 1801.116kb on chromosome 14
whereas 5 other neighboring windows on chromosome 14 (1868.636kb to 2084.067kb;
2217.163kb - 2239.085kb, 2276.443-2674.264kb, 2790.501kb-2909.929kb, and 3029.996kb3059.698kb) were also deemed to have significant associations with milk fat. For both sets of
GWA cross-validation studies (i.e., within station and across station splits) using EMMAX or
ssEMMAX, the single SNP associations based on all training data had the same peak as the
benchmark analysis. Similarly, for the adaptive window associations, peaks shifted between the
4 windows listed above with the exception of two training datasets: partition P5 in the within
station study where region 45150.817kb-46093.561kb on chromosome 16 had the strongest
association; and the subset excluding UF data in the across station cross-validation study for
which another region 44931.986kb-45039.750kb on chromosome 8 had the strongest association
(Figures B5 and B10 in Appendix B).

102

Figure 4.6 Manhattan plot for milkfat treating all cows as genotyped in benchmarking study.
Panel A: single SNP inferences for EMMAX; Panel B: adaptive window inferences for
EMMAX; Panel C: single SNP inferences for SSVS; Panel D: adaptive window inferences for
SSVS.

103

Figure 4.7 Manhattan plot for body weight treating all cows as genotyped in benchmarking
study. Panel A: single SNP inferences for EMMAX; Panel B: adaptive window inferences
approach for EMMAX; Panel C: single SNP inferences for SSVS; Panel D: adaptive window
inferences for SSVS.
To further assess the benefit of adapting single-step extensions of GWA, I compared the
measured strengths of association (i.e., -log10(P-value) or PPA) for the peak SNP or window for
ssEMMAX or ssSSVS versus their conventional counterparts. For milk fat, SNP ARS-BFGLNGS-4939 was the overwhelmingly most significant association in the benchmark analysis using

104

all available genotypes using both EMMAX or SSVS; hence our attention was focused on ARSBFGL-NGS-4939 for milk fat. For body weight, I focused on SNP marker ARS-BFGL-NGS109285 located at 57589.121kb on chromosome 18 since its inferred strength of associations
dominated all other markers. I particularly focused our attention on the gains in -log10(P-value)
or PPA, respectively, using their single step extensions on five separate analyses where the
genotypes on partitions P1-P5 were masked for a within station assessment as well as on the six
separate analyses where the genotypes on each station are masked in turn for the across herd
analysis. For single SNP inferences, I noticed that the mean âlog10(P-values) on ARS-BFGLNGS-4939 were not different for the within station and across station analyses on milk fat when
ssEMMAX was used instead of EMMAX (Table 4.5), similarly, there was no evidence of such a
difference in mean âlog10(P-values) for body weight (Table 4.6). The mean PPA for ARSBFGL-NGS-4939 increased from 0.84 for SSVS to 0.91 to ssSSVS for within station analyses
and from 0.69 to 0.88 for across station analyses, but the differences were not deemed significant
(P>0.05); furthermore, the corresponding differences for body weight were rather trivial.
Table 4.5 Average ((n=5 fold for within herds and n=6 fold for across herds) measures of
strength of association (-log10P-value using GBLUP or posterior probability using SSVS) for
most significant SNP/genomic region using single-step compared to conventional specifications
on milk fat
-log10P-value
Single SNP

Posterior Probability

Adaptive window

Single SNP

Adaptive window

Methods

Within

Across

Within

Across

Methods

Within

Across

Within

Across

EMMAX

8.73a

9.20a

5.67a

5.88a

SSVS

0.84a

0.69a

0.98a

0.75a

ssEMMAX

9.23a

8.94a

6.16a

5.61a

ssSSVS

0.91a

0.88a

0.91a

0.92a

Values not sharing the same letter within a column have different (P <0.05) height for peaks in
the Manhattan plot. For single SNP approach, the reference SNP ARS-BFGL-NGS-4939 is
located at 1801.116kb in chromosome 14 for both ssEMMAX and ssSSVS. For adaptive window
approach, the reference window for ssEMMAX ranges from 1868.636kb to 2084.067kb (3868th

105

window) in chromosome 14 and the reference window for ssSSVS ranges from 1189.341kb to
1801.116kb (3867th window) in chromosome 14.
Table 4.6 Average (n=5 fold for within herds and n=6 fold for across herds) measures of
strength of association (-log10P-value using GBLUP or posterior probability using SSVS) for
most significant SNP/genomic region using single-step compared to conventional specifications
on body weight
-log10P-value

Posterior Probability

Single SNP

Adaptive window

Single SNP

Adaptive window

Methods

Within

Across

Within

Across

Methods

Within

Across

Within

EMMAX

4.70 a

4.79a

9.37a

9.93a

SSVS

0.32a

0.34a

0.64a,1

0.67a,2

ssEMMAX

4.88 a

4.82a

9.70a

10.04a

ssSSVS

0.32a

0.37a

0.68a,1

0.70a,2

Across

Values not sharing the same letter within a column have different (P <0.05) height for peaks in
the Manhattan plot.). The 1P=0.08 and 2P=0.06 comparing the difference between ssSSVS and
SSVS. The adaptive window for ssEMMAX/EMMAX ranges from 8551.460kb to 8560.116kb
in chromosome 14; and the adaptive window for ssSSVS/SSVS ranges from 88350.890kb to
88668.261kb in chromosome 6.
I conducted the same comparison based on adaptive window inferences. Based on the
benchmark analyses, different but neighboring peak windows (1189.341kb to 1801.116kb) on
chromosome 14 were determined by EMMAX and SSVS, respectively, as being most significant
for milk fat. Using these as the respective reference regions for the assessment of single step
extensions of these two models, it was again determined that ssEMMAX and EMMAX were not
significantly different from each other whereas large albeit non-significant improvements in
mean PPA were observed for single step extensions of SSVS from 0.75 to 0.92 based on across
herd partition (Table 4.5). Similarly, for body weight, the windows were different between the
two benchmark analyses for body weight being Window 3894 (ranging from 8551.460 to
8560.116 kb on chromosome 14) for EMMAX and Window 3886 (ranging from 7104.1487342.696kb on chromosome 14) for SSVS. I noticed for single step extensions of SSVS for
genomic window associations had slightly higher (but not statistically significant) PPA than

106

regular SSVS on body weight for both within and across station analysis with P=0.08 and P=0.06
(Table 4.6). However, for EMMAX, as with single SNP inferences, single step extensions had
little merit for within and across herd splits of genotyped and non-genotyped animals.
I also explored the HETVAR ( ďł ďĄ2 ďš ďł u2 ) versus HOMVAR ( ďł ďĄ2 ď˝ ďł u2 ) specifications for the
ssGBLUP/ssEMMAX model for both within and across station splits of genotyped and nongenotyped animals, focusing only on milk fat. As anticipated, with random within station splits
of genotyped and non-genotyped animals, there was no statistical evidence to refute the HOMVAR
specification. Now the HETVAR specification may converge very slowly (i.e., may not converge
after 50 iterations) as it did so for Partition P2 in Table 4.7, even after using the HOMVAR
estimates as joint starting values for ďł u2 and ďł ďĄ2 . Nevertheless, most partitions did converge
within 10 iterations under the HETVAR analysis.
Table 4.7 Likelihood ratio test on H 0 : ďł ďĄ2 ď˝ ďł u2 for within station study in milk fat
Partition

ďł ďĄ2

ďł u2

p-value

P1

0.0174

0.0139

0.35

P2

Did not converge

P3

0.0173

0.0096

0.20

P4

0.0172

0.0167

0.45

P5

0.0171

0.0251

0.20

I conducted the across station comparison of HOMVAR versus HETVAR specifications in
two different ways. Firstly, the genotypes of each station were masked, one station at a time for
6 different analyses with likelihood ratio tests provided in Table 4.8. Most analyses either did
not converge or led to analyses that failed to reject Ho: ďł ďĄ2 ď˝ ďł u2 . However, it was rather curious
that the analyses based on masking the genotypes from ISU station indicated that its pedigee107

based assessment of polygenic variance ďł u2 exceeded the genomic variance ďł ďĄ2 for the other
stations. When masking was âflippedâ, (i.e. all ISU genotypes were available with all other
genoytpes masked), then ďł ďĄ2 > ďł u2 (Table 4.9) confirming that the difference in genetic
variability between ISU with the other stations was not only simply due to a difference in scaling
between genomic and pedigree-based relationship matrices, but rather because of the
heterogeneity of genetic variances across stations. In fact, when the HETVAR specification was
used with the ISU genotypes being masked, stronger measures of association for the top SNP and
genomic windows were determined using single step extensions of EMMAX and SSVS relative
to using the same extensions under the HOMVAR specification (Figure 4.8).
Table 4.8 Likelihood ratio test on H 0 : ďł ďĄ2 ď˝ ďł u2 for milk fat across station splits where
respective analysis masked genotypes for research station as indicated below
Station with masked genotypes

ďł ďĄ2

ďł u2

p-value

0.014

0.050

2.29e-08

(# of cows per station)
ISU (930)
MSU (264)

USDFRC (347)

Did not converge
0.016

0.027

UW (780)

0.13
Did not converge

FL (377)

0.017

0.006

0.06

AGIL (488)

0.016

0.018

0.40

108

Table 4.9 Likelihood ratio test on H 0 : ďł ďĄ2 ď˝ ďł u2 for milk fat across station splits where
respective analysis masked genotypes on all other research stations except for research station as
indicated below

ďł ďĄ2

ďł u2

p-value

ISU (930)

0.039

0.012

3.81e-06

MSU (264)

0.010

0.027

0.02

USDFRC (347)

0.023

0.022

0.48

UW (780)

0.008

0.032

4.25e-06

FL (377)

0.010

0.026

0.01

AGIL (488)

0.019

0.024

0.20

Genotype included
station
(# of cows per station)

109

Figure 4.8 Manhattan plot for milkfat masking genotypes ISU cows using ssEMMAX. Panel
A: single SNP inferences for HETVAR variance; Panel B: adaptive window inferences for
HETVAR variance; Panel C: single SNP inferences for HOMVAR variance; Panel D: adaptive
window inferences for HOMVAR variance.
4.6 Discussion
The goal of this study is to evaluate the potential merit of single step extensions for GBLUP
and SSVS for WGP and GWA. Based on simulation, the prediction accuracies from using
single-step procedures were generally greater within either class of model, i.e.,

110

ssGBLUP>GBLUP and ssSSVS>SSVS. For traits controlled by a relatively small number of
QTL (nqtl = 30 and 300), SSVS always had higher WGP accuracy than GBLUP and,
correspondingly, ssSSVS always has higher WGP accuracy than ssGBLUP for genotyped
animals as demonstrated by simulation. For more complex traits (nqtl = 3000), there was no
evidence that SSVS had a WGP accuracy different from GBLUP, nor ssSSVS from ssGBLUP.
For non-genotyped cows, the usage of ssSSVS versus ssGBLUP appeared to be less important
with perhaps a small advantage for ssSSVS under simpler genetic architectures (nqtl = 30 and
300) based on the simulation study.
This WGP advantage for single step extensions was also demonstrated by a cross-validation
application to milk fat and body weight data from a dairy cattle consortium, albeit the differences
there were very small.

It might not be too surprising that SSVS also outperformed GBLUP for

milkfat which is known to be dominated by DGAT1 on Chromosome 14 (Grisart et al. 2002)
with similar advantages of Bayesian methods having been found for milk fat in previous studies
(Hayes et al. 2010). Body weight may be more complex (i.e., effectively more polygenic) than
milk fat (Pryce et al. 2012) such that there may be less of a distinction between the two models.
Nevertheless, SSVS appeared to have a higher cross validation WGP accuracy than GBLUP
whereas there was no such evidence of a difference between ssSSVS and ssGBLUP for body
weight. There appeared to be no such distinction in cross validation prediction accuracies for
non-genotyped cows based on the analysis of either milk fat or body weight in the dairy
consortium study; however, ssSSVS had higher prediction accuracy for the non-genotyped cows
in our simulation study with nqtl = 30 and 300. The difference between simulation study and real
dairy data analysis can be explained by the difference in simulated genetic architecture and QTL
distribution in the real dataset. Additionally, in the simulation, I masked 50% of the cows as non-

111

genotyped, however, in the real dairy dataset, about 20% cow were treated as non-genotyped; it
is also reasonable to expect single-step having more advantage with higher nongenotyped/genotyped ratio because more information from phenotypes are incorporated
compared to regular approach. Higher difference using single-step Bayesian approach could be
observed in other datasets or applications, e.g., Lee et al. (2017). Finally, confirming results
already previously summarized by Legarra et al. (2014), our results suggest that ssGBLUP can
lead to higher WGP accuracies than conventional implementations of methods using only
genotyped individuals, particularly for more complex traits. For example, ssGBLUP had a
higher WGP accuracy than SSVS (using phenotypes on genotyped animals only) for body
weight.
I also evaluated the merit of single step extensions for GWA. I determined that ssSSVS had
higher pAUC05 than ssEMMAX based on an adaptive window approach to GWA pAUC05 for
each different specification of nqtl; however, I detected no such differences for single SNP
associations. Thus, I recommend using ssSSVS for GWA based on adaptively selected
windows. In fact, the use of genomic window associations lead to pAUC05 values that were
often multiples of pAUC05 values derived from single SNP associations, thereby suggesting a
proportionately greater number of more true positives using genomic window based associations
up until a FPR = 0.05. Strangely enough, ssEMMAX had significantly better pAUC05
performance than EMMAX for more polygenic cases (nqtl = 300 and 3000) using the adaptive
window approach but again the differences were very small. For single SNP associations,
ssEMMAX was not significantly different from EMMAX whereas ssSSVS was also not
significantly different from SSVS for pAUC05. Our simulation suggested there may be little

112

advantage of using the single-step approach for GWA for single marker associations, at least for
Bayesian methods.
Note that pAUC05 determinations are based on the rankings of the -log10(P-values) or PPA
and not on any thresholds for declaring significance. I suspected that the single-step approach
could lead to stronger measures of association than their conventional counterparts. Thus, I
compared -log10(P-values) or PPA values between EMMAX and SSVS and their single step
extensions for milkfat and body weight. For milkfat in a within station split cross-validation
study, the peak SNP of ssSSVS was the same as the peak SNP for SSVS for 2 out of 5 partitions,
in which SSVS and ssSSVS tied at PPA of 1.00 (Table 4.10). In partition P3 and P4 (Figure B3
and B4 in the Appendix B), ssSSVS had a lower peak for SNP ARS-BFGL-NGS-4939 and
Window 3867 (1189.341kb to 1801.116kb on chromosome 14) than SSVS. This may be
because some of the PPA were distributed to nearby SNP/region in high LD with ARS-BFGLNGS-4939. For example, in partition P3, SNP ARS-BFGL-NGS-107379 had a PPA= 0.181
being just in the next Window 3868 for ssSSVS (conventional SSVS had PPA of 0). The LD
heatmap in Figure 4.9 showed this SNP was in high LD with the most significant SNP ARSBFGL-NGS-4939. A similar situation occurred for conventional SSVS in P5, where the PPA
was distributed to another SNP ARS-BFGL-NGS-57820 with PPA of 0.708 (ssSSVS had PPA
of 0 for this SNP) in the same window that was almost in very high LD with the originally most
significant marker SNP ARS-BFGL-NGS-4939. This indicates window based approach can
somewhat mitigate the multicollinearity issue well for SNPs in high LD within the same
window, but for long range LD, the window based approach might still be affected by this
problem. Therefore, the âflankingâ window approach (Fernando et al. 2017) can be applied for
these type of issues.

113

Table 4.10 Full results of within station analyses for PPA in SSVS for most significant
SNP/genomic region using single-step compared to conventional specifications on milkfat and
body weight
Milk fat
Single SNP

Body weight

Adaptive window

Single SNP

Adaptive window

Partition

SSVS

ssSSVS

SSVS

ssSSVS

Partition

SSVS

ssSSVS

SSVS

ssSSVS

P1

1

1

1

1

P1

0.485

0.424

0.773

0.812

P2

1

1

1

1

P2

0.269

0.314

0.590

0.668

P3

1

0.820

1

0.820

P3

0.244

0.231

0.555

0.673

P4

1

0.753

1

0.753

P4

0.287

0.269

0.634

0.631

P5

0.213

1

0.921

1

P5

0.315

0.342

0.656

0.607

The single SNP for milkfat and body weight are ARS-BFGL-NGS-4939 and BTB-01412391
correspondingly. The adaptive window for milk fat ranges from 1189.341kb to 1801.116kb in
chromosome 14; and the adaptive window for body weight ranges from 7104.14888350.8907342.69688668.261kb in chromosome 14.

114

Figure 4.9 LD (r2 metric) heatmap for chromosome 14 from 1189.341kb to 3059.698kb that
contains all the SNP and windows selected by EMMAX based on the benchmark. Purple star
mean are all the SNPs that deem to be significant by EMMAX; blue circle is the starting and
ending SNPs for the windows deem to be significant by EMMAX; green circle is the starting and
ending SNPs for other windows in the map.
For cross-validation assessments based on across station splits of the data, the assessment of
single step extensions for both models appeared to be substantially more complicated. For two
out of the 6 partitions, where the genotypes of stations ISU and USDFRC were each masked in
turn, weaker measures of association using ssEMMAX for both top SNP with -log10(P-values) of

115

8.34 and 8.63 and top window with -log10(P-values) of 5.36 and 4.77 inferences were found
compared to EMMAX with -log10(P-values) of 9.68 and 9.58 for top SNP, and -log10(P-values)
of 6.99 and 5.58 for top window, which simply ignored the phenotypes on those respectively
masked genotyped cows (Figure B6 and Figure B8). The reason might be the HOMVAR
specification did not model the variance components correctly. For the top SNP (Table 4.11),
ssSSVS was observed to have lower PPA of top SNP than SSVS for 1 out of 6 partitions (Figure
B9), but this might have been caused by PPA being distributed between SNP ARS-BFGL-NGS57820 with PPA of 0.207 and SNP ARS-BFGL-NGS-4939 with PPA of 0.290 in high LD in the
same window. For the top window, ssSSVS always had a PPA higher than or equal to SSVS (3
out of 6 partitions tied). In the across station study, SSVS was more likely re-distribute PPA to
other SNPs/regions because the estimated/observed r2 of the genotyped cows in the crossvalidation study might be different from benchmark population, whereas ssSSVS might have
been more stable because it uses pedigree information to âimputedâ genotypes for non-genotyped
cows. GWA inferences using ssSSVS may highly depend upon the kinship relationship between
genotyped and non-genotyped cows.
For body weight, I noticed a slightly higher PPA for ssSSVS by just comparing the top
window in the benchmark compared to SSVS for both within and across station analyses (Table
4.6). Furthermore, in Table 4.11, ssSSVS had a higher PPA than SSVS in 5 out 6 across station
partitions, meaning that ssSSVS tends to do a better job in preserving the PPA of the top window
from the benchmark. However, more reranking of top SNP markers or genomic windows occur
for single SNP associations using ssEMMAX, and for both single SNP and window based
associations using ssSSVS, relative to the benchmark analyses. The measures of strength of

116

association were often no different in EMMAX and ssEMMAX for both within and across
station splits of genotyped and non-genotyped animals.
Table 4.11 Full results of across station analyses for PPA in SSVS for most significant
SNP/genomic region using single-step compared to conventional specifications on milkfat and
body weight
Milk fat
Single SNP
Excluded

SSVS

ssSSVS

Body weight

Adaptive window
SSVS

ssSSVS

Single SNP
Excluded

herd

Adaptive window

SSVS

ssSSVS

SSVS

ssSSVS

herd

ISU

0.426

1

0.442

1

ISU

0.275

0.372

0.641

0.672

MSU

0.335

1

0.633

1

MSU

0.631

0.648

0.698

0.753

USDFRC

1

1

1

1

USDFRC

0.209

0.184

0.702

0.683

UW

0.401

0.290

0.404

0.505

UW

0.128

0.149

0.372

0.397

FL

1

1

1

1

FL

0.465

0.533

0.762

0.816

AGIL

1

1

1

1

AGIL

0.312

0.361

0.869

0.884

The single SNP for milkfat and body weight are ARS-BFGL-NGS-4939 and BTB-01412391
correspondingly. The adaptive window for milk fat ranges from 1189.341kb to 1801.116kb in
chromosome 14; and the adaptive window for body weight ranges from 7104.14888350.8907342.69688668.261kb in chromosome 14.
The HOMVAR assumption, which considers marker based genomic variance and pedigree
based genetic variance to be the same, is widely used in almost all current ssGBLUP
applications, whether for genetic evaluations (Legarra et al. 2014) or for GWA (Wang et al.
2012; Zhang et al. 2016). Our analysis, particularly based on across station splits for crossvalidation, suggests I should use this assumption carefully as the two variance components, ďł ďĄ2
and ďł u2 , can be quite different (Table 4.8). In addition to excluding one station at a time, I also
conducted likelihood ratio tests for the reverse situations, including one station at a time (or
excluding 5 stations at a time), with some of the tests again suggesting that ďł ďĄ2 can be different
from ďł u2 (Table 4.9), with the reversal in magnitude suggesting that the issue pertains to true

117

heterogeneity of genetic variances across herds rather than differences in scaling between
pedigree versus genomic based relationship matrices. For example, for ISU, the test of treating
it as non-genotyped or treating it as the only genotyped station both suggest that the genetic
variation of ISU is different from other 5 stations. It then seems reasonable to estimate ďł ďĄ2 and

ďł u2 separately in some cases, particularly when some herds contributing data are exclusively
genotyped or non-genotyped animals. It seems reasonable and necessary to consider modeling
herd-specific heterogeneity in both ďł ďĄ2 and ďł u2 jointly with herd-specific heterogeneity in ďł e2 as
recently explored by Ou et al. (2016) as WGP and GWA inferences could be quite sensitive to
those specifications.
Currently the single-step Bayesian model are based on computing the âimputedâ genotypes for
Ë ď˝ ď­ A ng Î¤ . This procedure can be both memory and CPU
non-genotyped animal based on A nn Î¤
n
g

demanding when number of non-genotyped animal is large because Î¤Ë n is not sparse. Fernando et
al. (2016) provided an algorithm that avoids storage and multiplication of Î¤Ë n such that the
number of non-genotyped animal is less of concern. Another important advantage of Bayesian
single-step approach is the flexibility to use any prior to accommodate different genetic
architectures and extension to all existing âBayesian Alphabetâ. Recently, Lee et al. (2017)
applied two single-step Bayesian regression (SSBR) models to Hanwoo beef cattle, in which
they found SSBR lead to higher WGP accuracy than ssGBLUP for trait associated with small
number of QTLs with large effect (similar to our milkfat trait with DGAT1 and nqtl=30 or 300 in
the simulation study) and no disadvantages of SSBR were found for all other traits (similar to our
body weight trait and nqtl=3000 in the simulation study). Therefore, based on this study and our

118

results, single-step Bayesian models are promising for WGP analysis with different types of
traits.
Another important factor to consider in ssSSVS and SSVS is the specification of the
hyperparameter ď° ďŚ . It is well known that hyperparameter specifications can significantly influence
WGP accuracies (Lehermeier et al. 2013; Yang et al. 2015b). However, such hyperparameters can
be difficult to estimate with large number of SNP marker using MCMC. Lee et al. (2017)
demonstrated that specifications for hyperparameters like ď° ďŚ can be effectively determined using
cross-validation, recognizing that poorly estimated or miss-specified ď° ďŚ may lead to inferior WGP
than ssGBLUP for some traits. The previous study comparing ssGBLUP and Bayesian model in
WGP, such as in Lourenco et al. (2013), might be somewhat flawed as the proportion of non-zero
effect (similar to our ď° ďŚ ) in the Bayesian was arbitrarily set to 0.04 without any prior assessment
due to cross-validation or estimation. In the simulation study, the average estimated ď° ďŚ is 0.051
(with bad mixing) and it is observed that fixed ď° ďŚ of 0.01 resulted in higher pAUC05 for both
SSVS and ssSSVS compared to fixed ď° ďŚ of 0.001. Moreover, fixed ď° ďŚ of 0.001 led to non-intuitive
results where ssSSVS had lower pAUC05 than SSVS. Overall, miss specification of ď° ďŚ might lead
to inferior GWA results (Figure 4.5). Furthermore, I noticed ď° ďŚ might be also an important factor
in GWA because the smaller ď° ďŚ is, the fewer SNPs/regions will be selected and such that lower

ď° ďŚ is more likely to force few loci to stand out. Whether or not WGP cross-validation based
determinations for ď° ďŚ for GWA is optimal and how to effectively specify such hyperparameter
for GWA for Bayesian variable selection models require further study.

119

Weighted ssGBLUP (WssGBLUP) has been proposed by Wang et al. (2012) and Zhang et al.
(2016), using proportion of variance explained as indicator for GWA. I did not consider this
approach for two reasons: 1) it does not facilitate a method for assessment for statistical
significance; 2) the methods suffer from convergence difficulties such that WssGBLUP is
typically stopped after a fixed number of iterations.; however, choosing the number of such
iterations is quite arbitrary.
4.7 Summary and Conclusions
In conclusion, I determine that ssSSVS has higher WGP prediction accuracy than
ssGBLUP for simpler genetic architectures, i.e., traits controlled by few major genes. The use of
phenotypes on non-genotyped animals is important regardless of model (SSVS or GBLUP) or
genetic architecture (simple or complex). The choice of model seems to be more important than
use of phenotypes on non-genotyped animals for GWA based on results from our simulation study.
Based on applications to data from a dairy consortium, single-step extensions for milk fat were
deemed to be more useful than for body weight for WGP. Single-step extensions for SSVS for
genomic windows adaptively determined using LD seem to be particularly useful for GWA.

120

Chapter5 BATools: A Hierarchical Modeling R Package for Genome Prediction and
Genome-wide Association Analysis
5.1 Abstract
Whole genome prediction (WGP) and genome-wide association (GWA) analyses are being
extensively used in animal breeding and other quantitative genetic applications. Both types of
analyses are typically characterized by high dimensional inference based on thousands of SNP
markers. Bayesian regression methods have been developed to address such problems by
providing shrinkage based inference and variable selection. The BATools R-package
(https://github.com/chenchunyu88/BATools) implements a collection of such Bayesian
regression tools as well as genomic best linear unbiased prediction (GBLUP) for both WGP and
GWA. Features of BATools include the incorporation of phenotypes of non-genotyped
individuals using pedigree information, performing windows based GWA, and modeling
correlation between adjacent SNP using a first order antedependence correlation assumption.
Algorithm choices range between the use of Monte Carlo Markov Chain samplers or analytical
approximations based on the use of the EM algorithm along with restricted maximum likelihood
(REML) like estimators of variance components. The software is efficiently implemented
utilizing C/C++ code for the most time-consuming computations. The focus of this article is to
discuss the models in BATools and their usage in real-data analysis.
5.2 Introduction
Whole genome prediction (WGP) utilizing dense single nucleotide polymorphism (SNP)
marker information has been increasingly adopted in animal and plant breeding as an important
tool for genomic selection on economically important traits (de Los Campos et al. 2013). WGP
has transformed traditional best linear unbiased prediction (BLUP) estimates of breeding values
121

(EBV) based on individual records and pedigree relationship into genomic EBV (GEBV) using
the SNP marker panels (Meuwissen et al. 2016). Two broad categories of hierarchical linear
parametric models are available for WGP. One is genomic BLUP (GBLUP) based on linear
mixed model with a genomic relationship matrix created from SNP markers to specify the
correlation between random individual effects with restricted maximum likelihood (REML)
being used to estimate the underlying variance components (VanRaden 2008). The other is
hierarchical Bayesian models which uses more flexible prior specifications on SNP marker
effects, e.g. a scaled Student t (BayesA) (Meuwissen et al. 2001), mixture of scaled t and point
mass at zero (BayesB) (Meuwissen et al. 2001) or stochastic search variable selection (SSVS)
(George and McCulloch 1993; Chen and Tempelman 2015) amongst several others. The major
algorithm used for inference in these Bayesian models is Markov Chain Monte Carlo (MCMC)
based almost entirely on the use of the Gibbs sampler (Casella and George 1992).
Genome-wide association (GWA) analysis, on the other hand, is a useful tool to identify SNP
markers or genomic regions that are associated with causal variants or quantitative trait loci
(QTL). In a simple way, GWA is testing the null hypothesis that a marker or region has no effect
with respect to a trait with tests running across all SNP markers/genomic regions. Similar to
WGP, GWA has been based on the same two broad categories of models. A popular strategy,
EMMAX, treats the SNP marker of interest as fixed with all other marker effect as random effects to
account for population structure through traditional GBLUP-like or mixed effects models (Kang et
al. 2010). Additional modifications and computational enhancements have been described elsewhere
(Lippert et al. 2011; Zhou and Stephens 2012; Gualdron Duarte et al. 2014). But more recently,

hierarchical Bayesian models have also been implemented for GWA based posterior probability of
associations (PPA) (Moser et al. 2015; Fernando et al. 2017). Chen et al. (2017) extended both

types of models for genomic region based GWA.
122

With more individuals genotyped over the last decade, yet with many if not most individuals
not yet genotyped for various reasons, there has an increasing interest to combine the phenotypic
information from both genotyped and non-genotyped individuals to improve accuracy of GEBV
and GWA. The single step approach is one such model that utilizes genotype, phenotype, and
pedigree information of both genotype and non-genotyped individuals in one single model; a
previous strategy based on blending pedigree based EBV with GEBV in a âtwo-stepâ model
(VanRaden 2008). The original single step WGP analyses was based on the GBLUP
assumptions and hence known as ssGBLUP (Aguilar et al. 2010). Recently, Fernando et al.
(2014) extended single step approach to allow for more flexible hierarchical Bayesian modeling
assumptions. A recent study by Lee et al. (2017) and the work in Chapter 4 indicated that single

step Bayesian models inherit the same favorable properties of regular Bayesian WGP models and
even increased WGP accuracies for trait controlled by fewer number of QTLs. Chapter 4 also
demonstrated that single step Bayesian models had better GWA performance than single step
EMMAX extension for window based, as opposed to single SNP, inferences.
With either category (BLUP or Bayesian) of model, the marker effects are typically specified
to be independently distributed. Gianola et al. (2003) conjectured that some of SNP marker
effects might be spatially correlated within chromosomes. The Bayesian antedependence models
proposed by Yang and Tempelman (2012) that extended BayesA and BayesB and modeled
nonstationary spatial correlations between adjacent SNP markers, known respectively as
anteBayesA and anteBayesB, leading to higher WGP accuracies with higher LD (r2>0.24)
marker panels. These results have been further corroborated by others in a multiple trait
modeling context (Jiang et al. 2015). Yang and Tempelman (2012) and Tempelman (2015) also
suggested potential benefit for sharper GWA signals using such models.

123

In summary, both WGP and GWA analyses are typically characterized by much larger m
(number of SNP markers) relative to n (number of individual). This issue has been addressed
with hierarchical linear models based on either traditional mixed model or Bayesian approaches.
It seems important to develop a software package that provides a user-friendly interface for a
variety of different models with full support for both WGP and GWA.
Currently, there are many different software packages for WGP or GWA or both, but most of
them have a specific focus. BLUPF90 (Misztal et al. 2002) is a collection of software programs
written in FORTRAN for GBLUP models including popular single-step GBLUP (ssGBLUP)
that combines phenotypes on genotyped animals and on non-genotyped animals with pedigree
information (Aguilar et al. 2010). Although BLUPF90 is efficient and suitable for analysis on
large dataset (i.e. genomic evaluation with more than 1 million records), it does not allow for
Bayesian analyses, and its weighted single-step approach suffers from convergence issues,
leading thereby to heuristic solutions (Zhang et al. 2016). Furthermore, the GWA inferences in
BLUPF90 programs do not provide formal measures of statistical significance (e.g. P-value)
(Wang et al. 2012; Zhang et al. 2016). Gensel (Fernando and Garrick 2009) is a web-based
analysis of genomic data platform that features a large selection of different Bayesian models for
both WGP and GWA, but it is not available for public distribution. BLR (Perez et al. 2010) and
BGLR (Perez and de los Campos 2014) are sister R-packages that implements various Bayesian
and nonparametric models concentrated on WGP. While BGLR provides large selection of
models and support different type of traits (continuous or categorical), they do not provide
enough support for GWA. rrBLUP (Endelman 2011) is also an R-package that implements the
GBLUP model for both WGP and GWA with a user-friendly interface compared to BLUPF90,
but it does not support Bayesian analyses. synbreed (Wimmer et al. 2012) is a nice R-package

124

that provides rich data management and cleaning tools for WGP and GWA for animal and plant
breeding, however, their model fitting is done through other software packages such as BGLR.
GEMMA (Zhou and Stephens 2012) is a group of efficient tools written in C++ focus on GWA,
although it support Bayesian WGP, the model is based on one particular prior for marker effects
such that the option for the users is really limited for WGP. JWAS (Cheng et al. 2016) is an opensource software tool written in Julia (Bezanson et al. 2012) for Bayesian models applied to WGP
and GWA. JWAS provides models such as BayesB and BayesC (Habier et al. 2011) as well as
their single step extensions and it is the currently the only single step Bayesian software
implementation publicly available. However, Julia is a new programming language and does not
have the large user community. The package is also not fully documented and does not provide
enough support for GWA in the documentation.
Although these software packages implement a few different types of hierarchical linear
models, there is currently no known WGP/GWA open source R (R Core Team 2017) packages
for single step approach that utilize genotype, phenotype and pedigree information of both
genotype and non-genotyped individuals. I have developed R-package BATools to implement
such an approach. Along with single-step, BATools also implement some other Bayesian model
extensions/improvement that are not currently public available, including Bayesian
antedependence models for spatial correlations between adjacent SNP markers (Yang and
Tempelman 2012), a computationally tractable empirical Bayes approach for BayesA/SSVS
based on ExpectationâMaximization (EM) algorithm, and a window/region based approach for
joint testing of SNP marker effects in GWA using a fast version of EMMAX (Gualdron Duarte
et al. 2014; Bernal Rubio et al. 2016; Chen et al. 2017) and Bayesian extensions for GWA. The
package includes a collection of models in a unified framework for genomic data analysis is

125

available on Github (https://github.com/chenchunyu88/BATools) and will shortly be available on
CRAN. The objective of this paper is to demonstrate the models, algorithms and data
implemented in the package, to present some example analyses demonstrating some key
specifications and to provide a benchmark of computing time for the package.
5.3 Statistical Models and Algorithms
The BATools package currently supports the analysis of continuous traits. For both WGP and
GWA, the base model can be presented as:
y ď˝ XÎ˛ ďŤ TÎą ďŤ e

[5.1]

with
Tď˝

ď¨ M ď­ 1k 'ďŠ

[5.2]

ďĽ 2 p ď¨1 ď­ p ďŠ
m

j ď˝1

j

j

Here y is a n x 1 vector of phenotypes, X is a known n x p incidence matrix connecting y to
the p x 1 vector of unknown fixed effects or/and covariates Î˛ , T is a known n x m standardized
matrix of genotypes connecting y to the m x 1 vector of unknown random SNP marker effects

Îą , and e is the random error vector. M is the original n x m genotype matrix with elements
coded as â0, 1, 2â. Furthermore, element j of the m x 1 vector k is the mean value ( 2 p j ) for the
corresponding column of M , such that p j is the allele frequency of the reference allele of SNP
marker j =1,2,âŚ,m (VanRaden 2008). Recoding genotypes in this manner has been
demonstrated to improve algorithmic stability (Stranden and Christensen 2011). This model can
be also written as a subject-centric model (Henderson 1985):
y ď˝ XÎ˛ ďŤ u ďŤ e

[5.3]

126

Here u ď˝ TÎą is the additive genetic effect of each subject. I also assume throughout that

e ~ N (0, Iďł e2 ) . If I assume Îą ~ N (0, Iďł ďĄ2 ) , then u ~ N (0, Gďł ďĄ2 ) with G ď˝ TT' . Then the mixed
model equation (MME) corresponding to the linear models in [5.1] and [5.3] are Equations [5.4]
and [5.5] respectively.
ďŠ X' X
ďš ďŠ Î˛Ë ďš ďŠ X' y ďš
X 'T
ď˝ďŞ ' ďş
ďŞ '
'
2 ď­2 ďş ďŞ ďş
ďŤ T X T T ďŤ Iďł e ďł ďĄ ďť ďŤÎąË ďť ďŤ T y ďť

[5.4]

and
ďŠ X' X
ďš ďŠÎ˛Ë ďš ďŠ X' y ďš
X'I
ď˝ďŞ ' ďş
ďŞ '
'
ď­1 2 ď­2 ďş ďŞ ďş
ďŤ I X I I ďŤ G ďł e ďł ďĄ ďť ďŤuË ďť ďŤ I y ďť

[5.5]

2
2
The variance components ( ďł ďĄ and ďł e ) in MME [5.4] and [5.5] can be estimated using

Average Information REML (AIREML) (Gilmour et al. 1995; Johnson and Thompson 1995)
with solutions for [5.4] and [5.5] often referred to as GBLUP (VanRaden 2008). In fact, model
[5.1] and [5.3] are equivalent ( ÎąË ď˝ T'G ď­1uË ) with equation [5.3] being preferred for computing
efficiency when m

n (Stranden and Garrick 2009).

In a Baysian context, I use priors instead and the residual variance has a scale-inverse ďŁ 2
prior, i.e., ďŁ ď­2 ď¨ďł e2 | ve , ve se2 ďŠ with degrees of freedom ve ď˝ ď­1 and scale se2 ď˝ 0 as default by
BATools and can be changed by user. The fixed effects Î˛ are assigned with flat priors.
5.3.1 Priors for marker effects
All hierarchical linear models are based directly on equation [5.1] by assigning structural
priors on SNP marker effects. Different types of Bayesian models differ from each other in the
prior distributions for SNP marker effects Îą ; therefore, different prior selections may influence
WGP accuracy and GWA performance since they provide different shrinkage properties for

127

marker effects . Figure 5.1 provides a visualization of four types of base priors were
implemented for the SNP marker effects in BATools: using the Gaussion prior often referred as
Bayesian ridge regression (BRR) (Hoerl and Kennard 1970); a scaled-t distribution prior known
as BayesA, which can be written as normal mixture of scaled inverse ďŁ 2 or Gamma (Meuwissen
et al. 2001); a mixture of point mass at zero and scaled-t prior known as BayesB (Meuwissen et
al. 2001); and a mixture of two Gaussion densities known as SSVS (George and McCulloch
1993; Chen et al. 2017). In addition to those prior specifications, I also implemented the
antedependence models, i.e. anteBayesA and anteBayesB, to model the spatially-induced
correlations between adjacent SNP marker within the same chromosome (Yang and Tempelman
2012). Full details of the statistical expressions about these prior distributions are provided in
Table 5.1.

Figure 5.1 Visualization of prior distributions for SNP marker effects in BATools.

128

Table 5.1 List of models in BATools and their priors and hyperparameters
Model

Marker effect Priors

Hyperparameters
treatment

BRR

ďĄ j ~ N (0, ďł ďĄ2 )

ďł ďĄ2 ~ ďŁ ď­2 (ď­1, 0)

BayesA1

ďĄ j ~ N (0, ďł ďĄ2 )

vďĄ ďľ (1 ďŤ vďĄ )ď­2

ďł ďĄ2 ~ ďŁ ď­2 (vďĄ , vďĄ sďĄ2 )

sďĄ2 ~ Gamma(ďĄ s , ď˘ s )

ďĄ j ~ N (0, ďł ďĄ2 )

vďĄ ďľ (1 ďŤ vďĄ )ď­2

ďŹ ď˝ 0ď ď ď ď ď ď ď ď ď ď ď ď ď ď ď ď ď ď  1 ď­ ď° ďĄ
ďł ďĄ j ď­ ď­2
2
ďŽ~ ďŁ (vďĄ , vďĄ sďĄ ) ď° ďĄ

sďĄ2 ~ Gamma(ďĄ s , ď˘ s )

ď° ďĄ ~ Beta(ďĄď° , ď˘ď° )

ďĄ j ď˝ N ď¨ď°ďŹď¨ď´ j ďŤ (1 ď­ ď´ j ) / c)ďł ďĄ2 )

ďł ďĄ2 ~ ďŁ ď­2 (ď­1, 0)

ď´ j ~ Bernoulli(ď°ď´ );ď c ďł 1

ď° ď´ ~ Beta(ďĄď° , ď˘ď° )

ďĄ j ~ N (0, ďł ďĄ2ď´ j )

ďł ďĄ2 ~ ďŁ ď­2 (ď­1, 0)

ď´ j ~ ďŁ ď­2 (vďĄ , vďĄ )

vďĄ Fixed

j

2

anteBayesA
3

ssBayesA

j

BayesB1

j

anteBayesB2

2

ssBayesB3
SSVS4
5

ssSSVS

mapSSVS6
mapBayesA6

2

Antedependence

t j , j ď­1 ~ N ( ď­t , ďł )
2
t

ď­t ~ N ( ď­t 0 , ďł t20 )

ďł t2 ~ ďŁ ď­2 (vt , vt st2 )

1

Meuwissen et al. (2001); 2Yang and Tempelman (2012); 3Fernando et al. (2014) and Chapter 3;
4
George and McCulloch (1993); 5Chapter 3; 6Chen and Tempelman (2015) and Chen et al.
(2017).
5.3.2 Single-step for BayesA/B and SSVS
A ssGBLUP approach was originally developed to combine phenotypes on genotyped and
non-genotyped animals with pedigree information (Aguilar et al. 2010) and has been applied to
many livestock species (Legarra et al. 2014). The single-step approach for Bayesian WGP was
first proposed by Fernando et al. (2014) to include genotyped and non-genotyped individuals:

129

Ë Îą ďŤ Îľďš
ďŠT
ďŠ y n ďš ďŠ Xn ďš
n
ďşďŤe
ďŞy ďş ď˝ ďŞ X ďş Î˛ ďŤ ďŞ
T
Îą
g
g
ďŤ ďť ďŤ ďť
ďŤďŞ g ďťďş

[5.6]

Here the linear equation [5.3] partition the non-genotyped and genotyped individuals using

Ë in Equation
subscripts n and g. Other terms stay the same as with equation [5.1] except that T
n
[5.6] is an âimputedâ genotype matrix for the non-genotyped individuals that can be obtained by

Ë ď˝ ď­ A ng Î¤ , where Ann and Ang are the partitions of A-1 (inverse of the additive
solving A nn Î¤
n
g
relationship matrix based on pedigree) corresponding to non-genotyped by non-genotyped and
non-genotyped by genotyped animals, respectively (Fernando et al. 2014). The imputation

ď¨

residuals Îľ ~ N 0, ( A nn ) ď­1ďł u2

ďŠ

accounts for contributions of pedigree information to breeding

values for non-genotyped animals (Fernando et al. 2014). In Chapter 4, I also demonstrated how
to apply single-step SSVS (ssSSVS) to a simulated dataset and a USDA dairy consortium dataset.
A similar implementation for updating Îľ was used for single-step BayesA/B (ssBayesA/B) while
the rest parameters were updated the same with conventional BayesA/B. Lee et al. (2017) and my
work in Chapter 4 recently determined that single step Bayesian models led to better WGP
performance than ssGBLUP for trait controlled by few QTLs with large effects with no evidence
of a disadvantage for other types of genetic architectures.
5.3.3 Antedependence implementation
The antedependence models use a vector of association variables to model serially correlated
SNP markers in a nonstationary manner (Yang and Tempelman 2012). The model extends
equation [5.1] such that

ď¤1
if ď  j ď˝ 1
ďŽ j , j ď­1ďĄ j ď­1 ďŤ ď¤ j if ď 2 ďŁ j ďŁ m
ďŹ

ďĄj ď˝ ď­
t

[5.7]

130

Here ď¤ j ~ N (0, ďł ď¤2j ),ď  j ď˝ 1,

m and t j , j ď­1 ~ N ( ď­t , ďł t2 ) is the marker interval-specific

antedependence parameter (Zimmerman and Nunez-Anton 2010) of ďĄ j on ďĄ j ď­1 . Note that t j , j ď­1
is set to zero at the end of each chromosome. Yang and Tempelman (2012) determined that
anteBayesA and anteBayesB improved WGP accuracy compared to BayesA and BayesB for
population with high LD levels (r2 >0. 24) and would lead to even greater accuracies with higher
density SNP marker panels. Antedependence models could also have potential benefits in GWA.
5.3.4 Algorithms
The majority of the models were implemented using MCMC via Gibbs sampler (Casella and
George 1992) for updating marker effects. In the meantime, the hyperparameters such as vďĄ and
sďĄ2 should be updated in each MCMC iteration to maximize accuracy of WGP (Yang et al.

2015b; Zhu et al. 2016). However, even if it is possible to estimate these hyperparameters using
MCMC, the poor mixing for of hyperparameters may require a long MCMC chain for
convergence to the joint posterior density in equilibrium with subsequently slow mixing for high
density marker panels, making its implementation less practical for real data analysis. Therefore,
BATools adopts a univariate Metropolis-Hastings (UNIMH) algorithm, which substantially
improved mixing of MCMC chain, instead of Gibbs sampler for vďĄ and sďĄ2 to help mixing when
both need to be updated in (ante)BayesA/B (Yang et al. 2015b). Nevertheless, with large
number of SNP markers, even these improvements may still require a significant amount of
computing time. A maximum a posterior (MAP) approach that analytical estimates the marker
effects and hyperparameters was also implemented for BayesA and SSVS, known as MAPBayesA and MAP-SSVS for WGP (Chen and Tempelman 2015). Both MAP-BayesA and MAPSSVS require computing time comparable to GBLUP, however, they may lead to slightly lower

131

WGP accuracy than MCMC counterparts because of the possibility of converging to local
maximum.
5.3.5 GWA implementation
Traditionally, GWA has been based on using single SNP tests. GWA studies are increasingly
based on joint tests on SNP markers within pre-defined genomic windows rather than just tests
on single SNP marker as single SNP marker tests may have low statistical power or adversely
affected by multicollinearity or both (Chen et al. 2017; Fernando et al. 2017). A recent study by
Chen et al. (2017) illustrated that adaptive window based on LD (Dehman et al. 2015) could
have better GWA performance than using fixed window length or single SNP approach for
Bayesian analyses. BATools provides window based GWA using Bayesian posterior probability
or chi-square tests for âBayesian Alphabetâ or MAP based approaches correspondingly (Chen et
al. 2017; Fernando et al. 2017). Previously, ssGBLUP did not provide formal statistical
evidence of association in GWA analyses, but merely point estimates of SNP estimates or
percentage of genetic variance explained by sliding windows of SNP markers (Wang et al. 2012;
Zhang et al. 2016). Formal tests were provided in Chapter 4 for both single-SNP and window
based approaches to GWA inference. A summary for all models providing GWA are listed in
Table 5.2.
Table 5.2 GWA output for different models for single SNP and window based approaches
Model

Single SNP

Window

BRR

Bayesian p-value

1

Posterior probability2

BayesA, anteBayesA, ssBayesA

Bayesian p-value1

Posterior probability2

BayesB, anteBayesB, ssBayesB

Posterior probability3

Posterior probability2

SSVS, ssSSVS

Posterior probability3

Posterior probability2

GBLUP, ssGBLUP (EMMAX)

p-value4,5

p-value3,5

mapBayesA, mapSSVS

p-value3

p-value3

132

1

Bello et al. (2010); 2Fernando et al. (2017); 3Chen et al. (2017); 4Gualdron Duarte et al. (2014);
5
Chapter 4.
5.4 Data
The BATools package comes with a subset of MSUPRP data used in gwaR package
(https://github.com/steibelj/gwaR) to demonstrate GWA using fast version of EMMAX in
Gualdron Duarte et al. (2014), where they provided a strategy that fit Equation [5.3] once and
derived equivalent tests to EMMAX (Bernal Rubio et al. 2016). I choose this dataset because itâs
in relatively small enough to allow a quick demonstration and contains all the phenotype,
pedigree, genomic map and genotype information that is required for all the models included in
BATools. The original dataset was described in (Gualdron Duarte et al. 2014). The subset of data
contains 176 Duroc-Pietrain F2 crosses that are both phenotyped and genotyped with 20597 SNP
markers. The subset of data come as synbreed data object, I pre-processed the data to create
genomic window based on LD using BALD R package (Dehman and Neuvial 2015) and details of
constructing such window is provided in Figure C.1 in Appendix C. The Pig data contains
objects in Figure 5.2. PigPheno is a data.frame of phenotypes and its first column is trait
driploss used for domenstration; PigM is marker genotypes coded as â0, 1, 2â; PigMap is the
genomic map for each SNP with column chr (chromosome number), pos (position in Mb), and
idw (window id based on BALD or user can create fixed size windows using set.win
function); PigAlleleFreq is the allele frequency of genotype coded as â1â from F0 population;
and PigPed is a data.frame of pedigree with the first column to be individual ID, second column
to be sire ID and third column to be dam ID (unknown sire and dam must be NA).

133

rm(list=ls())
library(BATools)
data(Pig)
ls()
## [1] "PigAlleleFreq" "PigM"
## [5] "PigPheno"

"PigMap"

"PigPed"

Figure 5.2 Loading Pig data included in BATools
5.5 Interface and application examples
BATools is designed to fit the WGP prediction model provided in Equation [5.1] that
includes fixed effects, random genetic effects, and residual effects. Before using BATools, data
files such as genotype, phenotype and pedigree need to be prepared by the user. For animal and
plant breeding, synbreed (Wimmer et al. 2012) can be used for recoding genotypes, imputation
and etc.; plink (Purcell et al. 2007) can be also used for similar tasks; BATools leaves choices of
the data cleaning and management tools to the user as long as the dataset follows the similar
pattern as described in the âDataâ section. Using BATools often consists of three parts: l)
loading data and setting up the genotype matrix; 2) setting up initial values for variance
components/hyperparameters and options for running the model; 3) model fitting and
comparisons. With data loaded as illustrated in Figure 5.2, extra steps for fitting the model are
shown in Figure 5.3. To set up the genotype matrix, I can use either centered or standardized
genotype matrix for the analysis to help improve algorithmic stability (Stranden and Christensen
2011).

134

#Standardize genotype matrix with method="s"
#for standardization using equation [2]
geno=std_geno(PigM,method="s",freq=PigAlleleFreq)
#Setup initial values for variance component/hyperparameters
#using heritability based rules with h2=0.5
init=set.init(~driploss,data=PigPheno,geno=geno,~id)
#set options
op=set.options(init=init)
#Fitting model
gblup<-baFit(driploss~sex+car_wt,data=PigPheno,geno=geno ,
genoid = ~id,randomFormula = ~age_slg,options = op)

Figure 5.3 Basic model setting and fitting for a GBLUP model
To do that, a built-in function called std_geno is provided by BATools and it takes three
arguments: geno for the original genotype matrix; method for standardize as equation [5.2]
(method=âsâ) or center (method=âcâ, i.e. T ď˝ M ď­ 1k ' ) the genotype matrix with default
to standardize; and the freq for the user supplied reference allele frequency, by default, if freq
is not provided, BATools will compute freq based on geno. The next step is to set up the initial
values for variance components/hyperparameters and options such as the number of
total/maximum iterations, burn-in, screen printout messages and whether certain
hyperparameters are to be estimated, etc. Figure 5.3 demonstrates how to fit a GBLUP model
using the basic default settings, and Figure 5.4 show the full verbose code equivalent to Figure
5.3.

135

#Setup initial values for variance component/hyperparameters
#using heritability based rules for GBLUP using heritability (h2) of 0.5
init=set.init(~driploss,data=PigPheno,geno=geno,~id,h2=0.5,model="GBLUP")
#Default prior for GBLUP is Ď-2(-1,0) for residual and marker variance
priors<-list(nu_e=-1,tau2_e=0,nu_s=-1,tau2_s=0)
#Whether to update variance components
update_para=list(vare=TRUE,scale=TRUE)
#Max iteration for running AIREML is set to 50 by default
run_para=list(maxiter=50)
#set options with model (GBLUP), method (REML), Priors, initial values,update scheme,
#maximum number of iterations, file saving location, and convergence criteria
op<-set.options(model="GBLUP",method="REML",priors=priors,init=init,
update_para=update_para,run_para=run_para,save.at="GBLUP",convcrit=1E-4)
#Fitting model with fixed effect sex, covariates car_wt (carcass weight)
#and non-genetic random effect age_slg (age of slaughter)
gblup<-baFit(driploss~sex+car_wt,data=PigPheno,geno=geno ,genoid = ~id,
randomFormula = ~age_slg,options = op)

Figure 5.4 Full model setting and fitting for GBLUP. Verbose counterpart to Figure 5.3
The set.init function to heritability based rules provided in de Los Campos et al. (2013)
with a default heritability (h2) of 0.5 with a full documentation of the rules found in Appendix
C. The set.options function is used to set up all the options including priors, procedural
specifications (number of iterations, burn-in and skip or maximum iteration for REML/MAP),
printout options, etc. In the Box 2 and Box 3 example, the priors are default ďŁ ď­2 (ď­1, 0) priors
for both residual and marker variance; update_para indicates whether the user wants the
variance components to be estimated/updated; run_para for GBLUP only have maxiter to
indicate the maximum number of iterations for the AIREML algorithm. Full documentation can
be found using help(set.options) and default values for each method is documented in the
Appendix C.
To fit the model, I can call the baFit function with:
â˘

formula to specify the response trait to the left of the tilde â~â and corresponding fixed
effects. In Box 3 for example, driploss is the response and sex is the fixed effects as a

factor and car_wt (carcass weight) is a numeric as covariates.
136

â˘

data is a data.frame containing all the phenotypes, data should contain at least two
columns: trait to be analyzed and a column that contains individual ID corresponds to the

rownames of the genotype matrix geno.
â˘

geno is a genotype matrix with rownames of individual ID.

â˘

genoid is a formula to specify the column of data contains individual ID using â~â and
corresponding column name. Therefore, the genotype and the data records do not have to be
the same order and BATools will match the IDs. In the event that the ID in this column is not
available in the genotype matrix, the IDs will be ignored in the analysis. However, if singlestep approach is used, the IDs will be matched with the individual ID column in the required
pedigree file

â˘

randomFormula is a formula to specify the column of data to be treated as a random
effects factor using â~â and corresponding column name. The random effects factor for this
example is age_slg (age of slaughter).

â˘

options is an object of options created by set.options function

The return of the baFit function is class of object ba, which is basically a list containing
important variables including estimates of fixed effects and covariates (bhat) as in Equation
[5.1], estimate of random SNP marker effects (ahat), estimate of random non-genetic effects
(rhat) and the predicted value of the phenotypes (yhat) as well as other variables such as
hyperparameters/variance component estimates. When the GWA option is enabled with
GWA= âSNPâ or âWinâ, it also returns the p-value or posterior probability for single SNP, or
each window if specified using idw in the map. This window specification can be created by
user using BALD or using the set.win function that creates fixed size windows based on either
number of SNP markers per window or window size in MB (see Box S1). To quickly evaluate
the result summary of the estimates, a S3 print function is implemented as in Figure 5.5. I
137

noticed that, the random age_slg really had a small variance compared to SNP marker
variance, so I did not include it in the extra examples.
#Print out basic results
gblup
## Result of BATools:
##
## estimated fixed effects:
## (Intercept)
sexM
car_wt
## 0.853443770 -0.158191267 0.005642562
##
## SD
## (Intercept)
sexM
car_wt
##
1.0811347
0.1785045
0.0133068
##
## estimated hyperparameters:
##
vare
varMarker var_age_slg
## 0.377903527 0.218162021 0.001028349

Figure 5.5 Summary of BATools results
Several examples and use cases also are provided for fitting models for cross-validation
(Example 1), GWA using the faster version of EMMAX (Gualdron Duarte et al. 2014) and
Bayesian variable selection (Example 2), fitting antedependence model for GWA (Example 3),
fitting single step model for cross-validation (Example 4). The code for model fitting is provided
with the text for GWA and WGP with complete cross-validation analysis. Because I carefully
chose the example dataset, each example in the text was executed on a MacBook Pro (Retina,
Mid 2012) with 2.3 GHz Intel Core i7 and 8GB of memory within 2-4 minutes depends on
models (e.g. antedependence model will take about 4 minutes). Additional examples for each
model/method can be also found at the demo folder of the BATools package or type in
help(baFit)in R.
5.5.1 Example 1: Cross-validation using BRR, BayesA and SSVS
This example shows fitting a WGP model for cross-validation using three different methods.
For demonstration purposes, I only run 5,000 MCMC iterations after 5,000 burn-in samples,

138

saving every 10th sample to compute the posterior means with niter=10000, burnIn=5000 and
skip=10. Figure 5.7, shows the code for each of the three methods. For cross-validation,
BATools provides createCV to automatically generate random k-fold cross-validation. Then I
set up the initial values for all the three models using the heritability based rules described in the
Appendix C. Running each model is similar to Figure 5.4 except some additional settings for
initial values, updating parameter, running parameter and print out options. In baFit function, a
contrast factor train was used to indicate column names for cross-validation. baplot can be
used to create plot to visualize the results in Figure 5.6.

Figure 5.6 Visualization of cross-validation results for BRR, BayesA and SSVS via built-in
BATool function baplot. Black dots are from training and red dots are from validation set.
I also completed a quick 5-fold cross-validation shown the code in Figure 5.7 with the crossvalidation prediction accuracies shown in Table 5.3. In the small example, I didnât find any
evidence that the three models differed from each other in cross-validation prediction. In real
data applications, it is strongly advised to do more than 5-fold cross-validation (e.g. set k=100 in
createCV function for 100-fold cross-validation) to improve power.

139

rm(list=ls());library(BATools);data("Pig")
geno=std_geno(PigM,method="s",freq=PigAlleleFreq)
#create cv-folds using createCV function
set.seed(1234)
PigPheno=createCV(~driploss,data = PigPheno,k=5)
# Set up parameters and run cv for Bayesian Ridge regression
init=set.init(~driploss,data=PigPheno,geno=geno,~id,h2=0.5,model="rrBLUP")
run_para=list(niter=10000,burnIn=5000,skip=10)
print_mcmc=list(piter=500) # Print status to screen every 500 iteration
update_para=list(scale=TRUE)
ListcvRR<-list()
for(i in 1:5){
op<-set.options(model="rrBLUP",method="MCMC",init=init,
update_para=update_para,run_para=run_para,print_mcmc=print_mcmc,seed=i)
ListcvRR[[i]]<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options =
op, train=as.formula(paste0("~cv",i)))
}
# Set up parameters and run cv for BayesA
init=set.init(~driploss,data=PigPheno,geno=geno,~id,h2=0.5,model="BayesA")
update_para=list(df=TRUE,scale=TRUE)
ListcvBA<-list()
for(i in 1:5){
op<-set.options(model="BayesA",method="MCMC",init=init,
update_para=update_para,run_para=run_para,print_mcmc=print_mcmc,seed=i)
ListcvBA[[i]]<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options =
op, train=as.formula(paste0("~cv",i)))
}
# Set up parameters and run cv for SSVS
init=set.init(~driploss,data=PigPheno,geno=geno,~id,pi_snp=0.001,h2=0.5,c=1000,model=
"SSVS")
update_para=list(df=FALSE,scale=TRUE,pi=TRUE)
ListcvSSVS<-list()
for(i in 1:5){
op<-set.options(model="SSVS",method="MCMC",init=init,
update_para=update_para,run_para=run_para,print_mcmc=print_mcmc,seed=i)
ListcvSSVS[[i]]<-baFit(driploss~sex,data=PigPheno,geno=geno ,genoid = ~id,options =
op, train=as.formula(paste0("~cv",i)))
}
save(ListcvRR,ListcvBA,ListcvSSVS,file="ex1_5cv.RData")
# Plot the result for estimating âdriplossâ
par(mfrow=c(1,3))
baplot(ListcvRR[[1]],main="Bayesian Ridge Regression")
baplot(ListcvBA[[1]],main="BayesA")
baplot(ListcvSSVS[[1]],main="SSVS")
# Compute cv accuracies
calc.acc<-function(x) cor(x$y[!x$train],x$yhat[!x$train])
acc<-cbind(sapply(ListcvRR,calc.acc),
sapply(ListcvBA,calc.acc),sapply(ListcvSSVS,calc.acc))
colnames(acc)=c("BRR","BA","SSVS")
apply(acc,2,mean)

Figure 5.7 5-fold Cross-validation using BRR, BayesA and SSVS

140

Table 5.3 Cross-validation prediction accuracy for BRR, BayesA and SSVS
Model

Average cross-validation accuracy

BRR

0.415a

BayesA

0.415a

SSVS

0.408a

Values not sharing the same letter within a row have different (P <0.05) prediction accuracy
5.5.2 Example 2: GWA using EMMAX and SSVS
A computationally efficient algorithm for EMMAX has been recently proposed (Gualdron
Duarte et al. 2014; Bernal Rubio et al. 2016; Chen et al. 2017) for GWA. I adapt this strategy as
does the software gwaR (https://github.com/steibelj/gwaR) with map supplied and the type of
GWA as either SNP (single SNP) or Win (window based); Note the window idw (a numeric
indicate the window of each SNP) must be provided for the window based approach (Chen et al.,
2017); I use adaptive window for the example dataset as determined by the R package BALD
(Dehman and Neuvial 2015) with code in Figure C.1. Figure 5.8 shows the code used for GWA
and created Manhattan plot in Figure 5.9. I found none of the SNPs/windows were statistically
significant in the example dataset using a P-value threshold of 0.05 divided by the number of
SNPs/windows to account for multiple comparison for EMMAX and a PPA of 0.9 for SSVS.
This usually indicates that driploss is a more polygenic trait and explains the similar
performance between BRR, BayesA and SSVS in WGP in Table 5.3.

141

rm(list=ls());library(BATools);data(Pig)
geno=std_geno(PigM,method="s",freq=PigAlleleFreq)
init=set.init(driploss~1,data=PigPheno,geno=geno,~id)
op=set.options(init=init)
#Fitting model with GBLUP using default values
#GWA enabled by supplying map and type of GWA
gblup<-baFit(driploss~sex,data=PigPheno,geno=geno ,
genoid = ~id,options = op,map=PigMap,GWA="Win")
#Fitting model with SSVS
init=set.init(~driploss,data=PigPheno,geno=geno,~id,pi_snp=0.001,h2=0.5,c=1000,model=
"SSVS")
run_para=list(niter=10000,burnIn=5000,skip=10);print_mcmc=list(piter=500)
update_para=list(df=FALSE,scale=TRUE,pi=F)
op<-set.options(model="SSVS",method="MCMC",init=init,
update_para=update_para,run_para=run_para,print_mcmc=print_mcmc)
SSVS<-baFit(driploss~sex,data=PigPheno,geno=geno ,
genoid = ~id,options = op, map=PigMap,GWA="Win")
#Create Manhattan plot
par(mfrow=c(2,2))
man_plot_pvalue(gblup,ylim=c(0,6))
man_plot_pvalue(gblup,type="Win",ylim = c(0,6))
man_plot_prob(SSVS,ylim=c(0,1))
man_plot_prob(SSVS,type="Win",ylim=c(0,1))

Figure 5.8 GWA using EMMAX and SSVS

142

Figure 5.9 Manhattan plot from GWA using the example MSUPRP dataset. Panel A)
EMMAX single SNP approach; Panel B) EMMAX adaptive window approach; Panel C) SSVS
single SNP approach; Panel D) SSVS adaptive window approach.
5.5.3 Example 3: Fitting antedependence model for GWA
In this example, I illustrate fitting the antedependence models anteBayesA and anteBayesB
using our package. As a matter of fact, the procedures are no different from fitting BayesA or
BayesB model except that map must be provided as in Figure 5.10. Here I demonstrate using

143

anteBayesA and anteBayesB for GWA. The default initial value for association parameter t is 0
with ď­t ď˝ 0 and ďł t2 ď˝ 0.5 can be specified in set.init function and will be set to default if
these values are NULL. The default prior of ď­t ~ N (0, 0.01) and ďł t2 ~ ďŁ ď­2 (ď­1, 0) are suggested by
Yang and Tempelman (2012) and can be specified using the prior parameter in set.options
function. For a Bayesian model that does not explicitly involve variable selection, such as BRR,
BayesA and anteBayesA, BATools calculates the Bayesian p-value for single SNP approach and
posterior probability of the window explained more than 1% of the total genetic variance. Figure
5.11 provides Manhattan plots for these two models based on executing the code in Figure 5.10.
While I found anteBayesB has the peak in the same location as SSVS in Figure 5.9 for both
single SNP and adaptive window based approach, anteBayesA did not have visible peaks. Yang
and Tempelman (2012) suggested that the association parameter in the antedependence model
might be used for GWA purposes (Panel E and Panel F in Figure 5.11), I found that for
anteBayesB, the peak in association parameter corresponded to the peak using posterior
probability for single SNP, while for anteBayesA, the t-distributed prior still provided too much
shrinkage for relatively large marker effect compared to anteBayesB, such that the association
parameter did provide any signal.

144

rm(list=ls());library(BATools);data(Pig)
geno=std_geno(PigM,method="s",freq=PigAlleleFreq)
#Set parameters for anteBayesA and fit the model
init=set.init(~driploss,data=PigPheno,geno=geno,~id,df=2.5,
h2=0.5,mut=0,vart=0.5,model="anteBayesA")
run_para=list(niter=10000,burnIn=5000,skip=10);print_mcmc=list(piter=500)
update_para=list(df=F,scale=TRUE,mut=F)
priors=list(mu_m_t=0,sigma2_m_t=0.01,df_var_t=-1,scale_var_t=0)
op<-set.options(model="anteBayesA",method="MCMC",init=init,update_para=update_para,
priors=priors,run_para=run_para,save.at="anteBayesA",print_mcmc=print_mcmc)
anteBA<-baFit(driploss~sex,data=PigPheno,geno=geno,
genoid = ~id,options = op,map=PigMap,GWA="Win")
#Set parameters for anteBayesB and fit the model
init=set.init(~driploss,data=PigPheno,geno=geno,~id,
df=2.5,pi_snp=0.001,h2=0.5,model="anteBayesB")
update_para=list(df=F,scale=TRUE,pi=F,mut=F)
op<-set.options(model="anteBayesB",method="MCMC",init=init,update_para=update_para,
run_para=run_para,save.at="anteBayesB",print_mcmc=print_mcmc)
anteBB<-baFit(driploss~sex,data=PigPheno,geno=geno ,
genoid = ~id,options = op,map=PigMap,GWA="Win")
#Create Manhattan plot
par(mfrow=c(3,2))
man_plot_prob(anteBA,ylim=c(0,6))
man_plot_prob(anteBA,type="Win",ylim = c(0,1))
man_plot_prob(anteBB,ylim=c(0,1))
man_plot_prob(anteBB,type="Win",ylim=c(0,1))
man_plot_assoc(anteBA,ylim=c(0,1))
man_plot_assoc(anteBB,ylim=c(0,1))

Figure 5.10 GWA using anteBayesA and anteBayesB

145

Figure 5.11 Manhattan plot from GWA using the example MSUPRP dataset. Panel A)
anteBayesA single SNP approach; Panel B) anteBayesA adaptive window approach; Panel C)
anteBayesB single SNP approach; Panel D) anteBayesB adaptive window approach; Panel E)
absolute value of association parameter for anteBayesA; Panel F) absolute value of association
parameter for anteBayesA.

146

5.5.4 Example 4: Fitting single-step model using ssGBLUP, ssBayesA, ssBayesB and
ssSSVS
In single-step approach, only additional pedigree information (in a form of data.frame with
three columns including individual ID, sire ID and dam ID) is required and the example dataset
provided is PigPed. To demonstrate an example single-step extension, I artificially mask
genotypes of some individuals as missing as genoNew (Figure 5.12). Then I fit the model with
ssGBLUP, ssBayesA, ssBayesB and ssSSVS. The code for Table 5.4 after running 5-fold crossvalidation is provided in Figure 5.12. Although ssBayesA, ssBayesB and ssSSVS has slightly
higher cross-validation prediction accuracies than ssGBLUP, the difference was not significant.
Since the examples are just an illustration on how to use the code on a relatively small sample
size, Chapter 4 should be referred for the complete simulation study and real data analysis for
different type of traits. For ssGBLUP, default will use homogenous genetic variance
(ssGBLUPvar = "homVAR") and to set it to heterogeneous genetic variance in Chapter 4, use
the set.options function with ssGBLUPvar= "hetVAR" while be aware that hetVAR may
not converge.
Table 5.4 Cross-validation prediction accuracy for ssGBLUP, ssBayesA, ssBayesB and
ssSSVS
Model

Average cross-validation accuracy

ssGBLUP

0.406a

ssBayesA

0.420a

ssBayesB

0.421a

ssSSVS

0.430a

Values not sharing the same letter within a row have different (P <0.05) prediction accuracy

147

rm(list=ls());library(BATools);data("Pig")
geno=std_geno(PigM,method="s",freq=PigAlleleFreq)
#Mask some genotype as missing to test single-step approach
set.seed(1001);n=dim(geno)[1];indexng<-sort(sample(1:n,n%/%5))
genoNew=geno[-indexng,]
#create cv-folds using createCV function
set.seed(1234)
PigPheno=createCV(~driploss,data = PigPheno,k=5)
# Set up parameters and run cv ssGBLUP
init=set.init(~driploss,data=PigPheno,geno=genoNew,~id,h2=0.5,model="ssGBLUP")
ListcvGBLUP<-list()
for(i in 1:5){
op<-set.options(model="ssGBLUP",method="REML",init=init,seed=i)
ListcvGBLUP[[i]]<-baFit(driploss~sex,data=PigPheno,geno=genoNew ,genoid = ~id,
ped= PigPed,options = op, train=as.formula(paste0("~cv",i)))
}
# Set up parameters and run cv for ssBayesA
init=set.init(~driploss,data=PigPheno,geno=genoNew,~id,h2=0.5,model="ssBayesA")
run_para=list(niter=10000,burnIn=5000,skip=10);print_mcmc=list(piter=500)
update_para=list(df=TRUE,scale=TRUE);ListcvBA<-list()
for(i in 1:5){
op<-set.options(model="ssBayesA",method="MCMC",init=init,
update_para=update_para,run_para=run_para,print_mcmc=print_mcmc,seed=i)
ListcvBA[[i]]<-baFit(driploss~sex,data=PigPheno,geno=genoNew ,genoid = ~id,
ped= PigPed,options = op, train=as.formula(paste0("~cv",i)))
}
# Set up parameters and run cv for ssBayesB
init=set.init(~driploss,data=PigPheno,geno=genoNew,~id,pi_snp=0.001,model="ssBayesB")
update_para=list(df=TRUE,scale=TRUE,pi=TRUE);ListcvBB<-list()
for(i in 1:5){
op<-set.options(model="ssBayesB",method="MCMC",init=init,
update_para=update_para,run_para=run_para,print_mcmc=print_mcmc,seed=
i)
ListcvBB[[i]]<-baFit(driploss~sex,data=PigPheno,geno=genoNew ,genoid = ~id,
ped= PigPed,options = op, train=as.formula(paste0("~cv",i)))
}
# Set up parameters and run cv for ssSSVS
init=set.init(~driploss,data=PigPheno,geno=genoNew,~id,pi_snp=0.001,
h2=0.5,c=1000,model="ssSSVS")
update_para=list(df=FALSE,scale=TRUE,pi=T); ListcvSSVS<-list()
for(i in 1:5){
op<-set.options(model="ssSSVS",method="MCMC",init=init,
update_para=update_para,run_para=run_para,print_mcmc=print_mcmc,seed=i)
ListcvSSVS[[i]]<-baFit(driploss~sex,data=PigPheno,geno=genoNew ,genoid = ~id,
ped= PigPed,options = op, train=as.formula(paste0("~cv",i)))
}
# Compute cv accuracies
calc.acc<-function(x){
cor(x$y[!x$train],x$yhat[!x$train])
}
acc<-cbind(sapply(ListcvGBLUP,calc.acc),sapply(ListcvBA,calc.acc),sapply(ListcvBB,cal
c.acc),sapply(ListcvSSVS,calc.acc))
colnames(acc)=c("ssGBLUP","ssBA","ssBB","ssSSVS")
apply(acc,2,mean)

Figure 5.12 5-fold Cross-validation using ssGBLUP, ssBayesA, ssBayesB and ssSSVS
148

5.6 Performance and computing time
BATools uses C/C++ and FORTRAN subroutines to make sure it has the optimal
performance on a single core. The most computing demanding portion of the code is sampling
SNP marker effects using Gibbs sampler. I carried out the benchmark with five different sample
sizes (n= 1k, 2k, 3k, 4k and 5k) and five different marker densities (m=1k, 5k, 20k, 60k and 100k)
by fitting BayesB (Figure 5.13).

Figure 5.13 Computing time in seconds per 1000 iterations for BayesB for sampling all the
marker effects by sample size and the number of marker. The benchmark was performed on a
2.4Ghz Intel Xeon E5-2680v4 CPU using a single core
The benchmark was computed on Michigan State University High Performance Computing
Center (HPCC) on a single core of 2.4Ghz Intel Xeon E5-2680v4 CPU. The computing time was

149

both affected by the sample size and marker density and had almost linear relationship with both
when the other variable was fixed. The most demanding scenario (n=5k and m=100k) took about
14 minutes to run 1000 iterations. I also found that either antedependence specification roughly
doubled the computing time because of extra step to sample the association parameters with
roughly similar length of SNP marker effects. In a typical analysis running 200k iterations at
n=5k and m=60k, analysis can be completed within a day on a single core.
5.7 Concluding remarks and future developments
BATools provides a common user interface for a suite of popular Bayesian models for WGP
and GWA that allow for differences in shrinkage or variable selection options, models the
association between adjacent SNP markers and combines phenotype of non-genotyped individual
via pedigree information. BATools also provide easy tools for cross-validation and to visualize
the results for WGP and GWA. Further extensions such as extending the antedependence models
for the single-step approach, GxE using Bayesian models (Yang et al. 2015a) and modeling
repeated records will be available through updates.
The most computationally intensive part of BATools consists of using the Gibbs sampler for
SNP marker effects. Our approach is to use C/C++ and FORTRAN subroutines to reduce the
computing time. Still, the computing time per thousand iterations will linearly increase with
increasing marker density or sample size. The user should be aware of the fact the increasing
marker density might lead to poor mixing in MCMC, therefore, extra iterations might be required
to reach convergence. The Gibbs sampler particular for WGP does not appear to be parallelizable
because sampling each marker effect depends on the current value of all other marker effects,
whereas calculating the right-hand-side (rhs) of mixed model equation is parallelizable
(Fernando et al. 2016). This can be achieved using shared memory multiprocessing via OpenMP

150

or GPU computing to accommodate increasing number of MCMC samples. Since R does not
have good support of GPU computing and excessive environment setups are required for
compiling the code, I decided not to have this feature in BATools R package. As for OpenMP,
proper setups are required for executing the code in true parallel, therefore, a OpenMP version of
BATools will only be available on Github (https://github.com/chenchunyu88/batools) after it is
fully tested to allow experienced users to take advantage of parallelization.
With the increasing availability of the sequencing data, computing efficiency for WGP and
GWA using Bayesian model will be a major challenge. In the 1000 bull genomes project, 28.3
million variants of 238 cattle were identified (Daetwyler et al. 2014). With this dataset, BATools
will take ~4 hours per 1000 iterations, which is not very efficient. The much bigger problem is
that it is not efficient to load the data into R because it will take ~160 GB of memory. To handle
this type of dataset, a modification of BATools needs to be implemented: 1) use bigmemory
(Kane et al. 2013) R package to load the data into R; 2) use RcppEigen (Bates and Eddelbuettel
2013) R package to pre-construct the additive genomic relationship matrix G ; 3) modify
BATools to handle kinship matrix G instead of taking genotype matrix directly and output the u
in equation [5.3]; 4) write another function to obtain P-value for single SNP using RcppEigen
based on (Chen et al. 2017). These modifications are equivalent to use BATools for GWA using
fast EMMAX. Then one might use only say 5% variants with smallest P-value for a WGP and
the computing time per 1000 iteration will be just under ~20 minutes with 1.4 million variants.
Even with this approach, Bayesian models might still need large number of iterations to
converge for 1.4 million variants. Hybrid approaches using EM algorithm to set up starting
values for MCMC could effectively skip burn-in and reduce the total computing time than the
original MCMC approach (Wang et al. 2016). Overall, WGP and GWA with sequencing data is

151

challenging, but with further improvement, BATools can efficiently handle high density
sequencing data.

152

Chapter6 Conclusions, Discussions and Future Work
This dissertation focused on extending existing statistical models and developing software
tools for whole genome prediction (WGP) and genome wide association (GWA) analysis in
animal and plant breeding. These tools have been used, respectively, to accelerate selection for
economically important traits and identify important genomic regions based on high density SNP
marker genotypes. The primary goal of this work was to make hierarchical modeling and
software tools for WGP and GWA more accessible for academic research and industry
applications. This included exploring computationally feasible, albeit approximate, alternative
algorithms (Chapter 2), developing more powerful and more formal GWA strategies for
hierarchical linear models (Chapter 3), extending flexible Bayesian models to allow for
phenotypes on non-genotyped animals in GWA analyses (Chapter 4) and providing the
associated software tools for these and other recent hierarchical linear model developments for
GWA and WGP (Chapter 5).
During the time of preparing this dissertation, I designed the algorithms and methodologies
based on the assumption that the WGP or GWA inferences were usually m

n problems (m

being the number of markers or covariates and n being the number of observations or animals).
With the use of the EM algorithm in Chapters 2 and 3, I believed that I provided a
computationally tractable alternative to MCMC for more flexible priors (n<=5,000, m>50,000);
however, that assumption may not be true for some current or future applications. As a matter of
fact, some genomic evaluation programs now have more individuals (> 1 million) than the
number of SNP markers (~50,000) (Fernando et al. 2014; Masuda et al. 2016). In such cases, the
EM based approaches in Chapter 2 can still work efficiently with SNP marker effect models.
However, in many other applications, especially in plant breeding programs, m
153

n might still

be true for some time yet since, based on my personal experience, some companies may have
only a few thousand inbred lines (genetically similar individuals that are bred with each other for
uniformity) in their breeding program and selection is more based on these lines rather than the
hybrids (progeny of two inbred lines). In the applications for m

n , two major approaches can

be considered. Firstly, Bayesian models deserve greater consideration as they do not require the
large matrix inversion like REML but rather uses accelerated MCMC sampling via high
performance parallel or GPU computing (Fernando et al. 2016). Secondly a modified GBLUP
based algorithm for proven and young (APY) has been developed (Misztal 2016a), based on
specifying the breeding values (BVs) of noncore animals to be an approximate function of only
the BVs of core animals. This results in a computing cost for the inverse of genomic relationship
matrix to be only cubic to the number of core animals which is significant since the computing
time is only relevant to the number of these selected animals. One interesting development in the
ssGBLUP approaches that incorporate information on non-genotyped animals is that researchers
have attempted to differentially weight marker effects when building the genomic relationship
matrix to improve prediction accuracies analogous to our EM based approaches (Zhang et al.
2016).Currently, such models also suffer from some of the same convergence issues as with
EM, i.e., accuracies are higher in first few iterations than in later iterations when the algorithm is
close to convergence (Zhang et al. 2016). Comparison in WGP accuracies between EM-based
approaches and weighted ssGBLUP approaches deserve further investigation. To deal with the

m

n problem, I can use the SNP marker effects model directly. Although deterministic

annealing for MAP-SSVS is a useful tool to help with convergence and avoid local maxima, it is
computationally expensive, therefore, not suitable for large data applications.

154

In our GWA research in both Chapter 3 and Chapter 4, I found that Bayesian variable
selection model is particularly effective when associations are based on genomic windows
adaptively based on LD structure. I do realize a universally good performance may not be
guaranteed in populations given that the LD structure can be different across different studies.
Therefore, further studies are necessary. I also noticed WGP and GWA involve different goals
even though they are increasingly based on the same/similar models. EM based approaches
should be discouraged for GWA unless that the sensitivity to starting values and various
convergence issues can be fully resolved. I demonstrated that the use of the Expectation
Maximization variable selection (EMVS) strategy of Rockova and George (2014) can alleviate
the starting value issue in Chapter 2, but it may be computationally intractable for large scale
applications.
In Chapter 4, I found that ssSSVS which incorporates information on non-genotyped animals
can lead to higher estimated posterior probability on peak associations compared to SSVS,
particularly for a trait, milk fat, which is known to be heavily controlled by a major gene,
DGAT1. In Chapter 3 I determined the highest single SNP and window posterior probably of
association (PPA) of 0.478 and 0.772 correspondingly with 922 samples for backfat in swine
whereas in Chapter 4, the highest single SNP and window PPA, they are both 100% with milkfat
samples on 3186 dairy cows. I know that different species and different traits are not remotely
comparable, but I think itâs important to look into the sample size requirements and guidance for
hierarchical model GWA.
Another important topic that I cannot avoid discussing is hyperparameter specification. Even
though estimating hyperparameters via MCMC sampling have been highly recommended in
previous WGP research (Yang et al. 2015b) and I did find hyperparameter specification is

155

important for GWA inferences in Chapter 3 and Chapter 4, estimating them with MCMC may
not be always viable or take far too long to generate reliable inferences considering the poor
mixing of some hyperparameters, especially for more polygenic traits. Conceivably, some
hyperparameter specifications such as the degree of freedom v for t-distribution or the
probability of association, ď° , in variable selection methods could be determined by crossvalidation as long as other hyperparameters can be well estimated using MCMC when v or ď° is
fixed. Lee et al. (2017) recently determined hyperparameter values based on such specifications
that lead to the highest cross-validation WGP accuracies and hence might be a better solution
when hyperparameters cannot be well estimated through MCMC. KnĂźrr et al. (2013) also
proposed to use many different values of hyperparameters and finally averaged the prediction
results to obtain more robust inference. Hyperparameter specification is admittedly complicated,
but WGP accuracies are undoubtedly dependent upon their proper specification (Wimmer et al.
2013). Therefore, comprehensive guidelines for hyperparameter tuning is worth further study for
GWA and WGP.
The ssGBLUP approaches that incorporate information on non-genotyped animals have
become mainstream for genomic prediction problems (Misztal 2016b). Recent work by Lee et al.
(2017) and my Chapter 4 suggested Bayesian approaches that also incorporate such information,
i.e. ssSSVS, led to higher accuracies than ssGBLUP where the traits are controlled by major
genes; even for polygenic traits, ssSSVS had equivalent prediction accuracies with ssGBLUP
because in extreme polygenic cases, ssSSVS with ď° ďŚ ď˝ 1 is equivalent to ssGBLUP. However,
these examples were the only two real data applications using Bayesian sparse priors and focused
on a relatively small dataset (n<4000). Further research in large populations or national genomic
evaluations seems necessary. In ssSSVS, itâs natural to sample the variance component through

156

heterogeneous variance specification (i.e. estimate genetic variation due to marker effects and
not accounted for by markers separately) without extra computational cost if it is clear that the
genetic variability could be conceivably different between genotyped and non-genotyped
animals. For ssGBLUP, however, most applications are based on a homogeneous genetic
variance specification (Legarra et al. 2014). In Chapter 4, I found that heterogeneous genetic
variance specifications could be particularly important. However, since a heterogeneous genetic
variance specification seems to periodically suffer from convergence issues using AIREML, a
single-step Bayesian approach with a Gaussian prior using MCMC might be a solution. Again,
further research on this topic also needed. It is also conceivable that different herds have not
only heterogeneous genetic variation but also heterogeneous residual variation so that WGP and
GWA extensions that have already considered heterogeneous residual variance modeling (Ou et
al. 2016) for different herds should be combined with the developments that I have provided in
this thesis.
In Chapter 5, I demonstrated an R package BATools for implementing the models discussed
in this dissertation. I extensively tested the package, including using it for Chapter 2-4 and crossreferenced with other software packages, in the meantime, I will continue to test it in more data
sets through our research and industry applications. I also realize there is always room for
improving the computational efficiency. Finally, I designed BATools package to be extendable,
and many other models on our list, such as ss-anteBayesA/B, GxE extension on the Bayesian
model and handling multiple record data, will be added to the package.
At the time this research was proposed, whole-genome sequence (WGS) data for livestock
was not widely available, even today, it is still only available for few researchers. BrĂ¸ndum et al.
(2015) reported small (2-5%) increase in GEBV prediction reliability. Bayesian models are

157

expected to improve the WGP accuracies using WGS data compared to GBLUP because the
genomic relationship in GBLUP can be well estimated using high density SNP data (777k) while
Bayesian models model the marker effects directly may get additional benefit from higher
density in WGS (Meuwissen et al. 2016). Since long-range LD may be extensive in some
populations for WGS data, a window based inferences might be still more appropriate than
single SNP inferences in GWA. However, the adaptive window approach used in Chapter 3 and
4 becomes increasingly inefficient since it requires computation of the LD matrix for the entire
chromosome, hence, more efficient method for clustering variants into windows may need to be
developed. I believe that Bayesian WGP is still valid for WGS data. Since BATools includes
both GWA and WGP, I can slightly modify BATools to handle large WGS data set without the
excessive usage of memory through dimension reduction: select top variants from WGS using
LMM based GWA tools; then use top variants for our WGP models. Or I can reduce the number
of random draws from posterior distribution by developing efficient variable selection methods
that stop sampling some zero effects in the MCMC chain (Moser et al. 2015). Furthermore,
parallelizable versions of Bayesian WGP and GWA based on orthogonal data augmentation
should also be explored to deal with WGS data (Cheng et al. 2017).

158

APPENDICES

159

Appendix A: Chapter 3
Implementation Details on Maximum A Posteriori and Monte Carlo Markov Chain
Inferences in BayesA and SSVS models
Expectation (E-) steps and Maximization (M-) steps
Following Chen and Tempelman (2015), the E-step or the expectation of the portion of the log
joint posterior density that is a function of ď´ j is given for MAP-BayesA in Equation [A1]:
ďŚ

m

ď¨

j ď˝1

ď¨ ď¨

ďŠďŠ ď¸
ďś

E ď§ ďĽ log p ď¨ g j |ďł g2 ,ď´ j ďŠ ďŤ log p ď¨ď´ j | ďŽ ď´ ďŠ ďˇ
ď´ |.
j

ďŚ 1 g 2j ďŚ 1
ď˝ ďĽď§ ď­
Eď§
ď§ 2 ďł g2 ď´ j |. ď§ ď´ j
j ď˝1
ď¨
ď¨

ďś ďŚ ďŽď´
ďś
ďˇďˇ ď­ ď§ ďŤ 1ďˇ ď´E|. log ď¨ď´ j
ď¸ j
ď¸ ď¨ 2

ď¨

m

ďŠďŠ

ďŚ1
ď­ Eď§
2 ď´ j |. ď§ď¨ ď´ j

ďŽď´

ďśďś
ďˇďˇ ďˇ ďŤ constant .
ďˇ
ď¸ď¸

[A1]

and for MAP-SSVS in Equation [A2]
ďŚ

m

ď¨

j ď˝1

ďś

ď¨

ďŠď¸

E ď§ ďĽ log p ď¨ g j |ďł g2 ,ď´ j ďŠ ďŤ log p ď¨ď´ j | ď° ď´ ďŠ ďˇ
ď´ |.
j

1
ďŚ
ďś
ď˝ ďĽ ď­ ďł gď­2 g j 2 ď§ E ď¨ď´ j ďŠ ďŤ ďŚď§1 ď­ E ď¨ď´ j ďŠ ďśďˇ c ďˇ
ď´
|.
ď´
|.
j
2
ď¨
ď¸ ď¸
ď¨ j
j ď˝1
m

[A2]

ďŤ E ď¨ď´ j ďŠ log ď¨ď° ď´ ďŠ ďŤ ďŚď§1 ď­ E ď¨ď´ j ďŠ ďśďˇ log ď¨1 ď­ ď° ď´ ďŠ ďŤ constant .
ď¨

ď´ j |.

ď¸

ď´ j |.

ďŚ 1 ďś ď­1 ďŽ ďŤ 1
with E ď§ ďˇ ď˝ ď´Ë j ď˝ 2ď´
for MAP-BayesA
ď´ j |. ď´
gË j
ď¨ jď¸
ďŤďŽ ď´
2

ďłg

and E ď¨ď´Ë j ďŠ ď˝
ď´ j |.

ďŚgË ď¨ 0, ďł g2 ďŠ ď° ď´
j

ďŚ ďł ďś
ďˇďˇ ď¨1 ď­ ď° ď´ ďŠ
ď¨ c ď¸

ďŚgË ď¨ 0, ďł g2 ďŠ ď° ď´ ďŤ ďŚgË ď§ď§ 0,
j

j

2
g

for MAP-SSVS where ďŚx ď¨ ď­ , ďł 2 ďŠ denotes the

ordinate of a Gaussian probability density function with mean ď­ and variance ďł 2 evaluated at x.
A conditional maximization or M-step for ď˘ and g can be determined by solving the following
MME using a SNP-centric model in Equation [A3]

ďŠ X' X
ďš ďŠÎ˛Ë ďš ďŠ X' y ďš
X'Z
ď˝ďŞ ' ďş
ďŞ '
'
ď­1 2 ď­2 ďş ďŞ ďş
ďŤďŞ Z X Z Z ďŤ D ďł e ďł g ďťďş ďŤgË ďť ďŤ Z y ďť

[A3]

or an animal-centric model in Equation [A4]

160

ďŠ X' X
ďš ďŠÎ˛Ë ďš ďŠ X' y ďš
X'
ď˝ďŞ ' ďş
ďŞ
ď­1 2 ď­2 ďş ďŞ ďş
ďŤďŞ X I ďŤ G ďł e ďł g ďťďş ďŤaË ďť ďŤ Z y ďť
with G ď˝ ZDZ '

ď¨

ď¨

Dď­1 ď˝ diag ď´Ë j ďŤ c 1 ď­ ď´Ë j

[A4]

ďť ď˝

ď­1
(Sun et al. 2012) with D ď˝ diag ď¨ď´Ë j ďŠ

ď­1

in BayesA

or

ďŠ ďŠ in SSVS.

Hyperparameter estimation under MAP
Solutions based on the mixed model equations [A3] and [A4] are conditioned on the variance
components and/or hyperparameters being known. In the classical mixed model literature, those
variance components can be estimated using REML. The vector of hyperparameters are
Î¸ ď˝ ď¨ďł e2 , ďł g2 ,ďŽ ď´ ďŠ for BayesA, and Î¸ ď˝ ď¨ďł e2 , ďł g2 , ď° ď´ ďŠ for SSVS. I partition Î¸ into the variance

components Ď ď˝ ď¨ďł e2 , ďł g2 ďŠ and remaining hyperparameters as

Î¸ď­Ď such that, for example,

Î¸ď­Ď ď˝ ďŽ ď´ in BayesA whereas Î¸ď­Ď ď˝ ď° ď´ in SSVS.
The classical log REML function (Searle et al. 1992) can be written as follows:
l ď¨ Ď | y ďŠ ď˝ ď­0.5log V ď­ 0.5log X'V ď­1X ď­ 0.5y ' Py

ď¨

[A5]

ďŠ

ď­

with V ď˝ ZDZ ' ďł g2 ďŤ Iďł e2 and P ď˝ V ď­1 ď­ V ď­1 X X' V ď­1 X X' V ď­1 . In typical classical REML
specifications involving uncorrelated random effects, D = I. I modify this expression for our
BayesA and SSVS adaptations accordingly as:

l ď¨ Ď, Ď | Î¸ď­ Ď , y ďŠ ď˝ log ď¨ p ď¨ Ď, Ď | Î¸ ď­ Ď , y ďŠ ďŠ

[A6]

ď˝ constant ď­ 0.5log V ď­ 0.5log X'V ď­1X ď­ 0.5y ' Py + log p ď¨ Ď | Î¸ ď­ Ď ďŠ + log p ď¨ Ď ďŠ .

Recall for either hierarchical model, D is a function of Ď for which conditional expectations

ďť ď˝ in BayesA or

ď­1
are used to derive D ď˝ diag ď¨ď´Ë j ďŠ

ď­1

ď¨

ď¨

Dď­1 ď˝ diag ď´Ë j ďŤ c 1 ď­ ď´Ë j

ďŠďŠ

in SSVS as

noted earlier. Upon evaluating Equation [A6] at D ď­1 , this expression is maximized with respect
to Ď . I again denote the corresponding estimates as marginal maximum likelihood (MML)

161

estimates in order to distinguish them from classical REML estimates (Chen and Tempelman
2015).
Average Information REML (AIREML) is a particularly attractive hybrid Fisherâs
scoring/Newton Raphson algorithm used to obtain REML estimates under classical Gaussian
specifications for g based on the log likelihood of Equation [A5] (Gilmour et al. 1995; Johnson
and Thompson 1995). I adapt this algorithm for our proposed MML approach in Equation [A6]
by simply replacing Ď by ĎĚ from a previous E-step followed by maximizing Equation [A6]
with respect to Ď in a M-step evaluated at ĎĚ .To account for prior information in log p ď¨ Ď ďŠ , I
augment the AIREML first and second derivatives as provided by Johnson and Thompson

ďś
ďś
log p ď¨ Ď ďŠ , respectively.
log p ď¨ Ď ďŠ and
ďśĎďśĎ '
ďśĎ
2

(1995) with

MML algorithm for variance component estimation
Recall that the classical log REML function (Searle et al. 1992) can be written as follows:
l ď¨ Ď | y ďŠ ď˝ ď­0.5log V ď­ 0.5log X'V ď­1X ď­ 0.5y ' Py

[A7]

with V ď˝ ZDZ 'ďł g2 ďŤ Iďł e2 and P = V -1 - V -1 X ď¨ X'V -1 X ďŠ X'V -1 .
-1

The Fisher scoring algorithm for iterate [k] in AIREML for MAP-BayesA and MAP-SSVS
can be specified as follows:
ď­1

ďŚ ďŚ ďś 2 log p ď¨ Ď, Ď | Î¸ , y ďŠ ďś
ďś ďś log p ď¨ Ď, Ď | Î¸ , y ďŠ
ď­Ď
ď­Ď
ďˇ
ď¨ Ď[k ] ď­ Ď[k ď­1] ďŠ ď˝ ď§ď§ E y ď§ ď­
ďˇ
ďˇ
ďś
Ď
ďś
Ď
'
ďś
Ď
ď¸ Ď ď˝Ď[ k ď­1] ď¸
Ď ď˝Ď[ k ď­1]
ď¨ ď¨

[A8]

where the vector of first derivatives could be determined using Johnson and Thompson (1995)
as:

162

ďś log p ď¨ Ď, Ď | Î¸ď­ Ď , y ďŠ
ďś log p ď¨ Ď ďŠ
1
1
ď˝ ď­ tr ď¨ P ďŠ ďŤ y ' Py+
2
ďśďł e
2
2
ďśďł e2
ďŚ
ďś
trace ď¨ Dď­1C gg |D ďŠ ďś
ď¨ďŽ ďŤ 2 ďŠ
1 ď§ n ď­ rank ( X) ďŚ 1 ďś ďŚ
eË 'eË ďˇ ďŽ e se2
ďˇď­
ď˝ď­
ď­ ď§ 2 ďˇď§ m ď­
ďŤ
ď­ e 2
2
2
2
2
ďˇ ď¨ďł 2 ďŠ ďˇ 2 ď¨ďł 2 ďŠ
2ď§
ďłe
ďłg
2ďł e
ď¨ ďł e ď¸ ď§ď¨
e
e
ď¸
ď¨
ď¸

[A9]

and
ďś log p ď¨ Ď, Ď | Î¸ ď­Ď , y ďŠ
ďś log p ď¨ Ď ďŠ
1
1
ď˝ ď­ tr ď¨ PZDZ ' ďŠ ďŤ y 'PZDZ 'Py+
2
ďśďł g
2
2
ďśďł g2
ď­1 gg |D
2
ďŚ
ď¨ďŽ g ďŤ 2 ďŠ
Ë ď­1gË ďśďˇ ďŽ g sg
1 ď§ m trace ď¨ D C ďŠ g'D
ď˝ď­
ď­
ďŤ
ď­
2
2
2
2
2
2
ďˇ 2 ď¨ďł 2 ďŠ
2ď§ďłg
2ďł g2
ďł
ďł
ď¨
ďŠ
ď¨
ďŠ
g
g
g
ď¨
ď¸

[A10]

with C gg |D defined by Equation [A11]:

ďŠCď˘ď˘ |D
ďŞ g ď˘ |D
ďŤC

ď­2
Cď˘ g |D ďš ďŠ X ' Xďł e
ďşď˝ďŞ
ď­2
C gg |D ďť ďŤďŞ Z ' Xďł e

ďš
X ' Zďł eď­2
ďş
Z ' Zďł eď­2 ďŤ Dď­1ďł gď­2 ďťďş

ď­1

[A11]

Ë Ë.
Ë
and e=y-XÎ˛-Zg
The second derivative can be also obtained as described in Johnson and Thompson (1995).
Inverting the coefficient matrix as in Equation [A11] is required to obtain C gg , however, this
computation is nearly impossible with greater than tens of thousands of markers.
A reasonable strategy to use if m >> n is the animal effects model [A4], then back solve for
SNP effect estimates using gË ď˝ DZ'G ď­1aË . When I adopt the animal effects model, the
corresponding first derivatives are given by:

ďś log p ď¨ Ď, Ď | Î¸ď­ Ď , y ďŠ
ďś log p ď¨ Ď ďŠ
1
1
ď˝ ď­ tr ď¨ P ďŠ ďŤ y ' Py+
2
ďśďł e
2
2
ďśďł e2
ďŚ
ďś
trace ď¨ G ď­1Caa|D ďŠ ďś
ď¨ďŽ ďŤ 2 ďŠ
1 ď§ n ď­ rank ( X) ďŚ 1 ďś ďŚ
eË 'eË ďˇ ďŽ e se2
ďˇď­
ď˝ď­
ď­ ď§ 2 ďˇď§ n ď­
ďŤ
ď­ e 2
2
2
2
2
ďˇ ď¨ďł 2 ďŠ ďˇ 2 ď¨ďł 2 ďŠ
2ď§
ďłe
ďłg
2ďł e
ď¨ ďł e ď¸ ď§ď¨
e
e
ď¸
ď¨
ď¸
and

163

[A12]

ďś log p ď¨ Ď, Ď | Î¸ď­Ď , y ďŠ
ďś log p ď¨ Ď ďŠ
1
1
ď˝ ď­ tr ď¨ PZDZ ' ďŠ ďŤ y ' PZDZ'Py+
2
ďśďł g
2
2
ďśďł g2
ď­1 aa|D
2
ďŚ
ď¨ďŽ g ďŤ 2 ďŠ
Ë ď­1aË ďśďˇ ďŽ g sg
1 ď§ n trace ď¨ G C ďŠ a'G
ď˝ď­
ď­
ďŤ
ď­
2
2
2
2
2
2ď§ďłg
ď¨ďł g2 ďŠ
ď¨ďł g2 ďŠ ďˇď¸ 2 ď¨ďł g2 ďŠ 2ďł g
ď¨

[A13]

with Caa|D defined as in Equation [A14]

ďŠCď˘ď˘ |D
ďŞ a ď˘ |D
ďŤC

ď­2
Cď˘ a|D ďš ďŠ X ' Xďł e
ďşď˝ďŞ
ď­2
Caa|D ďť ďŤďŞ I ' Xďł e

ďš
X ' Iďł eď­2
ďş
I ' Iďł eď­2 ďŤ (ZDZ' ) ď­1ďł gď­2 ďťďş

ď­1

[A14]

Review of steps for MAP using animal-centric effects model and backtransforming to SNPeffects
I thereby highlight our MAP inference strategy as follows.
1. Set initial values for gĚ (0) , ďŹ(0) ď˝ ďł e2(0) ďł g2(0) and t ď˝ 1 .
2. Compute G ( t ) ď˝ ZD( t ) Z'

[A15]

where

ďť

D(t ) ď˝ diag ď¨ď´Ë j ďŠ(t )

ď˝

ďŹ gË 2j (t ď­1)
ďź
ďŤďŽ g ďŻ
ďŻ 2
ďŻ ďł g (t ď­1)
ďŻ
ď˝ diag ď­
ď˝
ďŻ ďŽ g ď­1 ďŻ
ďŻ
ďŻ
ďŽ
ďž

[A16]

ďŠďŠ

[A17]

for MAP-BayesA and

ď¨

ď¨

D( t ) ď˝ diag ď´Ë j ( t ) ďŤ c 1 ď­ ď´Ë j ( t )

for MAP-SSVS with

ďŚgË

ď´Ë j ď˝
ďŚgË

j ( t ď­1)

ď¨ 0,ďł ďŠ ď°
2
g ( t ď­1)

j( t ď­1)

( t ď­1)

ď¨ 0,ďł ďŠ ď°
2
g

ďŤ ďŚgË j

( t ď­1)

( t ď­1)

ďŚ ďł g2( t ď­1)
ď§ 0,
ď§
c
ď¨

ďś
ďˇ ď¨1 ď­ ď° ( t ď­1) ďŠ
ďˇ
ď¸

3. Obtain Caa|D using
164

[A18]

ďŠCď˘ď˘ |D
ďŞ a ď˘ |D
ďŤC

ď­1

ď­1

ďŠ X' X
ďš
ďš ďŠ X' y ďš
ďŠÎ˛Ë t ďš ďŠ X' X
X'
X 'I
C ď˘ a| D ďš
2
ď˝
ďł
and
ď˝
ďŞ
ďş e
ďş ďŞ ' ďş
ďŞ ďş ďŞ '
ďş
ď­1
'
ď­1
Caa|D ďť (t ) ďŞďŤ X I ďŤ G (t ) ďŹ(t ď­1) ďşďť
ďŤaË t ďť ďŞďŤ I X I I ďŤ G (t )ďŹ(t ď­1) ďşďť ďŤ Z y ďť
[A19]

4. Compute gË t ď˝ D(t ď­1) Z'G (t ) ď­1aË ( t )

[A20]

5. Estimate variance components ďł e2( t ) and ďł g2(t ) from animal effects model using AIREML.

ďŹ(t ) ď˝ ďł e2(t ) ďł g2(t ) . Increment iterate number t to t ďŤ 1 .
6. Repeat Steps 2-5 until convergence.
Asymptotic standard errors of prediction under MAP
Asymptotic standard errors of prediction can be based on the observed information matrix for
MAP-BayesA and MAP-SSVS:
ďŠ ďś 2 log p ď¨ Îˇ | ďł g2 ,ďł e2 , y ďŠ
ďŞď­
ďśÎˇďśÎˇ '
ďŞ
ďŤ

ď¨

ďŠ ďšďş
ďş
ďť

[A21]

ď¨ ď¨

ďŠďŠ

ďŠÎ˛ ďš
denotes the log posterior density of Îˇ ď˝ ďŞ ďş conditional on the
ďŤg ďť

where log p Îˇ | ďł g2 ,ďł e2 , y

ď­1

variance components but with the uncertainty on D integrated out. Using Louis (1982), I can
derive Expression [A21] for MAP-BayesA in Equation [A22].

ď­

ď¨

ďś 2 log p ď¨ Îˇ | ďł g2 ,ďł e2 , y ďŠ

ďŠ

ďśÎˇ

2

ď˝ď­ď˛

Dď­1

ď¨

ďś 2 log p ď¨ Îˇ | ďł g2 ,ďł e2 , y,Dď­1 ďŠ
ďśÎˇ2

ďŠ p ď¨D

ďŠ X ' R ď­1X
X ' R ď­1Z
ďŞ
ďŚ 1
ď˝ďŞ
ď­1
ď­1
ďŞ Z ' R X Z ' R Z ďŤ diag ď§ď§ 2
ďŞďŤ
ď¨ ďł gj

ď­1

ď¨

ďŹ ďś log p ď¨ Îˇ | ďł g2 ,ďł e2 , y,Dď­1 ďŠ
ďŻ
| Îˇ, ďł ,ďł , Y ďŠ dD - ď­1 var 2 2 ď­
D |Îˇ, Y,ďł g ,ďł e
ďśÎˇ
ďŻďŽ
2
g

2
e

ď­1

0
ďš ďŠ0
ďŞ
ďş
ďŚ
ďśďş ď­ ďŞ
2 gi2 ďŚ 1
ď§
ď§
ďˇ ďş ďŞ0 diag
ď§ ďŽ ď´ ďŤ 1 ď§ ďł g2
ďˇ ďŞ
ďş
ď¨ j
ď¸ďť ďŤ
ď¨

ďŠ X ' R ď­1X
X ' R ď­1Z ďš
ď˝ďŞ
ď­1
ď­1
ď­1 ďş
ďŤZ ' R X Z ' R Z ďŤ Î ďť

165

ďš
2 ďş
ďś ďśďş
ďˇ ďˇďş
ďˇ ďˇďş
ď¸ ď¸ďť

[A22]

ďŠ ďźďŻ
ď˝
ďŻďž

where ďł g2 j ď˝

gi 2 ďŤďŽ ď´ ďł g2

ďŽď´ ďŤ 1

ďŚ 1
ď­1
and Î ď˝ diag ď§ 2
ď§ďłg
ď¨ i

ďŚ
2 gi2
ď§1 ď­
ď§ ď¨ďŽ ď´ ďŤ 1ďŠ ďł g2
i
ď¨

ďśďś
ďˇďˇ.
ďˇďˇ
ď¸ď¸

Then I can obtain the asymptotic prediction error (co)variance (PEV) matrix (Cgg) of the SNP
effect estimates from the random by random portion of the inverse of Equation [A22]. For
MAP-SSVS, the observed information matrix can be similarly obtained from Equation [A22]
except that:
2
ďŚ 1 ďŚ
g 2jď´Ë j ď¨1 ď­ ď´Ë j ďŠ ď¨1 ď­ c ďŠ ďś ďś
ďˇďˇ
Î ď˝ diag ď§ 2 ď§ diag ď´Ë j ďŤ c ď¨1 ď­ ď´Ë j ďŠ ď­
ďˇďˇ
ď§ďłg ď§
ďł g2
ď¨
ď¸ď¸
ď¨
ď­1

ď¨

ďŠ

[A23]

As inverting Expression [A22] is difficult for large m, I base inference on an animal effects
model using

ďŠCď˘ď˘
ďŞ aď˘
ďŤC

ď­1

X 'I
Cď˘ a ďš ďŠ X ' X
ďš
ď˝ďŞ
ďł e2
*ď­1 ďş
aa ďş
C ďť ďŤ I ' X I 'I ďŤ G ďŹ ďť

[A24]

where G*ď­1 ď˝ (ZÎZ' )ď­1 and ďŹ ď˝ ďł e2 ďł g2 . Note that the prediction error (co)variance matrix

C gg ď˝ PEV (gË ) for the SNP effects can be derived from the prediction error (co)variance matrix
C gg ď˝ PEV (gË ) for the animal effects. By definition,
C gg ď˝ PEV (gË ) ď˝ var(g - gË ) ď˝ var(g) ď­ var(gË )

[A25]

and

Caa ď˝ PEV (aË ) ď˝ var(a - aË ) ď˝ var(a) ď­ var(aË )

[A26]

Noting that gË ď˝ DZ'G*ď­1aË , I then have

var(gË ) ď˝ var(DZ'G*ď­1aË ) ď˝ DZ'G*ď­1 var(aË )G*ď­1Z

[A27]

where

var(aË ) ď˝ var(a) ď­ Caa ď˝ G*ďł g2 ď­ Caa

[A28]
166

Using [A26], [A27] and [A28] in [A25], Cgg can be derived from Caa.
C gg ď˝ var(g ) ď­ var(gË ) ď˝ Dďł g2 ď­ DZ ' G *ď­1 var(aË )G *ď­1 ZD
ď˝ Dďł g2 ď­ DZ ' G *ď­1 (G *ďł g2 ď­ Caa )G *ď­1 ZD

[A29]

An important feature of Equation [A29] is that just diagonals of Cgg (for single SNP
associations) or block diagonals of Cgg (for windows based inference) can be readily computed
without computing all of [A29]. For example, suppose I write M ď˝ DZ'G*ď­1 . Hence for m 'j
being row j of M, the corresponding jth diagonal element, c gg
, of Cgg, used to derive either a
j, j
EMMAX, RRBLUP, MAP-BayesA or MAP-SSVS test for SNP j, can be determined as a
function of a simple quadratic form; i.e.
2
'
* 2
aa
c gg
j , j ď˝ d j , jďł g ď­ m j (G ďł g ď­ C )m j

[A30]

Similarly, if one conducts windows based inference where M 'k denotes the subset of rows of
M pertaining to the nk SNP markers in window k, then the corresponding block diagonal Ckgg of
Cgg for window k can be written simply:
Ckgg ď˝ Dkďł g2 ď­ M 'k (G *ďł g2 ď­ Caa )M k

[A31]

Here Dk is the diagonal sub-block of D pertaining to the nk SNPs in window k.

167

Full Conditional Densities (FCD) for Markov Chain Monte Carlo Inference (MCMC) in
BayesA and SSVS models
Recall the joint posterior density from Equation [3.7] as also provided again below

p ď¨ Î˛, g, Ď, ďł e2 , ďł g2 , ďąď´ | y ďŠ
ďŚ m
ďś
ďŚ n
2 ďś
ďľ ď§ ď p ď¨ yi |Î˛, g,ďł e ďŠ ďˇ ď§ ď p ď¨ g j |ďł g2 ,ď´ j ďŠ p ď¨ď´ j | ďąď´ ďŠ ďˇ p ď¨ Î˛ ďŠ p ď¨ďł g2 | vg , sg2 ďŠ p ď¨ďł e2 | ve , se2 ďŠ p ď¨ďąď´ ďŠ
ď¨ i ď˝1
ď¸ ď¨ j ď˝1
ď¸
For the fixed effects, suppose the design matrix is nĂp, write:

X nxp

ďŠ x11
ďŞx
ďŞ 21
ď˝ ďŞ x31
ďŞ
ďŞ
ďŞ xn1
ďŤ

x12 x13
x22 x23
x32 x33
xn 2 xn 3

x1 p ďš ďŠ x1' ďš
x2 p ďşďş ďŞďŞ x'2 ďşďş
x3 p ďş ď˝ ďŞ x3' ďş ď˝ ďŠďŤ x.1 x.2 x.3
ďş ďŞ ďş
ďş ďŞ ďş
xnp ďşďť ďŞďŤ x'n ďşďť

x. p ďšďť

[A32]

Here x. j is the vector of covariates or dummy variables for element j of the fixed effects.
For the marker effect, suppose design matrix is nĂm, write

ďŠ z11
ďŞz
ďŞ 21
Z ď˝ ďŞ z31
ďŞ
ďŞ
ďŞďŤ zn1

z1m ďš ďŠ z1' ďš
ďŞ ďş
z2 m ďşďş ďŞ z '2 ďş
z3m ďş ď˝ ďŞ z 3' ďş ď˝ ďŠďŤ z.1 z.2 z.3
ďş ďŞ ďş
ďş ďŞ ďş
znm ďşďť ďŞďŤ z 'n ďşďť

z12 z13
z22 z23
z32 z33
zn 2 zn 3

z.m ďšďť

[A33]

where z. j is the vector of genotype values for SNP marker j
Then the fully conditional distribution of any unknown parameters are outlined below, first
for MCMC-BayesA and then for MCMC-SSVS.
Full conditional densities (FCD) under MCMC-BayesA
FCD for Fixed Effects

ď¨

p ( ď˘ j | ELSE ) ~ N ď˘ j , vď˘ j

ďŠ

[A34]

with
168

ď˘j ď˝
ď˝

x.' j ď¨ ď¨ y-XÎ˛ ď­ Zg ďŠ +x. j ď˘ j ďŠ
x.' j x. j

x.' j ď¨ e ďŤ x. j ď˘ j ďŠ
x.' j x. j

ď¨x
ď˝

'
.j

e ďŤ x.' j x. j ď˘ j ďŠ
x.' j x. j

ďŚ n
ďŚ n 2ďś ďś
x
e
ďŤ
ď§ ďĽ ij i ď§ ďĽ xij ďˇ ď˘ j ďˇ
i ď˝1
ď¨ i ď˝1 ď¸ ď¸
ď˝ď¨
n
ďŚ
2ďś
ď§ ďĽ xij ďˇ
ď¨ i ď˝1 ď¸

and vď˘ j ď˝ ďł

2
e

ď¨x

'
.j

x. j ďŠ

ď­1

ď­1

ďŚ n 2ďś
ď˝ ďł ď§ ďĽ xij ďˇ .
ď¨ i ď˝1 ď¸
2
e

FCD for Marker effects

ďŚ n
ďś
p ď¨ g j | ELSE ďŠ ďľ ď§ ď p ď¨ yi |Î˛, g ďŠ ďˇ p ď¨ g j | ďł g2 ,ď´ j ďŠ ~ N ď¨ g j , vgj ďŠ ; j ď˝ 1, 2,ďź., m
ď¨ i ď˝1
ď¸
where

gj ď˝
ď˝

z.' j ď¨ ď¨ y-XÎ˛ ď­ Zg ďŠ +z. j g j ďŠ
z.' j z. j ďŤ ďł e2 ď¨ďł g2ď´ j ďŠ

z.' j ď¨ e ďŤ z.*j g j ďŠ

z.' j z. j ďŤ ďł e2 ď¨ďł g2ď´ j ďŠ
ďŚ

n

ď˝

n

ďĽ z e ďŤ ď§ď¨ ďĽ z
i ď˝1

ij i

i ď˝1

2
ij

ď­1

ď˝

ď­1

z.' j e+z.' j z. j g j

z.' j z. j ďŤ ďł e2 ď¨ďł g2ď´ j ďŠ

ď­1

ďś
ďˇgj
ď¸

ď­1 ďś
ďŚ n 2
2
2
ď§ ďĽ zij ďŤ ďł e ď¨ďł gď´ j ďŠ ďˇ
ď¨ i ď˝1
ď¸

2
ďŚ n
ďś
ď§ ďĽ ď¨ zij ďŠ
ďˇ
ď­
1
2
i ď˝1
ď§
ďŤ ď¨ďł gď´ j ďŠ ďˇ
and vgj ď˝
2
ď§ ďłe
ďˇ
ď§
ďˇ
ď¨
ď¸

ď­1

FCD for Marker-Specific Augmented Variables.

169

[A35]

p ď¨ď´ j | y,ELSE ďŠ ďľ p ď¨ g j | ď´ j , ďł g2 ďŠ p ď¨ď´ j | ďŽ ď´ ďŠ
ďŽď´

ďŚ ďŽď´ ďś 2
ďŽď´
ďŚďŽ
ďś
ď­ď§ ď´ ďŤ1ďˇ ď­
ďŚ
ďśď§ ďˇ
ď­1/2
1
2ď´ j
2
2 ď¨ 2 ď¸
ď¨ 2 ď¸
ďľ ď¨ 2ď°ďł gď´ j ďŠ exp ď§ ď­ 2 g j ďˇ
ď´
e
ď§ 2ďł ď´
ďˇ ďŚďŽď´ ďś j
g
j
ď¨
ď¸ ďď§ ďˇ
ď¨2ď¸
ďŚ
ďŚ g 2j
ďśďś
ďŤ
ďŽ
ď§
ď§
ď§ ďł g2 ď´ ďˇďˇ ďˇďˇ
ďŚ ďŽ ďŤ1 ďś
ď­ď§ ď´ ďŤ1ďˇ
1
ď§
ď¸
ďľ ď¨ď´ j ďŠ ď¨ 2 ď¸ exp ď§ ď­ ď¨
ďˇ
2
ď´j
ď§
ďˇ
ď§
ďˇ
ď¨
ď¸

[A36]

i.e., a scaled inverted chi-square density with degrees of freedom ďŽ ď´ ďŤ 1 and scale parameter

g 2j

ďł g2

ďŤďŽ ď´ .

FCD for Genetic Variance Component

p ď¨ďł g2 | y,ELSE ďŠ ďľ ď p ď¨ g j | ď´ j , ďł g2 ďŠ p ď¨ďł g2 | ďŽ g , sg2 ďŠ
m

j ď˝1

ďŽg

ďŚďŽ g ďś 2
2
ďŚ ďŽ g ďś ďŽ g sg
ď§
ďˇ
2
m
ď­ 2
G
ď­
ďŤ
1
ď§
ďˇ
ďŚ
ďś
g
ď­
ď§ 2 ďˇ
2
1
2ďł
ďľ ď¨ďł g2 ďŠ 2 exp ď§ ď­ 2 ďĽ j ďˇ ď¨ ď¸ ďł g2 ď¨ ď¸ e g
ď§ 2ďł j ď˝1 ď´ ďˇ ďŚ ďŽ ď´ ďś
g
j ď¸
ď¨
ďď§ ďˇ
ď¨2ď¸
ďľ ď¨ďł

ďŠ

2 ď­
g

m ďŤďŽ g
2

ďŤ1

[A37]

ďŚ 1 ďŚ G g 2j
ďśďś
exp ď§ ď­ 2 ď§ ďĽ ďŤďŽ g sg2 ďˇ ďˇ
ďˇďˇ
ď§ 2ďł g ď§ j ď˝1 ď´ j
ď¨
ď¸ď¸
ď¨

i.e., a scaled inverted chi-square density with degrees of freedom ďŽ g ďŤ m and scale parameter
m

g 2j

ďĽď´
j ď˝1

ďŤďŽ g sg2 .

j

I adopt the Metropolis-Hastings sampling strategy provided by Yang et al. (2015b) to sample
ďł g2 with uncertainty on the ď´jâs integrated out.

170

ďŚ m
ďś
p ď¨ďł g2 | y, ELSE ďŠ ďľ ď§ ď p ď¨ g j | ďŽ ď´ , ďł g2 ďŠ ďˇ p ď¨ďł g2 | ďĄ g , ď˘ g ďŠ
ď¨ j ď˝1
ď¸
ďŚ
ďŚďŽď´ ďŤ 1 ďś
ďŽ ďŤ1 ďś
ď­ ď´
2
ď§ m ď ď§ 2 ďˇ ďŚ 1 ďś1/2 ďŚ
g j ďś 2 ďˇ 2 ďĄ g ď­1 ď­ ď˘ gďł g2
ď¸
ďˇ ď¨ďł g ďŠ e
ďľ ď§ď ď¨
1
ďŤ
ď§ď§
ďˇ
ď§
2
2 ďˇ
ď§ j ď˝1 ď ďŚ ďŽ ď´ ďś ď¨ ďŽ ď´ ďł g ďˇď¸ ď§ď¨ ďŽ ď´ ďł g ďˇď¸
ďˇ
ď§ ďˇ
ď§
ďˇ
ď¨2ď¸
ď¨
ď¸

[A38]

Residual variance
p ď¨ďł | ELSE ďŠ ďľ ď¨ 2ď°ďł
2
e

ďľďł

ďŚ ďŽ ďŤn ďś
ď­ e ďŤ1ďˇ
2 ď§ď¨ 2
ď¸
e

ďŠ

2 ď­ n /2
e

ďŽ

ďŽ s2

ďŚ 1 n 2 ďś 2 ď­ďŚď§ď¨ 2e ďŤ1ďśďˇď¸ ď­ 2eďł ee2
exp ď§ ď­ 2 ďĽ e j ďˇ ďł e
e
ď¨ 2ďł e j ď˝1 ď¸

ďŚ 1 ďŚ n
ďśďś
exp ď§ ď­ 2 ď§ ďĽ e 2j ďŤ ďŽ e se2 ďˇ ďˇ
ď§ 2ďł j ď˝1
ďˇ
e ď¨
ď¸ď¸
ď¨

[A39]

n
ďŚ
ďś
i.e., ďŁ ď­2 ď§ ve ďŤ n, ďĽ e 2j ďŤ ďŽ e se2 ďˇ with e j ď˝ y j ď­ x'j Î˛ ď­ z 'j g
j ď˝1
ď¨
ď¸

Full Conditional Densities for MCMC-SSVS
FCD for Fixed effects: same as that for MCMC-BayesA
FCD for Marker Effects

ďŚ n
ďś
p ď¨ g j | ELSE ďŠ ďľ ď§ ď p ď¨ yi |Î˛, g ďŠ ďˇ p ď¨ g j | ďł g2 ,ď´ j ďŠ ~ N ď¨ g j , vgj ďŠ ; j ď˝ 1, 2,ďź., m
ď¨ i ď˝1
ď¸
where

171

[A40]

gj ď˝

ď˝

z.' j ď¨ ď¨ y-XÎ˛ ď­ Zg ďŠ +z. j g j ďŠ
ďŚ ďŚ ď¨1 ď­ ď´ j ďŠ
ďśďś
z.' j z. j ďŤ ďł e2 ď§ ďł g2 ď§
ďŤď´ j ďˇ ďˇ
ďˇďˇ
ď§ ď§ c
ď¸ď¸
ď¨ ď¨

ď­1

z.' j ď¨ e ďŤ z*. j g j ďŠ

z.' j e+z.' j z. j g j

ď˝
ď­1
ď­1
ďŚ ďŚ ď¨1 ď­ ď´ j ďŠ
ďś
ďŚ ďŚ ď¨1 ď­ ď´ j ďŠ
ďś
ďśďś
'
2
2
'
2
2
z. j z. j ďŤ ďł e ď§ ďł g ď§
ďŤď´ j ďˇ ďˇ
z. j z. j ďŤ ďł e ď§ ďł g ď§
ďŤď´ j ďˇ ďˇ
ďˇďˇ
ďˇďˇ
ď§ ď§ c
ď§ ď§ c
ď¸ď¸
ď¸ď¸
ď¨ ď¨
ď¨ ď¨
n
n
ďŚ
ďś
zij ei ďŤ ď§ ďĽ zij2 ďˇ g j
ďĽ
i ď˝1
ď¨ i ď˝1 ď¸
ď˝
ď­1
ďŚ n
ďŚ ďŚ ď¨1 ď­ ď´ j ďŠ
ďśďś ďś
2
2
2
ď§ z ďŤ ďł ď§ďł ď§
ďŤď´ j ďˇ ďˇ ďˇ
ij
e
ď§ď§ ďĽ
ďˇ ďˇ ďˇďˇ
ď§ gď§ c
i ď˝1
ď¨
ď¸ď¸ ď¸
ď¨
ď¨
2
ďŚ n
ď§ ďĽ ď¨ zij ďŠ
and vgj ď˝ ď§ i ď˝1 2
ď§ ďłe
ď§
ď¨

ďŚ ďŚ ď¨1 ď­ ď´ j ďŠ
ďśďś
ďŤ ď§ ďł g2 ď§
ďŤď´ j ďˇ ďˇ
ďˇďˇ
ď§ ď§ c
ď¸ď¸
ď¨ ď¨

ď­1

ďś
ďˇ
ďˇ
ďˇ
ďˇ
ď¸

ď­1

FCD for Marker-Specific Augmented (i.e. Indicator) variables ď´ j

p ď¨ď´ j | ELSE ďŠ ďľ p ď¨ g j | ďł g2 ,ď´ j ďŠ p ď¨ď´ j | ď° ď´ ďŠ
ďŚ
ďś
ď§
ďˇ
2
ď§
ďˇ ď´j
gj
1
1
1ď­ď´ j
ďľ
exp ď§ ď­
ďˇ ď° ď´ ď¨1 ď­ ď° ď´ ďŠ
ďśďˇ
ďŚ 1 ď­ď´ j ďŠ
ďś
ď§ 2 2 ďŚ ď¨1 ď­ ď´ j ďŠ
2 ď¨
ď§
ďł
ďŤ
ď´
2ď°ďł g ď§
ďŤď´ j ďˇ
ď§ď§
g
j ďˇďˇ
ď§ c
ďˇďˇ
ď§ c
ďˇ
ď¨
ď¸ď¸
ď¨
ď¨
ď¸

[A41]

Such that it can be readily determined that:
Prob ď¨ď´ j ď˝ 1| ELSE except g j ďŠ ď˝

h1ď° ď´

h0 ď¨1 ď­ ď°ď´ ďŠ ďŤ h1ď°ď´

172

[A42]

where h1 ď˝ p ď¨ g j | ďł g2 ,ď´ j ď˝ 1ďŠ ďľ

ďŚ 1 g 2j ďś
and
exp ď§ ď­
ď§ 2 ďł g2 ďˇďˇ
ďł g2
ď¨
ď¸
1

ďŚ
ď§ 1 g2
1
j
2
h0 ď˝ p ď¨ g j | ďł g ,ď´ j ď˝ 0 ďŠ ďľ
exp ď§ ď­
2
ď§ 2 ďłg
ďł g2
ď§
c
ď¨
c

ďś
ďˇ
ďˇ
ďˇ
ďˇ
ď¸

such that Prob ď¨ď´ j ď˝ 1| ELSE except g j ďŠ ď˝

h1ď° ď´

h0 ď¨1 ď­ ď° ď´ ďŠ ďŤ h1ď° ď´

ď˝

ď°ď´
h0
ď¨1 ď­ ď°ď´ ďŠ ďŤ ď°ď´
h1

; i.e., ď´ j can be

drawn from a Bernoulli distribution.

FCD for genetic variance component

p ď¨ďł g2 | y,ELSE ďŠ ďľ ď p ď¨ g j | ď´ j , ďł g2 ďŠ p ď¨ďł g2 | ďŽ g , sg2 ďŠ
m

j ď˝1

ďŚ
ďś
ď§
ďˇ
2
ďŚ ďŽ g ďś ďŽ g sg
2
m
G
ď­ď§ď§ ďŤ1ďˇďˇ ď­ 2
ď§
ďˇ
g
ď­
1
2ďł
2
j
2
ďľ ď¨ďł g2 ďŠ 2 exp ď§ ď­ 2 ďĽ
ďˇ ďł g ď¨ ď¸e g
ďśďˇ
ď§ 2ďł g j ď˝1 ďŚ ď¨1 ď­ ď´ j ďŠ
ď§
ďŤď´ j ďˇ ďˇ
ď§ď§
ď§ c
ďˇďˇ
ď¨
ď¸ď¸
ď¨
ďŚ
ďŚ
ďśďś
ď§
ď§
ďˇďˇ
2
m ďŤďŽ g
ď§ 1 ď§ G
ďˇďˇ
g
ď­
ďŤ
1
j
ďľ ď¨ďł g2 ďŠ 2 exp ď§ ď­ 2 ď§ ďĽ
ďŤ ďŽ g sg2 ďˇ ďˇ
ďś
ď§ 2ďł g ď§ j ď˝1 ďŚ ď¨1 ď­ ď´ j ďŠ
ďˇďˇ
ď§
ďŤ
ď´
ď§
ď§ď§ ď§
j ďˇ
ďˇďˇ ďˇ
ďˇ
c
ď§
ďˇ
ď¸
ď¨ ď¨
ď¸ď¸
ď¨

[A43]

i.e., a scaled inverted chi-square density with degrees of freedom ďŽ g ďŤ m and scale parameter
g 2j

m

ďĽďŚ
j ď˝1

ď§
ď§
ď¨

ď¨1 ď­ ď´ ďŠ ďŤ ď´
j

c

j

ďś
ďˇ
ďˇ
ď¸

ďŤďŽ g sg2 .

FCD for Residual variance: same as that for MCMC-BayesA.
173

Supplementary Tables and Figures
Table A.1 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations based on non-overlapping genomic windows of length 0.5Mb.
Comparisons are made within different specifications of shape parameter (ď§) for Gamma
distribution of quantitative trait loci (QTL) and number of QTL (nqtl)
Methods
Factors

SSVS

BayesA

EMMAX

MAPSSVS

MAP-

RRBLUP

BayesA

Shape
ď°ďŽďąď¸
ď§

2.71a

2.63a

1.80b

0.79c,

0.56d,

0.54d, *

ďąďŽď´ď¸

4.15a

4.19a

2.73b

*
c,
0.82

*
d,
0.54

0.38e, *

ďł

4.31a

4.90a

2.92b

*
c,
0.68

*
d,
0.42

0.27e, *

nqtlď 

*

*

30

6.54a

6.96a

3.94b

1.52c

0.59d,

0.34e, *

90

3.75a

3.73a

2.42b

0.67c,

*
c,
0.52

0.41d, *

300

1.98a

2.08a

1.50b

*
c,
0.43

d, * c,
0.41

0.40c, *

Overall

3.65a

3.78a

2.43b

*
c,
0.76

*
d,
0.50

0.38e, *

*

*

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. * indicates the corresponding method is worse than a random classifier (pAUC05 = 1).
Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on number
(30, 100, or 300) of markers chosen to be quantitative trait loci (QTL) from the MSUPRP
genotypes, and shape parameter (0.18,1.48, or 3.00) for Gamma distribution of QTL effects

174

Table A.2 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations based on non-overlapping genomic windows of length 2Mb.
Comparisons are made within different specifications of shape parameter (ď§) for Gamma
distribution of quantitative trait loci (QTL) and number of QTL (nqtl)
Methods
Factors

SSVS

BayesA

EMMAX

MAPSSVS

MAP-

RRBLUP

BayesA

Shape ď§
ď°ďŽďąď¸

2.87a

2.83a

1.67b

0.62c, *

0.65c, *

0.43d, *

ďąďŽď´ď¸

4.33a

4.23a

2.15b

0.49c, *

0.31d, *

0.24e, *

ďł

4.91a

5.06a

2.07b

0.49c, *

0.33d, *

0.21e, *

30

7.15a

7.17a

3.12b

1.14c

0.67d, *

0.33e, *

90

3.72a

3.76a

1.76b

0.44c, *

0.37c, d,

0.29d, *

300

2.30a

2.24a

1.36b

0.30c, *

* c, *
0.27

0.22c, *

Overall

3.94a

3.92a

1.95b

0.53c, *

0.41c, *

0.28d, *

nqtlď 

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. *indicates the corresponding method is worse than a random classifier (pAUC05 = 1).
Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on number (nqtl
= 30, 100, or 300) of quantitative trait loci (QTL), and shape parameter (ď§ď˝0.18, 1.48, or 3.00)
for Gamma distribution of QTL effects.

175

Table A.3 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations based on non-overlapping genomic windows of length 3Mb.
Comparisons are made within different specifications of shape parameter (ď§) for Gamma
distribution of quantitative trait loci (QTL) and number of QTL (nqtl)
Methods
Factors

SSVS

BayesA

EMMAX

MAPSSVS

MAP-

RRBLUP

BayesA

Shape
ď§
ď°ďŽďąď¸

2.91a

2.77a

1.64b

0.73c, *

0.66c, *

0.37d, *

ďąďŽď´ď¸

4.43a

4.30a

1.98b

0.55c, *

0.49c, *

0.24d, *

ďł

5.10a

5.10a

1.95b

0.50c, *

0.43d, *

0.17e, *

30

7.52a

7.17a

2.93b

1.16c

0.96c, *

0.34d, *

90

3.64a

3.72a

1.71b

0.61c, *

0.48c, *

0.24d, *

300

2.41a

2.28a

1.26b

0.34c, *

0.30c, *

0.16d, *

Overall

4.044a

3.93a

1.85b

0.62c, *

0.52c, *

0.23d, *

nqtlď 

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. *indicates the corresponding method is worse than a random classifier (pAUC05 = 1).
Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on number (nqtl
= 30, 100, or 300) of quantitative trait loci (QTL), and shape parameter (ď§ď˝0.18, 1.48, or 3.00)
for Gamma distribution of QTL effects.

176

Table A.4 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
specifications of degrees of freedom hyperparameter ( vg = 2.5 versus vg = 5.0) using MCMCBayesA. Comparisons are made within different specifications of shape parameter (ď§) for
Gamma distribution of quantitative trait loci (QTL) and number of QTL (nqtl)
Factors

vg =ď˛ďŽďľ

vg =ďľ

0.18

3.60a

2.38b

1.48

5.87a

4.87b

3

6.74a

4.88b

30

9.03a

4.79b

90

4.98a

3.77b

300

3.17a

3.14a

Overall

5.22a

3.84b

Shape ď§

nqtlď 

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on
number (nqtl = 30, 100, or 300) of quantitative trait loci (QTL), and shape parameter (ď§ď ď˝ď 0.18,
1.48, or 3.00) for Gamma distribution of QTL effects.

177

Table A.5 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different sets of
starting values for SNP effects (MCMC-SSVS vs RRBLUP) for MAP-SSVS. Comparisons are
made within different specifications of number of quantitative trait loci (nqtl)
Window

Factor

specification

Starting values for
MAP-SSVS
MCMC

RRBLUP

30

4.47a

3.79b

90

2.65a

2.49b

300

1.72a

1.64a

Overall

2.73a

2.49b

Overall

0.94a, *

0.70b, *

30

2.36a

1.76b

90

0.98a, *

0.80b, *

300

0.68a, *

0.66a, *

Overall

1.16a

0.97b, *

nqtl

Single SNP

1Mb window

nqtl

Adaptive
window

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. *indicates the corresponding method is not better than a random classifier (pAUC05=1).
Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on number (nqtl
= 30, 100, or 300) of quantitative trait loci (QTL), and shape parameter (ď§ď ď˝ď 0.18, 1.48, or 3.00)
for Gamma distribution of QTL effects.

178

Table A.6 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different sets of
starting values for SNP effects (MCMC-BayesA vs RRBLUP) for MAP-BayesA. Comparisons
are made within different specifications of number of quantitative trait loci (nqtl)
Starting values for SNP
MCMC-effects

RRBLUP

BayesA
Single SNP

Overall

2.68a
sA

2.76a

1Mb window

Overall

0.75a, *

0.46b, *

30

2.33a

0.87b, *

90

1.14a

0.50b, *

300

0.73a, *

0.62a, *

Overall

1.25a

0.65b, *

nqtl

Adaptive
window

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. * indicates the corresponding method is not better than a random classifier
(pAUC05=1). Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial
on number (nqtl = 30, 100, or 300) of quantitative trait loci (QTL), and shape parameter (ď§ď ď˝ď 0.18,
1.48, or 3.00) for Gamma distribution of QTL effects.
.

179

Table A.7 Least squares mean relative (random classifier = 1) partial areas under a receiving
operating characteristic curve up until a false positive rate of 5% (pAUC05) for different
methods for inferring associations averaging across all window size determinations (single SNP,
0.5 Mb, 1.0Mb, 2.0Mb, 3.0Mb and adaptive windows) based on two different methods for
inferring posterior probabilities of association: 1) That proposed by Fernando et al., 2014 and 2)
that proposed by Moser et al., 2015. Comparisons are made within different specifications of
shape parameter (ď§) for Gamma distribution of quantitative trait loci (QTL) and number of QTL
(nqtl)
Factor

PPA determination strategy
Fernando et
al., (2014)

Moser et al.,
(2015)

Shape ď§
ď°ďŽďąď¸

3.03a

3.03a

ďąďŽď´ď¸

4.60a

4.43a

ďł

5.08a

4.56b

30

7.24a

7.23a

90

4.04a

3.99a

300

2.42a

2.11b

Overall

4.14a

3.94b

nqtlď 

Values not sharing the same letter within a row have different (P <0.05) relative pAUC05 within
the row. Mean estimates based on 10 replicates per each of 9 populations of 3 x 3 factorial on
number (nqtl = 30, 100, or 300) of quantitative trait loci (QTL), and shape parameter (ď§ď ď˝ď 0.18,
1.48, or 3.00) for Gamma distribution of QTL effects.

180

Figure A.1 Boxplot of window lengths for windows adaptively chosen based on the BALD
software in terms of mega bases (Panel A) and number of SNP markers (Panel B)

181

Figure A.2 Average ROC curve (10 replicates) for 1Mb versus 10 SNP windows using
EMMAX (Panel A), MAP-SSVS (Panel B), MAP-BayesA (Panel C) and RRBLUP (Panel D) for
30 quantitative trait loci generated from a Gamma distribution with shape parameter 1.48

182

Figure A.3 Scatterplots of posterior probabilities of association (PPA) for MCMC-SSVS (xaxis) versus MAP-SSVS (y-axis) for analysis on 13th rib backfat on 922 pigs from the MSUPRP
population based on with different starting values for MAP-SSVS: A) RRBLUP and B) MCMCSSVS.

Figure A.4 Scatterplot of posterior probabilities of association (PPA) based on local false
discovery rates (lFDR) conversions of p-values from EMMAX procedure (y-axis: PPA=1-lFDR)
and MCMC-SSVS (x-axis) on 13th rib backfat on 922 pigs from the MSUPRP population.

183

Appendix B: Chapter 4
Implementation Details for Monte Carlo Markov Chain (MCMC) inferences in ssSSVS

Recall the joint posterior density from Equation [13] as also provided again below
ďŚ n
ďś
p ď¨ Î˛, ďĄ1 , ďĄ 2 ,..., ďĄ m , Îľ, ďł u2 , ďł ďĄ2 , ďł e2 | y ďŠ ďľ ď§ ď p ď¨ yi | Î˛, ďĄ1 , ďĄ 2 ,..., ďĄ m , Îľ, ďł e2 ďŠ ďˇ
ď¨ i ď˝1
ď¸
ďŚ m
ďś
2
2
2
2
2
2
2
2
ď§ ď p ď¨ďĄ j | ďł ďĄ , ďŚ j ďŠ p (ďŚ j | ď° ) ďˇ p ď¨ Îľ | ďł u ďŠ p ď¨ďł ďĄ | sďĄ ,ďŽ ďĄ ďŠ p ď¨ďł u | su ,ďŽ u ďŠ p ď¨ďł e | ve , se ďŠ p ď¨ď° | ďĄ 0 , ď˘ 0 ďŠ
ď¨ j ď˝1
ď¸
n
ď­ n /2
2ďś
ď­ q1 /2
ď­ m /2
ďŚ 1
ďŚ 1
ďŚ 1
ďś
ďľ ď¨ďł e2 ďŠ exp ď§ ď­ 2 ďĽ ď¨ yi ď­ xi' Î˛ ď­ w i' Îą ď­ ui' Îľ ďŠ ďˇ ď¨ďł u2 ďŠ
exp ď§ ď­ 2 Îľ' A nnÎľ ďˇ ď¨ďł ďĄ2 ďŠ
exp ď§ ď­ 2
ď¨ 2ďł e i ď˝1
ď¸
ď¨ 2ďł u
ď¸
ď¨ 2ďł ďĄ
ďŚďŽ
ďś ďŽ ďĄ sďĄ
ď­ ďĄ ďŤ1 ď­
2ďł ďĄ2
2 ď§ď¨ 2 ďˇď¸

2

ďłďĄ

e

ďł

ďŚďŽ
ďś
ď­ u ďŤ1
2 ď§ď¨ 2 ďˇď¸
u

e

ďŽ s2
ď­ u u2
2ďł u

ďł

ďŚďŽ
ďś
ď­ e ďŤ1
2 ď§ď¨ 2 ďˇď¸
e

e

ďŽ s2
ď­ e e2
2ďł e

ď° ďĄ ď¨1 ď­ ď° ďŠ
0

ď˘0

for the MME for equation [4.6].
For the fixed effects Î˛, suppose the design matrix is nĂp, write

X nxp

ďŠ x11
ďŞx
ďŞ 21
ď˝ ďŞ x31
ďŞ
ďŞ
ďŞ xn1
ďŤ

x12 x13
x22 x23
x32 x33
xn 2 xn 3

x1 p ďš ďŠ x1' ďš
x2 p ďşďş ďŞďŞ x'2 ďşďş
x3 p ďş ď˝ ďŞ x3' ďş ď˝ ďŠďŤ x.1 x.2 x.3
ďş ďŞ ďş
ďş ďŞ ďş
xnp ďşďť ďŞďŤ x'n ďşďť

x. p ďšďť

Here x. j is the covariate/dummy variable values for variable j of the fixed effects.
Similarly, for the marker effects Îą, the design matrix is nĂm, write

ďŠ w11
ďŞw
ďŞ 21
W ď˝ ďŞ w31
ďŞ
ďŞ
ďŞďŤ wn1

w12 w13
w22 w23
w32 w33
wn 2 wn3

w1m ďš ďŠ w1' ďš
ďŞ ďş
w2 m ďşďş ďŞ w '2 ďş
w3m ďş ď˝ ďŞ w 3' ďş ď˝ ďŠďŤ w.1 w.2 w.3
ďş ďŞ ďş
ďş ďŞ ďş
wnm ďşďť ďŞďŤ w 'n ďşďť

w.m ďšďť

where w . j is the covariate/dummy variable values for SNP genotype j of the random marker
effects.
184

m

ďĽďĄ
j ď˝1

2
j

ďś
ďˇ
ď¸

Finally, for the imputation residual Îľ, the design matrix is nnĂqn, write
ďŠ z11
ďŞ
ďŞ z21
Z n ď˝ ďŞ z31
ďŞ
ďŞ
ďŞ
ďŤďŞ znn 1

z12

z13

z22

z23

z32

z33

z1q1 ďš ďŠ z ' ďš
ďş ďŞ '( n )1 ďş
z2 q1 ďş
ďŞ z ( n )2 ďş
ďş
z3q1 ď˝ ďŞ z '( n )3 ďş ď˝ ďŠďŤ z ď¨ n ďŠ.1 z ď¨ n ďŠ.2 z ď¨ n ďŠ.3
ďş ďŞ
ďş
ďş ďŞ
ďş
ďş ďŞz' ďş
znn qn ďťďş ďŤ ( n ) nn ďť

znn 2 znn 3

z ď¨ n ďŠ.qn ďšďť

where z ď¨1ďŠ. j is the covariate/dummy variable values for ungenotyped animal j of the
imputation residuals.

Full conditional densities (FCD) under MCMC-ssSSVS
FCD for Fixed Effects

ď¨

p ( ď˘ j | ELSE ) ~ N ď˘ j , vď˘ j

ďŠ

with

ď˘j ď˝
ď˝

x.' j ď¨ ď¨ y-XÎ˛ ď­ WÎą ď­ UÎľ ďŠ +x. j ď˘ j ďŠ

x.' j ď¨ e ďŤ x. j ď˘ j ďŠ
x.' j x. j

x.' j x. j

ď¨x e ďŤ x
ď˝
'
.j

x ď˘j ďŠ

'
.j .j

x.' j x. j

ďŚ n
ďŚ n 2ďś ďś
ď§ ďĽ xij ei ďŤ ď§ ďĽ xij ďˇ ď˘ j ďˇ
i ď˝1
ď¨ i ď˝1 ď¸ ď¸
ď˝ď¨
n
ďŚ
2ďś
ď§ ďĽ xij ďˇ
ď¨ i ď˝1 ď¸
ď­1

ď­1
ďŚ n
ďś
and vď˘ j ď˝ ďł e2 ď¨ x.' j x. j ďŠ ď˝ ďł e2 ď§ ďĽ xij2 ďˇ .
ď¨ i ď˝1 ď¸

185

FCD for Marker-Specific Augmented (i.e. Indicator) variables ďŚ j
The the full conditional density is given by

p ď¨ďŚ j | ELSE ďŠ ďľ p ď¨ďĄ j | ďł ďĄ2 , ďŚ j ďŠ p ď¨ďŚ j | ď° ďŠ
ďŚ
ďś
ď§
ďˇ
ď§ 1
ďˇ ďŚj
ďĄ 2j
1
1ď­ďŚ j
ďľ
exp ď§ ď­
ďˇ ď° ď¨1 ď­ ď° ďŠ
ďśďˇ
ďŚ 1ď­ďŚ j ďŠ
ďś
ď§ 2 2 ďŚ ď¨1 ď­ ďŚ j ďŠ
2 ď¨
ď§
ďł
ďŤ
ďŚ
2ď°ďł ďĄ ď§
ďŤďŚj ďˇ
ď§ď§
ďĄ
j ďˇďˇ
ď§ c
ďˇďˇ
ď§ c
ďˇ
ď¨
ď¸ď¸
ď¨
ď¨
ď¸
Such that it can be readily determined that:
Prob ď¨ďŚ j ď˝ 1| ELSE except g j ďŠ ď˝

h1ď°
h0 ď¨1 ď­ ď° ďŠ ďŤ h1ď°

where

ďŚ 1 ďĄ 2j
h1 ď˝ p ď¨ďĄ j | ďł ďĄ , ďŚ j ď˝ 1ďŠ ďľ
exp ď§ ď­
ď§ 2ďł2
ďł ďĄ2
ďĄ
ď¨
2

1

ďŚ
ď§ 1 ďĄ 2j
1
2
h0 ď˝ p ď¨ďĄ j | ďł ďĄ , ďŚ j ď˝ 0 ďŠ ďľ
exp ď§ ď­
2
ďł ďĄ2
ď§ 2 ďłďĄ
ď§
c
ď¨
c

ďś
ďˇďˇ and
ď¸

ďś
ďˇ
ďˇ
ďˇ
ďˇ
ď¸

h1ď°
ď°
ď˝
.In fact, I can use
h0 ď¨1 ď­ ď° ďŠ ďŤ h1ď° h0 1 ď­ ď° ďŤ ď°
ď¨
ďŠ
h1
the dnorm function in R to compute h0 and h1 directly. Then ďŚ j can be drawn from a Bernoulli
such that Prob ď¨ďŚ j ď˝ 1| ELSE except g j ďŠ ď˝

distribution.

186

FCD for Marker Effects
Then the posterior distribution for marker effects are given as
ďŚ n
ďś
p ď¨ďĄ j | ELSE ďŠ ďľ ď§ ď p ď¨ yi |Î˛, Îą ďŠ ďˇ p ď¨ďĄ j | ďł ďĄ2 , ďŚ j ďŠ ~ N ď¨ďĄ j , vďĄ j ďŠ ; j ď˝ 1, 2,ďź., m
ď¨ i ď˝1
ď¸

where

ďĄj ď˝

ď˝

w.' j ď¨ ď¨ y-XÎ˛ ď­ WÎą-UÎľ ďŠ +w. jďĄ j ďŠ
ďŚ ďŚ ď¨1 ď­ ďŚ j ďŠ
ďśďś
w.' j w. j ďŤ ďł e2 ď§ ďł ďĄ2 ď§
ďŤďŚj ďˇďˇ
ďˇďˇ
ď§ ď§ c
ď¸ď¸
ď¨ ď¨
w.' j ď¨ e ďŤ w*. jďĄ j ďŠ

ď­1

w.' j e+w.' j w. jďĄ j

ď˝

ďŚ ďŚ ď¨1 ď­ ďŚ j ďŠ
ďŚ ďŚ ď¨1 ď­ ďŚ j ďŠ
ďśďś
ďśďś
w.' j w. j ďŤ ďł e2 ď§ ďł ďĄ2 ď§
ďŤďŚj ďˇďˇ
w.' j w. j ďŤ ďł e2 ď§ ďł ďĄ2 ď§
ďŤďŚj ďˇďˇ
ďˇďˇ
ďˇďˇ
ď§ ď§ c
ď§ ď§ c
ď¸ď¸
ď¸ď¸
ď¨ ď¨
ď¨ ď¨
n
ďŚ n
ďś
wij ei ďŤ ď§ ďĽ wij2 ďˇ ďĄ j
ďĽ
i ď˝1
ď¨ i ď˝1 ď¸
ď˝
ď­1
ďŚ n
ďŚ ďŚ ď¨1 ď­ ďŚ j ďŠ
ďś ďś
ďś
ď§ w2 ďŤ ďł 2 ď§ ďł 2 ď§
ďŤďŚj ďˇďˇ ďˇ
ij
e
ďĄ
ď§ď§ ďĽ
ď§
ďˇ ďˇ ďˇďˇ
ď§
c
i ď˝1
ď¸ď¸ ď¸
ď¨ ď¨
ď¨
ď­1

2
ďŚ n
ď­1 ďś
ď§ ďĽ ď¨ wij ďŠ ďŚ ďŚ ď¨1 ď­ ďŚ j ďŠ
ďśďś ďˇ
ďŤ ď§ ďł ďĄ2 ď§
ďŤďŚj ďˇďˇ ďˇ
and vďĄ j ď˝ ď§ i ď˝1 2
ďˇďˇ ďˇ
ď§ ď§ c
ď§ ďłe
ď¸ď¸ ďˇ
ď¨ ď¨
ď§
ď¨
ď¸

ď­1

ď­1

FCD for Imputation Residuals: đş
Note that the ďĽ terms only cross-reference to the non-genotyped animals. Thatâs because

ďŠZ ďš
ďŠZ Îľďš
UÎľ ď˝ ďŞ n ďş Îľ ď˝ ďŞ n ďş
ďŤ0ďť
ďŤ 0 ďť
Hence, the vector of residuals could be broken down into two components:

ďŠ e n ďš ďŠ y n -X nÎ˛ ď­ WnÎą ď­ Z nÎľ ďš
ďŞe ďş ď˝ ďŞ y -X Î˛ ď­ W Îą ďş
g
g
g
ďŤ gďť ďŤ
ďť

187

Also, letâs write the rows of Ann as follows:

ďŠa1nn 'ďš
ďŞ nn ďş
ďŞa 2 'ďş
A nn ď˝ ďŞa3nn 'ďş
ďŞ ďş
ďŞ ďş
ďŞa nn
ďş
ďŤ q 'ďť
1

Then p(ďĽ k | ELSE ) ~ N ď¨ ďĽ k , vďĽ k ďŠ ; k ď˝ 1, 2,ďź., q1 where

ďĽk ď˝
ď˝

z ď¨' n ďŠ.k

ď¨ď¨ y -X Î˛ ď­ W Îą-Z Îľ ďŠ +zď¨ ďŠ ďĽ ďŠ ď­ ďŹ a
n

n

ď¨

n

ďŠ

n

n .k k

u

nn
k

'Îľ ďŤ ďŹu akknnďĽ k

z '( n ).k z ( n ).k ďŤ akknn ďŹu

nn
z ď¨' n ďŠ.k e n ďŤ z ď¨ n ďŠ.k ďĽ k ď­ ďŹu ann
k 'Îľ ďŤ ďŹu akk ďĽ k

z ď¨' n ďŠ.k z ď¨ n ďŠ.k ďŤ akknn ďŹu

ď˝

nn
z ď¨' n ďŠ.k e n +z ď¨' n ďŠ.k z ď¨ n ďŠ.k ďĽ k ď­ ďŹu ann
k 'Îľ ďŤ ďŹu akk ďĽ k

z ď¨' n ďŠ.k z ď¨ n ďŠ.k ďŤ akknn ďŹu

ďŚ n1 2 ďś
nn
nn
z
e
ďŤ
ďĽ
ď¨ n ďŠij i ď§ ďĽ zď¨ n ďŠij ďˇ ďĽ k ď­ ďŹu a k 'Îľ ďŤ ďŹu akk ďĽ k
i ď˝1
ď¨ i ď˝1
ď¸
ď˝
n1
ďŚ
ďś
2
nn
ď§ ďĽ zď¨ n ďŠij ďŤ akk ďŹu ďˇ
ď¨ i ď˝1
ď¸
n1

with ďŹu ď˝

ďł e2
and
ďł u2

ďŚ n1 2
ďś
ď§ ďĽ zď¨ n ďŠij
ďˇ
vďĽ k ď˝ ď§ i ď˝1 2 ďŤ akknnďł uď­2 ďˇ
ď§ ďłe
ďˇ
ď§
ďˇ
ď¨
ď¸

ď­1

nn
Note that akk
is element k,k of Ann.

FCD for Residual Variance ďł e2
p ď¨ďł | ELSE ďŠ ďľ ď¨ 2ď°ďł
2
e

ďľďł

ďŚ ďŽ ďŤn ďś
ď­ e ďŤ1ďˇ
2 ď§ď¨ 2
ď¸
e

ďŠ

2 ď­ n /2
e

ďŽ

ďŽ s2

ďŚ 1 n ďś ď­ďŚď§ e ďŤ1ďśďˇ ď­ e e2
exp ď§ ď­ 2 ďĽ e 2j ďˇ ďł e2 ď¨ 2 ď¸e 2ďł e
ď¨ 2ďł e j ď˝1 ď¸

ďŚ 1 ďŚ n 2
ďśďś
exp ď§ ď­ 2 ď§ ďĽ e j ďŤ ďŽ e se2 ďˇ ďˇ
ď§ 2ďł j ď˝1
ďˇ
e ď¨
ď¸ď¸
ď¨

188

where
e j ď˝ y j ď­ z 'j Î˛ ď­ w 'j Îą ď­ u 'j Îľ
n
ďŚ
ďś
i.e., it is ďŁ ď­2 ď§ ve ďŤ n, ďĽ e 2j ďŤ ďŽ e se2 ďˇ
j ď˝1
ď¨
ď¸

FCD for marker variance

ďŚ m
ďś
p ď¨ďł ďĄ2 | ELSE ďŠ ďľ ď§ ď p ď¨ g j | ďł ďĄ2 , ďŚ j ďŠ ďˇ p ď¨ďł ďĄ2 | ďŽ ďĄ ,ďŽ ďĄ sďĄ2 ďŠ
ď¨ j ď˝1
ď¸
ďŚ
ďś
ď§
ďˇ
ď­1/2
2
2
m ďŚ
m
ďŚ ď¨1 ď­ ďŚ j ďŠ
ďśďś
ď§
ďˇ 2 ď­ďŚď§ďŽ2ďĄ ďŤ1ďśďˇ ď­ďŽ2ďĄďłsďĄ2
ďĄ
1
j
2
ďľ ď ď§ďłďĄ ď§
ďŤ ďŚ j ďˇ ďˇ exp ď§ ď­ ďĽ
ďˇďł ďĄ ď¨ ď¸ e ďĄ
ď§
ďˇ
ď§
ďˇ
c
2
ďŚ
ďś
1
ď­
ďŚ
j ď˝1
j ď˝1
ď¨ j ďŠ ďŤďŚ ďˇ ďˇ
ď§
ď¸ď¸
ď¨ ď¨
ďł ďĄ2 ď§
ď§ď§
j ďˇ
ď§ c
ďˇďˇ
ď¨
ď¸ď¸
ď¨
ďŚ
ďś
ď§
ďˇ
2
ďŚďŽ
ďś ďŽ ďĄ sďĄ
2
m
ď§
ďˇ
ď­ ďĄ ďŤ1 ď­
ďĄj
1
2ďł ďĄ2
2 ď­ m /2
2 ď§ď¨ 2 ďˇď¸
ďľ ď¨ďł ďĄ ďŠ
exp ď§ ď­ 2 ďĽ
e
ďˇďłďĄ
ďśďˇ
ď§ 2ďł ďĄ j ď˝1 ďŚ ď¨1 ď­ ďŚ j ďŠ
ď§
ďŤďŚj ďˇ ďˇ
ď§ď§
ď§ c
ďˇďˇ
ď¨
ď¸ď¸
ď¨
ďŚ
ďŚ
ďśďś
ď§
ď§
ďˇďˇ
ďŚďŽďĄ ďŤ m ďś
2
m
ď§
ď­ď§
ďŤ1ďˇ
ďˇďˇ
ďĄj
1 ď§
ďľ ďł ďĄ2 ď¨ 2 ď¸ exp ď§ ď­ 2 ď§ ďĽ
ďŤ ďŽ ďĄ sďĄ2 ďˇ ďˇ
ďś
ď§ 2ďł ďĄ ď§ j ď˝1 ďŚ ď¨1 ď­ ďŚ j ďŠ
ďˇďˇ
ďŤďŚj ďˇ
ď§
ď§ď§ ď§ď§
ďˇďˇ ďˇ
ďˇ
c
ď§
ďˇ
ď¨
ď¸
ď¨
ď¸ď¸
ď¨
ďŚ
ďś
ď§
ďˇ
2
m
ď§
ďˇ
ďĄ
j
The FCD is ďŁ ď­2 ď§ vďĄ ďŤ m, vďĄ sďĄ2 ďŤ ďĽ
ďˇ
ďśďˇ
j ď˝1 ďŚ ď¨1 ď­ ďŚ j ďŠ
ď§
ď§
ďŤďŚj ďˇ ďˇ
ď§ď§
ď§ c
ďˇďˇ
ď¨
ď¸ď¸
ď¨

189

FCD for Polygenic Variance ďł u2

ď¨

ďŠ ď¨ ďŠ

p ďł u2 | ELSE ďľ ďł u2

ď­ q1 /2

ďŚ 1
ďś
exp ď§ ď­ 2 Îľ' A nnÎľ ďˇ p(ďł u2 | ďŽ u , su2 )
ď¨ 2ďł u
ď¸
ďŽu

ďŚ ďŽ u su2 ďś 2
2
ď§
ďˇ
ďŚďŽ
ďś ďŽ u su
ď­ď§ u ďŤ1ďˇ ď­ 2
ď­ q1 /2
2
ďŚ
ďś
1
ď¸ ďł 2 ď¨ 2 ď¸ e 2ďł u
ď˝ 2ď°ďł u2
exp ď§ ď­ 2 Îľ' A nnÎľ ďˇ ď¨
u
ďŽ
ď¨ 2ďł u
ď¸ ďďŚ u ďś
ď§ 2ďˇ
ď¨ ď¸
ďŚ q1 ďŤďŽ u ďś
ď­ď§
ďŤ1ďˇ
ďŚ 1
ďś
ďľ ďł u2 ď¨ 2 ď¸ exp ď§ ď­ 2 Îľ' A nnÎľ ďŤďŽ u su2 ďˇ
ď¨ 2ďł u
ď¸

ď¨

ďŠ

ď¨ ďŠ

ď¨

ďŠ

Hence, the FCD of ďł u2 is a scaled inverse chi-square with degree of freedom of q1 ďŤ ďŽ u and

ď¨

ď­2
' nn
2
scale of Îľ' A nnÎľ ďŤ ďŽ u su2 , i.e. ďŁ q1 ďŤ ďŽ u , Îľ A Îľ ďŤďŽ u su

ďŠ

FCD for ď°
The posterior of ď° is given by,

ďŚ m
ďś
ď˘
m ď­ m ďŤ ď˘ ď­1
p ď¨ď° | ELSE ďŠ ďľ ď§ ď p (ďŚ j | ď° ) ďˇ ď° ďĄ0 ď¨1 ď­ ď° ďŠ 0 ďľ ď° m1 ďŤďĄ0 ď­1 ď¨1 ď­ ď° ďŠ 1 0
ď¨ j ď˝1
ď¸
m

where m1 ď˝ ďĽ ďŚ j denotes the number of âlarge varianceâ genetic effects as determined in the
j ď˝1

current cycle.

190

Figures

Figures B1-B5: Supplementary Manhattan plot figures for within-station splits genotyped and masked genotyped animals for
within-station partitions P1 (Figure B1), P2 (Figure B2), P3 (Figure B3), P4 (Figure B4), and P5 (Figure B5) for milk fat.
Panel A: Plot of âlog10(P-value) versus genomic region for single SNP associations using EMMAX without using phenotypes
on non-genotyped animals; Panel B: Plot of âlog10(P-value) versus genomic region for genomic window associations using
EMMAX without using phenotypes on non-genotyped animals; Panel C: Plot of posterior probabilities versus genomic region
for genomic window associations using SSVS without using phenotypes on non-genotyped animals; Panel D: Plot of posterior
probabilities versus genomic region for genomic window associations using SSVS without using phenotypes on non-genotyped
animals; Panel E: Plot of âlog10(P-value) versus genomic region for single SNP associations using ssEMMAX using
phenotypes on non-genotyped animals; Panel F: Plot of âlog10(P-value) versus genomic region for genomic window
associations using ssEMMAX using phenotypes on non-genotyped animals; Panel G: Plot of posterior probabilities versus
genomic region for genomic window associations using ssSSVS using phenotypes on non-genotyped animals; Panel H: Plot of
posterior probabilities versus genomic region for genomic window associations using ssSSVS using phenotypes on nongenotyped animals.

191

Figure B.1 Partition P1: Manhattan plot for milkfat in within station splits of genotyped and non-genotyped animals.
192

Figure B.2 Partition P2: Manhattan plot for milkfat in within station splits of genotyped and non-genotyped animals.
193

Figure B.3 Partition P3: Manhattan plot for milkfat in within station splits of genotyped and non-genotyped animals.
194

Figure B.4 Partition P4: Manhattan plot for milkfat in within station splits of genotyped and non-genotyped animals.
195

Figure B.5 Partition P5: Manhattan plot for milkfat in within station splits of genotyped and non-genotyped animals.
196

Figures B6-B11: Supplementary Manhattan plot figures for acroos-station splits for genotyped and masked genotyped animals
with genotype masking on cows from ISU (Figure B6), MSU (Figure B7), USDFRC (Figure B8), UW (Figure B9), and FL
(Figure B10) and AGIL (Figure B11) for milk fat. Panel A: Plot of âlog10(P-value) versus genomic region for single SNP
associations using EMMAX without using phenotypes on non-genotyped animals; Panel B: Plot of âlog10(P-value) versus
genomic region for genomic window associations using EMMAX without using phenotypes on non-genotyped animals; Panel
C: Plot of posterior probabilities versus genomic region for genomic window associations using SSVS without using
phenotypes on non-genotyped animals; Panel D: Plot of posterior probabilities versus genomic region for genomic window
associations using SSVS without using phenotypes on non-genotyped animals; Panel E: Plot of âlog10(P-value) versus genomic
region for single SNP associations using ssEMMAX using phenotypes on non-genotyped animals; Panel F: Plot of âlog10(Pvalue) versus genomic region for genomic window associations using ssEMMAX using phenotypes on non-genotyped animals;
Panel G: Plot of posterior probabilities versus genomic region for genomic window associations using ssSSVS using
phenotypes on non-genotyped animals; Panel H: Plot of posterior probabilities versus genomic region for genomic window
associations using ssSSVS using phenotypes on non-genotyped animals.

197

Figure B.6 Without the genotype of ISU: Manhattan plot for milkfat in across station study.
198

Figure B.7 Without the genotype of MSU: Manhattan plot for milkfat in across station study.
199

Figure B.8 Without the genotype of USDFRC: Manhattan plot for milkfat in across station study.
200

Figure B.9 Without the genotype of UW: Manhattan plot for milkfat in across station study.
201

Figure B.10 Without the genotype of FL: Manhattan plot for milkfat in across station study.
202

Figure B.11 Without the genotype of AGIL: Manhattan plot for milkfat in across station study.
203

Figures B12-B16: Supplementary Manhattan plot figures for within-station splits genotyped and masked genotyped
animals for within-station partitions P1 (Figure B12), P2 (Figure B13), P3 (Figure B14), P4 (Figure B15), and P5 (Figure B16)
for body weight. Panel A: Plot of âlog10(P-value) versus genomic region for single SNP associations using EMMAX without
using phenotypes on non-genotyped animals; Panel B: Plot of âlog10(P-value) versus genomic region for genomic window
associations using EMMAX without using phenotypes on non-genotyped animals; Panel C: Plot of posterior probabilities
versus genomic region for genomic window associations using SSVS without using phenotypes on non-genotyped animals;
Panel D: Plot of posterior probabilities versus genomic region for genomic window associations using SSVS without using
phenotypes on non-genotyped animals; Panel E: Plot of âlog10(P-value) versus genomic region for single SNP associations
using ssEMMAX using phenotypes on non-genotyped animals; Panel F: Plot of âlog10(P-value) versus genomic region for
genomic window associations using ssEMMAX using phenotypes on non-genotyped animals; Panel G: Plot of posterior
probabilities versus genomic region for genomic window associations using ssSSVS using phenotypes on non-genotyped
animals; Panel H: Plot of posterior probabilities versus genomic region for genomic window associations using ssSSVS using
phenotypes on non-genotyped animals.

204

Figure B.12 Partition 1: Manhattan plot for body weight in within station study.
205

Figure B.13 Partition 2: Manhattan plot for body weight in within station study.
206

Figure B.14 Partition 3: Manhattan plot for body weight in within station study.
207

Figure B.15 Partition 4: Manhattan plot for body weight in within station study.
208

Figure B.16 Partition 5: Manhattan plot for body weight in within station study.
209

Figures B17-B22: Supplementary Manhattan plot figures for acroos-station splits for genotyped and masked genotyped
animals with genotype masking on cows from ISU (Figure B17), MSU (Figure B18), USDFRC (Figure B19), UW (Figure B20),
and FL (Figure B21) and AGIL (Figure B22) for milk fat. Panel A: Plot of âlog10(P-value) versus genomic region for single
SNP associations using EMMAX without using phenotypes on non-genotyped animals; Panel B: Plot of âlog10(P-value) versus
genomic region for genomic window associations using EMMAX without using phenotypes on non-genotyped animals; Panel
C: Plot of posterior probabilities versus genomic region for genomic window associations using SSVS without using
phenotypes on non-genotyped animals; Panel D: Plot of posterior probabilities versus genomic region for genomic window
associations using SSVS without using phenotypes on non-genotyped animals; Panel E: Plot of âlog10(P-value) versus genomic
region for single SNP associations using ssEMMAX using phenotypes on non-genotyped animals; Panel F: Plot of âlog10(Pvalue) versus genomic region for genomic window associations using ssEMMAX using phenotypes on non-genotyped animals;
Panel G: Plot of posterior probabilities versus genomic region for genomic window associations using ssSSVS using
phenotypes on non-genotyped animals; Panel H: Plot of posterior probabilities versus genomic region for genomic window
associations using ssSSVS using phenotypes on non-genotyped animals.

210

Figure B.17 Without the genotype of ISU: Manhattan plot for body weight in across station study.
211

Figure B.18. Without the genotype of MSU: Manhattan plot for body weight in across station study.
212

Figure B.19 Without the genotype of USDFRC: Manhattan plot for body weight in across station study.
213

Figure B.20. Without the genotype of UW: Manhattan plot for body weight in across station study.
214

Figure B.21 Without the genotype of FL: Manhattan plot for body weight in across station study.
215

Figure B.22 Without the genotype of AGIL: Manhattan plot for body weight in across station study.
216

Appendix C: Chapter 5
Initial value for hyperparameters
In this section, I discuss the initial values for priors based on rules provided by de Los
Campos et al. (2013). To start with, the default heritability h2 is 0.5 and can be changed in
set.init function. The initial starting value for residual variance ďł e2 is

var( y ) ď´ (1 ď­ h2) ď´ (ve ďŤ 2) ď˝ var( y) ď´ (1 ď­ h2) as ve =-1 by default. This setting is true for all
models. For all antedependence models, the default initial starting values for ď­t ď˝ 0 and

ďł t2 ď˝ 0.5 with prior ď­t ~ N (0, 0.01) and ďł t2 ~ ďŁ ď­2 (ď­1,0) as suggested by Yang and Tempelman
(2012). Note all the initial values can be set manually if prior knowledge about the data is
available, see help(set.options) for details.

BRR/GBLUP/ssGBLUP
The marker variance is initially set to be ďł ďĄ2 ď˝ var( y) ď´ h2 / MSM as vďĄ ď˝ ď­1 for BRR/GBLUP
n

m

where MS M ď˝ n ď­1 ďĽďĽ M ij2 is sum of the sample variance of the column of the marker genotype
i ď˝1 j ď˝1

type matrix. For ssGBLUP if itâs hetVAR the initial value for the genetic variance not accounted
by marker genotype ďł u2 is set to var( y ) ď´ h2 / MS M as well and this is also the same with all
ssBayesA, ssBayesB and ssSSVS. For ssGBLUP with homVAR,

ďł ďĄ2 ď˝ ďł u2 ď˝ var( y) ď´ h2 / MSM .

BayesA/ssBayesA/anteBayesA/e-BayesA

217

The starting degrees of freedom parameter vďĄ is set to 5 by default and the scale parameter
then is set to sďĄ2 ď˝ var( y) ď´ h2 ď´ (vďĄ ď­ 2) / vďĄ / MSm and the same value is used for ďł ďĄ2 in
mapBayesA. The shape parameter as ď˝ 0.5 and rate parameter ď˘ s ď˝ 0 of Gamma corresponds to

ďł ďĄ2 ~ ďŁ ď­2 (ď­1,0) suggested by Gelman (2006). For the UNIMH sample of both vďĄ and sďĄ2 , the
tuning procedure is the same as suggested by Yang et al. (2015b).

BayesB/ssBayesB/anteBayesB
The degree of freedom vďĄ is set to 5 by default and the scale parameter then is set to
sďĄ2 ď˝ var( y) ď´ h2 ď´ (vďĄ ď­ 2) / vďĄ / MSM / ď° ďĄ where ď° ďĄ is set to 0.05 initially by default with a prior

of ď° ďĄ ~ Beta(1,9) that has prior mean of 0.1. The shape parameter as ď˝ 0.1 and rate parameter

ď˘ s ď˝ 0.1 as in Yang and Tempelman (2012).

SSVS/ssSSVS/e-SSVS
The initial variance component is ďł ďĄ2 ď˝ var( y) ď´ h2 ď´ c / MSM / (c+(1-ď°ď´ ) ď´ (1-c)) where ď° ď´
initialized as 0.05 like BayesB with prior of ď° ď´ ~ Beta(1,9) and c=1000 by default.

218

# `BALD` package is available at http://www.math-evry.cnrs.fr/logiciels/bald.
#Linux or Mac are recommended since `BALD` is not available at CRAN.
#On windows,`RTools` needs to be pre-installed from https://cran.r-project.org/bin/wi
ndows/Rtools/ before installing BALD.
#Installing pre-required R packages for BALD
source("https://bioconductor.org/biocLite.R")
biocLite("chopsticks")
biocLite("snpStats")
biocLite("ROC")
install.packages(c("LDheatmap","quadrupen", "ROC", "grplasso","snpStats"))
# Download and install BALD using the following commands
system("wget http://www.math-evry.cnrs.fr/_media/logiciels/bald_0.2.1.tar.gz")
system("R CMD INSTALL bald_0.2.1.tar.gz")
# Then we load BALD package and load the Pig data
library(BALD)
library(BATools)
data(Pig)
map=PigMap
# Spilt the map for each chromosome because adaptive window is computed by chromosome
chrs=list()
for(i in 1:max(map$chr)){
ii=which(map$chr==i)
chrs[[i]]=geno[,ii]
}
#Then we can create adaptive window for each chromosome
#This will take 6-7 hours to run in serial
#It is suggested to run it in parallel for each chromosome in a computing cluster
adaptiveWindows=list()
for(i in 1:length(chrs)){
Z=chr[[i]]+1
p=dim(Z)[2]
gapS <- gapStatistic(Z, min.nc=2, max.nc=p-1, B=50)
gapS$best.k
adaptiveWindows[[i]] <- cutree(gapS$tree, gapS$best.k)
}
#Finally, we compute the window id for each SNP and add it to the map
idw<-adaptiveWindows[[1]]
for(i in 2:length(adaptiveWindows)){
tmp<-max(idw)
idw<-c(idw,adaptiveWindows[[i]]+tmp)
}
map$idw=idw
#For fixed size window, use set.win function in BATools
#For example, to create 1Mb window, simply run
map<-set.win(map = map,len=1,unit = "Mb")
#To create 5-SNP window, simply run
map<-set.win(map = map,len = 5,unit="count")

Figure C.1 Example on creating adaptive window using BALD and fix size window for the
Pig data
219

220

REFERENCES

221

REFERENCES

Aguilar, I., I. Misztal, D. L. Johnson, A. Legarra, S. Tsuruta et al., 2010 Hot topic: a unified
approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of
Holstein final score. J Dairy Sci 93: 743-752.
Andrews, D. F., and C. L. Mallows, 1974 Scale mixtures of normal distributions. J R Stat Soc
Series B Methodol 36: 99-102.
Bates, D., and D. Eddelbuettel, 2013 Fast and Elegant Numerical Linear Algebra Using the
RcppEigen Package. Journal of Statistical Software; Vol 1, Issue 5 (2013).
Bello, N. M., J. P. Steibel and R. J. Tempelman, 2010 Hierarchical Bayesian modeling of
random and residual variance-covariance matrices in bivariate mixed effects models. Biom J 52:
297-313.
Bernal Rubio, Y. L., J. L. Gualdron Duarte, R. O. Bates, C. W. Ernst, D. Nonneman et al., 2016
Meta-analysis of genome-wide association from genomic prediction models. Anim Genet 47: 3648.
Bezanson, J., S. Karpinski, V. B. Shah and A. Edelman, 2012 Julia: A fast dynamic language for
technical computing, pp. arXiv preprint arXiv:1209.5145.
BrĂ¸ndum, R. F., G. Su, L. Janss, G. Sahana, B. Guldbrandtsen et al., 2015 Quantitative trait loci
markers derived from whole genome sequence data increases the reliability of genomic
prediction. J Dairy Sci 98: 4107-4116.
Cai, X., A. Huang and S. Xu, 2011 Fast empirical Bayesian LASSO for multiple quantitative
trait locus mapping. BMC bioinformatics 12: 211.
Calus, M. P., J. Vandenplas and J. Ten Napel, 2015 Ever-growing data sets pose (new)
challenges to genomic prediction models. J Anim Breed Genet 132: 407-408.
Casella, G., 1985 An Introduction to Empirical Bayes Analysis. The American Statistician 39:
83-87.
Casella, G., and E. I. George, 1992 Explaining the Gibbs Sampler. The American Statistician 46:
167-174.
Chen, C., J. P. Steibel and R. J. Tempelman, 2017 Genome-Wide Association Analyses Based on
Broadly Different Specifications for Prior Distributions, Genomic Windows, and Estimation
Methods. Genetics 206: 1791.
Chen, C., and R. J. Tempelman, 2015 An integrated approach to empirical Bayesian whole
genome prediction modeling. JABES 20: 491-511.

222

Chen, C. Y., I. Misztal, I. Aguilar, A. Legarra and W. M. Muir, 2011a Effect of different
genomic relationship matrices on accuracy and scale. J Anim Sci 89: 2673-2679.
Chen, C. Y., I. Misztal, I. Aguilar, S. Tsuruta, T. H. E. Meuwissen et al., 2011b Genome-wide
marker-assisted selection combining all pedigree phenotypic information with genotypic data in
one step: An example using broiler chickens. J Anim Sci 89: 23-28.
Cheng, H., R. Fernando and D. Garrick, 2017 Parallel Computing to Speed up Whole-Genome
Bayesian Regression Analyses Using Orthogonal Data Augmentation. bioRxiv.
Cheng, H., D. Garrick and R. Fernando, 2016 JWAS: Julia implementation of whole-genome
analyses software using univariate and multivariate Bayesian mixed effects model, pp.
Colombani, C., A. Legarra, S. Fritz, F. Guillaume, P. Croiseau et al., 2013 Application of
Bayesian least absolute shrinkage and selection operator (LASSO) and BayesCpi methods for
genomic selection in French Holstein and Montbeliarde breeds. J Dairy Sci 96: 575-591.
Cuyabano, B. C., G. Su and M. S. Lund, 2014 Genomic prediction of genetic merit using LDbased haplotypes in the Nordic Holstein population. Bmc Genomics 15: 1171.
Daetwyler, H. D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen et al., 2014 Wholegenome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle.
Nat Genet 46: 858-865.
de Los Campos, G., J. M. Hickey, R. Pong-Wong, H. D. Daetwyler and M. P. Calus, 2013
Whole-genome regression and prediction methods applied to plant and animal breeding.
Genetics 193: 327-345.
Dehman, A., C. Ambroise and P. Neuvial, 2015 Performance of a blockwise approach in variable
selection using linkage disequilibrium information. BMC bioinformatics 16: 148.
Dehman, A., and P. Neuvial, 2015 BALD: Blockwise Approach using Linkage Disequilibrium
information. R package version 0.2.1.
Edwards, D. B., C. W. Ernst, N. E. Raney, M. E. Doumit, M. D. Hoge et al., 2008 Quantitative
trait locus mapping in an F2 Duroc x Pietrain resource population: II. Carcass and meat quality
traits. J Anim Sci 86: 254-266.
Endelman, J. B., 2011 Ridge Regression and Other Kernels for Genomic Selection with R
Package rrBLUP. Plant Genome-Us 4: 250-255.
Erbe, M., B. J. Hayes, L. K. Matukumalli, S. Goswami, P. J. Bowman et al., 2012 Improving
accuracy of genomic predictions within and between dairy cattle breeds with imputed highdensity single nucleotide polymorphism panels. J Dairy Sci 95: 4114-4129.
Fan, B., S. K. Onteru, Z. Q. Du, D. J. Garrick, K. J. Stalder et al., 2011 Genome-wide association
study identifies Loci for body composition and structural soundness traits in pigs. PloS one 6:
e14726.
223

Fernando, R., and D. Garrick, 2013 Bayesian Methods Applied to GWAS, pp. 237-274 in
Genome-Wide Association Studies and Genomic Prediction, edited by C. Gondro, J. van der
Werf and B. Hayes. Humana Press.
Fernando, R., A. Toosi, A. Wolc, D. Garrick and J. Dekkers, 2017 Application of WholeGenome Prediction Methods for Genome-Wide Association Studies: A Bayesian Approach.
Journal of Agricultural, Biological and Environmental Statistics 22: 172-193.
Fernando, R. L., H. Cheng, B. L. Golden and D. J. Garrick, 2016 Computational strategies for
alternative single-step Bayesian regression models with large numbers of genotyped and nongenotyped animals. Genetics Selection Evolution 48: 96.
Fernando, R. L., J. C. M. Dekkers and D. J. Garrick, 2014 A class of Bayesian methods to
combine large numbers of genotyped and non-genotyped animals for whole-genome analyses.
Genetics Selection Evolution 46.
Fernando, R. L., and D. J. Garrick, 2009 in GenSel - user manual.
Fragomeni, B. d. O., I. Misztal, D. L. Lourenco, I. Aguilar, R. Okimoto et al., 2014 Changes in
variance explained by top SNP windows over generations for three traits in broiler chicken.
Frontiers in Genetics 5: 332.
GarcĂ­a-Ruiz, A., J. B. Cole, P. M. VanRaden, G. R. Wiggans, F. J. Ruiz-LĂłpez et al., 2016
Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as
a result of genomic selection. Proceedings of the National Academy of Sciences of the United
States of America 113: E3995-E4004.
Garrick, D. J., J. F. Taylor and R. L. Fernando, 2009 Deregressing estimated breeding values and
weighting information for genomic regression analyses. Genetics Selection Evolution 41: 55.
Gelman, A., 2006 Prior distributions for variance parameters in hierarchical models (Comment
on an Article by Browne and Draper). Bayesian Anal 1: 515-533.
Gelman, A., J. Hill and M. Yajima, 2012 Why I (Usually) don't have to worry about multiple
comparisons. J Res Educ Effectiveness 5: 189-211.
George, E. I., and R. E. McCulloch, 1993 Variable selection via Gibbs sampling. J Amer Statist
Assoc 88: 881 - 889.
Gianola, D., 2013 Priors in whole-genome regression: the bayesian alphabet returns. Genetics
194: 573-596.
Gianola, D., G. de los Campos, W. G. Hill, E. Manfredi and R. Fernando, 2009 Additive Genetic
Variability and the Bayesian Alphabet. Genetics 183: 347-363.
Gianola, D., J. L. Foulley and R. Fernando, 1986 Prediction of breeding values when variances
are not known. Genetics, Selection, Evolution 18: 485-498.

224

Gianola, D., M. Perez-Enciso and M. A. Toro, 2003 On marker-assisted prediction of genetic
value: Beyond the ridge. Genetics 163: 347-365.
Gilmour, A. R., R. Thompson and B. R. Cullis, 1995 Average information REML: An efficient
algorithm for variance parameter estimation in linear mixed models. Biometrics 51: 1440-1450.
Goddard, M. E., and B. J. Hayes, 2009 Mapping genes for complex traits in domestic animals
and their use in breeding programmes. Nature reviews. Genetics 10: 381-391.
Goddard, M. E., K. E. Kemper, I. M. MacLeod, A. J. Chamberlain and B. J. Hayes, 2016
Genetics of complex traits: prediction of phenotype, identification of causal polymorphisms and
genetic architecture. P Roy Soc B-Biol Sci 283.
Gray, K. A., J. P. Cassady, Y. J. Huang and C. Maltecca, 2012 Effectiveness of genomic
prediction on milk flow traits in dairy cattle. Genetics Selection Evolution 44.
Grisart, B., W. Coppieters, F. Farnir, L. Karim, C. Ford et al., 2002 Positional candidate cloning
of a QTL in dairy cattle: Identification of a missense mutation in the bovine DGAT1 gene with
major effect on milk yield and composition. Genome Res 12: 222-231.
Groenen, M. A., 2016 A decade of pig genome sequencing: a window on pig domestication and
evolution. Genetics, selection, evolution : GSE 48: 23.
Gualdron Duarte, J. L., R. O. Bates, C. W. Ernst, N. E. Raney, R. J. Cantet et al., 2013 Genotype
imputation accuracy in a F2 pig population using high density and low density SNP panels. BMC
genetics 14: 38.
Gualdron Duarte, J. L., R. J. Cantet, R. O. Bates, C. W. Ernst, N. E. Raney et al., 2014 Rapid
screening for phenotype-genotype associations by linear transformations of genomic evaluations.
BMC bioinformatics 15: 246.
Guan, Y., and M. Stephens, 2011 Bayesian variable selection regression for genome-wide
association studies and other large-scale problems. Ann Appl Stat 5: 1780-1815.
Habier, D., R. L. Fernando, K. Kizilkaya and D. J. Garrick, 2011 Extension of the bayesian
alphabet for genomic selection. BMC bioinformatics 12: 186.
Harville, D. A., 1974 Bayesian inference for variance components using only error contrasts.
Biometrika 61: 383-385.
Harville, D. A., 1977 Maximum Likelihood Approaches to Variance Component Estimation and
to Related Problems. Journal of the American Statistical Association 72: 320-338.
Hayashi, T., and H. Iwata, 2010 EM algorithm for Bayesian estimation of genomic breeding
values. BMC genetics 11: 3.

225

Hayes, B., 2013 Overview of statistical methods for genome-wide association Studies (GWAS),
pp. 149-169 in Genome-Wide Association Studies and Genomic Prediction, edited by C. Gondro,
J. van der Werf and B. Hayes. Humana Press.
Hayes, B., and M. E. Goddard, 2001 The distribution of the effects of genes affecting
quantitative traits in livestock. Genetics Selection Evolution 33: 209-229.
Hayes, B. J., P. J. Bowman, A. J. Chamberlain and M. E. Goddard, 2009 Invited review:
Genomic selection in dairy cattle: Progress and challenges. J Dairy Sci 92.
Hayes, B. J., J. Pryce, A. J. Chamberlain, P. J. Bowman and M. E. Goddard, 2010 Genetic
architecture of complex traits and accuracy of genomic prediction: coat colour, milk-fat
percentage, and type in Holstein cattle as contrasting model traits. Plos Genet 6: e1001139.
Henderson, C. R., 1975 Best Linear Unbiased Estimation and Prediction under a Selection
Model. Biometrics 31: 423-447.
Henderson, C. R., 1985 Equivalent Linear Models to Reduce Computations. J Dairy Sci 68:
2267-2277.
Hoerl, A. E., and R. W. Kennard, 1970 Ridge Regression - Biased Estimation for Nonorthogonal
Problems. Technometrics 12: 55-&.
Huang, A., S. Xu and X. Cai, 2015 Empirical Bayesian elastic net for multiple quantitative trait
locus mapping. Heredity 114: 107-115.
Jiang, J., Q. Zhang, L. Ma, J. Li, Z. Wang et al., 2015 Joint prediction of multiple quantitative
traits using a Bayesian multivariate antedependence model. Heredity 115: 29-36.
Johnson, D. L., and R. Thompson, 1995 Restricted maximum likelihood estimation of variance
components for univariate animal models using sparse matrix techniques and average
information. J Dairy Sci 78: 449-456.
Kane, M., J. W. Emerson and S. Weston, 2013 Scalable Strategies for Computing with Massive
Data. Journal of Statistical Software; Vol 1, Issue 14 (2013).
Kang, H. M., J. H. Sul, S. K. Service, N. A. Zaitlen, S. Y. Kong et al., 2010 Variance component
model to account for sample structure in genome-wide association studies. Nat Genet 42: 348354.
Kang, H. M., N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman et al., 2008 Efficient control
of population structure in model organism association mapping. Genetics 178: 1709-1723.
Karkkainen, H. P., and M. J. Sillanpaa, 2012 Back to basics for Bayesian model building in
genomic selection. Genetics 191: 969-987.
Kemper, K. E., C. M. Reich, P. J. Bowman, C. J. Vander Jagt, A. J. Chamberlain et al., 2015
Improved precision of QTL mapping using a nonlinear Bayesian method in a multi-breed
226

population leads to greater accuracy of across-breed genomic predictions. Genetics, selection,
evolution : GSE 47: 29.
Klein, R. J., C. Zeiss, E. Y. Chew, J.-Y. Tsai, R. S. Sackler et al., 2005 Complement Factor H
Polymorphism in Age-Related Macular Degeneration. Science (New York, N.Y.) 308: 385-389.
KnĂźrr, T., E. LĂ¤Ă¤rĂ¤ and M. J. SillanpĂ¤Ă¤, 2013 Impact of prior specifications in ashrinkageinducing Bayesian model for quantitative trait mapping and genomic prediction. Genetics,
selection, evolution : GSE 45: 24-24.
Lee, J., H. Cheng, D. Garrick, B. Golden, J. Dekkers et al., 2017 Comparison of alternative
approaches to single-trait genomic prediction using genotyped and non-genotyped Hanwoo beef
cattle. Genetics Selection Evolution 49: 2.
Legarra, A., O. F. Christensen, I. Aguilar and I. Misztal, 2014 Single Step, a general approach
for genomic selection. Livest Sci 166: 54-65.
Lehermeier, C., V. Wimmer, T. Albrecht, H. J. Auinger, D. Gianola et al., 2013 Sensitivity to
prior specification in Bayesian genome-based prediction models. Statistical applications in
genetics and molecular biology 12: 375-391.
Lippert, C., J. Listgarten, Y. Liu, C. M. Kadie, R. I. Davidson et al., 2011 FaST linear mixed
models for genome-wide association studies. Nat Methods 8: 833-835.
Louis, T. A., 1982 Finding the observed information matrix when using the EM algorithm. J R
Stat Soc Series B Methodol 44: 226-233.
Lourenco, D. A. L., I. Misztal, H. Wang, I. Aguilar, S. Tsuruta et al., 2013 Prediction accuracy
for a simulated maternally affected trait of beef cattle using different genomic evaluation models.
J Anim Sci 91: 4090-4098.
Lourenco, D. A. L., S. Tsuruta, B. O. Fragomeni, Y. Masuda, I. Aguilar et al., 2015 Genetic
evaluation using single-step genomic best linear unbiased predictor in American Angus. J Anim
Sci 93: 2653-2662.
Lu, Y., 2016 Quantitative Genetic and Genomic Modeling of Feed Efficiency in Dairy Cattle,
pp. in Animal Scienece. Michigan State University.
Lu, Y., M. J. Vandehaar, D. M. Spurlock, K. A. Weigel, L. E. Armentano et al., 2015 An
alternative approach to modeling genetic merit of feed efficiency in dairy cattle. J Dairy Sci 98:
6535-6551.
Ma, H., A. I. Bandos, H. E. Rockette and D. Gur, 2013 On use of partial area under the ROC
curve for evaluation of diagnostic performance. Statistics in Medicine 32: 3449-3458.
Martin, L. S., and E. Eskin, 2016 Review: Population Structure in Genetic Studies: Confounding
Factors and Mixed Models. bioRxiv.

227

Masuda, Y., I. Misztal, S. Tsuruta, A. Legarra, I. Aguilar et al., 2016 Implementation of genomic
recursions in single-step genomic best linear unbiased predictor for US Holsteins with a large
number of genotyped animals. J Dairy Sci 99: 1968-1974.
Metz, C. E., 1978 Basic Principles of Roc Analysis. Semin Nucl Med 8: 283-298.
Meuwissen, T., B. Hayes and M. Goddard, 2016 Genomic selection: A paradigm shift in animal
breeding. Animal Frontiers 6: 6-14.
Meuwissen, T. H., T. R. Solberg, R. Shepherd and J. A. Woolliams, 2009 A fast algorithm for
BayesB type of prediction of genome-wide estimates of genetic value. Genetics, selection,
evolution : GSE 41: 2.
Meuwissen, T. H. E., B. J. Hayes and M. E. Goddard, 2001 Prediction of total genetic value
using genome-wide dense marker maps. Genetics 157: 1819-1829.
Misztal, I., 2016a Inexpensive Computation of the Inverse of the Genomic Relationship Matrix
in Populations with Small Effective Population Size. Genetics 202: 401-409.
Misztal, I., 2016b Is genomic selection now a mature technology? J. Anim. Breed. Genet. 133:
81-82.
Misztal, I., S. Tsuruta, T. Strabel, B. Auvray, T. Druet et al., 2002 BLUPF90 and related
programs (BGF90) in 7th world congress on genetics applied to livestock production,
Montpellier.
Moser, G., S. H. Lee, B. J. Hayes, M. E. Goddard, N. R. Wray et al., 2015 Simultaneous
discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model.
Plos Genet 11: e1004969.
Nadaf, J., V. Riggio, T. P. Yu and R. Pong-Wong, 2012 Effect of the prior distribution of SNP
effects on the estimation of total breeding value. BMC proceedings 6 Suppl 2: S6.
Ou, Z., R. J. Tempelman, J. P. Steibel, C. W. Ernst, R. O. Bates et al., 2016 Genomic Prediction
Accounting for Residual Heteroskedasticity. G3: Genes|Genomes|Genetics 6: 1.
Perez, P., and G. de los Campos, 2014 Genome-wide regression and prediction with the BGLR
statistical package. Genetics 198: 483-495.
Perez, P., G. de Los Campos, J. Crossa and D. Gianola, 2010 Genomic-Enabled Prediction Based
on Molecular Markers and Pedigree Using the Bayesian Linear Regression Package in R. 3: 106116.
Pryce, J. E., J. Arias, P. J. Bowman, S. R. Davis, K. A. Macdonald et al., 2012 Accuracy of
genomic predictions of residual feed intake and 250-day body weight in growing heifers using
625,000 single nucleotide polymorphism markers. J Dairy Sci 95: 2108-2119.

228

Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira et al., 2007 PLINK: a tool set
for whole-genome association and population-based linkage analyses. Am J Hum Genet 81: 559575.
R Core Team, 2017 R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria., pp., Vienna, Austria.
Resende, M. F. R., P. Munoz, M. D. V. Resende, D. J. Garrick, R. L. Fernando et al., 2012
Accuracy of Genomic Selection Methods in a Standard Data Set of Loblolly Pine (Pinus taeda
L.). Genetics 190: 1503-1510.
Robinson, G. K., 1991 That BLUP is a good thing: the estimation of random effects. Statist. Sci.
6: 15-32.
Rockova, V., and E. I. George, 2014 EMVS: The EM Approach to Bayesian Variable Selection.
Journal of the American Statistical Association 109: 828-846.
Schmid, K., and Z. Yang, 2008 The trouble with sliding windows and the selective pressure in
BRCA1. PloS one 3: e3746.
Searle, S. R., G. Casella and C. E. McCulloch, 1992 Variance components. Wiley, New York.
Shepherd, R. K., T. H. Meuwissen and J. A. Woolliams, 2010 Genomic selection and complex
trait prediction using a fast EM algorithm applied to genome-wide markers. BMC bioinformatics
11: 529.
Sing, T., O. Sander, N. Beerenwinkel and T. Lengauer, 2005 ROCR: visualizing classifier
performance in R. Bioinformatics 21: 3940-3941.
Sorensen, D., and D. Gianola, 2002 Likelihood, Bayesian, and MCMC methods in quantitative
genetics. Springer-Verlag, New York.
Stephens, M., 2017 False discovery rates: a new deal. Biostatistics 18: 275-294.
Stephens, M., and D. J. Balding, 2009 Bayesian statistical methods for genetic association
studies. Nat Rev Genet 10: 681-690.
Stram, D. O., and J. W. Lee, 1994 Variance Components Testing in the Longitudinal Mixed
Effects Model. Biometrics 50: 1171-1177.
Stranden, I., and O. F. Christensen, 2011 Allele coding in genomic evaluation. Genetics,
Selection, Evolution 43: 25.
Stranden, I., and D. J. Garrick, 2009 Technical note: Derivation of equivalent computing
algorithms for genomic predictions and reliabilities of animal merit. J Dairy Sci 92: 2971-2975.
Sun, X., L. Qu, D. J. Garrick, J. C. Dekkers and R. L. Fernando, 2012 A fast EM algorithm for
BayesA-like prediction of genomic breeding values. PloS one 7: e49157.
229

Technow, F., 2013 Simulation of genomic data in applied genetics. R package version 0.4., pp.
Tempelman, R. J., 2015 Statistical and computational challenges in whole genome prediction
and genome-wide association analyses for plant and animal breeding. JABES 20: 442-466.
Tempelman, R. J., D. M. Spurlock, M. Coffey, R. F. Veerkamp, L. E. Armentano et al., 2015
Heterogeneity in genetic and nongenetic variation and energy sink relationships for residual feed
intake across research stations and countries. J Dairy Sci 98: 2013-2026.
Tizioto, P. C., J. F. Taylor, J. E. Decker, C. F. Gromboni, M. A. Mudadu et al., 2015 Detection
of quantitative trait loci for mineral content of Nelore longissimus dorsi muscle. Genetics,
selection, evolution : GSE 47: 15.
Ueda, N., and R. Nakano, 1998 Deterministic annealing EM algorithm. Neural Networks 11:
271-282.
Vallejo, R. L., T. D. Leeds, B. O. Fragomeni, G. Gao, A. G. Hernandez et al., 2016 Evaluation of
Genome-Enabled Selection for Bacterial Cold Water Disease Resistance Using Progeny
Performance Data in Rainbow Trout: Insights on Genotyping Methods and Genomic Prediction
Models. Front Genet 7: 96.
van den Berg, I., S. Fritz and D. Boichard, 2013 QTL fine mapping with Bayes C(pi): a
simulation study. Genetics Selection Evolution 45.
VanRaden, P. M., 2008 Efficient methods to compute genomic predictions. J Dairy Sci 91:
4414-4423.
Verbyla, K., B. Hayes, P. Bowman and M. Goddard, 2009 Accuracy of genomic selection using
stochastic search variable selection in Australian Holstein Friesian dairy cattle. Genet Res 91:
307 - 311.
Visscher, Peter M., Matthew A. Brown, Mark I. McCarthy and J. Yang, 2012 Five Years of
GWAS Discovery. The American Journal of Human Genetics 90: 7-24.
Wang, H., I. Misztal, I. Aguilar, A. Legarra and W. M. Muir, 2012 Genome-wide association
mapping including phenotypes from relatives without genotypes. Genetics research 94: 73-83.
Wang, T., Y.-P. P. Chen, P. J. Bowman, M. E. Goddard and B. J. Hayes, 2016 A hybrid
expectation maximisation and MCMC sampling algorithm to implement Bayesian mixture model
based genomic prediction and QTL mapping. Bmc Genomics 17: 744.
Wang, X., N. J. Morris, X. Zhu and R. C. Elston, 2013 A variance component based multimarker association test using family and unrelated data. BMC genetics 14: 17.
Warr, A., C. Robert, D. Hume, A. L. Archibald, N. Deeb et al., 2015 Identification of LowConfidence Regions in the Pig Reference Genome (Sscrofa 10.2). Frontiers in Genetics 6.

230

Wellcome Trust Case Control Consortium, 2007 Genome-wide association study of 14,000 cases
of seven common diseases and 3,000 shared controls. Nature 447: 661-678.
Wiggans, G. R., T. A. Cooper, C. P. Van Tassell, T. S. Sonstegard and E. B. Simpson, 2013
Technical note: Characteristics and use of the Illumina BovineLD and GeneSeek Genomic
Profiler low-density bead chips for genomic evaluation. J Dairy Sci 96: 1258-1263.
Wiggans, G. R., P. M. VanRaden and T. A. Cooper, 2011 The genomic evaluation system in the
United States: Past, present, future. J Dairy Sci 94: 3202-3211.
Wimmer, V., T. Albrecht, H. J. Auinger and C. C. Schon, 2012 synbreed: a framework for the
analysis of genomic prediction data using R. Bioinformatics 28: 2086-2087.
Wimmer, V., C. Lehermeier, T. Albrecht, H.-J. Auinger, Y. Wang et al., 2013 Genome-Wide
Prediction of Traits with Different Genetic Architecture Through Efficient Variable Selection.
Genetics 195: 573-587.
Wolc, A., J. Arango, P. Settar, J. E. Fulton, N. P. O'Sullivan et al., 2016 Mixture models detect
large effect QTL better than GBLUP and result in more accurate and persistent predictions. J
Anim Sci Biotechnol 7: 7.
Wolc, A., J. Arango, P. Settar, J. E. Fulton, N. P. O'Sullivan et al., 2012 Genome-wide
association analysis and genetic architecture of egg weight and egg uniformity in layer chickens.
Anim Genet 43 Suppl 1: 87-96.
Wu, M. C., P. Kraft, M. P. Epstein, D. M. Taylor, S. J. Chanock et al., 2010 Powerful SNP-set
analysis for case-control genome-wide association studies. Am J Hum Genet 86: 929-942.
Xu, S., 2007 An Empirical Bayes Method for Estimating Epistatic Effects of Quantitative Trait
Loci. Biometrics 63: 513-521.
Yang, W., C. Chen, J. P. Steibel, C. W. Ernst, R. O. Bates et al., 2015a A comparison of
alternative random regression and reaction norm models for whole genome predictions. J Anim
Sci 93: 2678-2692.
Yang, W., C. Chen and R. J. Tempelman, 2015b Improving the computational efficiency of fully
Bayes inference and assessing the effect of misspecification of hyperparameters in wholegenome prediction models. Genetics, selection, evolution : GSE 47: 13.
Yang, W., and R. J. Tempelman, 2012 A Bayesian antedependence model for whole genome
prediction. Genetics 190: 1491-1501.
Yi, N., and S. Xu, 2008 Bayesian LASSO for quantitative trait loci mapping. Genetics 179:
1045-1055.
Zhang, X., D. Lourenco, I. Aguilar, A. Legarra and I. Misztal, 2016 Weighting Strategies for
Single-Step Genomic BLUP: An Iterative Approach for Accurate Calculation of GEBV and
GWAS. Front Genet 7: 151.
231

Zhou, X., and M. Stephens, 2012 Genome-wide efficient mixed-model analysis for association
studies. Nat Genet 44: 821-U136.
Zhu, B., M. Zhu, J. Jiang, H. Niu, Y. Wang et al., 2016 The Impact of Variable Degrees of
Freedom and Scale Parameters in Bayesian Methods for Genomic Prediction in Chinese
Simmental Beef Cattle. PloS one 11: e0154118.
Zimmerman, D. L., and V. A. Nunez-Anton, 2010 Antedependence models for longitudinal data,
pp. xvii, 270 p. in Monographs on statistics and applied probability 112. Chapman & Hall/CRC,,
Boca Raton, FL.

232