LEVERAGING LOCAL GENETIC INFORMATION IN HIGH-DIMENSIONAL BAYESIAN REGRESSION: METHODS AND COMPUTATION TOOLS By Alexa S. Lupi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Biostatistics – Doctor of Philosophy 2024 ABSTRACT I present three projects that propose computationally efficient Bayesian methods and applications for analyzing high-dimensional genetic data, focusing on incorporating local SNP information, such as linkage disequilibrium, to elucidate genetic variability across the genome. Genome-wide information offers valuable insights, but its vast scale presents significant statistical, computational, and interpretation challenges. Focusing on local genomic segments can help address these challenges by providing a more refined approach to understanding genetic variation, particularly across different ancestry groups. In Chapter 1, I propose an approach to map the contribution of short chromosome segments to the genetic correlation between traits. While genome-wide genetic correlations between traits offer an overall estimate for comorbid traits, local regions with opposing directional genetic correlations are masked, making it challenging to untangle the strength of the relationship overall. Chapter 1 addresses this limitation by estimating local genetic correlations. Hyperuricemia/gout and chronic kidney disease are comorbid conditions for which the biological roots of the comorbidity remain unknown. Utilizing a novel approach, I disentangled the shared genetic regions contributing to both conditions. The results presented in this chapter validate several previously suggested pleiotropic loci and discovered new ones, with about a third showing genetic correlation estimates opposite to the overall correlation. Chapter 2 focuses on estimating the portability of local polygenic scores in cross-ancestry prediction accuracy. The vast majority of genetic data comes from individuals of European ancestry. As a result, many investigators attempt cross-ancestry prediction, utilizing European data to predict the risk of disease/traits among underrepresented non-European ancestries. In most cases, cross-ancestry prediction remains more accurate than within-ancestry predictions due to limitations imposed by non-European sample sizes, but it is still low. This shortcoming is largely due to differences in allele frequencies and linkage disequilibrium patterns between different ancestry groups, as well as genetic-by-environmental interactions involving environmental exposures that are not independent of ancestry. In this study, I propose a method, MC-ANOVA, to estimate the relative accuracy loss in cross-ancestry prediction across ancestries due to local linkage disequilibrium and allele frequency differences. I implemented the proposed algorithm and developed maps of the relative accuracy of cross-ancestry prediction for four non-European ancestry groups. Furthermore, I developed an interactive R Shiny app that can be used to visualize the results obtained in each portability map. My findings revealed significant variability in the portability of local PGS across genomic regions, reflecting varying degrees of genetic similarity between ancestries across regions. This study highlights the potential for improving cross- ancestry predictions by taking local genetic differences into account. The advent of big data has had a remarkable impact on PGS prediction accuracy. Sample size affects both the power to detect significant associations between SNPs and phenotypes and the accuracy of SNP effects estimates. For homogenous populations, PGS prediction accuracy grows monotonically with sample size. However, when using multi-ancestry data, the relative proportion of each ancestry group can greatly impact prediction accuracy. Therefore, in Chapter 3, using data from individuals of European ancestry from the UK Biobank and African ancestry from All of Us, I investigate how sample size and the relative proportion of each ancestry group influence PGS prediction accuracy. This study sheds light on the relative benefits of increasing within- and across-ancestry sample sizes in cross-ancestry genetic predictions through empirical results, ultimately highlighting the importance of prioritizing the collection of non-European ancestry data. TABLE OF CONTENTS INTRODUCTION .......................................................................................................................... 1 REFERENCES ........................................................................................................................... 4 CHAPTER 1: Local genetic covariance between serum urate and kidney function estimated with Bayesian multitrait models ......................................................................................................... 6 REFERENCES ......................................................................................................................... 24 APPENDIX A: Chapter 1 ......................................................................................................... 27 CHAPTER 2: Mapping the relative accuracy of cross-ancestry prediction ................................. 29 REFERENCES ......................................................................................................................... 69 APPENDIX B: Chapter 2 ......................................................................................................... 72 CHAPTER 3: The impact of sample size and the relative proportion of ancestry group on cross- ancestry prediction accuracy .................................................................................................. 118 REFERENCES ....................................................................................................................... 145 APPENDIX C: Chapter 3 ....................................................................................................... 148 CONCLUSION ........................................................................................................................... 158 iv INTRODUCTION 1 In recent years, statistical genetics has obtained unprecedented access to vast datasets such as the UK Biobank, with near half a million participants with genotypes (at millions of single- nucleotide polymorphisms [SNPs]) and thousands of phenotypes and disease records. The ever- increasing sample sizes in genetic data have significantly improved the statistical power of genome-wide association studies (GWAS), leading to the publication of thousands of results1. However, this increase in data is accompanied by an increase in statistical and computational challenges. While advancements in statistical methods have allowed for the evaluation of complex genome-wide models, as sample sizes grow it becomes increasingly less efficient and feasible to analyze hundreds of thousands of SNPs using standard techniques. Common approaches, such as single-SNP methods, fail to incorporate linkage disequilibrium (LD) that exists between flanking variants. Additionally, due to variation across the genome, models attempting to use whole- genome information can mask important differences between chromosome segments2. In this dissertation, I propose methods to estimate important genetic parameters for short chromosome segments. In Chapter 1, I propose an approach to map the contribution of short chromosome segments to the correlation between traits. Using this methodology, and data from the UK Biobank, I report estimates of the (local) genetic correlation between serum urate and estimated glomerular filtration rate. The results presented in Chapter 1 validate several previously suggested pleiotropic loci and discovered new ones, with about a third showing genetic correlation estimates opposite to the overall correlation. The prediction accuracy for European (EUR)-ancestry individuals has improved with increased statistical power from larger sample sizes3,4. However, the vast overrepresentation of 2 EUR-ancestry in GWAS datasets (approximately 80%5) leads to poor cross-ancestry prediction accuracy. This is particularly true for more distant ancestry groups, such as African (AF)5–16. The poor portability of EUR-derived PGS in cross-ancestry prediction has been primarily attributed to differences in allele frequencies and LD, among other factors8,9,17. I hypothesize that, owing to varying levels of LD and allele frequency differences between ancestry groups, the portability of local PGS varies substantially over the genome, with some regions having high portability of SNP effects between ancestries and others exhibiting very poor potability in cross-ancestry prediction. Therefore, in Chapter 2, I propose a methodology to map the portability of local PGS between ancestry groups. The methodology uses a Monte Carlo approach to map both within-ancestry loss of accuracy (due to imperfect LD between markers and causal loci) and the loss of accuracy in cross-ancestry prediction attributable to differences in allele frequencies and LD between ancestry groups. I used the proposed methodology, and data from the UK Biobank to generate maps of the relative accuracy of local PGS in cross-ancestry prediction for several non-EUR ancestry groups. Finally, in Chapter 3, building on the investigations in cross-ancestry PGS prediction accuracy, I investigate the impact of sample size and of the proportion of data from different ancestry groups on PGS prediction accuracy. In this study, I used data from individuals of EUR ancestry from the UK Biobank (n~250,000)18 and AF ancestry data from All of Us (n~50,000)19. The results emphasize the importance of investing in the collection of non-EUR data. 3 REFERENCES 1. Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 2023 Jan 6;51(D1):D977– 85. 2. Shi H, Mancuso N, Spendlove S, Pasaniuc B. Local Genetic Correlation Gives Insights into the Shared Genetic Architecture of Complex Traits. The American Journal of Human Genetics. 2017 Nov;101(5):737–51. 3. Lello L, Avery SG, Tellier L, Vazquez AI, de los Campos G, Hsu SDH. Accurate Genomic Prediction of Human Height. Genetics. 2018;210(2):477–97. 4. Kim H, Grueneberg A, Vazquez AI, Hsu S, de los Campos G. Will Big Data Close the Missing Heritability Gap? Genetics. 2017;207(3):1135–45. 5. Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019 Apr;51(4):584–91. 6. Dikilitas O, Schaid DJ, Kosel ML, Carroll RJ, Chute CG, Denny JA, et al. Predictive Utility of Polygenic Risk Scores for Coronary Heart Disease in Three Major Racial and Ethnic Groups. Am J Hum Genet. 2020 May 7;106(5):707–16. 7. Scutari M, Mackay I, Balding D. Using Genetic Distance to Infer the Accuracy of Genomic Prediction. Hickey JM, editor. PLoS Genet. 2016 Sep 2;12(9):e1006288. 8. Wang Y, Guo J, Ni G, Yang J, Visscher PM, Yengo L. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat Commun. 2020 Jul 31;11(1):3865. 9. Privé F, Aschard H, Carmi S, Folkersen L, Hoggart C, O’Reilly PF, et al. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am J Hum Genet. 2022 Jan 6;109(1):12–23. 10. Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. The American Journal of Human Genetics. 2015 Oct;97(4):576–92. 11. Belsky DW, Moffitt TE, Sugden K, Williams B, Houts R, McCarthy J, et al. Development and evaluation of a genetic risk score for obesity. Biodemography Soc Biol. 2013;59(1):85–100. 4 12. Domingue BW, Belsky D, Conley D, Harris KM, Boardman JD. Polygenic Influence on Educational Attainment: New evidence from The National Longitudinal Study of Adolescent to Adult Health. AERA Open. 2015;1(3):1–13. 13. Lee JJ, Wedow R, Okbay A, Kong E, Maghzian O, Zacher M, et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Genet. 2018 Jul 23;50(8):1112–21. 14. Vassos E, Di Forti M, Coleman J, Iyegbe C, Prata D, Euesden J, et al. An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis. Biol Psychiatry. 2017 Mar 15;81(6):470–7. 15. Li Z, Chen J, Yu H, He L, Xu Y, Zhang D, et al. Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia. Nat Genet. 2017 Nov;49(11):1576–83. 16. Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. The American Journal of Human Genetics. 2017 Apr;100(4):635–49. 17. Lupi AS, Vazquez AI, de los Campos G. Mapping the relative accuracy of cross-ancestry prediction. Nat Commun. 2024; accepted 2024. 18. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018 Oct;562(7726):203–9. 19. The All of Us Research Program Investigators. The “All of Us” Research Program. N Engl J Med. 2019 Aug 15;381(7):668–76. 5 CHAPTER 1: Local genetic covariance between serum urate and kidney function estimated with Bayesian multitrait models This chapter is from a published manuscript: Alexa S. Lupi, Nicholas A. Sumpter, Megan P. Leask, Justin O’Sullivan, Tayaza Fadason, Gustavo de los Campos, Tony R. Merriman, Richard J. Reynolds, Ana I. Vazquez. Local genetic covariance between serum urate and kidney function estimated with Bayesian multitrait models. G3 Genes|Genomes|Genetics, Volume 12, Issue 9, September 2022, jkac158, https://doi.org/10.1093/g3journal/jkac158 6 Abstract Hyperuricemia (serum urate >6.8 mg/dl) is associated with several cardiometabolic and renal diseases, such as gout and chronic kidney disease. Previous studies have examined the shared genetic basis of chronic kidney disease and hyperuricemia in humans either using single- variant tests or estimating whole-genome genetic correlations between the traits. Individual variants typically explain a small fraction of the genetic correlation between traits, thus the ability to map pleiotropic loci is lacking power for available sample sizes. Alternatively, whole-genome estimates of genetic correlation indicate a moderate correlation between these traits. While useful to explain the comorbidity of these traits, whole-genome genetic correlation estimates do not shed light on what regions may be implicated in the shared genetic basis of traits. Therefore, to fill the gap between these two approaches, we used local Bayesian multitrait models to estimate the genetic covariance between a marker for chronic kidney disease (estimated glomerular filtration rate) and serum urate in specific genomic regions. We identified 134 overlapping linkage disequilibrium windows with statistically significant covariance estimates, 49 of which had positive directionalities, and 85 negative directionalities, the latter being consistent with that of the overall genetic covariance. The 134 significant windows condensed to 64 genetically distinct shared loci which validate 17 previously identified shared loci with consistent directionality and revealed 22 novel pleiotropic genes. Finally, to examine potential biological mechanisms for these shared loci, we have identified a subset of the genomic windows that are associated with gene expression using colocalization analyses. The regions identified by our local Bayesian multitrait model approach may help explain the association between chronic kidney disease and hyperuricemia. 7 Introduction Chronic kidney disease (CKD) carries significant global health and economic burden1,2. CKD stages three to five manifest as decreased renal function and are defined by elevated serum creatinine (sCr) or estimated glomerular filtration rate (eGFR) <60 mL/min/1.73m2. Hyperuricemia is defined by serum urate (sU) concentration >6.8 mg/dL and is contributed to by deteriorating renal function3. Hyperuricemia has several comorbidities associated with it, including CKD and gout3–5. Among people with hyperuricemia, there is a higher prevalence of CKD, and among patients with CKD, sU concentrations are higher6,7. Genome-wide analyses have demonstrated that the association observed between eGFR and serum urate has a genetic basis. Tin et al. carried out a large-sample trans-ethnic genome- wide association study (GWAS) of sU and, through cross-trait linkage disequilibrium (LD) score regression, obtained an estimate of overall genetic correlation between eGFR and sU of -0.26 (standard error of 0.04)8. This was one of the largest negative correlations with sU out of 748 traits analyzed8. Reynolds et al., using two large family-based datasets and Bayesian whole- genome regressions, obtained global genetic correlations between sCr (which has a direct inverse relationship to eGFR, hence the directionality difference between the estimates) and sU of 0.20 (95% credibility region (CR): 0.07, 0.33) in one dataset and 0.25 (95% CR: 0.07, 0.41) in the other9. While these estimates contribute to dissecting biological causes of the observed comorbidities, the shared pleiotropic genomic regions and underlying biological mechanisms are only reliably discovered by estimating local genetic covariances10. GWAS of sU and eGFR have identified numerous loci associated with each phenotype separately. A recent study comparing large GWAS of these traits identified 36 shared loci11. However, the GWAS methods used to detect the shared signals are based on the marginal 8 association of individual single-nucleotide polymorphisms (SNPs) with phenotypes, thus not accounting for LD between SNPs. Our method improves over post-analysis of GWAS summary statistics by estimating neighboring SNP effects concomitantly. Incorporating local LD to estimate genetic effects in a tightly segregating chromosomal segment has been previously suggested to account for the correlation between SNPs12–14. Additionally, our methodology implements a multi-trait model so we obtain direct genetic covariance estimates. In this study, we aimed to characterize the common genetic basis for CKD (eGFR) and hyperuricemia (sU levels) by identifying pleiotropic genomic regions. To achieve this goal, we identified the local regions contributing to genetic variances and covariances across the whole genome14. We used Bayesian multi-trait models to estimate the genetic (co)variances. SNP effects were estimated in large DNA regions and genetic variances and covariances were calculated from the posterior means per LD window. We identified 64 unique local genetic regions with significant local genetic covariance, including previously implicated and novel shared loci. Materials and Methods Participants This study was based on 333,542 Caucasian participants from the UK Biobank. Participants missing serum urate or serum creatinine for both of their two visits were excluded from the analysis. We excluded close relatives with relatedness  0.1, estimated using the R package BGData15 (see details in the Supplementary Methods). Genotypes and phenotypes The UK Biobank used the custom UK Biobank AxiomTM Array by Affymetrix to genotype study participants16. Quality control involved removing SNPs that had a minor allele frequency less than 1% or a missing call rate greater than 5%, resulting in 607,490 autosomal 9 chromosomes (1-22) SNPs17. Serum urate and sCr data were obtained from the first visit. For the small number of participants (0.28%) that did not have phenotype data of interest collected at the first visit, we retrieved data from the second visit. sCr was used to define eGFR and details on this can be found in the Supplementary Methods. For both eGFR and sU, we took a log transformation to normalize their distributions and preadjusted by age, sex, and the first five SNP-derived principal components using ordinary least squares. Local Bayesian multi-trait models We estimated local (co)variances by fitting Bayesian models to chromosomal segments with a non-overlapping core of 1,000 contiguous SNPs (between 3-4 Mbp depending on the region). We included two overlapping flanking regions each consisting of 250 SNPs to each side of the core. The SNPs in the flanking regions were included to account for the effects of SNPs that were outside of the core region but possibly in LD with SNPs in the core segment. Whole genome regressions have been used to fit several markers concomitantly (e.g., Vazquez et al.18). However, biobank data imposes computational restrictions due to its large dimensions. In the context of a single trait, local Bayesian conditional regressions have been employed to deal with the computational burden (Funkhouser et al.14). In their study, the authors indagated sex differences in genetic effects in single-trait models. Here, we utilized the idea of conditional regressions in large chunks of DNA with flanking regions in the context of a multi-trait Bayesian model. This provides posterior estimates of variances and covariances between traits to find pleiotropic regions. The linear model used had the form 𝐘= 𝟏𝛍(cid:4593)+ 𝐗𝛃 + 𝐄, where 𝐘n x 2 is a matrix containing the pre-adjusted phenotypes, 𝛍 2 x 1 is a vector of trait-specific intercepts, 𝐗n x 1500 is a SNP-genotype matrix (1,000 core SNPs plus 250 flanking SNPs to each side), 𝛃1500 x 2 is a matrix 10 of SNP effects, and 𝐄n x 2 is a matrix of error terms. The error terms were assumed to be IID multivariate normal with a mean of zero and covariance Var(𝛆(cid:2919))=𝐑2 x 2, where 𝛆(cid:2919) is the ith row of 𝐄. We used IID priors with a point of mass at zero and a bivariate Gaussian slab with a mean of zero and (co)variance matrix 𝚺2 x 2. The extent of shrinkage and variable selection was influenced by three groups of parameters: 𝐑, 𝚺, and the prior proportion of non-zero effects, 𝛑. For a two- trait model, 𝛑={𝜋1, 𝜋2} and represents the prior probability of non-zero effects for traits 1 and 2 (sU and eGFR), respectively. We treated the {𝐑, 𝚺, 𝛑} parameters as unknown and we assigned Inverse-Wishart priors for the (co)variance matrices and Beta priors for the prior probability of non-zero effects. We used the Multitrait function from the BGLR R package available in the R CRAN19 to generate 5,000 samples from the posterior distribution for each chromosomal segment. We filtered the samples of the SNP effects collected using a burn-in of 250 SNPs and a thinning interval of 10, thus retaining 475 samples for further inference. Defining local LD-based windows After we obtained the model estimates, for each core segment SNP we defined an LD window that contained correlated, neighboring SNPs with an overlapping sliding technique13,14. Within each LD window, we collected the corresponding estimated effects and computed (co)variance estimates (described below). For each seed SNP xij (i=1,...,n individuals and j=1,...,p core segment SNPs) coming from the core segment of SNPs, we sequentially identified SNPs in both directions (xij*) surrounding the seed SNP and included them in window j if Corr(xij, xij*) ≥ 0.1. In a simplified example, if SNP xij had an adequate pairwise correlation with 2 SNPs to the left, and 1 SNP to the right, the window for that SNP would be defined as the set of SNPs: {xij-2, xij-1, xij, xij+1}. That is, Corr(xij, xij-1) ≥ 0.1 and Corr(xij, xij-2) ≥ 0.1 and Corr(xij, xij+1) ≥ 0.1. Our 11 definition of an LD sliding window also involved an allowance for one SNP in the sequential process to not meet this correlation criterion, to allow for a brief loss of LD or minor mapping errors, and the SNP was still included in the LD window. In the previous example, if Corr(xij, xij-1) < 0.1, and Corr(xij, xij-2) ≥ 0.1, then the set would still include both xij-2 and xij-1. The LD window ends when two SNPs sequentially did not meet the criteria described above. The LD windows could include flanking buffer SNPs, but buffer SNPs were never used to define an LD window. Local (co)variances For each LD window, we computed the local variances for traits 1 and 2 and the local and covariances using 𝑉(cid:3050)(cid:2869)(cid:3046) = Var(𝐗(cid:2933)𝛃(cid:2933)(cid:2869)(cid:2929)), 𝑉(cid:3050)(cid:2870)(cid:3046) = Var(𝐗(cid:2933)𝛃(cid:2933)(cid:2870)(cid:2929)), and Cov(cid:2933)(cid:2929) = Cov(𝐗(cid:2933)𝛃(cid:2933)(cid:2869)(cid:2929), 𝐗(cid:2933)𝛃(cid:2933)(cid:2870)(cid:2929)). Here, 𝐗(cid:2933) is the matrix containing the genotypes of the SNPs in the wth window and 𝛃(cid:2933)(cid:2869)(cid:2929) and 𝛃(cid:2933)(cid:2870)(cid:2929) are the samples of effects of those SNPs for traits 1 and 2 collected at the sth iteration of the sampler. This generated samples from the posterior distribution of the local (co)variances, which we used to produce posterior mean estimates (by averaging across the samples from the posterior distribution), estimate posterior standard deviations, and obtain 95% posterior CRs. As discussed in Lehermeier et al.20, this approach accounts for the contribution of local LD to genetic (co)variances and, by averaging over samples from the posterior distribution, for uncertainty about SNP effects. Gene expression/eQTL analysis A colocalization analysis was performed between GWAS significant markers for sU and sCr and the publicly available eQTL data from GTEx V821. The R package COLOC was used, which implements a Bayesian test that analyses a single genomic region and identifies LD patterns in that locus using SNP summary statistics and the associated minor allele frequencies. The lead variant for both sCr and sU was used at each significant covariance window with a 12 surrounding 500 kb buffer in the GTEx database. The Contextualizing Developmental SNPs using 3D Information algorithm22,23 was modified to identify long-distance regulatory relationships for the lead sU and sCr variants at each significant covariance window within a 500 kb region. eQTL data for variants +/- 500 kb of the lead variant were also extracted from GTEx and then COLOC was used to assess if the significant cis- and trans-eQTL identified were colocalized with sCr and sU signals. An eQTL was determined to be colocalized if the COLOC H4 (posterior probability of colocalization (PPC)) was at least 0.5 for both traits and at least 0.8 for one of the two traits, according to Giambartolomei et al.21. Validation We performed a validation analysis with the related Caucasian UK Biobank cohort, consisting of 57,370 subjects not missing sU or eGFR phenotypes. The genotyping array used for this cohort is the same as that used for the discovery analysis cohort. The validation analysis repeated the estimation procedures described above and the sliding LD windows used were identical to those used in the discovery set. Results This study was based on 333,542 distantly related white participants, of whom 53.7% were female with an average age of 56.9 ± 8.0 years old. The average sCr level was 0.8 ± 0.2 mg/dL (the average ± standard error), average eGFR was 144.2 ± 56.0 ml/min/1.73 m2, and the average sU level was 5.2 ± 1.3 mg/dL. Two (2.0) percent of the individuals had an ICD10 diagnosis or self-diagnosis of gout, 12.4% had hyperuricemia, 0.5% had CKD, and 0.3% had hyperuricemia and CKD. We analyzed the markers (sU and eGFR) using a sequence of Bayesian multi-trait models where the markers were regressed on contiguous SNPs in a large chromosomal segment (core) 13 plus overlapping flanking buffers. We collected the samples from the posterior distribution of effects for each core segment and used these samples to estimate the local variances for each marker (Figure 1) and the local covariances between the markers (Figure 2). The (co)variances were estimated within 511,828 overlapping LD windows (small, non-independent contiguous chromosomal regions). 14 Figure 1: The variance estimates of overlapping LD windows. a) Variance estimates multiplied by 1E4 for sU concentrations and (b) for eGFR. 15 Figure 2: The covariance estimates of overlapping LD windows. Windows are selectively annotated with the gene name of the mid-point SNP of that window. Windows that contained SNPs in loci associated with known eGFR genes are highlighted in dark green, windows that contained SNPs in genes associated with sU are highlighted in blue, and windows that contained SNPs in genes associated with both sU and eGFR (from comparing GWAS, Leask et al., 202011) are highlighted in bright green. Windows significant for genetic covariance are highlighted in red. The covariance estimates were multiplied by 1E4. We found 134 LD windows with covariance estimates that had a 95% CR excluding zero (Figure 2; Table A1). The number of SNPs in the significant LD windows ranged from one to 56, and the median SNPs per window was 6.0 (22 kbp on average, excluding 12 single-SNP windows). Interestingly, although the global correlation between sU and eGFR is negative8,9, 49 of the 134 significant windows showed positive genetic covariance directionality, and the remaining 85 were negative. The 134 significant LD windows often included the same variants and mapped to identical GWAS loci, so we collapsed the 134 windows to 64 unique loci that possessed genetic covariance signal between eGFR and sU (Table A2 and Supplementary Methods). The top 25 distinct loci implicated by the significant windows in terms of covariance magnitude are listed in Table 1. A 16 graphical representation of the top significant loci is presented in Figure 3. Figure 3: The top 25 shared loci and their covariance estimates with corresponding 95% CRs. The top 25 distinct loci from LD genomic regions with CRs not including zero. The window size indicates the number of SNPs in each window. The covariance estimates and CRs were multiplied by 1E4. 17 Table 1: The top 25 magnitude genomic windows significant for covariance between sU and eGFR with their chromosome, annotated gene name, number of SNPs and first and last SNP names, estimated covariance [95% CR], and colocalized genes. Chromosome Annotated Gene Name 2 2 2 6 10 17 CPS1 LRP2 NRBP1/IFT172/F NDC4/GCKR SLC17A1/SLC17A 3/SLC17A2 A1CF BCAS3 19 SLC7A9/CEP89 2 2 2 3 6 6 7 7 8 11 11 12 13 LOC105373585 HOXD13/HOXD1 2/HOXD10 KCNS3 SLC15A2/ILDR1 VEGFA TTBK1/SLC22A7/ CRIP3 UNCX LOC730338 STC1 OVOL1 DCDC1 R3HDM2/INHBC/ INHBE DACH1 Number of SNPs in the Window and First to Last SNP 1 rs1047891 6 rs41268683-rs2075252 16 Affx-19857019-rs1260333 56 rs1165196-rs9467632 7 rs12413118-rs61856594 7 rs9904048-rs9895661 16 rs78676942-rs11668957 7 rs11122800-rs35932591 5 rs847153-rs711818 7 rs9789415-rs11688124 9 rs2049330-rs6438689 1 rs881858 20 rs2651206-rs2242416 13 rs6950388-rs1880301 5 rs700752-rs12537178 6 rs62502212-rs1705690 Estimated Covariance [95% CR]a 6.42 [5.45, 7.65] 4.58 [2.61, 6.4] 10.3 [8.43, 12] 4.87 [.863, 8.61] 4.64 [3.74, 5.66] 2.34 [1.38, 3.19] 3.84 [1.85, 5.2] -4.19 [-5.58, -2.57] -2.86 [-4.14, -1.84] -2.42 [-3.19, -1.59] -2.02 [-3.12, -1.03] -6.85 [-8.61, -5.48] -2.24 [-3.31, -1.27] -6.94 [-8.56, -5.18] -2.31 [-3.89, -.944] -5.83 [-7.38, -4.46] 7 rs4014195-rs36008241 -5.59 [-8.13, -3.29] 10 rs963837-rs10767873 7 rs73115999-rs507562 5 rs7981995-rs626277 -12.7 [-14.9, -10.7] -5.13 [-6.49, -3.72] -1.98 [-2.73, -1.39] Colocalized Genes NRBP1 A1CF CRHBP, SH3GL2 SLC7A9, CLDND2 SLC15A2, CD86 SETD1A SETD1A PALM2, PSMD11 RP11-38H17.1 PCNX3, MAP3K11, SCYL1, RP-11-770G2.2, OVOL1, KRT8P26 KMT2A, R3HDM2, SFXN5 18 Table 1 (cont’d) 15 15 16 16 20 NRG4 IGF1R UMOD/PDILT LOC105371257 CYP24A1 1 rs8024155 4 rs907808-rs12437561 9 rs1123670-rs12917707 1 rs12927956 4 rs4809954-rs2616278 -2.82 [-4.29, -1.42] -2.68 [-3.75, -1.52] -2.52 [-3.77, -1.32] -2.25 [-3.24, -1.5] -2.12 [-2.9, -1.24] MAN2C1, PARD3 IGF1R, NRCAM, TRAPPC10 ACSM1, DNAH3 a Estimates and CRs were multiplied by 1E4 for readability. Gene expression/eQTL analysis We used COLOC21 and expression data from The Genotype Tissue Expression (GTEx) project (v8)24 to identify candidate causal genes at significant local genetic covariance windows between sU and eGFR. Twenty-six of the 64 distinct significant shared loci (41.6%) were shown to modify the expression of candidate causal genes colocalized with the covariance signals (Table A3). Of note are TRIM6 and L3MBTL3 in cis, which are genes that have a significant covariance signal and a colocalized eQTL that is expressed in the kidney. Validation In the related white UK Biobank validation cohort twelve LD windows were significant for genetic covariance between sU and eGFR (Table A1). All of the twelve significant windows were also significant in the main analysis with consistent directionality. The 12 windows condensed to five distinct loci (Table A2), meaning five out the 64 significant distinct loci from the main analysis were also significant in this validation. The sample size of the related cohort is 82.8% smaller (n=57,370) than the unrelated cohort used in the discovery set (n=333,542), so our validation analysis was comparatively underpowered to the main analysis. 19 Discussion The goal of this study was to infer the shared genetic architecture of sU (causal for gout), and eGFR (a marker for CKD). Our results highlight genes that may be involved in the observed relationship between the traits. In this study, we estimated local genetic (co)variances between sU and eGFR and identified regions with pleiotropy. This study was based on the large-scale UK Biobank and formal statistical inference from local Bayesian multi-trait models. Our results demonstrated that genetic covariance between eGFR and sU was widespread across the genome. Our method identified 64 distinct LD windows with shared genetic effects between eGFR and sU, the majority of which had negative genetic covariance estimates. We identified 22 distinct novel shared loci, to our knowledge, with significant local genetic covariance for sU and eGFR, including MMP11/SMARCB1, ADH1B, MIP/GLS2, ENG/AK1, EPB41L5, KIAA1199, CELSR2, SOS2, KCNS3, TET2, SMLR1/EPB41L2, GLIS1, KIAA1683/JUND, and METTL10/FAM175B. Furthermore, 14 distinct loci identified were previously only known to be associated with only one of the two traits, demonstrating that the set of loci contributing to both traits is substantially larger than previously thought. These loci are partially responsible for the comorbidity between hyperuricemia/gout and CKD. One advantage of the local method that we present here is that it facilitates the identification of genomic windows with opposite signs to the overall negative genetic correlation between eGFR and sU. Out of the significant shared loci, about two-thirds showed negative local genetic covariance estimates. This is consistent with the overall genetic covariance directionality8,9, indicating that they either contribute to worsening kidney function (decreasing eGFR or increasing sCr) and increasing sU, or vice versa. Interestingly, there were 21 distinct significant shared loci with positive local genetic covariance estimates (about one-third). Positive 20 covariance indicates that the genomic region either contributes to increasing sU and improved kidney function or decreasing sU and worsening kidney function. Two of the loci with a significant positive signal, GCKR and CPS1, are mainly expressed in the liver and one, LRP2, is mainly expressed in the kidney24. One novel shared locus identified in this study consisted of the genes SLC17A1, SLC17A3, and SLC17A2. This large window in chromosome six (56 SNPs, Table 1) had a strong, positive significant covariance signal and SLC17A1 and SLC17A3 are urate transporters both linked to gout25. The opposite signs of locus-specific genetic covariances are indicative of distinct physiological processes governing the phenotypic expression of urate and eGFR. The loci with positive covariance in particular are excellent candidates for discovering functional mechanisms that simultaneously increase sU and improve kidney function. Urate transporters SLC2A9 and ABCG2 have the largest GWAS effect sizes for sU, accounting for 4-5% of the variance in sU8,26–29. However, no windows in SLC2A9 or ABCG2 had a 95% CR for local genetic covariance that did not include zero. Our results demonstrate that windows in both SLC2A9 and ABCG2 loci are associated with just sU levels but are not pleiotropic regions for sU and eGFR. A similar phenomenon is observed with the eGFR gene SHROOM3. That is, none of the windows containing SNPs in SHROOM3 were significant for local genetic covariance. This exemplifies that the loci driving the genetic correlation between these two traits are not necessarily the leading GWAS hits. Previous research investigating pleiotropic genetic loci between serum urate and eGFR has implicated loci as shared if signals of association obtained from marginal single-marker regressions (e.g., GWAS) for both traits are colocalized11. Leask et al.11 recently compared overlapping loci between two large GWAS, one of sU and the other kidney function8,30, and found 36 independent colocalized loci. Our results validate 20 of these 36 loci, and all but three 21 loci (DACH1, CPS1, and INS-IGF2) had covariance directionality that matched the directionality of effects found by Leask et al.11. Our covariance approach may have direct implications for assessing causal relationships between exposures using Mendelian randomization (MR). Pleiotropic genetic variants violate assumptions of univariate MR, however, they are useful in multivariable MR that can simultaneously assess the causal effects of multiple risk factors on an outcome31. For example, genetic variants from SLC2A9 and ABCG2 may be valid instrumental variables to use in MR to test for a causal effect of sU on CKD, however, the loci listed in Table A1 would not. In fact, SLC22A11 has previously been identified as a pleiotropic variant that may improve kidney function through its activity in raising urate levels28. MR has previously been used to show that serum urate is not causal of CKD32, however, Jordan et al. noted significant pleiotropy in the genetic variants used in their study, which they attempted to counter using MR techniques robust to pleiotropy. Of the 26 SNPs used by Jordan et al., rs1260326 (GCKR) and rs17050272 (LINC01101) were identified by us as shared, and rs1165151 and rs3741414 were located within one of our significant pleiotropic regions but were not in our genotyping platform. Our eQTL analysis of the windows significant for local genetic covariance uncovered numerous genes of interest, such as SLC7A9, which encodes a solute transporter largely expressed in the small intestine, A1CF, which encodes a protein involved in apolipoprotein B synthesis in the liver, and TRIM6, which encodes an E3 ubiquitin ligase involved in interferon gamma signaling and innate immune response with high expression levels in the kidney24. The genes uncovered from the eQTL analysis will be particularly interesting for future study, as they will likely aid our understanding of the relationship between kidney function and sU. Through our approach of obtaining local genetic (co)variance estimates from Bayesian 22 multi-trait models in very large datasets, we have uncovered twenty-two novel shared genetic regions for sU and eGFR. The approach presented in this paper was applied in the context of sU and eGFR, but it could be applied to any pair of traits. While our discovery set sample size is excellent, we lack a dataset of a similar size for the validation. Some regions were validated but not all. The local shared genomic regions we have uncovered in this study can provide insight into the relationship between hyperuricemia/gout and CKD, elucidating the biological mechanisms underlying the traits. This will help further understanding of the genetic basis of hyperuricemia/gout and CKD. Data Availability All data used are secondary and are held in public repositories. This study utilized deidentified data from the UK Biobank where genotype and phenotype data are available to researchers upon registration. The protocol and consent were approved by the UK Biobank’s Research Ethics Committee and were conducted under the application number “15326.” For eQTL analysis, cis- and trans-eQTL data were downloaded from the GTEx V8 portal (Carithers and Moore 2015). Supplemental material is available at G3 online. UK Biobank: https://www.ukbiobank.ac.uk/. 23 REFERENCES 1. Bikbov, B. et al. Global, regional, and national burden of chronic kidney disease, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. The Lancet 395, 709–733 (2020). 2. Hill, N. R. et al. Global Prevalence of Chronic Kidney Disease – A Systematic Review and Meta-Analysis. PLoS ONE 11, e0158765 (2016). 3. Sun, M. et al. Untangling the complex relationships between incident gout risk, serum urate, and its comorbidities. Arthritis Res Ther 20, 90 (2018). 4. Singh, G., Lingala, B. & Mithal, A. Gout and hyperuricaemia in the USA: prevalence and trends. Rheumatology (Oxford) 58, 2177–2180 (2019). 5. Clarson, L. E. et al. Increased risk of vascular disease associated with gout: a retrospective, matched cohort study in the UK clinical practice research datalink. Ann Rheum Dis 74, 642–647 (2015). 6. Jing, J. et al. Genetics of serum urate concentrations and gout in a high-risk population, patients with chronic kidney disease. Sci Rep 8, 13184 (2018). 7. Zhu, Y., Pandya, B. J. & Choi, H. K. Comorbidities of Gout and Hyperuricemia in the US General Population: NHANES 2007-2008. The American Journal of Medicine 125, 679-687.e1 (2012). 8. Tin, A. et al. Target genes, variants, tissues and transcriptional pathways influencing human serum urate levels. Nat Genet 51, 1459–1474 (2019). 9. Reynolds, R. J. et al. Genetic correlations between traits associated with hyperuricemia, gout, and comorbidities. Eur J Hum Genet (2021) doi:10.1038/s41431-021-00830-z. 10. Shi, H., Mancuso, N., Spendlove, S. & Pasaniuc, B. Local Genetic Correlation Gives Insights into the Shared Genetic Architecture of Complex Traits. The American Journal of Human Genetics 101, 737–751 (2017). 11. Leask, M. P. et al. The Shared Genetic Basis of Hyperuricemia, Gout, and Kidney Function. Seminars in Nephrology 40, 586–599 (2020). 12. Vilhjálmsson, B. J. et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. The American Journal of Human Genetics 97, 576–592 (2015). 24 13. Fernando, R., Toosi, A., Wolc, A., Garrick, D. & Dekkers, J. Application of Whole-Genome Prediction Methods for Genome-Wide Association Studies: A Bayesian Approach. JABES 22, 172–193 (2017). 14. Funkhouser, S. A., Vazquez, A. I., Steibel, J. P., Ernst, C. W. & los Campos, G. de. Deciphering Sex-Specific Genetic Architectures Using Local Bayesian Regressions. Genetics 215, 231–241 (2020). 15. Grueneberg, A. & de los Campos, G. BGData - A Suite of R Packages for Genomic Analysis with Big Data. G3 9, 1377–1383 (2019). 16. Affymetrix. Genetic data: Detailed genetic data on half a million people. http://www.ukbiobank.ac.uk/scientists-3/uk-biobank-axiom-array/ (2021). 17. Kim, H., Grueneberg, A., Vazquez, A. I., Hsu, S. & de los Campos, G. Will Big Data Close the Missing Heritability Gap? Genetics 207, 1135–1145 (2017). 18. Vazquez, A. I. et al. A comprehensive genetic approach for improving prediction of skin cancer risk in humans. Genetics 192, 1493–1502 (2012). 19. Pérez, P. & de los Campos, G. Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495 (2014). 20. Lehermeier, C., de Los Campos, G., Wimmer, V. & Schön, C.-C. Genomic variance estimates: With or without disequilibrium covariances? J Anim Breed Genet 134, 232–241 (2017). 21. Giambartolomei, C. et al. Bayesian Test for Colocalisation between Pairs of Genetic Association Studies Using Summary Statistics. PLoS Genet 10, e1004383 (2014). 22. Fadason, T., Schierding, W., Lumley, T. & O’Sullivan, J. M. Chromatin interactions and expression quantitative trait loci reveal genetic drivers of multimorbidities. Nat Commun 9, 5198 (2018). 23. Genome3d/codes3d-v2. Genome3d (2019). 24. Carithers, L. J. & Moore, H. M. The Genotype-Tissue Expression (GTEx) Project. Biopreservation and Biobanking 13, 307–308 (2015). 25. Reimer, R. J. SLC17: a functionally diverse family of organic anion transporters. Mol Aspects Med 34, 350–359 (2013). 25 26. Johnson, R. J. et al. Hyperuricemia, Acute and Chronic Kidney Disease, Hypertension, and Cardiovascular Disease: Report of a Scientific Workshop Organized by the National Kidney Foundation. Am. J. Kidney Dis. 71, 851–865 (2018). 27. Major, T. J., Dalbeth, N., Stahl, E. A. & Merriman, T. R. An update on the genetics of hyperuricaemia and gout. Nat Rev Rheumatol 14, 341–353 (2018). 28. Hughes, K., Flynn, T., de Zoysa, J., Dalbeth, N. & Merriman, T. R. Mendelian randomization analysis associates increased serum urate, due to genetic variation in uric acid transporters, with improved renal function. Kidney Int. 85, 344–351 (2014). 29. Yang, Q. et al. Multiple genetic loci influence serum urate levels and their relationship with gout and cardiovascular disease risk factors. Circ Cardiovasc Genet 3, 523–530 (2010). 30. Wuttke, M. & Köttgen, A. Insights into kidney diseases from genome-wide association studies. Nat Rev Nephrol 12, 549–562 (2016). 31. Burgess, S. & Thompson, S. G. Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects. Am J Epidemiol 181, 251–260 (2015). 32. Jordan, D. M. et al. No causal effects of serum urate levels on the risk of chronic kidney disease: A Mendelian randomization study. PLoS Med. 16, e1002725 (2019). 33. Levey, A. S. et al. A new equation to estimate glomerular filtration rate. Ann. Intern. Med. 150, 604–612 (2009). 34. Coresh, J. et al. Calibration and random variation of the serum creatinine assay as critical elements of using equations to estimate glomerular filtration rate. American Journal of Kidney Diseases 39, 920–929 (2002). 26 APPENDIX A: Chapter 1 Supplementary Tables A1-A3 for Chapter 1 can be found with the publication: https://academic.oup.com/g3journal/article/12/9/jkac158/6649732#supplementary-data. Supplementary Methods for Lupi et al. 2022 Identification of distantly related samples We used the R package BGData15 to compute the expected proportion of allele sharing among UK Biobank individuals with the additive genomic relationship matrix G, 𝑮 = 𝒁𝒁(cid:4594) (cid:3047)(cid:3045)(𝒁𝒁(cid:4594))/(cid:3041) , where Z is a matrix of centered genotypes. That is, Zij = xij - 2pj where xij is the number of copies of the reference allele at the jth loci of the ith individual and pj is the frequency of the reference allele of the jth loci. In a homogeneous sample, gij (where i ≠ j) can be considered as an estimate of the relatedness between subjects i and j. If gij ≥ 0.1 they were excluded from the sample. Phenotypes eGFR is an indicator of renal function and was used to ascertain CKD. In this study, we defined eGFR using the abbreviated Modification of Diet in Renal Disease (MDRD) equation, which uses fewer variables than others yet performs just as well33, with a modification to include a calibration factor to correct for the variability of sCr measures across laboratories and time34: eGFR = 186.3 × (sCr – 0.24) -1.154 × Age -0.203 × (0.742 if Female). Defining distinct loci We condensed our 134 significant windows to 64 distinct, non-overlapping regions. To determine which significant window would represent each region, we first checked if a window’s base pair position overlapped with that of a neighboring window. If the windows overlapped, we kept whichever window had the most SNPs. If the number of SNPs in the windows were equal, 27 we kept the first of the two. This iterative process ended once there were no overlapping neighboring significant windows. 28 CHAPTER 2: Mapping the relative accuracy of cross-ancestry prediction This chapter is from a manuscript accepted for publication: Alexa S. Lupi, Ana I. Vazquez, Gustavo de los Campos. Mapping the Relative Accuracy of Cross-Ancestry Prediction. Accepted for publication in Nature Communications, 09/11/2024. 29 Abstract The overwhelming majority of participants in genome-wide association studies (GWAS) have European (EUR) ancestry, and polygenic scores (PGS) derived from EURs often perform poorly in non-EURs. Previous studies suggest that between-ancestry differences in allele frequencies and linkage disequilibrium are significant contributors to the poor portability of PGS in cross-ancestry prediction. We hypothesize that the portability of (local) PGS varies significantly over the genome. Therefore, we develop a method, MC-ANOVA, to estimate the loss of accuracy in cross-ancestry prediction attributable to allele frequency and linkage disequilibrium differences between ancestries. Using data from the UK Biobank we develop PGS relative accuracy (RA) maps quantifying the local portability of EUR-derived PGS in non-EUR ancestries. We report substantial variability in RA along the genome, suggesting that even in ancestries with low overall RA of EUR-derived effects (e.g., African), there are regions with high RA. We substantiate our findings using six complex traits, which show that EUR-derived effects from regions where MC-ANOVA predicts high RA also have high empirical RA in real PGS. We provide software implementing MC-ANOVA and RA maps for several non-EUR ancestries. These maps can be used to interpret similarities and differences in GWAS results between groups and to improve cross-ancestry prediction. Introduction In the last fifteen years, thousands of genome-wide association studies (GWAS) have been published1. Increasingly, single nucleotide polymorphisms (SNPs) that these studies reported to be associated with specific phenotypes or disease outcomes are used to build polygenic scores (PGS). The availability of biobank-sized data has led to unprecedented improvements in PGS prediction accuracy2,3. However, the overwhelming majority of participants in GWAS 30 (approximately 80%) are of European (EUR) descent4, leading to issues with generalizability and exacerbating existing health disparities. Consistently, studies across various traits/diseases and target ancestry groups have shown that PGS derived with data from EURs have poor predictive performance when used to predict among individuals of non-EUR ancestry (African [AF] in particular)4–15. Several factors can contribute to the poor portability of PGS across ancestries. At causal loci, unaccounted gene-by-gene (GG) and genetic-by-environment (GE) interactions can lead to ancestry differences in the additive effects of causal alleles. Furthermore, differences across ancestry groups in allele frequencies and linkage disequilibrium (LD) patterns can lead to heterogeneity in marker effects even for loci without such heterogeneity at causal loci16. The relative contribution of GG, GE, allele frequency differences, and LD differences to the poor portability of PGS remains largely unknown and can be expected to vary across traits and ancestries. However, several studies suggest that allele frequency and LD differences between ancestries are significant factors contributing to the poor portability of PGS, possibly explaining up to 75% of the empirical loss of accuracy (LOA) in cross-ancestry prediction7,8,17,18. Many studies have investigated the portability of PGS across ancestries from a whole- genome perspective7,8. However, no previous study has quantified how the portability of local PGS varies over the genome and how this information can be used to identify genomic regions of low and high relative accuracy (RA, the ratio of cross-ancestry to within-ancestry variance explained and functions thereof) between ancestral groups. We hypothesize that the degree of allele frequency and LD differences between ancestries (and therefore the local portability and RA of PGS) varies along the genome. Therefore, we developed an algorithm, Monte Carlo ANOVA (MC-ANOVA), to map the RA of local linear functions of SNP genotypes. 31 In this work, we apply the MC-ANOVA method to data from the UK Biobank19 and the ARIC (Arteriosclerosis in Risk Communities) study20 to generate portability maps of the local RA of PGS between EUR and non-EUR ancestry groups. Using PGS for six quantitative traits (height, high-density lipoprotein [HDL], low-density lipoprotein [LDL], serum urate, body mass index [BMI], and serum glucose), we show that the portability maps we develop are predictive of the empirical local RA of EUR-derived PGS for the prediction of the same traits in African (AF), Caribbean (CR), East Asian (EA), and South Asian (SA) ancestry groups. We illustrate how the RA maps we develop can be used, together with GWAS results, to improve prediction accuracy in underrepresented ancestry groups. Our study is accompanied by the software needed to develop RA maps for other ancestries or data sets. Results The MC-ANOVA method estimates the impact of differences in allele frequencies and LD patterns between ancestries on the local relative accuracy (RA, and functions thereof) of PGS. To define RA, let us consider a scenario where the same causal additive model holds in two ancestry groups: 𝑦(cid:3036) = 𝐳(cid:2919) (cid:4593)𝛂 + 𝜀(cid:3036) [1] where 𝑦(cid:3036) (i=1,...,n is an index for subjects) is a phenotype, 𝐳(cid:2919) is the (centered) vector of SNP genotypes at causal loci (QTL), and 𝛂 is the vector of effects. Now, let us consider an instrumental model where phenotypes are regressed on SNPs that may not necessarily have a causal effect (markers): 𝑦(cid:3036) = 𝐱(cid:2919) (cid:4593)𝛃 + 𝑒(cid:3036) [2] where 𝐱(cid:2919) is a vector of (centered) SNP genotypes at markers. For a single marker-QTL pair j, the (population) marker effect is defined as: 32 𝛽(cid:3037) = (cid:2887)(cid:2925)(cid:2932)(cid:3435)(cid:3051)(cid:3284)(cid:3285),(cid:3053)(cid:3284)(cid:3285)(cid:3439) (cid:2906)(cid:2911)(cid:2928)(cid:3435)(cid:3051)(cid:3284)(cid:3285)(cid:3439) 𝛼(cid:3037) [3] where Var(cid:3435)𝑥(cid:3036)(cid:3037)(cid:3439) is the marker variance and Cov(𝑥(cid:3036)(cid:3037), 𝑧(cid:3036)(cid:3037)) is the marker-QTL covariance (both scalars). Extending this to a multilocus model21, we have that the vector of population marker effects is defined as: where 𝚺(cid:2908) is the covariance matrix of marker genotypes and 𝚺(cid:2908)(cid:2910) is the covariance matrix between 𝛃 = 𝚺(cid:2908) (cid:2879)(cid:2869)𝚺(cid:2908)(cid:2910)𝛂 [4] marker and QTL genotypes. Within-ancestry R-squared: Within an ancestry group, the maximum proportion of variance of the genetic values that can be explained by a regression on SNPs (assuming SNP effects are known with certainty) depends on the extent of LD between the SNPs used in [1] and those in [2], specifically (see the Supplementary Methods for a step-by-step derivation of [5]): 𝑅(cid:2870) = Corr(𝐱(cid:2919) (cid:4593)𝛃, 𝐳(cid:2919) (cid:4593)𝛂)(cid:2870) = (𝛂(cid:4593)𝚺(cid:2910)(cid:2908)𝚺(cid:2908) (cid:2879)(cid:2869)𝚺(cid:2908)(cid:2910)𝛂) (𝛂(cid:4593)𝚺(cid:2936)𝛂 ⁄ ) . [5] Under perfect LD between markers and QTLs (something that will occur if the causal loci are genotyped or are perfectly predicted by markers), [5] would be equal to one. However, if there is imperfect LD between markers and QTLs, 𝑅(cid:2870) would be less than one. Thus, the R-squared in [5] captures the impact of imperfect LD between markers and QTL on the proportion of variance at casual loci that can be explained by a regression on SNPs22 within a population. Cross-ancestry R-squared: An R-squared similar to [5] can be derived for cross-ancestry prediction by using marker effects from an ancestry (ancestry 1, (𝛃(cid:2869) [4]) to predict genetic scores in a different ancestry group (ancestry 2). Thus, introducing ancestry group notation, we can define cross-ancestry R-squared as: (cid:2870) = Corr(cid:3435)𝐱(cid:2919)(cid:3118) 𝑅(cid:2869)→(cid:2870) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3118) (cid:4593) 𝛂(cid:3439) (cid:2870) [6] 33 where 𝐱(cid:2919)(cid:3118) and 𝐳(cid:2919)(cid:3118) are the marker and QTL genotype vectors of an individual from the target ancestry (ancestry 2), 𝛃(cid:2869) is the vector of marker effects from ancestry 1, and 𝛂 is the vector of QTL effects in the target ancestry. The Supplementary Method present a step-by-step derivation of [5] and [6], expressing the within- and cross-ancestry R-squared parameters as a function of (co)variance matrices of alleles at markers and QTL loci and the QTL effects. It is important to highlight that the R-squared defined above (expressions [5] and [6], as well as the expressions presented in the Supplementary Methods) are not directly comparable to empirical PGS R-squared values commonly reported in the literature because empirical PGS R- squared values quantify the proportion of variance of a phenotype that can be explained by a PGS (and such, its upper limit is the genomic heritability). The R-squared defined above capture the proportion of genetic (not phenotypic) variance at causal loci that can be explained by regression on SNPs (as such, the upper limit for [5] and [6] is one; this will happen under perfect LD between markers and causal variants). Relative accuracy: Following Wang et al. (2020)7, we define the RA of a PGS as: RA = (cid:3118) (cid:3019)(cid:3117)→(cid:3118) (cid:3118) (cid:3019)(cid:3117)→(cid:3117) [7] where 𝑅(cid:2869)→(cid:2869) (cid:2870) is a within-ancestry R-squared (i.e., the proportion of variance at causal loci that can be explained by regression on markers within-ancestry group [5]), and 𝑅(cid:2869)→(cid:2870) (cid:2870) is a cross-ancestry R- squared [6]. Under the assumption that the effects of the causal loci are the same in both ancestries and in the absence of allele frequency or LD differences between ancestries, 𝛃(cid:2869) = 𝛃(cid:2870). In this case, the RA will equal one. However, if there are allele frequency or LD differences between ancestries and imperfect LD between markers and causal variants (QTLs), 𝛃(cid:2869) ≠ 𝛃(cid:2870) and the RA will be less than one. Thus, the RA captures the proportion of the reduction in PGS prediction R-squared attributable to allele frequency and LD differences between ancestries. 34 Monte Carlo Analysis of Variance (MC-ANOVA): Estimating the R-squared parameters ([5] and [6]) and the RA ([7]) requires knowledge of the QTL positions and effects (𝛂), which are both unknown. Therefore, we propose a Monte Carlo (MC) algorithm (Figure 4) that, for a given chromosome segment, estimates the distribution of these R-squared values by computing R-squared values over possible configurations of marker and causal loci and their effects. The algorithm is an extension of a method proposed by us previously23 to estimate the proportion of variance of a high-dimensional set by a regression on another high-dimensional set (in our case, the QTL by the SNPs). Additional details of the MC-ANOVA algorithm can be found in the Methods. 35 Figure 4: A representation of the MC-ANOVA algorithm. MC-ANOVA uses genetic data from two or more ancestry groups (here, to illustrate we consider European [EUR] and African [AF] ancestry) to estimate the proportion of variance at causal loci explained by EUR-derived marker effects in testing data from EUR and non-EUR (e.g., AF) ancestry groups. To estimate the relative accuracy (RA) for a given chromosome segment (e.g., all loci in a ten Kbp segment), MC-ANOVA assumes that the same additive genetic model (𝑔(cid:3036)∗ = 𝐳𝐢∗ (cid:4593) 𝛂, * = EUR or AF) holds in 36 Figure 4 (cont’d) both ancestry groups. Within a short chromosome segment, for one Monte Carlo replicate (MC rep), we sample quantitative trait locus (QTL) (e.g., three) positions (𝐳𝐢∗ for i = 1,...,n) at random. The remaining SNPs in the segment plus those in short flanking regions form a marker genotype vector 𝐱𝐢. After sampling QTL effects (𝛂) from a standard normal distribution, N(0,1), genetic scores are computed as 𝑔(cid:3036)∗ = 𝐳𝐢∗ (cid:4593) 𝛂. Marker effects are derived from the EUR ancestry group using 𝛃𝐄𝐔𝐑 = Var(cid:3435)𝐱𝐢𝐄𝐔𝐑(cid:3439) (cid:2879)(cid:2869) Cov(cid:3435)𝐱𝐢𝐄𝐔𝐑, 𝐳𝐢𝐄𝐔𝐑 (cid:4593) (cid:3439)𝛂, where Var is the variance and Cov is the covariance, and these effects are used to obtain local marker scores for both ancestry groups (cid:4593) 𝛃𝐄𝐔𝐑). The squared correlations (Cor) between genetic (𝑔(cid:3036)∗) and marker (𝑆(cid:3036)∗) scores are (𝑆(cid:3036)∗ = 𝐱𝐢∗ used to derive cross-ancestry and within-ancestry R-squared (R-sq.) values, and the RA is computed as the ratio between the two. This procedure is repeated a large number of times for each segment, resampling QTL positions and their effects every time. For each segment, the R- squared and RA values are averaged across MC replicates. The procedure is applied to each chromosome segment. Maps of the relative accuracy of European-derived PGS in non-Europeans We used the MC-ANOVA method to develop maps of the RA of EUR-derived marker effects in non-EUR ancestry groups from the UK Biobank. We developed RA maps using SNPs from the UK Biobank arrays (~610,000 SNPs with minor-allele frequency 1%) as well as using ~1.3 million HapMap SNPs (with minor-allele frequency 0.1%) that were present in the imputed UK genotypes (see Methods for further details on the QC and filtering steps). To develop each of these portability maps, we partitioned the genome into short nonoverlapping segments that were at least ten Kbp long and had at least ten SNPs. We chose to use short chromosome segments to capture the proportion of variance at causal loci that can be explained (in both within- and cross-ancestry prediction) by SNPs that are physically close to causal variants. The average segment was 45 Kbp long (containing 12 core SNPs) in the case of the map derived using SNPs from the UK Biobank arrays and 22 Kbp long (containing 13 core 37 SNPs) in the case of the map using HapMap variants. Though the base-pair length differed, the average and median number of SNPs per segment in each map were very similar. The results from the map developed using SNPs from the UK Biobank arrays are presented in the main body of the article, and those based on HapMap variants are provided as supplementary data. Whenever pertinent, we discuss the differences between the two maps. The derivation of marker effects (Figure 4) and within-ancestry R-squared [5] used the genotypes of 230,000 distantly related EUR ancestry individuals from the UK Biobank. To estimate the cross-ancestry R-squared [6] and the RA [7], we used data from the UK Biobank of individuals of African (AF), Caribbean (CR), East Asian (EA), and South Asian (SA) ancestry (Table 2 and Figure B1). Further details about sample selection and SNP QC are offered in the Methods section. An interactive R Shiny app that displays RA estimates (from the UK Biobank arrays or HapMap variants) for user-specified genome positions (or SNP IDs) was created and is available via an R package and also on a website (see Supplementary Notes for more information). In addition, the portability map based on the UK Biobank arrays is provided in Supplementary Data 1 and the portability map based on the HapMap variants is provided in Supplementary Data 2. 38 Table 2: Average R-squared and relative accuracy (RA) by testing set (based on SNPs from the UK Biobank arrays). Ancestry Group Sample Fst with R-squared Size EUR (cid:2870) (𝑅(cid:2869)→(cid:2870) )* Relative Standard Variance in Accuracy Error of RA Across (cid:2870) (𝑅(cid:2869)→(cid:2870) (cid:2870) /𝑅(cid:2869)→(cid:2869) ) RA*** Segments*** European (EUR) 230,000 --- 0.648** 1.000 --- --- African (AF) 3,083 0.120 0.182 0.268 0.016 0.033 Caribbean (CR) 3,343 0.102 0.228 0.340 0.017 0.030 East Asian (EA) 1,329 0.095 0.379 0.564 0.030 0.043 South Asian (SA) 7,919 0.022 0.506 0.771 0.017 0.016 * Subscript 1 always indicates an EUR training or testing set; 2 indicates non-EUR testing; ** 𝑅(cid:2869)→(cid:2869) (cid:2870) ; *** Median MC-ANOVA predicts low relative accuracy of PGS between ancestry groups Averaged over the genome, the within-EUR R-squared, 𝑅(cid:2869)→(cid:2869) (cid:2870) [5], was 0.65. This suggests that within-EUR ancestry SNPs from the UK Biobank arrays could explain roughly two-thirds of the genetic variance at ungenotyped causal loci that have a similar allele frequency distribution to the SNPs in the UK Biobank arrays (Table 2). The cross-ancestry R-squared [6] estimates were much lower, ranging from 0.182 (AF) to 0.506 (SA), which resulted in RA estimates ranging from 0.268 (AF) to 0.771 (SA). As expected, the RA was inversely related to the genetic distance between the testing ancestry and the EUR group (Table 2 and Figure B1). For example, the AF ancestry group had the highest Fst24 with the EUR group (0.120) and the lowest whole-genome RA (0.268), while the SA group had the lowest Fst (0.022) and the highest RA (0.771). The estimated R-squared values were significantly higher when the map was produced using HapMap variants (Figure B2 and Table B1). The variance of RA between segments was slightly smaller when the map was produced with the HapMap variants (Table 2 and Table B1). The increase in 39 RA with the HapMap variant-based map was expected, given that this map had twice as many SNPs as the one using array SNPs. Predicted versus empirical RA We used data from distantly related EURs from the UK Biobank and a Bayesian shrinkage variable-selection prediction method (BayesC25) to develop PGS for six complex traits: height, HDL, LDL, serum urate, BMI, and serum glucose (see Methods for details about the phenotypes and methods used to derive the PGS). Using testing data from the EUR and non-EUR ancestry groups (Table B2), we estimated the empirical prediction R-squared for each trait, the corresponding empirical RA (i.e., the ratio of the PGS R-squared in non-EURs relative to the PGS R-squared in EURs), and the loss of accuracy (LOA) attributable to allele frequency and LD differences between ancestries7 (LOA % = (cid:2869)(cid:2879)(cid:2926)(cid:2928)(cid:2915)(cid:2914)(cid:2919)(cid:2913)(cid:2930)(cid:2915) (cid:2902)(cid:2885) (cid:2869)(cid:2879)(cid:2915)(cid:2923)(cid:2926)(cid:2919)(cid:2928)(cid:2919)(cid:2913)(cid:2911) (cid:2902)(cid:2885) × 100, where the predicted RA is defined in [7]). For most traits, the empirical RA estimates were smaller than the predicted RA (Figure 5 for UK Biobank arrays and Figure B3 for HapMap variants). This is expected because the MC- ANOVA-predicted RA captures the LOA attributable to allele frequency and LD differences, which together are only one source of LOA. In general, for any given trait, ancestries with higher predicted RA also had higher empirical RA (Figure 5). This suggests that, as noted earlier by Wang et al.7, allele frequency and LD differences between ancestries are a substantial factor affecting the portability of PGS and that the MC-ANOVA estimates capture that. For most traits and ancestry groups, allele frequency and LD differences alone explained more than 50% of the empirical LOA. However, for glucose, the proportion of reduction in accuracy explained by allele frequency and LD differences was smaller. This could suggest that differences in the genetic architecture (including both heritability and polygenicity) of traits between ancestries and G×E 40 interactions may play a more important role in glucose than the other traits evaluated. For example, height is highly heritable and highly polygenic, and BMI is also highly polygenic and moderately heritable. On the other end, glucose has a moderately low heritability and is less polygenic than height or BMI26–28. Figure 5: Predicted relative accuracy versus empirical RA with UK Biobank array SNPs. MC-ANOVA predicted relative accuracy (RA) versus empirical RA of European (EUR)-derived polygenic scores when used to predict phenotypes of individuals of non-EUR ancestry (AF, CR, EA, and SA denote African, Caribbean, East Asian, and South Asian ancestry, respectively). Each panel displays a different phenotype (height, high-density lipoprotein [HDL], serum urate, low- density lipoprotein [LDL], body mass index [BMI], and glucose). The loss of accuracy (LOA, %) attributable to allele frequency and LD differences between ancestries is shown on top of each bar set. A standard error bar of each mean RA estimate is shown and derivation details are in the Supplementary Methods and details for the empirical RA are in the Methods. The sample sizes used to derive the standard errors are in Table B2. These results are based on SNPs from the UK Biobank arrays; see Figure B3 for results obtained using HapMap SNPs. 41 We compared our PGS-predicted RA and LOA estimates with those reported by Wang et al., 20207, who developed a method to predict RA and LOA for specific PGS. Overall, except for LDL, our results were similar to those published by Wang et al. in terms of both RA and LOA (Figure B4), although, unlike Wang et al.’s method, MC-ANOVA does not use trait-specific SNP effect estimates. The (local) relative accuracy of PGS varies along the genome The results presented above were based on the estimated R-squared and RA averaged across the genome or the segments of the genome represented in a PGS. However, in line with our main hypothesis, we found sizable variability in cross-ancestry R-squared [6] and RA between chromosome segments (Figure 6 and Figure B5 [UK Biobank arrays]; Figure B6 and Figure B7 [HapMap variants]), suggesting that even for ancestries with a low overall RA (e.g., AF), there are still chromosome segments with high RA and portability of EUR-derived PGS. The distribution of the within-EUR R-squared [5] values was symmetric; however, for ancestries with a strong African ancestry influence, the distribution of the cross-ancestry R-squared [6] was heavily right-skewed, with most of the chromosome segments having a low cross-ancestry R- squared. 42 Figure 6: Within- and cross-ancestry R-squared distributions based on UK Biobank array SNPs. Distribution of the cross-ancestry R-squared (R-sq.) versus the within-European (EUR) R- squared for the African (AF), Caribbean (CR), East Asian (EA), and South Asian (SA) ancestry groups obtained when using SNPs from the UK Biobank arrays (see Figure B6 for results based on HapMap SNPs). Each panel displays a different non-EUR ancestry group. Each point represents a small chromosome segment (45 Kbp) and a histogram of the distribution of the points is also shown along each axis. Each subplot has dashed gray lines at the 10th, 50th, and 90th percentiles of the distribution and a red dashed 45-degree reference line (slope of one and intercept at zero). There is a white point at the intersection of the within-ancestry R-squared median and the cross-ancestry R-squared median. See Figure B6 for results based 43 Figure 6 (cont’d) on HapMap SNPs. The estimates presented in Figure 6 correspond to average results across MC runs of the MC-ANOVA algorithm. In our maps, we also provide the standard deviation (SD) of the distribution of the R-squared and RA parameters across MC replicates, along with the standard error of the means (Supplementary Methods). The median cross-ancestry R-squared [6] standard error was 8.0% (median) of the point estimates and the within-ancestry R-squared [5] variance was 1.7%. To illustrate the uncertainty associated with the reported R-squared estimates, we sampled 100 segments for each ancestry group and displayed the within- and cross-ancestry R- squared point estimates with their corresponding standard error bars in Figure B8. MC-ANOVA estimates are predictive of the local RA of empirical PGS The results shown in Figure 6 suggest that in any ancestry group, but particularly for those that are more genetically distant from the EUR ancestry, the predicted cross-ancestry R-squared and RA vary substantially over the genome. To evaluate whether MC-ANOVA estimates are predictive of the local RA of real PGS, we first grouped SNPs into sets according to their MC- ANOVA predicted cross-ancestry R-squared [6] and used this to define four portability groups: Very Low, Low, Medium, and High (Table 3 for AF; Table B3 for CR, EA, and SA). Then, we decomposed the trait-specific PGS into subscores, each using the SNPs in a predicted portability group. Finally, we computed the correlation between each subscore and their corresponding adjusted phenotype in testing sets for EUR and non-EUR, as well as the difference in the correlations of within- and cross-ancestry PGS prediction. 44 Table 3: Estimated relative accuracy (RA) of the SNP segments across the genome grouped by their estimated portability in terms of cross-ancestry R-squared (𝑅(cid:2869)→(cid:2870) (cid:2870) for 1 = EUR and 2 = AF testing set). Results were obtained using SNPs from the UK Biobank arrays. Testing Portability Quantile (cid:2870) 𝑅(cid:2869)→(cid:2870) Number Average Average Average RA Group Group Group Cutoff Range of SNPs (cid:2870) 𝑅(cid:2869)→(cid:2869) (cid:2870) 𝑅(cid:2869)→(cid:2870) (cid:2870) (𝑅(cid:2869)→(cid:2870) (cid:2870) /𝑅(cid:2869)→(cid:2869) ) High (0.8,1] (0.26,0.97] 122,135 0.751 0.400 0.529 Medium (0.6,0.8] (0.18,0.26] 122,131 0.674 0.215 0.323 Low (0.5,0.6] (0.15,0.18] 61,065 0.646 0.162 0.255 African (AF) Very Low [0,0.5] [0,0.15] 305,352 0.597 0.086 0.144 (See Table B3 for other ancestry groups.) For most traits, we observed that the difference in empirical PGS correlation (non-EUR PGS correlation subtracted from EUR PGS correlation) decreased as the predicted portability of the SNP set increased (Figure 7 for AF; Figure B9 for CR, EA, and SA). For instance, for individuals of AF ancestry, the difference in the within- and cross-ancestry PGS and phenotype correlations for height ranged from 0.30 for the Very Low portability group of SNP segments to just 0.06 for the High portability group of SNP segments (top-left panel in Figure 7). Similar patterns were observed for the other traits (and ancestry groups; Figure B9). For serum urate and HDL cholesterol, there was near-perfect portability of PGS between EUR and AF for SNPs in the High portability group. Furthermore, the LOA attributable to allele frequency and LD differences estimated within each SNP portability group was lowest in the High portability group for most traits and ancestry groups (Figure B10). For example, in the AF group, we achieve a LOA for height of just 9.2% for the High portability group, but in the Very Low portability group, the LOA 45 is 88.3%. This indicates that MC-ANOVA-predicted portability is predictive of the empirical RA and LOA of chromosome segments. Figure 7: The difference between polygenic score prediction correlation by SNP portability group based on UK Biobank array SNPs. The vertical axis represents the difference between the within- and cross-ancestry polygenic score prediction correlations of European (EUR) derived polygenic scores (PGS) for SNP groups with Very Low, Low, Medium, and High MC-ANOVA predicted portability (𝑅(cid:2869)→(cid:2870) (cid:2870) groupings, Table 3) by trait. Each panel displays a different phenotype (height, high-density lipoprotein [HDL], serum urate, low-density lipoprotein [LDL], body mass index [BMI], and glucose). A positive difference in PGS prediction correlation indicates that the PGS of the SNP set had a higher prediction correlation in EUR (within-ancestry prediction) than in individuals of African (AF, cross-ancestry prediction) ancestry. The number of SNPs entering each PGS is annotated toward the bottom of each subplot. A standard error bar for each prediction correlation difference is shown and details for the calculation can be found in the Methods. The gray vertical bars are the simulated null distribution (mean +/- standard error of 2,000 iterations) for the correlation difference, where SNPs were assigned to portability groups completely at random, maintaining the number of SNPs in each subgroup. The sample sizes for the simulated null distribution are in Table B2. See Figure B9 for results for other 46 Figure 7 (cont’d) ancestry groups (Caribbean, East Asian, and South Asian) and Figure B11 for results based on HapMap SNPs. Using HapMap SNPs did not notably improve PGS local portability over using the called genotypes set (Figure B11). Overall, the validation results obtained with the HapMap-based map were similar to the ones reported for the map based on SNPs of the UK Biobank arrays; however, the grouping of SNPs based on the HapMap-based map was not as effective at reducing the empirical difference in prediction correlation between EUR and non-EUR ancestry groups as with the map based on SNPs from the UK Biobank arrays (Figure 7, and Figures B9 and B11). We believe this may partially reflect possible artifacts induced by the use of imputed SNPs which may lead to upwardly biased estimates of RA. To benchmark the results of Figure 7, we performed a similar analysis to that presented in Figure 4, Figure B9, and Figure B11 classifying SNPs into portability groups using Fst24 and Wang et al.’s RA method7 (Figure B12). Overall, MC-ANOVA was considerably more effective at identifying SNP sets with varying levels of portability than Fst or Wang et al.’s RA. Fst was very poor at predicting the RA of trait-specific local PGS, and Wang et al.’s RA was only effective at detecting SNP sets with different RAs for height (Figure B12). Conversely, both the High and Medium portability groups based on MC-ANOVA were different from the simulated null for height, and the High portability group based on MC-ANOVA was different from the simulated null for HDL, serum urate, and BMI (Figure 7). Genomic regions with high RA are enriched for GWAS hits and high SNP density We investigated whether the MC-ANOVA estimates of R-squared and RA were associated with the presence of GWAS hits (p value < 5e-8; Table B4) in the EUR ancestry. We found that genomic regions with higher MC-ANOVA R-squared values were highly enriched for 47 GWAS hits for all the traits investigated (Figure 8) and tended to have higher marker density (which, in turn, leads to higher LD between markers and causal variants). However, for segments with similar marker density to each other, the R-squared estimates were relatively uniformly distributed across the entire range (Figure B13), especially for the EA and SA ancestry groups. This suggests that high marker density is a necessary but not sufficient condition to achieve high MC-ANOVA R-squared values. Figure 8: The proportion of UK Biobank array SNPs that were significantly associated with a trait for SNP groups with Very Low, Low, Medium, and High MC-ANOVA predicted portability. The y-axes give the proportion of SNPs for which a European (EUR)-based genome- wide association study (GWAS) p value (based on a two-sided test of a t-statistic, with the null hypothesis that the SNP effect is zero) was less than 5e-8 within each portability group (x-axes). Each panel displays a different phenotype (height, high-density lipoprotein [HDL], serum urate, low-density lipoprotein [LDL], body mass index [BMI], and glucose). For the EUR testing set (African [AF], Caribbean [CR], East Asian [EA], and South Asian [SA]), the grouping was based on the within-ancestry R-squared [5]. The number of SNPs is noted above each bar and is based on SNPs from the UK Biobank arrays. Using RA to improve cross-ancestry prediction of transfer learning algorithms To demonstrate how RA maps can be used to improve cross-ancestry PGS prediction 48 accuracy, we evaluated PGS informed by the RA maps in the context of transfer learning. Gradient Descent with Early Stopping (GD-ES) is a widely employed technique for transfer learning (TL) in various machine learning algorithms. Recently, Zhao et al.29 introduced the application of GD-ES in constructing PGS for cross-ancestry prediction. This approach uses EUR-derived SNP effect estimates as initial values for a GD-ES algorithm that updates these estimates iterating on data from the non-EUR target population. In GD-ES, a learning rate parameter is used to control the strength of the updates. In Zhao et al.29, the learning rate was the same for all SNPs in the PGS. We took this concept one step further by using the cross-ancestry RA maps to inform the learning rate of the gradient descent algorithm, making it SNP-specific (see Methods). Specifically, we allowed for stronger learning rates for SNPs in regions with low predicted portability and weaker learning rates for SNPs with high cross-ancestry portability. We applied this approach to develop PGS for non-EUR ancestry groups from the UK Biobank, using EUR-derived effects as initial values. Our preliminary results (Table B8) suggest that using RA- informed learning rates can improve cross-ancestry prediction accuracy over using a fixed learning rate in most traits evaluated for prediction in an external testing set (see Methods). The improvement is particularly clear in the CR and AF ancestry groups (Table B8). External validation The results presented thus far were entirely based on UK Biobank data. Prediction across cohorts poses additional challenges (e.g., the use of different SNP arrays and GE factors). Therefore, to assess the performance of MC-ANOVA in an external validation, we conducted an evaluation using data from the Atherosclerosis Risk in Communities (ARIC) study20. The validation involved 9,628 European American (AEA) and 3,130 African American (AAA) participants from the ARIC study. For these analyses, we utilized a set of 795,613 SNPs that were 49 common between the genotypes of the ARIC study and the imputed genotypes from the UK Biobank19. The AEA group from the ARIC study served as a within-ancestry (cross-data set) testing set, while the AAA group from the ARIC study served as a cross-ancestry (and cross-data set) testing set. We evaluated global RA and LOA, as well as local PGS, based on the predicted portability groups based on the MC-ANOVA R-squared estimates [6] for height, serum urate, and BMI. The whole PGS empirical RA estimates were higher than those of the within data set (UK Biobank only) analysis for height (approximately 0.35) and BMI (approximately 0.25), and the predicted RA estimates were correspondingly higher as well. The whole PGS LOA attributable to allele frequency and LD differences across height, serum urate, and BMI was approximately 60% (Figure B14a), which is similar to what we estimated using the UK Biobank data. The assessment of empirical correlation difference (UK Biobank EUR → ARIC AEA minus UK Biobank EUR → ARIC AAA) within SNP sets grouped by MC-ANOVA portability estimates validated the results for height, as the empirical correlation difference deviated from the simulated null distribution in the High portability group (Figure B14b). Discussion Previous studies suggest that between-ancestry differences in allele frequencies and LD patterns are a major factor contributing to the loss of accuracy (LOA) in cross-ancestry PGS prediction6–8. For instance, Privé et al.8 showed that the portability of PGS between ancestry groups worsens with the genetic distance between the groups, and Wang et al.7 reported that much of the LOA in prediction from European (EUR) to African (AF) ancestry could be attributed to allele frequency and LD differences. However, no previous study has investigated whether the relative accuracy (RA) of cross-ancestry PGS varies along the genome. To address this knowledge gap, we developed a novel approach (MC-ANOVA) to estimate the RA of short 50 chromosome segments. MC-ANOVA estimates the RA of randomly generated linear functions of genotypes within each chromosome segment, making MC-ANOVA a trait-agnostic method that is solely based on genome information. The methodology can be used to map regions of high and low (local) PGS portability between two or more ancestry groups. We applied MC-ANOVA to UK Biobank data to generate maps (with a mapping resolution of ~45 Kbp) of the maximum expected RA when EUR-derived SNP effects are used to predict phenotypes or disease risk of non-EURs, including individuals of AF, Caribbean (CR), East Asian (EA), and South Asian (SA) descent. Finally, we validated these RA maps by quantifying the empirical RA of real PGS for SNP sets with High, Medium, Low, and Very Low MC-ANOVA predicted portability for prediction within and across data sets. Genome differentiation between populations has been a focus of population genetics for more than seven decades. The Fst24 metric quantifies differentiation in allele frequencies. MC- ANOVA and Wang et al.’s RA method7 capture both differences in allele frequencies and LD patterns, with the key difference being that Wang et al.’s RA method accounts for pairwise LD and MC-ANOVA uses a multilocus regression approach that accounts for the full patterns of conditional linear dependence/independence of loci within a segment and does not require assuming that causal variants are independent. Additionally, unlike Wang et al.’s method, MC- ANOVA is trait-agnostic in that it does not use SNP effect estimates. This makes MC-ANOVA suitable to develop RA maps that can be used with any trait. We benchmarked MC-ANOVA against Fst and Wang et al.’s RA metric in terms of the ability of the methods to identify SNPs with Very Low, Low, Medium, and High portability. In the benchmark analysis, MC-ANOVA convincingly outperformed both Fst and Wang et al.’s RA method across traits and ancestry groups (Figure B12). 51 Consistent with previously reported LOA estimates7, we found that, on average, allele frequency and LD differences between ancestries explained approximately half of the LOA genome-wide in the EA and SA ancestry groups and approximately two-thirds in the AF and CR groups. As expected, for the average chromosome segment, MC-ANOVA predicts lower RA for groups more genetically distant (e.g., EUR→AF or EUR→CR) relative to genetically closer groups (e.g., EUR→EA or EUR→SA). These results support the literature that allele frequency and LD differences between ancestries significantly affect the RA of PGS across ancestries. However, we also found significant variability in RA across chromosome segments. Indeed, even for the more genetically distant groups (e.g., EUR→AF), we found many segments with high predicted RA. This is important because it suggests that there are many genomic regions of the genome for which results from large EUR GWAS may be portable to non-EUR ancestries, which has the potential for improving cross-ancestry prediction. MC-ANOVA estimates capture the components of LOA attributable to differences in allele frequencies and LD between ancestry groups, which together are only one of the factors affecting the RA of PGS in cross-ancestry prediction. Therefore, MC-ANOVA-predicted RA should be considered the maximum RA that one could achieve in cross-ancestry prediction, under the implicit assumption that causal variants are being tagged by SNPs within ~45 Kbp. The gap between the predicted empirical RA varied between traits. For example, among the traits we considered, the gap between the MC-ANOVA predicted RA and the empirical RA appeared to be largest for glucose (Figure 5), a trait that is likely to be more affected by GE exposures (e.g., diet, lifestyle, and exercise) that can be correlated with ancestry. Likewise, the ability of MC- ANOVA RA maps to identify regions of high and low RA varied between traits (Figure 7). For traits with an extremely polygenic genetic architecture (e.g., height and BMI26,27), MC-ANOVA 52 appeared to be more predictive of the empirical difference in the PGS prediction correlation between the EUR and non-EUR groups than for traits such as glucose. This is expected because MC-ANOVA estimates the RA of linear functions averaging over many possible randomly drawn linear combinations of SNP and QTL genotypes. The MC-ANOVA algorithm is controlled by a few parameters, including the segment size, the number of causal variants within the segment, the number of SNPs in the flanking regions, and the distribution causal variant effects are drawn from. The RA maps that we present in this study are based on small (~45 Kbp) segments, each containing three causal variants (which are randomly chosen in each MC replicate) and ten SNPs in each of the flanking regions of the segment. We chose these parameters to achieve a relatively fine mapping resolution for segments that may hold more than one causal variant. To assess the robustness of our results with respect to the parameter values chosen, we performed sensitivity analyses first varying the number of causal variants in the segment, then varying the flank size for a given QTL and segment size, and finally changing the distribution used to sample effects from Gaussian to Gamma (Figures B15a, B15b, and B16, respectively). Overall, in all sensitivity analyses, we found that the distribution of the RA measures, as well as the genomic regions where RA peaks, were reasonably robust to the parameters of the MC-ANOVA algorithm, except in cases involving just one causal variant or no flanking SNPs. In these two cases, we observed a systematic reduction in R-squared parameters and RA (Figures B15c and B15d). The RAs of the map developed with UK Biobank array SNPs (~610,000 SNPs) were smaller (and the variance in RA was higher) than those estimated using twice as many HapMap variants (~1.3 million SNPs). This can be attributed to the higher marker density of the HapMap variant set and the stronger LD among those variants compared to those of the UK Biobank array. 53 The higher LD among variants in the HapMap variant set was both a consequence of the higher marker density and of a distribution of the minor allele frequency (MAF) that was symmetric and with a mode near 0.24. On the other hand, the distribution of the MAF in the array set had an enrichment in the lower MAF which would impose limits on the maximum LD30. Furthermore, correlated imputation errors (which may result from a tendency to impute genotypes from certain haplotypes) may lead to a spurious increase in LD among imputed variants. Overall, the global MC-ANOVA predicted relative accuracy was more similar to the empirical relative accuracy with the UK Biobank array-based map (Figure 5) than the HapMap-based map (Figure B3). Furthermore, the UK Biobank array-based RA map was slightly better than the HapMap-based map at predicting the empirical differences between the within- and cross-ancestry PGS prediction correlation (compare Figure 7 and Figure B9 with Figure B11). Therefore, for PGS with SNPs within the allele frequency spectrum represented in the UK Biobank arrays, we recommend using the map based on UK Biobank array variants. Nevertheless, both maps are made available with this article. When comparing RA estimates with GWAS results, we found that regions with high predicted portability are highly enriched for GWAS hits. This is expected because RA is expected to be high in regions with strong and long-spanning LD and, at the same time, high LD among variants also increases the power to detect associations when causal variants are not genotyped. Furthermore, selection can lead to higher LD for loci with large effects on fitness traits31,32. A good example of the overlap of high RA in regions that have been detected to be associated with many traits, including many fitness traits, appears on chromosome six between 25.84 and 33.29 Mbp (Figure B5), which had the largest cross-ancestry R-squared [6] values in all four non-EUR ancestry groups. This peak closely overlaps with the major histocompatibility complex (MHC) 54 region33. An abundance of literature has established that the MHC region includes numerous loci (e.g., human leukocyte antigen [HLA] genes) associated with many traits and diseases, particularly autoimmune diseases (e.g., nephropathy), infections, cancers, and psychiatric conditions (e.g., autism and schizophrenia)1,33–37. The MHC region is also known to be highly polymorphic, has high gene density, and has very strong LD33,34,38. Interestingly, for all four ancestry groups, the majority of the genes with the highest predicted portability were within chromosome six and the MHC region (Tables B5-B7) An important question is whether the RA maps that we developed can be used to improve PGS prediction accuracy for groups that are underrepresented in GWA studies. For example, in the construction of PGS for cross-ancestry prediction, one could filter out SNPs that are in regions with very low predicted RA. However, in our maps, there were almost no segments with negative cross-ancestry correlation estimates. Therefore, we don’t expect that removing SNPs based on their low RA would result in improved cross-ancestry PGS prediction. Another possibility is to use cross-ancestry predicted R-squared [6] estimates to inform transfer learning (TL) algorithms used to develop PGS for non-EUR ancestry groups. We found that using cross-ancestry predicted R-squared [6] to inform learning rates in a GD-ES29 algorithm resulted in improvements in PGS prediction accuracy compared to an algorithm that used a fixed learning rate; thus, demonstrating an important practical application of the RA maps developed in this study. In conclusion, we developed and validated a method to map the RA of short chromosome segments and used data from the UK Biobank and the ARIC study cohorts to develop RA maps for several ancestry groups. These maps can provide valuable information for explaining GWAS replication (or lack thereof) across ancestry groups and can help in prioritizing variants for the development of PGS for cross-ancestry prediction. Together with the methods and results 55 presented in this study, we provide software that can be used to generate RA maps for other data sets and ancestry groups and share the maps of RA through an R-package and a web interface. Methods Data In this study, we used data from the UK Biobank and the ARIC study cohorts. For model training, we leveraged the large sample size of Europeans (EUR) from the UK Biobank. We conducted an internal validation using testing data from EUR and non-Europeans from the UK Biobank and an external validation using data from European Americans and African Americans from the ARIC study. UK Biobank cohort. We used distantly related individuals (defined as individuals with a within-ancestry genomic relationship < 0.05) from the UK Biobank. We randomly split the 236,698 distantly related EUR ancestry individuals into a training set of size 230,000 and a testing set of 6,698. Additionally, UK Biobank testing sets included individuals of African ([AF], n=3,083), Caribbean ([CR], n=3,343), East Asian ([EA], n=1,329), and South Asian ([SA], n=7,919) ancestry (Table 2). Ancestral groups were defined by the UK Biobank self-reported Ethnic background (Data-Field 2100039), but individuals were only included in each ancestry group if they passed the UK Biobank’s Sample QC (Resource 53139), not excluded from kinship inference, included in phasing, and not identified as an outlier in heterozygosity and missing rates. Samples were also excluded if they withdrew from the study, if they had a mismatch of reported and genetic sex, if they were missing all six phenotypes of interest (described below), or if they were related to other samples with relatedness 0.05. Relatedness was determined using genomic relationship matrices (𝐆 = 𝐙𝐙(cid:4594) (cid:2930)(cid:2928)(𝐙𝐙(cid:4594))/(cid:3041) , where Z is the centered genotype matrix) computed within an ancestry group. 56 The ARIC study cohort. An external validation utilized the ARIC study, consisting of a European American (AEA) testing set of 9,628 and an African American (AAA) testing set of 3,130 based on self-reported race, which is highly concordant with the ancestry group defined based on SNP-derived principal component analysis16. The previously described EUR training set from the UK Biobank was used as the training set again for this external validation. UK Biobank genotypes. For analysis involving the SNPs from the UK Biobank arrays, we used 610,791 genotyped SNPs from the UK Biobank Affymetrix array19 in autosomal chromosomes. SNPs with a minor allele frequency of <1% or a missing call rate >5% overall (all ancestry groups combined) were excluded, and monomorphic SNPs in a particular ancestry group were excluded from analyses involving that group (108 for AF and 47,390 for EA). The base pair positions provided are based on GRCh3719. The HapMap SNP set used was based on the intersection of the Northern and Western European ancestry HapMap 340,41 SNPs and the UK Biobank imputed SNP genotypes19. 1,297,917 SNPs with a quality score >0.7, a minor allele frequency in the full dataset cohort 0.1%, and not monomorphic in either the EUR or non-EUR cohorts were retained for analysis. The ARIC study genotypes. For analysis involving model training in the UK Biobank and model validation in the ARIC study, we identified a common set of 795,613 autosomal chromosome SNPs between the ARIC study genotyped SNPs and the UK Biobank imputed set (excluding multiallelic variants)19. SNPs were excluded if they were monomorphic in either the UK Biobank EUR training set or one of the testing sets from the ARIC study. We checked for consistency of the genotyped strand and the reference alleles. SNP effects for SNPs with different reference alleles in the UK Biobank and the ARIC study (estimated in the UK Biobank) were multiplied by -1 before PGS were computed in the ARIC study cohorts. 57 Mapping the relative accuracy (RA) of cross-ancestry PGS prediction MC-ANOVA method. MC-ANOVA uses genomic data from two or more ancestry groups (here, we use 1 = EUR and 2 = AF groups to illustrate). The goal is to estimate the proportion of variance (R-squared) at causal loci that can be explained by EUR-derived marker effects in testing data from EUR (𝑅(cid:2869)→(cid:2869) (cid:2870) = Corr(cid:3435)𝐱(cid:2919)(cid:3117) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3117) (cid:4593) 𝛂(cid:3439) (cid:2870) [5]) and AF (𝑅(cid:2869)→(cid:2870) (cid:2870) = Corr(cid:3435)𝐱(cid:2919)(cid:3118) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3118) (cid:4593) 𝛂(cid:3439) (cid:2870) [6]) ancestries. Here, 𝐳(cid:2919)∗ and 𝐱(cid:2919)∗ are genotypes at causal variants and markers (including markers in the core and flanking regions, Figure 1) of group * (* = 1 or 2), respectively, 𝛂 is the vector of QTL effects (which are assumed to be the same in both groups), and 𝛃(cid:2869) is the vector of marker effects in group 1. The relative accuracy (RA) ratio is then defined and computed as RA = 𝑅(cid:2869)→(cid:2870) (cid:2870) /𝑅(cid:2869)→(cid:2869) (cid:2870) . For a chromosome segment, MC-ANOVA estimates RA by quantifying the portability of randomly generated linear functions of SNP genotypes within short chromosome segments. We have previously shown that for general settings, the MC-ANOVA algorithm provides unbiased estimates of [5]23. RA maps. To develop our RA maps, we first grouped SNPs into disjoint segments. For each chromosome, we partitioned the SNPs into ten Kbp nonoverlapping segments with a minimum of ten core SNPs per segment, leading to 52,956 segments for the SNPs from the UK Biobank arrays and 100,311 segments for the SNPs from the HapMap variants. The average SNP segment was 45 (22) Kbp long and contained 12 (13) core SNPs for the UK Biobank array SNPs (HapMap variants). The code used to define the SNP segments for the RA maps can be found at https://github.com/lupiA/MCANOVA (Supplementary Notes). For each segment and Monte Carlo (MC) replicate, we sampled three QTL positions at random (𝐳(cid:2919)∗). The remaining SNPs in the segment plus 20 flanking SNPs (ten for each flanking region) were used as markers (𝐱(cid:2919)∗). QTL effects were sampled from IID standard normal 58 distributions. For the sensitivity analysis shown in Figure B16, QTL effects were sampled from IID Gamma distributions with a shape parameter equal to 1.5 and a rate parameter equal to one. We computed genetic scores for the causal model for individuals from ancestries 1 and 2 using 𝑔(cid:3036)(cid:3117) = 𝐳(cid:2919)(cid:3117) (cid:4593) 𝛂 and 𝑔(cid:3036)(cid:3118) = 𝐳(cid:2919)(cid:3118) (cid:4593) 𝛂. Marker effects in ancestry group 1 were computed as 𝛃(cid:3553) (cid:2869) = (𝐗(cid:2869) (cid:4593) 𝐗(cid:2869) + 𝐈𝑘)(cid:2879)(cid:2869)𝐗(cid:2869) (cid:4593) 𝐙(cid:2869)𝛂, where k = 1e-8 was a small constant added to the diagonal of 𝐗(cid:2869) (cid:4593) 𝐗(cid:2869) to avoid numerical problems. For short chromosome segments, the resulting marker effect estimates (𝛃(cid:3553) (cid:2869)) are almost identical to the true population effects (𝛃(cid:2869)) because the response used to derive 𝛃(cid:3553) (cid:2869) (𝑔(cid:3036)(cid:3117) = 𝐳(cid:2919)(cid:3117) (cid:4593) 𝛂) is not affected by errors and the sample size used vastly exceeded the number of markers. For each MC replicate, we estimated the within and across R-squared parameters ([5] and [6]) using data not used to derive marker effects by squaring the correlation of the marker and QTL predictions: 𝑅(cid:2869)→(cid:2869) (cid:2870) = Corr(cid:3435)𝐱(cid:2919)(cid:3117) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3117) (cid:4593) 𝛂(cid:3439) (cid:2870) [5] and 𝑅(cid:2869)→(cid:2870) (cid:2870) = Corr(cid:3435)𝐱(cid:2919)(cid:3118) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3118) (cid:4593) 𝛂(cid:3439) (cid:2870) [6]. For each segment, we conducted 300 MC replicates (each time resampling QTL positions and their effects) and reported the average (across MC replicates) R-squared and RA values in the RA maps. A visual representation of the MC-ANOVA estimation algorithm can be found in Figure 1. MC-ANOVA sensitivity analysis. To demonstrate the robustness of MC-ANOVA to its main parameters, we re-estimated the RA maps in the AF UK Biobank cohort, first varying the number of QTLs sampled for a given segment (one, two, three, four, five, and six QTLs per segment). Second, we varied the number of flanking SNPs to each side of the segment to be included in the MC-ANOVA estimation (zero, five, ten, 15, 20, and 30 flanking SNPs to each side). These were both evaluated in the chromosome segments discussed above. Phenotype preprocessing UK Biobank phenotypes. We evaluated six phenotypes in the UK Biobank cohort (Table B2): height, HDL, serum urate, LDL, BMI, and serum glucose. Each phenotype was preadjusted 59 using an ordinary least squares (OLS) regression including sex, age, the first five genotyped principal components, center, and batch. We used records from the first or, when the first instance was missing, the second visit. Serum urate was log-transformed before preadjustment. The ARIC study phenotypes. We evaluated three phenotypes that were common between the ARIC study and those evaluated in our main UK Biobank-based analyses: height, serum urate, and BMI. The ARIC study phenotypes were preadjusted within each ancestry group using OLS regressions including sex and age. Serum urate was log-transformed before preadjustment. The ARIC study subjects were removed from the PGS analyses if they were missing the phenotype of interest, sex, or age. Relative accuracy map validation for real traits GWAS. For each preadjusted phenotype (Table B2), we conducted a GWAS in the training set described above – distantly related individuals of EUR ancestry (n=230,000) from the UK Biobank (Table B4). Each GWAS (a single marker regression) was carried out using the R package BGData42 (the rayOLS option). This uses a t-statistic with the null hypothesis that the SNP effect is zero (a two-sided test). The GWAS p values were used as a filtering step for the subsequent PGS, in that a SNP was included in the PGS if it had a p value < 1e-5. Note that when referring to a GWAS hit, as in Figure 8, we used the standard cutoff of p value < 5e-8 for consistency with other literature. SNP effects for polygenic scores (PGS) using real data. For each phenotype, effects (𝐛(cid:4632) (cid:2869)) for the GWAS-filtered SNPs were estimated with a Bayesian shrinkage variable-selection method (BayesC25, a mixture prior consisting of a point of mass at zero and a Gaussian slab). These models were fit using the BLRXy function from the R package BGLR43, which generates posterior samples using a Gibbs sampler44. We estimated SNP effects using 50,000 posterior 60 samples collected using five MCMC chains. SNP effects were averaged over the chains. PGS prediction. For each phenotype, we computed PGS for each subject in each testing set (ancestry group 1 = EUR and 2 = AF, CR, EA, or SA) using 𝑦(cid:3548)(cid:3036)∗ = 𝐱𝐢∗ (cid:4593) 𝐛(cid:4632) 𝟏, where * denotes group 1 or 2. The PGS prediction correlation was then defined as Corr(𝑦(cid:3548)(cid:3036)∗, 𝑦(cid:3036)∗), where 𝑦(cid:3036)∗ is the adjusted phenotype of the ith subject of the corresponding testing group. The empirical RA was then defined as RA = (cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3118),(cid:3052)(cid:3284)(cid:3118) (cid:3439) (cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3117),(cid:3052)(cid:3284)(cid:3117) (cid:3439) (cid:3118) (cid:3118), where the numerator is the squared PGS correlation for a cross-ancestry PGS (e.g., 2 = AF, CR, EA, or SA), and the denominator is that for within-ancestry (1 = EUR). Comparing this empirical RA to the MC-ANOVA predicted RA, RA = (cid:3118) (cid:3019)(cid:3117)→(cid:3118) (cid:3118) (cid:3019)(cid:3117)→(cid:3117) [7], we can also define the loss of accuracy7 (LOA) percentage attributable to allele frequency and LD differences between ancestries: LOA % = (cid:2869)(cid:2879)(cid:2926)(cid:2928)(cid:2915)(cid:2914)(cid:2919)(cid:2913)(cid:2930)(cid:2915) (cid:2902)(cid:2885) (cid:2869)(cid:2879)(cid:2915)(cid:2923)(cid:2926)(cid:2919)(cid:2928)(cid:2919)(cid:2913)(cid:2911) (cid:2902)(cid:2885) × 100. Standard error estimates. We obtained approximate standard error estimates for the PGS correlation coefficients, Corr(𝑦(cid:3548)(cid:3036)∗, 𝑦(cid:3036)∗), using (cid:3496)(cid:2869)(cid:2879)(cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)∗,(cid:3052)(cid:3284)∗(cid:3439) (cid:3041)∗(cid:2879)(cid:2870) (cid:3118) , where 𝑛∗ is the sample size of the given testing set (* = 1 or 2). The standard error of the correlation difference between two ancestries (e.g., 1 = EUR and 2 = AF), Corr(cid:3435)𝑦(cid:3548)(cid:3036)(cid:3117), 𝑦(cid:3036)(cid:3117)(cid:3439) − Corr(𝑦(cid:3548)(cid:3036)(cid:3118), 𝑦(cid:3036)(cid:3118)), was computed as (cid:3493)SE(cid:2869) (cid:2870) + SE(cid:2870) (cid:2870). Following Wang et al.7, the standard error for the empirical RA was computed as SE(empirical RA) = (cid:3496)(empirical RA)(cid:2870) ∗ (cid:4678) (cid:2872)(cid:4672)(cid:2869)(cid:2879)(cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3117),(cid:3052)(cid:3284)(cid:3117) (cid:3439) (cid:3041)(cid:3117)∗(cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3117),(cid:3052)(cid:3284)(cid:3117) (cid:3439) (cid:3118) (cid:4673) (cid:3118) + (cid:2872)(cid:4672)(cid:2869)(cid:2879)(cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3118),(cid:3052)(cid:3284)(cid:3118) (cid:3439) (cid:3041)(cid:3118)∗(cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3118),(cid:3052)(cid:3284)(cid:3118) (cid:3439) (cid:3118) (cid:4673) (cid:3118) (cid:4679). A similar method was used to obtain standard errors for the predicted RA, with the addition of an MC error component. More details of this can be found in the Supplementary Methods. PGS subscores. To validate the MC-ANOVA method, we computed four PGS subscores for each trait and ancestry group based on the MC-ANOVA cross-ancestry R-squared estimates 61 [6] from the RA maps. For one ancestry group and trait, the High PGS subscore consisted of the SNPs in the PGS that were in the top 20th percentile of 𝑅(cid:2869)→(cid:2870) (cid:2870) [6]. Similarly, the Medium subscore was the 60th-80th percentile SNPs, the Low the 50th-60th, and the Very Low the bottom 50th. The PGS correlations described above, Corr(𝑦(cid:3548)(cid:3036)(cid:3118), 𝑦(cid:3036)(cid:3118)), were then computed within each of those SNP sets. Note that in Table B4, the trait-specific proportion of variance explained by the EUR ancestry-derived PGS was computed from the overall PGS R-squared (using all PGS SNPs). In the benchmark analysis described next, PGS subscores were computed in the same way as MC- ANOVA, with SNP sets for PGS subscores based on the quantiles of the respective method’s RA map (Fst or Wang et al.’s RA). To obtain a simulated null distribution for the expected correlation difference based on the number of SNPs included in each PGS in Figure 7 and Figures B9, B11, B12, and B14b, we permuted the grouping labels over 2,000 iterations for each trait and ancestry group and estimated the PGS correlation difference between EUR and non-EUR within each permuted grouping. Benchmarks We benchmarked MC-ANOVA against Fst24 and the RA method described in Wang et al., 20207. Both of these benchmark RA methods were evaluated in the same SNP segments described above (which were defined based on a minimum length of ten Kbp and at least ten SNPs) to build cross-ancestry RA maps for MC-ANOVA, ultimately building RA maps for each benchmark method as well. Fixation index (Fst). Derived from Wright’s F-statistic, Fst24 has been the traditional metric used in population genetics to quantify genome differentiation in terms of allele frequency differences between populations. For a given locus, Fst decomposes the genetic variance as the proportion of between-population variation out of the total population variation, such that a value 62 of zero corresponds to no differentiation between the populations. We computed the Fst for the qth window as the average Fst of all core SNPs in that segment, where the Fst for a single SNP is: (cid:4678)(cid:4672)(cid:3043)(cid:2869)∗ (cid:3289)(cid:3117) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:2878)(cid:3043)(cid:2870)∗ (cid:3289)(cid:3118) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:4673)∗(cid:3436)(cid:2869)(cid:2879)(cid:4672)(cid:3043)(cid:2869)∗ (cid:3289)(cid:3117) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:2878)(cid:3043)(cid:2870)∗ (cid:3289)(cid:3118) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:4673)(cid:3440)(cid:4679)(cid:2879)(cid:3436) (cid:3289)(cid:3117) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) ∗(cid:3043)(cid:2869)∗((cid:2869)(cid:2879)(cid:3043)(cid:2869))(cid:2878) (cid:3289)(cid:3118) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) ∗(cid:3043)(cid:2870)∗((cid:2869)(cid:2879)(cid:3043)(cid:2870))(cid:3440) (cid:4672)(cid:3043)(cid:2869)∗ (cid:3289)(cid:3117) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:2878)(cid:3043)(cid:2870)∗ (cid:3289)(cid:3118) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:4673)∗(cid:3436)(cid:2869)(cid:2879)(cid:4672)(cid:3043)(cid:2869)∗ (cid:3289)(cid:3117) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:2878)(cid:3043)(cid:2870)∗ (cid:3289)(cid:3118) (cid:3289)(cid:3117)(cid:3126)(cid:3289)(cid:3118) (cid:4673)(cid:3440) [8] where 𝑝∗ is the minor allele frequency and 𝑛∗ is the sample size for population *. Wang et al. RA method. The second RA method was described by Wang et al., 20207 to quantify the proportion of prediction accuracy loss across ancestries attributable to allele frequency and LD differences. We modified Wang et al.’s method to make it trait-invariant. For each core SNP j (i.e., those in a chromosome segment, excluding the SNPs in flanking regions) in a single segment (see the section ‘Mapping the relative accuracy (RA) of cross-ancestry PGS prediction’ for segment details), we computed the SNPs in pairwise LD (𝑅(cid:2870) ≥ .45) from SNPs in the core or buffer of that window. The local RA of Wang et al. for SNP j was then defined by: (cid:3045)(cid:3117),(cid:3285)(cid:3045)(cid:3118),(cid:3285) (cid:3496) (cid:3291)(cid:3118),(cid:3285)((cid:3117)(cid:3127)(cid:3291)(cid:3118),(cid:3285)) (cid:3291)(cid:3117),(cid:3285)((cid:3117)(cid:3127)(cid:3291)(cid:3117),(cid:3285)) (cid:3118) (cid:3045)(cid:3117),(cid:3285) ⎛ ⎜ ⎝ (cid:2870) ⎞ ⎟ ⎠ x (cid:3043)(cid:3117),(cid:3285)((cid:2869)(cid:2879)(cid:3043)(cid:3117),(cid:3285)) (cid:3043)(cid:3118),(cid:3285)((cid:2869)(cid:2879)(cid:3043)(cid:3118),(cid:3285)) . [9] Here, 𝑝∗,(cid:3037) is the allele frequency for the jth SNP, and 𝑟∗,(cid:3037) is the mean correlation between the jth SNP and the SNPs in pairwise LD with it, for ancestry group * = 1,2 (for this analysis 1 = EUR and 2 = AF). The overall RA estimated for a segment by Wang et al. is the average of [9] over each core SNP j in the segment. Validation in the ARIC Study RA maps were developed using the UK Biobank EUR training set and the data from the AEA and AAA participants from the ARIC study for external validation. For this validation, the MC-ANOVA procedure was carried out as described above for the UK Biobank (a minimum 63 segment length of ten Kbp and at least ten SNPs, and three QTL), and 65,525 nonoverlapping SNP segments (an average of 36 Kbp and containing 12 core SNPs) were defined for the RA maps. Global predicted RA, empirical RA, and LOA were estimated for height, serum urate, and BMI. First, portability measures (cross-ancestry R-squared [6] and predicted RA [7]) were estimated within each segment with MC-ANOVA. In this case, predicted RA is defined as (cid:2870) 𝑅(cid:2889)(cid:2905)(cid:2902)→(cid:2885)(cid:2885)(cid:2885) (cid:2870) /𝑅(cid:2889)(cid:2905)(cid:2902)→(cid:2885)(cid:2889)(cid:2885) [7], where EUR is the UK Biobank EUR training set, AAA is the ARIC study African American testing set, and AEA is the ARIC study European American testing set. Similarly, the global PGS (using all SNPs meeting the GWAS p value threshold of 1e-5) was evaluated for each trait. The same procedure as above was used to estimate SNP effects (see ‘SNP effects for polygenic scores (PGS) using real data’), which are derived from the UK Biobank EUR training set: 𝐛(cid:4632) (cid:4593) (cid:2869). Then, the PGS prediction is 𝑦(cid:3548)(cid:3036)∗ = 𝐱(cid:2919) ∗𝐛(cid:4632) (cid:2869), for * = 1 or 2 now denoting either AEA (within-ancestry) or AAA (cross-ancestry), respectively. The PGS correlation calculation, Corr(𝑦(cid:3548)(cid:3036)∗, 𝑦(cid:3036)∗), was also the same as above for the UK Biobank (* = 1 [AEA] or 2 [AAA]; see ‘PGS prediction’). Thus, empirical RA in this case was computed as (cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3118),(cid:3052)(cid:3284)(cid:3118)(cid:3439) (cid:2887)(cid:2925)(cid:2928)(cid:2928)(cid:3435)(cid:3052)(cid:3548)(cid:3284)(cid:3117),(cid:3052)(cid:3284)(cid:3117)(cid:3439) (cid:3118) (cid:3118). When evaluating the RA map validation estimating PGS subscores based on SNP groups defined by the RA maps, the correlation difference was computed as Corr(cid:3435)𝑦(cid:3548)(cid:3036)(cid:3117), 𝑦(cid:3036)(cid:3117)(cid:3439) (cid:2870) − Corr(cid:3435)𝑦(cid:3548)(cid:3036)(cid:3118), 𝑦(cid:3036)(cid:3118)(cid:3439) (cid:2870) , and the portability groupings were based on the same 𝑅(cid:2869)→(cid:2870) (cid:2870) [6] quantiles as for the UK Biobank (see ‘PGS subscores’). Integrating RA maps into a gradient descent algorithm Gradient descent with early stopping (GD-ES) is an approach commonly used for TL in machine learning algorithms. Recently, Zhao et al.29 proposed using GD-ES to build PGS for cross-ancestry prediction. In Zhao’s GD-ES algorithm, effects are estimated by minimizing a 64 residual sum of squares evaluated in a data set (D2) from a target population (e.g., African ancestry), using an iterative procedure that uses an external estimator (𝛃(cid:3553) (cid:2869) derived from D1 of, e.g., European ancestry) as the initial value. Thus, GD-ES produces a sequence of estimates, (cid:3419)𝛃(cid:3561) (cid:2870)((cid:2868)), 𝛃(cid:3561) (cid:2870)((cid:2869)), … , 𝛃(cid:3561) (cid:2870)((cid:2929))(cid:3423), starting with 𝛃(cid:3561) (cid:2870)((cid:2868)) = 𝛃(cid:3553) (cid:2869) (pure cross-ancestry prediction) and moving toward the solution that one would obtain only using D2 (𝛃(cid:3553) (cid:2870)) after s iterations. Early stopping of the GD algorithm renders estimates that are a compromise between 𝛃(cid:3553) (cid:2869) and 𝛃(cid:3553) (cid:2870) and have been shown to improve cross-ancestry PGS prediction compared to using either a purely external (𝛃(cid:3553) (cid:2869)) or a purely internal (𝛃(cid:3553) (cid:2870)) estimate29. We extended this approach by allowing for a SNP-specific learning rate (LR) that is based on MC-ANOVA relative accuracy estimates. In a GD algorithm, coefficients are updated one at a time using 𝛽(cid:2870)(cid:3037) (cid:2924)(cid:2915)(cid:2933) = 𝛽(cid:2870)(cid:3037) (cid:2913)(cid:2931)(cid:2928)(cid:2928)(cid:2915)(cid:2924)(cid:2930) − LR × dL/dβ(cid:2870)(cid:2920), where LR is a learning rate parameter (controlling how fast the algorithm moves in the direction that minimizes the loss function, in our case the residual sum of squares loss function) and dL/dβ(cid:2870)(cid:2920) is the gradient of the loss function with respect to the jth coefficient of 𝛃(cid:2870). In Zhao et al.29, the same LR was used for all SNPs. We modified the algorithm by introducing an adaptive (SNP-specific) LR: LR(cid:2920) = 0.01 × 𝑒(cid:2879)(cid:2871)(cid:3019)(cid:3117)→(cid:3118),(cid:3285) (cid:3118) , where 𝑅(cid:2869)→(cid:2870),(cid:3037) (cid:2870) is the estimate presented in equation [6]. With this approach, a SNP with a high MC-ANOVA cross-ancestry R-squared estimate will have a low learning rate, staying closer to the initial external estimate (𝛽(cid:4632) (cid:2869)(cid:3037)) and a SNP with a low 𝑅(cid:2869)→(cid:2870),(cid:3037) (cid:2870) will have a higher learning rate, thus moving further away from the EUR- derived estimated effect. For a EUR ancestry group effect, 𝛃(cid:3553) (cid:2869), we used the same PGS effects (𝐛(cid:4632) (cid:2869)) described above (see the Methods section ‘SNP effects for polygenic scores (PGS) using real data’). This was then employed as an initial value in a gradient descent algorithm run on data from either AF, SA, or 65 CA ancestry from the UK Biobank (the EA group was excluded due to the small sample size for this group). To obtain an unbiased estimate of the out-of-sample R-squared, we split the data into training and testing sets (n-testing=300). We then conducted a five-fold cross-validation within the training data to select the optimal number of iterations of the GD algorithm (which acts as the parameter controlling how much effects are shrunk towards the initial values). Then, we ran the GD algorithm with that number of iterations on the entire training data and used the resulting effects to predict in the excluded testing data. This was repeated 50 times, each time with a different random partition of training and testing. The average results for the 18 trait-ancestry group combinations are reported in Table B8. The adaptive (SNP-specific) learning rate was compared to using a fixed learning rate, which was the mean of the adaptive learning rate for each trait-ancestry group pair. Additionally, for each trait-ancestry pair, we compute the percentage of times (across training-testing partitions) for which the prediction R-squared for the adaptive learning rate method compared to the fixed learning rate is higher (Table B8), excluding partitions that had identical R-squared. The R-code implementing the GD algorithm is included in the GD.R function in the GitHub repository https://github.com/lupiA/MCANOVA. Genetic distance The genetic distance reported between the ancestry groups in Table 2 and Figure B1 was computed as the overall (genome-wide) Fst24 between pairwise ancestries using PLINK (v1.90b6.24)45: --fst –within. We used a random sample of 20,000 individuals from the EUR ancestry group. Data availability The relative accuracy maps generated in this study have been deposited in the Zenodo database at https://doi.org/10.5281/zenodo.13769713 and are provided as Supplementary Data. 66 The GWAS summary statistics are available through Zenodo at https://doi.org/10.5281/zenodo.13785877. The UK Biobank data is available under restricted access and access can be obtained by applying at https://www.ukbiobank.ac.uk/. The ARIC Study data is available from dbGaP (https://www.ncbi.nlm.nih.gov/gap/) under accession code phs000280.v3.p1. The raw UK Biobank and the ARIC study data are protected and are not available due to data privacy laws. The protocol and consent were approved by the UK Biobank’s Research Ethics Committee and were conducted under the application number 15326. Data from the ARIC study usage was approved by Michigan State University's Institutional Review Board under Study ID LEGACY15-745. Source data for Figures are provided with this paper. Code availability The software presented and described in this study (the MC-ANOVA algorithm, a function to obtain the chromosome segments, the portability maps, and an interactive Shiny App) along with examples of how to use the MC-ANOVA algorithm can be found in an R package described and installable from https://github.com/lupiA/MCANOVA (Zenodo: https://doi.org/10.5281/zenodo.13769713). An identical web-based Shiny app is also available at https://lupia.github.io/Cross-Ancestry-Portability/ (Zenodo: https://doi.org/10.5281/zenodo.13769723) which will run slower than the R package app but does not require R software or package installation. 67 Acknowledgements Data from the UK Biobank was acquired from application number 15326 and data from the ARIC study was acquired through dbGaP under accession code phs000280.v3.p1 and project number 9191. We would like to thank the participants and those who developed the UK Biobank and ARIC data sets, as well as Michigan State University and the Institute for Cyber-Enabled Research at Michigan State University for providing funding and computing resources, respectively. We also thank Wen Huang for the comments provided when A.L. presented the preliminary results of this study. The authors received funding from NIH grants R01DK119836 (A.L., G.D.L.C., and A.V.), R03HG011674 (A.L., G.D.L.C., and A.V.), and R01HG013794 (A.L., G.D.L.C., and A.V.). Author Contributions Statement Conceptualization: A.L., G.D.L.C., and A.V.; Methodology: A.L. and G.D.L.C.; Software Development: A.L. and G.D.L.C.; Investigation: A.L. and G.D.L.C.; Writing – Original Draft: A.L.; Writing – Review & Editing: A.L., G.D.L.C., and A.V.; Project Administration and Supervision: A.V. and G.D.L.C.; Funding Acquisition: A.L., G.D.L.C., and A.V. Competing Interests Statement No authors have any competing interests to declare. 68 REFERENCES 1. Sollis, E. et al. The NHGRI-EBI GWAS catalog: knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985 (2023). 2. Lello, L. et al. Accurate genomic prediction of human height. Genetics 210, 477–497 (2018). 3. Kim, H., Grueneberg, A., Vazquez, A. I., Hsu, S. & de los Campos, G. Will big data close the missing heritability gap? Genetics 207, 1135–1145 (2017). 4. Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019). 5. Dikilitas, O. et al. Predictive utility of polygenic risk scores for coronary heart disease in three major racial and ethnic groups. Am. J. Hum. Genet. 106, 707–716 (2020). 6. Scutari, M., Mackay, I. & Balding, D. Using genetic distance to infer the accuracy of genomic prediction. PLOS Genet. 12, e1006288 (2016). 7. Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 11, 3865 (2020). 8. Privé, F. et al. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am. J. Hum. Genet. 109, 12–23 (2022). 9. Vilhjálmsson, B. J. et al. Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am. J. Hum. Genet. 97, 576–592 (2015). 10. Belsky, D. W. et al. Development and evaluation of a genetic risk score for obesity. Biodemography Soc. Biol. 59, 85–100 (2013). 11. Domingue, B. W., Belsky, D., Conley, D., Harris, K. M. & Boardman, J. D. Polygenic influence on educational attainment: new evidence from The National Longitudinal Study of Adolescent to Adult Health. AERA Open 1, 1–13 (2015). 12. Lee, J. J. et al. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50, 1112–1121 (2018). 13. Vassos, E. et al. An examination of polygenic score risk prediction in individuals with first- episode psychosis. Biol. Psychiatry 81, 470–477 (2017). 14. Li, Z. et al. Genome-wide association analysis identifies 30 new susceptibility loci for schizophrenia. Nat. Genet. 49, 1576–1583 (2017). 15. Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017). 69 16. Veturi, Y. et al. Modeling heterogeneity in the genetic architecture of ethnically diverse groups using random effect interaction models. Genetics 211, 1395–1407 (2019). 17. Cavazos, T. B. & Witte, J. S. Inclusion of variants discovered from diverse populations improves polygenic risk score transferability. Hum. Genet. Genomics Adv. 2, 100017 (2021). 18. Hou, K. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat Genet 55, 549–558 (2023). 19. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). 20. The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators. Am. J. Epidemiol. 129, 687–702 (1989). 21. de los Campos, G., Sorensen, D. & Gianola, D. Genomic heritability: what is it? PLOS Genet. 11, e1005048 (2015). 22. de los Campos, G., Vazquez, A. I., Fernando, R., Klimentidis, Y. C. & Sorensen, D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 9, e1003608 (2013). 23. de los Campos, G. et al. ANOVA-HD: Analysis of variance when both input and output layers are high-dimensional. PloS One 15, e0243251 (2020). 24. Wright, S. The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution 19, 395–420 (1965). 25. Habier, D., Fernando, R. L., Kizilkaya, K. & Garrick, D. J. Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12, 186 (2011). 26. Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat Genet 50, 746–753 (2018). 27. Zhang, Y., Qi, G., Park, J.-H. & Chatterjee, N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat Genet 50, 1318–1326 (2018). 28. Qiao, Z. et al. Estimation and implications of the genetic architecture of fasting and non- fasting blood glucose. Nat Commun 14, 451 (2023). 29. Zhao, Z., Fritsche, L. G., Smith, J. A., Mukherjee, B. & Lee, S. The construction of cross- population polygenic risk scores using transfer learning. Am J Hum Genet 109, 1998–2008 (2022). 70 30. VanLiere, J. M. & Rosenberg, N. A. Mathematical properties of the measure of linkage disequilibrium. Theoretical Population Biology 74, 130–137 (2008). 31. Bulmer, M. G. The effect of selection on genetic variability. Am. Nat. 105, 201–211 (1971). 32. Slatkin, M. Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nat. Rev. Genet. 9, 477–485 (2008). 33. Trowsdale, J. & Knight, J. C. Major histocompatibility complex genomics and human disease. Annu. Rev. Genomics Hum. Genet. 14, 301–323 (2013). 34. Dendrou, C. A., Petersen, J., Rossjohn, J. & Fugger, L. HLA variation and disease. Nat. Rev. Immunol. 18, 325–339 (2018). 35. The International Schizophrenia Consortium. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752 (2009). 36. The Autism Spectrum Disorders Working Group of The Psychiatric Genomics Consortium. Meta-analysis of GWAS of over 16,000 individuals with autism spectrum disorder highlights a novel locus at 10q24.32 and a significant overlap with schizophrenia. Mol. Autism 8, 21 (2017). 37. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014). 38. Matzaraki, V., Kumar, V., Wijmenga, C. & Zhernakova, A. The MHC locus and genetic susceptibility to autoimmune and infectious diseases. Genome Biol. 18, 76 (2017). 39. UK Biobank - UK Biobank. https://www.ukbiobank.ac.uk/. 40. International HapMap 3 Consortium et al. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010). 41. HapMap 3. Broad Institute https://www.broadinstitute.org/medical-and-population- genetics/hapmap-3 (2008). 42. Grueneberg, A. & de los Campos, G. BGData - A suite of R packages for genomic analysis with big data. G3amp58 GenesGenomesGenetics 9, 1377–1383 (2019). 43. Pérez, P. & de los Campos, G. Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495 (2014). 44. Geman, S. & Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984). 45. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007). 71 APPENDIX B: Chapter 2 Supplementary Materials Mapping the relative accuracy of cross-ancestry prediction Alexa S Lupi1,2,*, Ana I Vazquez1,2, and Gustavo de los Campos1,2,3,* 1 Department of Epidemiology and Biostatistics, Michigan State University (MSU), East Lansing, Michigan 48824, United States. 2 Institute for Quantitative Health Science and Engineering, Systems Biology, MSU. 3 Department of Statistics and Probability, MSU. 72 Supplementary Methods Derivation of the Within- and cross-ancestry R-squared parameters In this note we present a step-by-step derivation of the within- and cross-ancestry R-squared parameters. The note expands what is presented in the main text and shows that these parameters (and functions thereof such as the relative accuracy) are functions of (i) allele frequencies (which impact the variance of genotypes at each locus), (ii) linkage- disequilibrium patterns (which impacts the correlation of genotypes at markers and causal loci), and (iii) the effects of causal alleles (which in MC-ANOVA we integrate out by averaging over MC replicates). Under the framework described in the Results section, for ancestry group 1 the causal model (expression [1] in the main text) is: 𝑦(cid:3036)(cid:3117) = 𝐳(cid:2919)(cid:3117) (cid:4593) 𝛂 + 𝜀(cid:3036)(cid:3117) (SE1) where 𝑦(cid:3036)(cid:3117) is the phenotype of the ith individual (from ancestry group 1), 𝐳(cid:2919)(cid:3117)is the vector of QTL genotypes from the same individual, and 𝛂 is the vector of QTL effects. To isolate the effects of LD and allele frequency differences between ancestry groups in cross- ancestry prediction, MC-ANOVA assumes that the same causal model holds in both ancestries and, hence, 𝜶 does not have a population subscript (and S1 is 𝑦(cid:3036)(cid:3118) = 𝐳(cid:2919)(cid:3118) (cid:4593) 𝛂 + 𝜀(cid:3036)(cid:3118) for ancestry group 2). The instrumental model (expression [2] in the main text) for ancestry group 1 can be written as: 𝑦(cid:3036)(cid:3117) = 𝐱(cid:2919)(cid:3117) (cid:4593) 𝛃(cid:2869) + 𝜀(cid:3036)(cid:3117) (SE2) where 𝐱(cid:2919)(cid:3117) is the vector of markers/SNPs for an individual i from ancestry group 1. Assuming the causal model of SE1 and that the errors in SE2 are uncorrelated with markers, marker effects in population 1 are 73 𝛃(cid:2869) = Var(cid:3435)𝐱(cid:2919)(cid:3117)(cid:3439) (cid:2879)(cid:2869) Cov(cid:3435)𝐱(cid:2919)(cid:3117), 𝐳(cid:2919)(cid:3117) (cid:4593) (cid:3439)𝛂 = 𝚺(cid:2908)(cid:3117) (cid:2879)(cid:2869)𝚺(cid:2908)(cid:3117)(cid:2910)(cid:3117)𝛂 (SE3) where 𝚺(cid:2908)(cid:3117) is the covariance matrix among markers and 𝚺(cid:2908)(cid:3117)(cid:2910)(cid:3117) is the covariance matrix between the markers and QTL, both in ancestry group 1; these matrices are functions of allele frequencies and LD patterns in population 1. Therefore, the squared correlation between the true genetic values (𝐳(cid:2919)(cid:3117) (cid:4593) 𝛂) and the marker-predicted genetic scores (𝐱(cid:2919)(cid:3117) (cid:4593) 𝛃(cid:2869)) in population 1 (expression [5] in the main text) is: Corr(cid:3435)𝐱(cid:2919)(cid:3117) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3117) (cid:2870) (cid:4593) 𝛂(cid:3439) = Using 𝛃(cid:2869) = 𝚺(cid:2908)(cid:3117) (cid:2879)(cid:2869)𝚺(cid:2908)(cid:3117)(cid:2910)(cid:3117)𝛂 (SE3) in SE4 we get: 𝟐 (cid:4594) 𝚺(cid:3156)(cid:3117)(cid:3158)(cid:3117)𝛂(cid:3431) (cid:3427)𝛃(cid:3117) (cid:4594) 𝚺(cid:3156)(cid:3117)𝛃(cid:3117)(cid:3431)(cid:3427)𝛂(cid:4594)𝚺(cid:3158)(cid:3117)𝛂(cid:3431) (cid:3427)𝛃(cid:3117) . (SE4) Corr(cid:3435)𝐱(cid:2919)(cid:3117) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3117) (cid:2870) (cid:4593) 𝛂(cid:3439) = (cid:3427)𝛂(cid:4594)𝚺(cid:3158)(cid:3117)(cid:3156)(cid:3117)𝚺(cid:3156)(cid:3117) (cid:3127)(cid:3117)𝚺(cid:3156)(cid:3117)(cid:3158)(cid:3117)𝛂(cid:3431) (cid:3127)(cid:3117)𝚺(cid:3156)(cid:3117)(cid:3158)(cid:3117)𝛂](cid:3427)𝛂(cid:4594)𝚺(cid:3158)(cid:3117)𝛂(cid:3431) [𝛂(cid:4594)𝚺(cid:3158)(cid:3117)(cid:3156)(cid:3117)𝚺(cid:3156)(cid:3117) (cid:3118) = (cid:3127)(cid:3117)𝚺(cid:3156)(cid:3117)(cid:3158)(cid:3117)𝛂 𝛂(cid:4594)𝚺(cid:3158)(cid:3117)(cid:3156)(cid:3117)𝚺(cid:3156)(cid:3117) 𝛂(cid:4594)𝚺(cid:3158)(cid:3117)𝛂 . (SE5) Conceptually, this can be elucidated if we consider the QTL (𝐙(cid:2869)) and the markers (𝐗(cid:2869)) to have some multivariate distribution with covariance 𝚺(cid:2869) = (cid:3428) 𝚺(cid:2910)(cid:3117) 𝚺(cid:2908)(cid:3117)(cid:2910)(cid:3117) 𝚺(cid:2910)(cid:3117)(cid:2908)(cid:3117) 𝚺(cid:2908)(cid:3117) (cid:3432). Then the conditional covariance of the QTL given the markers, Cov(𝐙(cid:2869)|𝐗(cid:2869)), is known to be the Schur complement of 𝚺(cid:2908)(cid:3117), which is 𝚺(cid:2910)(cid:3117) − 𝚺(cid:2910)(cid:3117)(cid:2908)(cid:3117)𝚺(cid:2908)(cid:3117) (cid:2879)(cid:2869)𝚺(cid:2908)(cid:3117)(cid:2910)(cid:3117). Thus, the term 𝚺(cid:2910)(cid:3117)(cid:2908)(cid:3117)𝚺(cid:2908)(cid:3117) (cid:2879)(cid:2869)𝚺(cid:2908)(cid:3117)(cid:2910)(cid:3117) captures the variance and covariance from QTL explained by regression on markers. Therefore, we define the within-ancestry R-squared as the squared correlation in (SE5): 𝑅(cid:2869)→(cid:2869) (cid:2870) = Corr(cid:3435)𝐱(cid:2919)(cid:3117) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3117) (cid:2870) (cid:4593) 𝛂(cid:3439) = (cid:3127)(cid:3117)𝚺(cid:3156)(cid:3117)(cid:3158)(cid:3117)𝛂 𝛂(cid:4594)𝚺(cid:3158)(cid:3117)(cid:3156)(cid:3117)𝚺(cid:3156)(cid:3117) 𝛂(cid:4594)𝚺(cid:3158)(cid:3117)𝛂 . (SE6) This is equivalent to what is shown in expression [5] in the main text (including ancestry group indices here). 74 To derive the cross-ancestry correlation, we define the following (co)variance matrices for ancestry group 2: Var(cid:3435)𝐱(cid:2919)(cid:3118)(cid:3439) = 𝚺(cid:2908)(cid:3118), Cov(cid:3435)𝐱(cid:2919)(cid:3118), 𝐳(cid:2919)(cid:3118) (cid:4593) (cid:3439) = 𝚺(cid:2908)(cid:3118)(cid:2910)(cid:3118), and Var(cid:3435)𝐳(cid:2919)(cid:3118)(cid:3439) = 𝚺(cid:2910)(cid:3118) . (SE7) Using the marker effects from ancestry group 1 to predict genetic scores in ancestry group 2 (𝐳(cid:2919)(cid:3118) (cid:4593) 𝛂), the cross-ancestry R-squared is: Corr(cid:3435)𝐱(cid:2919)(cid:3118) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3118) (cid:2870) (cid:4593) 𝛂(cid:3439) = (cid:4594) (cid:2887)(cid:2925)(cid:2932)(cid:4672)𝐱(cid:3167)(cid:3118),𝐳(cid:3167)(cid:3118) (cid:3118) (cid:4594) (cid:4673)𝛂(cid:4675) (cid:4594) 𝛃(cid:3117)(cid:4673)(cid:2906)(cid:2911)(cid:2928)(cid:4672)𝐳(cid:3167)(cid:3118) (cid:4674)𝛃(cid:3117) (cid:2906)(cid:2911)(cid:2928)(cid:4672)𝐱(cid:3167)(cid:3118) (cid:4594) 𝛂(cid:4673) = Replacing 𝛃(cid:2869) with the right-hand side of (SE3) we get: (cid:3118) (cid:4594) 𝚺(cid:3156)(cid:3118)(cid:3158)(cid:3118)𝛂(cid:3431) (cid:3427)𝛃(cid:3117) (cid:4594) 𝚺(cid:3156)(cid:3118)𝛃(cid:3117)(cid:3431)(cid:3427)𝛂(cid:4594)𝚺(cid:3158)(cid:3118)𝛂(cid:3431) (cid:3427)𝛃(cid:3117) . (cid:2870) = Corr(cid:3435)𝐱(cid:2919)(cid:3118) 𝑅(cid:2869)→(cid:2870) (cid:4593) 𝛃(cid:2869), 𝐳(cid:2919)(cid:3118) (cid:2870) (cid:4593) 𝛂(cid:3439) = (cid:3118) (cid:3427)𝛂(cid:4594)𝚺(cid:3158)(cid:3117)(cid:3156)(cid:3117)𝚺(cid:3156)(cid:3117) (cid:3127)(cid:3117)𝚺(cid:3156)(cid:3118)𝚺(cid:3156)(cid:3117) (cid:3127)(cid:3117)𝚺(cid:3156)(cid:3118)(cid:3158)(cid:3118)𝛂(cid:3431) (cid:3127)(cid:3117)𝚺(cid:3156)(cid:3117)(cid:3158)(cid:3117)𝛂(cid:4675)(cid:3427)𝛂(cid:4594)𝚺(cid:3158)(cid:3118)𝛂(cid:3431) (cid:4674)𝛂(cid:4594)𝚺(cid:3158)(cid:3117)(cid:3156)(cid:3117)𝚺(cid:3156)(cid:3117) . (SE8) (SE9) It is interesting to compare the quadratic forms involved in the within- and cross-ancestry R-squared parameters (expressions SE6 and SE9). If the variances of genotypes at individual loci and the LD patterns are the same in both ancestry groups (i.e., if 𝚺(cid:2910)(cid:3117) = 𝚺(cid:2910)(cid:3118), 𝚺(cid:2908)(cid:3117) = 𝚺(cid:2908)(cid:3118), and 𝚺(cid:2908)(cid:3117)(cid:2910)(cid:3117) = 𝚺(cid:2908)(cid:3118)(cid:2910)(cid:3118)), the two R-squared values are identical. 75 The variance of MC-ANOVA predicted relative accuracies Following Wang et al.1, the variance of a ratio is approximately: Var(𝑥/𝑦) ≈ (cid:4672) (cid:2870) (cid:3006)((cid:3051)) (cid:4673) (cid:3006)((cid:3052)) (cid:4674) (cid:2906)(cid:2911)(cid:2928)((cid:3051)) (cid:3006)((cid:3051))(cid:3118) + (cid:2906)(cid:2911)(cid:2928)((cid:3052)) (cid:3006)((cid:3052))(cid:3118) − 2 (cid:4672) (cid:2887)(cid:2925)(cid:2932)((cid:3051),(cid:3052)) (cid:3006)((cid:3051))(cid:3006)((cid:3052)) (cid:4673) (cid:4675) . For notation purposes, let relative the accuracy (RA, [7]) be denoted as: RA = (cid:3118) (cid:3019)(cid:3117)→(cid:3118) (cid:3118) = (cid:3019)(cid:3117)→(cid:3117) (cid:3118) (cid:3019)(cid:3118) (cid:3118) . (cid:3019)(cid:3117) Plugging (SE11) into the general formula in (SE10), we obtain: (SE10) (SE11) Var (cid:4672) (cid:2870) (cid:3118) (cid:3019)(cid:3118) (cid:3118)(cid:4673) ≈ (cid:3436) (cid:3019)(cid:3117) (cid:3118)(cid:3439) (cid:3006)(cid:3435)(cid:3019)(cid:3118) (cid:3118)(cid:3439) (cid:3006)(cid:3435)(cid:3019)(cid:3117) (cid:3440) (cid:3428) (cid:3118)(cid:3439) (cid:2906)(cid:2911)(cid:2928)(cid:3435)(cid:3019)(cid:3118) (cid:3118) + (cid:3118)(cid:3439) (cid:3006)(cid:3435)(cid:3019)(cid:3118) (cid:3118)(cid:3439) (cid:2906)(cid:2911)(cid:2928)(cid:3435)(cid:3019)(cid:3117) (cid:3118) − 2 (cid:3436) (cid:3118)(cid:3439) (cid:3006)(cid:3435)(cid:3019)(cid:3117) (cid:2887)(cid:2925)(cid:2932)(cid:3435)(cid:3019)(cid:3118) (cid:3006)(cid:3435)(cid:3019)(cid:3118) (cid:3118)(cid:3439) (cid:3118),(cid:3019)(cid:3117) (cid:3118)(cid:3439) (cid:3118)(cid:3439)(cid:3006)(cid:3435)(cid:3019)(cid:3117) (cid:3440) (cid:3432) . (SE12) Replacing the expected value with the MC-ANOVA estimate, 𝐸(𝑅∗ (cid:2870)) = 𝑅∗ (cid:2870) (* = 1, 2), and assuming Cov(𝑅(cid:2870) (cid:2870), 𝑅(cid:2869) (cid:2870)) = 0 since the ancestry cohorts are independent of one another we get: Var (cid:4672) (cid:2870) (cid:3118) (cid:3019)(cid:3118) (cid:3118)(cid:4673) ≈ (cid:4672) (cid:3019)(cid:3117) (cid:3118) (cid:3019)(cid:3118) (cid:3118)(cid:4673) (cid:3019)(cid:3117) (cid:4674) (cid:3118)(cid:3439) (cid:2906)(cid:2911)(cid:2928)(cid:3435)(cid:3019)(cid:3118) (cid:3120) + (cid:3019)(cid:3118) (cid:3118)(cid:3439) (cid:2906)(cid:2911)(cid:2928)(cid:3435)(cid:3019)(cid:3117) (cid:3120) (cid:3019)(cid:3117) (cid:4675) . (SE13) For the Var(𝑅∗ (cid:2870)), we must consider two sources of uncertainty, the sampling variance of the estimator (resulting from the use of a finite sample size) and the Monte Carlo error; therefore Var(𝑅∗ (cid:2870)) = MC_variance(𝑅∗ (cid:2870)) + (cid:4672) (cid:2872) (cid:3041)∗ (cid:4673) 𝑅∗ (cid:2870)(1 − 𝑅∗ (cid:2870)), (SE14) where the MC_variance component is the variance of the estimate over the 300 Monte Carlo replications, and the sample variance component is from the same Taylor series-based derivation as used in the empirical RA variance approximation (see ‘Standard error estimates’ in Methods). Thus: Var (cid:4672) (cid:3118) (cid:3019)(cid:3118) (cid:3118)(cid:4673) ≈ (cid:4672) (cid:3019)(cid:3117) (cid:3118) (cid:3019)(cid:3118) (cid:3118)(cid:4673) (cid:3019)(cid:3117) (cid:2870) (cid:4680) (cid:2897)(cid:2887)_(cid:2932)(cid:2911)(cid:2928)(cid:2919)(cid:2911)(cid:2924)(cid:2913)(cid:2915)(cid:3435)(cid:3019)(cid:3118) (cid:3118)(cid:3439)(cid:2878)(cid:4672) (cid:3120) (cid:3289)(cid:3118) (cid:4673)(cid:3019)(cid:3118) (cid:3118)(cid:3439) (cid:3118)(cid:3435)(cid:2869)(cid:2879)(cid:3019)(cid:3118) (cid:3120) (cid:3019)(cid:3118) (cid:2897)(cid:2887)_(cid:2932)(cid:2911)(cid:2928)(cid:2919)(cid:2911)(cid:2924)(cid:2913)(cid:2915)(cid:3435)(cid:3019)(cid:3117) (cid:3118)(cid:3439)(cid:2878)(cid:4672) (cid:3120) (cid:3289)(cid:3117) (cid:4673)(cid:3019)(cid:3117) (cid:3118)(cid:3439) (cid:3118)(cid:3435)(cid:2869)(cid:2879)(cid:3019)(cid:3117) (cid:3120) (cid:3019)(cid:3117) + (cid:4681) . (SE15) The standard error bars presented in the portability maps are the square root of (SE15). 76 Supplementary Data Supplementary Figures Figure B1: Loadings in the first two SNP-derived principal components (PC) colored by ancestry. The inner and outer ellipses represent the 68th and 99th percentile of the PC loadings of each ancestry (European [EUR], African [AF], Caribbean [CR], East Asian [EA], and South Asian [SA]). A random sample of 3,000 EUR ancestry individuals were selected for plotting. 77 Figure B2: Comparing the UK Biobank and HapMap SNP set estimates. The cross-ancestry (European [EUR] to non-EUR) R-squared [6] distributions for each SNP set (HapMap variants compared to UK Biobank arrays) and each ancestry group (African, Caribbean, East Asian, and South Asian). Each panel displays a different non-EUR ancestry group. The bottom line of each box represents the first quartile, the next line is the median, and the top line is the third quartile. Verticle lines extend from the first (third) quartiles to the minimum (maximum) and outliers are represented by blue points. 78 Figure B3: MC-ANOVA predicted relative accuracy (RA) versus empirical RA using SNPs from the HapMap variants. Predicted compared to empirical RA of European (EUR)-derived polygenic scores when used to predict phenotypes of individuals of non-EUR ancestry (AF, CR, EA, and SA denote African, Caribbean, East Asian, and South Asian ancestry). Each panel displays a different phenotype. The loss of accuracy (LOA, %) attributable to allele frequency and LD differences between ancestries is shown on top of each bar set. A standard error bar is shown for each mean RA estimate (derivation details for predicted RA are in the Supplementary Methods and details for the empirical RA are in the Methods). The sample sizes used to derive the standard errors are in Table B2. See Figure 5 for results based on SNPs from the UK Biobank arrays. 79 a: Relative accuracy. b: Loss of accuracy. A O L Trait Figure B4: Polygenic score relative accuracy and loss of accuracy by method. (a) Predicted and empirical relative accuracy (RA) for four traits (height, high-density lipoprotein [HDL], low- density lipoprotein [LDL], and body mass index [BMI]) in the African (AF) ancestry group (compared to European). (b) Loss of accuracy (LOA) explained by genome differentiation for four traits in the AF group by method: MC-ANOVA (using SNPs from the UK Biobank arrays) and the values reported in Wang et al. 20201. The sample sizes used to derive the standard errors for MC-ANOVA mean RA are in Table B2. 80 a: African ancestry group. → d e r a u q s - R b: Caribbean ancestry group. → d e r a u q s - R Figure B5: Cross-ancestry MC-ANOVA predicted R-squared by chromosome and position based on SNPs from the UK Biobank arrays. Each dot represents the estimated 𝑅(cid:2869)→(cid:2870) (cid:2870) [6] for a chromosome segment (with an average length of 45 Kbp) by the ancestry group of the testing data: AF=African (a), CR=Caribbean (b), EA=East Asian (c), and SA=South Asian (d). Ancestry group 1 is European (EUR). The green line is the 80th percentile value of 𝑅(cid:2869)→(cid:2870) short dashed line is the 60th percentile, and the red long dashed line is the 50th percentile. See (cid:2870) , the blue Figure B7 for results based on HapMap SNPs. 81 Figure B5 (cont’d) c: East Asian ancestry group. → d e r a u q s - R d: South Asian ancestry group. d: South Asian ancestry group. → d e r a u q s - R 82 Figure B6: Within- and cross-ancestry R-squared distributions based on HapMap SNPs. Distribution of the cross-ancestry R-squared (R-sq.) versus the within-European (EUR) R- squared for the African, Caribbean, East Asian, and South Asian (AF, CR, EA, and SA, respectively) ancestry groups. Each panel displays a different non-EUR ancestry group. Each point represents a small chromosome segment (23 Kbp on average). Each subplot has dashed gray lines at the 10th, 50th, and 90th percentiles of the distribution and a red dashed 45-degree reference line (slope of one and intercept at zero). There is a white point at the intersection of the within-ancestry R-squared median and the cross-ancestry R-squared median. See Figure 6 for results based on SNPs from the UK Biobank arrays. 83 a: HapMap SNPs African ancestry group. . q s - R p a M p a H → b: HapMap SNPs Caribbean ancestry group. . q s - R p a M p a H → Figure B7: Cross-ancestry MC-ANOVA predicted R-squared (R-sq.) by chromosome and position based on SNPs from the HapMap variants. Each dot represents the estimated 𝑅(cid:2869)→(cid:2870) (cid:2870) [6] for a chromosome segment from the HapMap SNP set (with an average length of 23 Kbp) by the ancestry group of the testing data: AF=African (a), CR=Caribbean (b), EA=East Asian (c), and SA=South Asian (d). Ancestry group 1 is European (EUR). The green line is the 80th percentile value of 𝑅(cid:2869)→(cid:2870) line is the 50th percentile. See Figure B5 for results based on SNPs from the UK Biobank arrays. , the blue short dashed line is the 60th percentile, and the red long dashed (cid:2870) 84 Figure B7 (cont’d) c: HapMap SNPs East Asian ancestry group. . q s - R p a M p a H → d: HapMap SNPs South Asian ancestry group. . q s - R p a M p a H → 85 Figure B8: Distribution of the cross-ancestry R-squared (R-sq.) versus the within- European (EUR) R-squared for the African (AF), Caribbean (CR), East Asian (EA), and South Asian (SA) ancestry groups. Each point represents a small chromosome segment (45 Kbp) from the UK Biobank arrays. Each panel displays a different non-European (EUR) ancestry group. Five hundred segments were randomly sampled and plotted for each ancestry group. The standard error bars for each R-squared (R-sq.) point estimate are shown with the cross bars. The sample sizes used to derive the standard errors are in Table B2. 86 a: Caribbean ancestry group. b: East Asian ancestry group. Figure B9: Difference between within- and cross-ancestry polygenic score prediction 87 Figure B9 (cont’d) correlation of European (EUR) derived polygenic scores by ancestry and SNP portability group based on SNPs from the UK Biobank arrays. The vertical axis represents the difference between the within- and cross-ancestry polygenic score prediction correlation for SNP groups with Very Low, Low, Medium, and High MC-ANOVA predicted portability (𝑅(cid:2869)→(cid:2870) (cid:2870) groupings, Table 3) by trait (height, high-density lipoprotein [HDL], serum urate, low-density lipoprotein [LDL], body mass index [BMI], and glucose) and ancestry group (CR=Caribbean [a], EA=East Asian [b], and SA=South Asian [c]). A positive difference in PGS prediction correlation indicates that the PGS of the SNP set had a higher prediction correlation in EURs (within- ancestry prediction) than in individuals of CR, EA, or SA (cross-ancestry prediction) ancestry. The number of SNPs entering each PGS is annotated toward the bottom of each subplot. A standard error bar for each prediction correlation difference is shown and details for the calculation can be found in the Methods. The gray vertical bars are the simulated null distribution (mean +/- standard error) for the correlation difference, where SNPs were assigned to portability groups completely at random, maintaining the number of SNPs in each subgroup. The sample sizes for the simulated null distribution are in Table B2. 88 Figure B9 (cont’d) c: South Asian ancestry group. 89 Figure B10: Predicted and empirical relative accuracies (RA) by SNP portability group by trait and ancestry group based on SNPs from the UK Biobank arrays. MC-ANOVA predicted relative accuracy (RA) and empirical RA of European (EUR)-derived polygenic scores when used to predict phenotypes of individuals of non-EUR ancestry (AF, CR, EA, and SA denote African, Caribbean, East Asian, and South Asian ancestry) by SNP portability group for six traits (height, high-density lipoprotein [HDL], serum urate [SU], low-density lipoprotein [LDL], body mass index [BMI], and glucose). Each panel displays a different phenotype- ancestry group combination. The loss of accuracy (LOA, %) attributable to genome differentiation is shown on top of each bar set. A standard error bar is shown for each mean RA estimate (derivation details are in the Methods). The sample sizes used to derive the standard errors are in Table B2. 90 a: African ancestry group (HapMap SNPs). b: Caribbean ancestry group (HapMap SNPs). Figure B11: Validation plots for the HapMap SNP set. The difference between polygenic 91 Figure B11 (cont’d) score prediction correlation by HapMap SNP portability group. The vertical axis represents the difference between the within- and cross-ancestry polygenic score prediction correlations of European (EUR) derived polygenic scores for SNP groups with Very Low, Low, Medium, and High MC-ANOVA predicted portability (𝑅(cid:2869)→(cid:2870) (cid:2870) groupings) by trait (height, high-density lipoprotein [HDL], serum urate, low-density lipoprotein [LDL], body mass index [BMI], and glucose) and ancestry group (AF=African [a], CR=Caribbean [b], EA=East Asian [c], and SA=South Asian [d]). A positive difference in PGS prediction correlation indicates that the PGS of the SNP set had a higher prediction correlation in EURs (within-ancestry prediction) than in individuals of AF (cross-ancestry prediction) ancestry. The number of SNPs entering each PGS is annotated toward the bottom of each subplot. A standard error bar for each prediction correlation difference is shown and details for the calculation can be found in the Methods. The gray vertical bars are the simulated null distribution (mean +/- standard error) for the correlation difference, where SNPs were assigned to portability groups completely at random, maintaining the number of SNPs in each subgroup. The sample sizes for the simulated null distribution are in Table B2. See Figure 7 and Figure B9 for results based on SNPs from the UK Biobank array 92 Figure B11 (cont’d) c: East Asian ancestry group (HapMap SNPs). d: South Asian ancestry group (HapMap SNPs). 93 a: Fst2 compared to MC-ANOVA. SNP Portability Group RA Method African Caribbean East Asian East Asian South Asian Figure B12: Difference between within- and cross-ancestry polygenic score prediction correlation of European (EUR) derived polygenic scores by ancestry and SNP portability groups based on different methods (using SNPs from the UK Biobank arrays). The vertical axis represents the difference between the within- and cross-ancestry polygenic score prediction correlation for SNP groups with Very Low, Low, Medium, and High predicted portability determined from different methods (Fst2 vs. MC-ANOVA [a] and Wang et al. RA1 vs. MC- ANOVA [b]) by trait (height, high-density lipoprotein [HDL], serum urate, low-density lipoprotein [LDL], body mass index [BMI], and glucose) and ancestry group (AF=African, CR=Caribbean, EA=East Asian, and SA=South Asian). A positive difference in PGS prediction 94 Figure B12 (cont’d) correlation indicates that the PGS of the SNP set had a higher prediction correlation in EURs (within-ancestry prediction) than in individuals of AF, CR, EA, or SA (cross-ancestry prediction) ancestry. Within (a) and (b), the panels are first grouped by ancestry group and then by trait. A standard error bar for each prediction correlation difference is shown and details for the calculation can be found in the Methods. The gray vertical bars are the simulated null distribution (mean +/- standard error) for the correlation difference, where SNPs were assigned to portability groups completely at random, maintaining the number of SNPs in each subgroup. The sample sizes for the simulated null distribution are in Table B2. b: Wang et al.’s1 RA compared to MC-ANOVA. SNP Portability Group RA Method African Caribbean East Asian South Asian 95 Figure B13: Predicted cross-ancestry R-squared [6] by number of SNPs in the chromosome segment (including the core and the SNPs in the flaking regions) by ancestry of the testing data (AF=African, CR=Caribbean, EA=East Asian, and SA=South Asian). Each panel displays a different ancestry group. The results are based on SNPs from the UK Biobank arrays. For European (EUR) the plot displays the within-ancestry R-squared parameter [5]. 96 a Figure B14: PGS relative accuracies (RA) and correlation differences in the ARIC study data set. (a) MC-ANOVA predicted relative accuracy (RA) versus empirical RA of UK Biobank European (EUR)-derived polygenic scores (PGS) when used to predict phenotypes of individuals from the ARIC study. The within-ancestry testing group is the ARIC study European Americans (AEA) and the cross-ancestry group is the ARIC study African Americans (AAA). A standard error (SE) bar is shown for each mean empirical RA estimate. The loss of accuracy (LOA, %) attributable to genome differentiation is shown on top of each bar set. (b) The vertical axis represents the difference between the within- and cross-ancestry polygenic score prediction correlation for SNP groups with Very Low, Low, Medium, and High MC-ANOVA predicted portability (𝑅(cid:2869)→(cid:2870) (cid:2870) groupings) by trait (height, serum urate, and body mass index [BMI]). A 97 Figure B14 (cont’d) positive difference in PGS prediction correlation indicates that the PGS of the SNP set had a higher prediction correlation in AEA (within-ancestry prediction) relative to AAA (cross- ancestry prediction) ancestry. The number of SNPs entering each PGS is annotated toward the bottom of each subplot. A standard error bar for each prediction correlation difference is shown and details for the calculation can be found in the Methods. The gray vertical bars are the simulated null distribution (mean +/- standard error) for the correlation difference, where SNPs were assigned to portability groups completely at random, maintaining the number of SNPs in each subgroup. For both a and b, the sample sizes for the SE bars and the simulated null are n=9,628 AEA and n=3,130 AAA for height, n=9,627 AEA and n=3,046 AAA for serum urate, and n=9,625 AEA and n=3,127 AAA for BMI. b 98 a: Varying the number of QTL per segment. b: Varying the number of flanking SNPs to the sides of each segment. Figure B15: (a, b) Cross-ancestry R-squared by chromosome and position by number for 99 Figure B15 (cont’d) varying numbers of causal variants in the segment and number of SNPs in the flanking regions (all results based on SNPs from the UK Biobank arrays). Each dot represents the estimated 𝑅(cid:2869)→(cid:2870) (cid:2870) [6] for a chromosome segment for the AF=African ancestry group by (a) the number of sampled QTL and (b) the number of SNPs included in the flanking regions to each side of the chromosome segment. (c, d) Cross-ancestry R-squared (R-sq.) from the baseline model (three QTL and ten flanking SNPs) subtracted from the model varying either the number of causal variants in the segment or the number of SNPs in the flank (based on SNPs from the UK Biobank arrays). Each histogram shows the distribution of the difference in 𝑅(cid:2869)→(cid:2870) (cid:2870) [6] between the sensitivity model minus the baseline model for the AF=African ancestry group by (c) the number of sampled QTL and (d) the number of SNPs included in the flanking regions to each side of the chromosome segment. There is a vertical red line at R-squared equals zero. 100 Figure B15 (cont’d) c: Varying the number of QTL per segment compared to the baseline method (3 QTL). t n u o C t n u o C t n u o C R-sq. (EUR→AF) difference from baseline R-sq. (EUR→AF) difference from baseline 101 Figure B15 (cont’d) d: Varying the flanking SNPs to the sides of each segment compared to the baseline method (3 QTL). t n u o C t n u o C t n u o C R-sq. (EUR→AF) difference from baseline method R-sq. (EUR→AF) difference from baseline method 102 a: Cross-ancestry R-squared (EUR → AF). b: Within-ancestry R-squared (EUR → EUR). Figure B16: Cross- and within-ancestry R-squared for different causal variant effect distributions based on SNPs from the UK Biobank arrays. The MC-ANOVA cross-ancestry R-squared (R-sq.) estimates for the African (AF) ancestry group (a) and the within-ancestry (European [EUR]) R-squared estimates (b) when drawing causal variant effects from a normal distribution (shown in the main results) compared to a gamma distribution with a shape parameter of 1.5 and rate parameter of one. The pairwise Pearson correlation is noted for each subplot. 103 Supplementary Tables Table B1: Average R-squared and relative accuracy (RA) by testing set using HapMap SNPs. Ancestry Group Sample Size R-squared Relative Accuracy Standard Variance in RA (cid:2870) (𝑅(cid:2869)→(cid:2870) )* Error of the Across (cid:2870) (𝑅(cid:2869)→(cid:2870) (cid:2870) /𝑅(cid:2869)→(cid:2869) ) RA*** Segments*** European (EUR) 230,000 0.926** 1.000 African (AF) 3,083 0.596 0.638 0.021 0.030 Caribbean (CR) East Asian (EA) South Asian (SA) 3,343 0.629 0.674 0.020 0.026 1,329 0.814 0.875 0.022 0.012 7,919 0.868 0.935 0.010 0.003 * Subscript 1 always indicates an EUR training or testing set; 2 indicates non-EUR testing; ** 𝑅(cid:2869)→(cid:2869) Median (cid:2870) ; *** 104 Table B2: Descriptive statistics by ancestry group in the UK Biobank data set. Continuous variables are reported as the mean  standard deviation and are followed by the number of samples missing in parentheses. Variable Units European EUR South Asian East Asian Caribbean (EUR) Training Testing (SA) (EA) (CR) African (AF) Total Sample Size 230,000 6,698 7,919 1,329 3,343 3,083 Female % 52.8 52.4 45.6 62.8 62.7 48.5 Age years 56.8  8.0 57.0  7.9 53.2  8.5 52.4  7.6 52.8  8.1 50.8  7.9 Height cm 169.1  9.2 9.3 HDL mmol/L 1.5  0.4 1.5  0.4 169.2  164.4  8.9 162.0  7.7 167.3  8.6 167.7  8.6 (305) 1.3  0.3 (1,053) (25) (51) (66) 1.5  0.4 (172) 1.5  0.4 1.4  0.4 (428) (406) Serum Urate umol/L 309.8  80.1 310.5  318.7  79.8 311  76.9 305.5  81.7 318.7  80.5 79.8 (410) (62) (183) (207) LDL mmol/L 3.6  0.9 3.6  0.9 3.3  0.9 (419) 3.4  0.8 (61) BMI kg/m2 27.4  4.7 27.4  4.7 27.1  4.4 (169) 24.1  3.4 (8) Serum Glucose mmol/L 5.1  1.2 5.1  1.2 5.4  1.9 (1,048) 5.1  1.0 (172) 3.3  0.8 3.2  0.8 (187) (209) 29.3  5.5 29.6  5.1 (50) (51) 5.2  1.6 5.1  1.5 (434) (407) 105 Table B3: Estimated relative accuracy (RA) of SNP windows grouped by the estimated cross- ancestry R-squared (𝑅(cid:2869)→(cid:2870) (cid:2870) for 1=European [EUR] and 2=testing set) for the Caribbean (CR), East Asian (EA), and South Asian (SA) ancestry groups using SNPs from the UK Biobank array. Testing Portability Quantile (cid:2870) 𝑅(cid:2869)→(cid:2870) Number of Average Average Average RA Group Group Group Cutoff Range SNPs (cid:2870) 𝑅(cid:2869)→(cid:2869) (cid:2870) 𝑅(cid:2869)→(cid:2870) (cid:2870) (𝑅(cid:2869)→(cid:2870) (cid:2870) /𝑅(cid:2869)→(cid:2869) ) High (0.8,1] (0.31,0.98] 122,158 0.752 0.447 0.592 Caribbean Medium (0.6,0.8] (0.23,0.31] 122,157 0.675 0.266 0.400 (CR) Low (0.5,0.6] (0.20,0.23] 61,073 0.645 0.211 0.334 Very Low [0,0.5] [0,0.20] 305,403 0.596 0.128 0.216 High (0.8,1] (0.52,0.98] 112,673 0.771 0.642 0.835 East Asian Medium (0.6,0.8] (0.41,0.52] 112,685 0.685 0.460 0.678 (EA) Low (0.5,0.6] (0.36,0.41] 56,332 0.652 0.384 0.596 Very Low [0,0.5] [0,0.36] 281,711 0.592 0.240 0.405 High (0.8,1] (0.62,0.98] 122,156 0.784 0.712 0.908 South Medium (0.6,0.8] (0.53,0.62] 122,151 0.694 0.575 0.831 Asian (SA) Low (0.5,0.6] (0.50,0.53] 61,082 0.656 0.516 0.791 Very Low [0,0.5] [0.04,0.50] 305,402 0.573 0.395 0.689 106 Table B4: The number of SNPs that were selected for each trait (height, high-density lipoprotein [HDL], serum urate, low-density lipoprotein [LDL], body mass index [BMI], and glucose) by the threshold used for the p-value in the GWAS (based on a two-sided test of a t-statistic, with the null hypothesis that the SNP effect is zero), and the proportion of variance of the (adjusted) phenotype explained by the European (EUR)-derived PGS in testing data, by ancestry group (African [AF], Caribbean [CR], East Asian [EA], South Asian [SA], ARIC European American [AEA], and ARIC African American [AAA]) using SNPs from the UK Biobank array. Variable # SNPs (p<1e-5) # SNPs for GWAS (p<5e-8) Proportion Proportion Proportion Proportion Proportion of Variance of Variance of Variance of Variance of Variance Explained Explained Explained Explained Explained in EUR (%) in AF (%) in CR (%) in EA (%) in SA (%) Height 11,675 6,907 HDL 3,609 1,967 Serum Urate 3,151 1,751 LDL 2,272 1,210 BMI 2,371 Glucose 938 830 338 27.4 18.0 11.4 10.1 3.8 1.9 Proportion of Variance Proportion of Variance Explained in ARIC AEA (%) Explained in ARIC AAA (%) 3.9 6.6 5.1 6.0 0.3 0.3 6.7 7.5 4.6 7.9 0.9 0.4 9.7 9.3 6.2 5.4 0.9 0.5 15.0 13.2 8.8 4.1 2.8 0.8 22.0 -- 9.0 -- 3.6 -- 7.6 -- 2.7 -- 0.9 -- 107 Table B5: The average cross-ancestry R-squared, 𝑅(cid:2869)→(cid:2870) (cid:2870) [6], by chromosome (Chr) and ancestry group, and the number of annotated genes for each using SNPs from the UK Biobank array. African (AF) Caribbean (CR) East Asian (EA) South Asian (SA) Chr Average (cid:2870) 𝑅(cid:2869)→(cid:2870) # Genes Chr Average (cid:2870) 𝑅(cid:2869)→(cid:2870) # Genes Chr Average (cid:2870) 𝑅(cid:2869)→(cid:2870) # Genes Chr Average (cid:2870) 𝑅(cid:2869)→(cid:2870) # Genes 6 0.335 1060 6 0.379 1060 6 0.515 1041 6 0.605 1060 11 19 0.241 1303 11 0.292 1304 0.236 1406 16 0.286 802 16 0.233 802 17 0.281 1125 17 0.232 1125 19 0.28 1406 5 0.222 876 15 0.22 607 0.219 932 5 7 3 0.272 876 0.271 932 0.27 1122 7 3 1 0.218 1122 22 0.268 463 0.428 847 0.215 2031 15 0.268 607 15 0.421 591 0.464 1351 17 0.571 1125 0.455 1081 19 0.571 1406 0.45 1254 22 0.569 463 0.448 237 11 0.565 1304 0.442 456 16 0.558 802 0.432 756 21 0.554 244 0.43 1935 15 0.551 607 19 17 11 21 22 16 1 5 1 2 7 5 0.547 2031 0.546 1278 0.544 932 0.543 876 22 0.215 463 2 0.212 1278 1 2 0.267 2031 0.262 1278 12 0.21 1053 12 0.26 1058 7 2 3 0.418 910 0.417 1230 0.415 1068 20 0.54 564 10 0.21 776 20 0.259 564 10 0.415 762 12 0.537 1058 21 0.208 244 10 0.258 776 4 0.411 777 14 0.536 650 20 0.208 564 21 0.254 244 14 0.207 650 9 0.206 768 9 4 0.254 768 0.252 802 14 20 12 0.411 622 3 0.536 1122 0.41 545 10 0.536 776 0.405 1006 9 4 0.533 768 0.527 802 4 0.205 802 14 0.251 650 9 0.403 739 18 0.193 312 18 0.237 312 13 0.188 383 8 0.237 715 18 13 0.395 308 18 0.521 312 0.389 373 13 0.519 383 8 0.186 715 13 0.236 383 8 0.387 695 8 0.51 715 108 Table B6: The top fifteen most portable annotated genes (largest 𝑅(cid:2869)→(cid:2870) (cid:2870) [6]) for each ancestry group and the associated chromosome (Chr) and number of SNPs in each gene using SNPs from the UK Biobank array. The gene that was common between all ancestry groups is noted with an asterisk. African (AF) Caribbean (CR) East Asian (EA) South Asian (SA) # # # # Gene Chr SNPs Gene Chr SNPs Gene Chr SNPs Gene Chr SNPs 1 2 3 4 5 6 7 8 9 HIST1H2 AD HLA- DQB1 TCF19 CCHCR1 HLA- DRB1 HLA-F- AS1* LINC001 16 HLA- DOB SFTA2 10 11 12 BAG6 HCG4 LOC554 223 13 OR2B3 14 15 BTNL2 HLA- DQA2 6 6 6 6 6 6 2 6 6 6 6 6 6 6 6 1 TCF19 84 13 HLA- DQB1 HLA-F- AS1* 58 CCHCR1 83 48 7 20 HLA- DRB1 HCG4 LOC5542 23 HIST1H2 AD 7 OR2B3 21 HLA- DOB 6 SFTA2 17 LINC001 16 7 BTNL2 36 43 BAG6 BRD2 6 6 6 6 6 6 6 6 6 6 6 2 6 6 6 13 84 48 58 LOC1001 29195 ZSCAN1 6 HLA-F- AS1* ZFP57 6 6 6 6 83 ZNFX1 20 6 HIST1H1 T 17 HLA-F 1 7 ZBTB22 B3GALT 4 20 PFDN6 7 7 WDR46 BTNL2 36 OR2B3 6 6 6 6 6 6 6 6 21 OR51M1 11 19 SFTA2 6 4 5 ZBTB22 HIST1H1 T 6 6 48 OR51M1 11 60 LOC1001 29195 7 7 ZSCAN1 6 B3GALT 4 33 PFDN6 4 1 1 WDR46 ZFP57 HLA-F 12 LOC5531 03 36 ADH1A 7 6 7 HLA- DPB2 HLA-F- AS1* HIST1H2 BG 6 6 6 6 6 6 6 5 4 6 6 6 4 7 6 4 5 1 1 12 60 33 2 4 69 48 25 109 Table B7: From the top fifteen most portable genes from chromosome six from any ancestry group (African [AF], Caribbean [CR], East Asian [EA], and South Asian [SA]) using SNPs from the UK Biobank arrays, the 26 unique genes are grouped by base pair (BP) position (within 50 Kbp) only or base pair position as well as functional class. The three groups based on proximity as well as class were the HIST genes, the HLA-F/V genes, and the HLA-D genes. Genes BP position Ancestry groups H1-6 (HIST1H1T), H2BC8 (HIST1H2BG), H2AC7 (HIST1H2AD) 26106237-26216656 AF, CR, EA, SA ZSCAN16-AS1 (LOC100129195), ZSCAN16 28092306-28103691 EA, SA OR2B3 29045632-29054923 AF, CR, EA ZFP57, HLA-F, HLA-F-AS1, HCG4, HLA-V (LOC554223) 29640785-29768123 AF, CR, EA, SA SFTA2 30899163-30900150 AF, CR, EA CCHCR1, TCF19 31108829-31130078 AF, CR BAG6 31606813-31619576 AF, CR BTNL2 32361762-32374640 AF, CR, EA HLA-DRB1, HLA-DQB1, HLA-DQA2, HLA-DOB, HLA-DPB2 32542638-33101602 AF, CR, SA BRD2 32938199-32948804 CR B3GALT4, WDR46, PFDN6, ZBTB22 33245868-33283766 EA, SA 110 Table B8: Prediction correlation for each trait (height, high-density lipoprotein [HDL], serum urate, low-density lipoprotein [LDL], body mass index [BMI], and glucose) averaged over 50 replications in an external testing set (n-testing=300) from a cross-ancestry gradient descent algorithm for each ancestry group (AF=African, CR=Caribbean, and SA=South Asian) when using an adaptive learning rate based on relative accuracy compared to a fixed learning rate (LR). Ancestry Group Trait Average Prediction Average Prediction % Change in % of Testing Sets in Correlation (R- Correlation (R- R-squared Which Using an squared) with Fixed squared) with (Fixed to Adaptive LR Improved Learning Rate (LR) Adaptive LR Adaptive LR) Prediction R-squared* African (AF) Caribbean (CR) Height 0.207 (0.043) 0.213 (0.045) HDL SU LDL BMI 0.248 (0.062) 0.255 (0.065) 0.203 (0.041) 0.206 (0.042) 0.227 (0.052) 0.229 (0.052) 0.059 (0.003) 0.059 (0.003) Glucose 0.021 (0.0004) 0.025 (0.001) Height 0.246 (0.061) 0.250 (0.062) HDL SU LDL BMI 0.245 (0.060) 0.249 (0.062) 0.170 (0.029) 0.178 (0.032) 0.262 (0.068) 0.264 (0.069) 0.083 (0.007) 0.084 (0.007) Glucose 0.059 (0.003) 0.060 (0.004) Height 0.345 (0.119) 0.346 (0.120) 5.97 5.64 2.34 1.70 -0.71 43.77 3.13 3.13 9.16 1.48 0.95 3.66 0.40 South Asian (SA) HDL 0.319 (0.102) 0.319 (0.102) -0.15 SU LDL BMI 0.266 (0.071) 0.266 (0.071) 0.177 (0.031) 0.179 (0.032) 0.160 (0.026) 0.163 (0.027) Glucose 0.086 (0.007) 0.088 (0.008) 0 2.13 3.02 6.54 74.0 88.0 56.0 72.0 54.2* 51.0* 70.0 78.0 79.5* 73.8* 50.0* 66.7* 64.0 46.9* All replications were equal 64.0 70.0 62.0 * Percentage excludes training-testing partitions for which the adaptive and fixed R-squared were identical, which happened whenever the optimal number of iterations was zero or very large, in which cases varying learning rates do not affect estimates. 111 Supplementary Notes Supplementary Note 1: Portability map availability and use. The portability maps for SNPs from the UK Biobank arrays as well as SNPs from HapMap 3 variants are accessible in three ways: 1) Supplementary Data, 2) via an R package (downloadable as data objects as well as interactively through a Shiny app), and 3) interactively through a Shiny app hosted on a webpage. 1. Supplementary Data The two portability maps can be downloaded directly as Supplementary Data 1 and Supplementary Data 2 (UK Biobank arrays and HapMap variants, respectively). 2. Webpage Shiny app A Shiny app graphical interface was created to provide portability map information based on user-provided base pair positions (or genes or RS IDs). It is available at: https://lupia.github.io/Cross-Ancestry-Portability/. This version of the Shiny app will run slower than the identical version accessible through the R package described next. However, this web-based version does not require R software or packages. 3. R package MCANOVA: data objects and Shiny app The maps are available in an R package, detailed on GitHub here: https://github.com/lupiA/MCANOVA/blob/main/README.md. The MCANOVA R package provides the portability maps in two ways. First, they are directly useable as data objects (see Examples, i) once the MCANOVA package is installed. Second, we have created an interactive Shiny app (see Examples, ii) in which users can input base pair positions (or genes or RS IDs) to obtain the relative accuracy estimates and other 112 portability information for those regions from the maps. Additionally, the MCANOVA package provides a function implementing the MC- ANOVA method to estimate relative accuracy and functions to obtain the small chromosome segments (see Examples, iii) used in this paper. Installation To install the `MCANOVA` package in R, first install the `remotes` package: install.packages("remotes") library(remotes) Then install the package: install_github("lupiA/MCANOVA") library(MCANOVA) Examples After installation is complete: i) To load the portability maps into an R session as data objects: data(MAP_UKB) data(MAP_HAPMAP) ii) Launching the Shiny app to interactively access the portability map information: PGS_portability_app() iii) Creating chromosome segments of a minimum base pair length and size (using a small example map): data(geno_map_example) minSNPs <- 10 minBP <- 10e3 MAP_example <- geno_map_example MAP_example$segments <- getSegments(MAP_example$base_pair_position, 113 chr = MAP_example$chromosome, minBPSize = minBP, minSize = minSNPs, verbose = TRUE) iv) Running MC-ANOVA # install.packages("BGData") library(BGData) # Set seed set.seed(12345) ## # Generate genotypes (100 subjects and 500 SNPs) n <- 100 p <- 500 X <- matrix(sample(0:2, n * p, replace = TRUE), ncol = p) data(geno_map_example) colnames(X) <- geno_map_example$SNPs minSNPs <- 10 minBP <- 10e3 MAP_example <- geno_map_example MAP_example$segments <- getSegments(MAP_example$base_pair_position, chr = MAP_example$chromosome, minBPSize = minBP, minSize = minSNPs, verbose = TRUE) ## # Assign ancestry IDs (80% to ancestry 1, 20% to ancestry 2) n_1 <- round(0.8 * n) 114 n_2 <- round(0.2 * n) ancestry <- rep(c("Group_1", "Group_2"), times = c(n_1, n_2)) rownames(X) <- ancestry ## # Initialize portability estimates MAP_example$correlation_within <- NA MAP_example$correlation_across <- NA MAP_example$R_squared_within <- NA MAP_example$R_squared_across <- NA ## # Set parameters for MC-ANOVA lambda <- 1e-8 nRep <- 300 nQTL <- 3 ## # Loop over segments and run MC-ANOVA # For whole genome applications, this can be run in parallel with one job per segment in a High-Performance Computing Cluster ## for (i in min(MAP_example$segments):max(MAP_example$segments)) { core <- which(MAP_example$segments == i) flank_size <- 10 chunk_start <- max(min(core)– flank_size, 1) chunk_end <- min(max(core) + flank_size, nrow(MAP_example)) chunk <- chunk_start:chunk_end isCore <- chunk %in% core 115 ## X_1 <- X[rownames(X) ==“Group_1”, chunk] X_2 <- X[rownames(X) ==“Group_2”, chunk] ## # Run MC-ANOVA out <- MC_ANOVA(X = X_1, X2 = X_2, core = which(isCore), lambda = lambda, nQTL = nQTL, nRep = nRep) ## # Extract portability estimates MAP_example$correlation_within[chunk[isCore]] <- out[1, 1] MAP_example$correlation_across[chunk[isCore]] <- out[2, 1] MAP_example$R_squared_within[chunk[isCore]] <- out[1, 1]^2 MAP_example$R_squared_across[chunk[isCore]] <- out[2, 1]^2 } ## RA <- MAP_example$R_squared_across/MAP_example$R_squared_within 116 REFERENCES 1. Wang, Y. et al. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat. Commun. 11, 3865 (2020). 2. Wright, S. The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution 19, 395–420 (1965). 117 CHAPTER 3: The impact of sample size and the relative proportion of ancestry group on cross- ancestry prediction accuracy 118 Introduction Over the past two decades, there has been a large increase in the publication of Genome- Wide Association Studies (GWAS), with initial studies relying on cohorts of a few thousand participants1. These early investigations identified numerous loci linked to various human traits and diseases. However, there was a lack of replication between studies which highlighted the need for larger sample sizes to better detect associations between single nucleotide polymorphisms (SNPs) and phenotypes, especially for SNPs with small effects and rare variants. Consequently, numerous GWAS were conducted by consortia, which meta-analyzed summary statistics from multiple cohorts, revealing many novel findings. Despite these advancements, consortia faced limitations such as reliance on summary statistics, inconsistent phenotype definitions, and a focus on single health issues. Thus, with the establishment of biobanks housing hundreds of thousands of individual phenotype-genotype records, sample sizes were increased drastically, overcoming some of the previous consortia limitations. The advent of Big Data in genomics allowed for a more accurate identification of quantitative trait loci (QTL), and thus significantly enhanced our ability to predict complex traits and disease risk2,3. Polygenic scores (PGS) are a common method to estimate the disease or trait genetic predisposition for an individual. Now that genotyping platforms have become sufficiently dense, and with the availability of methods that can be used to impute several millions of variants, the overarching limiting factor of prediction accuracy in PGS is sample size2,4–7. Within-population PGS prediction accuracy is affected by three main factors. First, the trait heritability imposes an upper bound on PGS prediction accuracy. Theoretically, we could achieve a PGS prediction R-squared equal to the trait heritability if we knew all the causal variants (and were able to genotype them) and their effects without error8,9. However, for complex 119 traits, knowing all causal variants is nearly impossible. Therefore, PGS rely on using SNPs that are in linkage disequilibrium (LD) with causal variants. Thus, a second factor affecting PGS prediction accuracy is the strength of LD between causal variants and the SNPs used to build a PGS9-11. This depends on trait heritability, marker density, and sample size because these three factors affect the power to detect associations between SNPs and phenotypes. Third, PGS use SNP effect estimates; thus, a third factor affecting PGS prediction performance is the accuracy of SNP effect estimates9-12. Additionally, for cross-ancestry PGS prediction, the portability of SNP effects between ancestry groups also affects PGS prediction performance13,14. There is comparatively poor prediction performance and replication of PGS when applied across ancestries, particularly between more genetically distant ancestry groups, such as European (EU) and African (AF)15–20. Genomic differences between ancestry groups in allele frequencies, LD patterns, and LD strength are the primary factors contributing to poor PGS prediction accuracy13,14,20,21. Additional factors affecting prediction accuracy are genetic-by-genetic, genetic- by-environment interactions, and effect size differences, however, previous literature suggests that causal variants and their effect sizes are mostly shared between ancestry groups13,23–26. Nevertheless, cross-ancestry (EU to non-EU) prediction remains a necessity due to the lack of statistical power from small sample sizes available for within-non-EU prediction and the extreme overrepresentation of EU ancestry groups in genetic data. Recently, studies have found that including even a small number of non-EUs (the target ancestry group) in the training data, or incorporating non-EU summary statistics into the PGS construction, can improve prediction20,27– 32. We hypothesize that in cross-ancestry prediction (e.g., EU to AF), as training sample size increases, there is more statistical power, typically resulting in more SNPs entering each PGS and 120 ultimately increasing prediction accuracy. However, we anticipate that the gains in prediction accuracy from increasing the training sample size of EU versus AF is not equivalent, and increasing the sample size of AF will have a bigger gain in prediction accuracy than the same increase in sample size of EU. Additionally, we expect that increasing the training sample sizes will improve SNP effect estimation precision and that the portability of SNP effects between ancestry groups is an additional factor affecting cross-ancestry prediction accuracy. In this study, using EU ancestry data from the UK Biobank33 and AF ancestry data from the All of Us platform34 we evaluate how factors influence cross-ancestry prediction accuracy (EU to AF). We focus on three primary factors: SNP selection and the strength of LD between markers and QTL, SNP effect estimate precision, and SNP effect portability across ancestry groups. Additionally, we estimate the relative contribution to the prediction accuracy of additional cross-ancestry samples compared to within-ancestry samples, hypothesizing that they are not a one-to-one equivalent. Our analysis provides insight into the need for prioritizing non-EU data collection and explores the main bottlenecks in cross-ancestry prediction accuracy. Materials UK Biobank cohort This study selected distantly related individuals of European (EU) ancestry from the UK Biobank who had complete data on height, sex, and age. Participants were between 18 and 75 years old and were not excluded from kinship inference, were included in phasing, and were not identified as an outlier in heterozygosity and missing rates. Following Lupi et al., 202414, samples were excluded if they withdrew from the study, “if they had a mismatch of reported and genetic sex, or if they were related to other samples with relatedness 0.05. Relatedness was determined using genomic relationship matrices (𝐆 = 𝐙𝐙(cid:4594) (cid:2930)(cid:2928)(𝐙𝐙(cid:4594))/(cid:3041) , where Z is the centered genotype matrix) 121 computed within an ancestry group.” All of Us cohorts We selected distantly related African (AF) ancestry individuals from the All of Us cohort34 with complete data on height, sex, and age (18-75 years old). Relatedness and ancestry were both defined by Controlled Tier data provided by the platform (relatedness-based kinship scores and predicted ancestry35). The principal components (PCs) used for this cohort were also supplied by the platform as Controlled Tier data. Methods Study overview One of the main factors limiting PGS prediction accuracy is sample size. To evaluate how different factors, some affected by training sample size, impact cross-ancestry PGS prediction accuracy, we used European (EU) data from the UK Biobank (UKB) and the unprecedentedly large African (AF) ancestry data from the All of Us (AoU) platform to evaluate different scenarios to produce PGS. For each scenario, we constructed PGS for height at varying AF and EU training set sample sizes (used for effect estimation) and evaluated the prediction in the same two testing sets every time: AF (𝑇𝑆𝑇(cid:3002)(cid:3007), 𝑛(cid:3021)(cid:3020) (cid:3250)(cid:3255) = 9,078) and EU (𝑇𝑆𝑇(cid:3006)(cid:3022), 𝑛(cid:3021)(cid:3020)(cid:3021)(cid:3254)(cid:3270) = 10,000). The scenarios differed by: (1) varying both the training sample sizes and the number of SNPs selected (a typical PGS), (2) fixing the number of SNPs but varying the training sample sizes (isolating how sample size affects effect estimation), (3) incorporating SNP effect portability, by comparing PGS consisting of SNPs estimated to be more portable across ancestry group to SNPs estimated to be less portable. The AF training (𝑇𝑅𝑁(cid:3002)(cid:3007)) sample size for effect estimation ranged from 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0 to 40,000 and EU training (𝑇𝑅𝑁(cid:3006)(cid:3022)) sample size ranged from 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0 to 250,000, with a grid of eight additional sample sizes in between for each ancestry group (more 122 details can be found in the cohort descriptions in the Methods section). Scenario 1: A typical PGS (SNP filtering and estimation depend on sample size). To examine the impact of both QTL signal detection and SNP effect estimation, which are dependent on sample size, we varied the training sample sizes used for both SNP filtering and SNP effect estimation. We evaluated each PGS with a standard approach in which SNPs entering into each PGS were selected based on meta-analysis p-values (more details on this can be found in the section ‘Genome-wide association study (GWAS)’ in Methods) from combining the single marker GWAS (p-value < 1e-4) using the given training sample size of each training ancestry (AF and EU). Scenario 2: Isolating effect estimation (fixing the SNP set). Next, to distinguish the impact of effect estimation on prediction accuracy from the impact of the number of SNPs selected, we evaluated each PGS with a predetermined SNP set. Thus, in this scenario, the training sample size used for effect estimation varied but the SNP set was fixed (p=5,234 SNPs, filtered from a meta-GWAS from 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 100,000). An additional fixed SNP set was filtered from the meta-GWAS of AF (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000) and EU (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 25,000). Scenario 3: SNP portability. Previous literature has suggested that some regions of the genome will be portable across ancestry groups in PGS and others will not14. Therefore, to examine SNP portability (more details on this follow in the ‘SNP selection’ section of Methods), we evaluated the PGS in two different genomic scores. The first PGS included the most portable SNPs across ancestry group, i.e., the SNPs that were in the top 20th percentile of MC-ANOVA predicted cross-ancestry R-squared (most portable SNPs)14 and the second consisted of the SNPs in the bottom 20th percentile (least portable SNPs). 123 Design Out of the 270,859 selected EU individuals, a random sample of 𝑛(cid:3021)(cid:3020)(cid:3021)(cid:3254)(cid:3270) = 10,000 individuals was designated as the EU testing set (𝑇𝑆𝑇(cid:3006)(cid:3022)), and then nine distinct training sets (𝑇𝑅𝑁(cid:3006)(cid:3022)) of varying sample sizes were randomly drawn: 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = {0, 5,000, 10,000, 25,000, 50,000, 75,000, 100,000, 150,000, 200,000, 250,000}. The training sets, 𝑇𝑅𝑁(cid:3006)(cid:3022), did not include any individuals from the testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022). Additionally, smaller training sets were subsets of the larger training sets. A random sample of 𝑛(cid:3021)(cid:3020)(cid:3021)(cid:3250)(cid:3255) = 9,078 individuals was selected to be the AF testing set (𝑇𝑆𝑇(cid:3002)(cid:3007)), and nine distinct training sets (𝑇𝑅𝑁(cid:3002)(cid:3007)) of varying sample sizes were randomly drawn: 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = {0, 5,000, 7,500, 10,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000}. The training sets, 𝑇𝑅𝑁(cid:3002)(cid:3007), did not include any individuals from the testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007). Additionally, smaller training sets were subsets of the larger training sets. Genotypes Since this analysis involved combining data from two cohorts, UKB and AoU, we used the intersection between the AoU genotyped SNPs and the UKB imputed SNPs. SNPs were excluded from this set if they had a minor allele frequency of less than 0.01 or missingness of over 0.1 in either dataset (among the full sample set), resulting in p = 522,170 SNPs retained for analysis. If a sample subset contained a missing SNP, it was imputed with the mean. Phenotypes For the AF cohort, the height measurement selected was the one closest to 60 years old for each individual, and outliers for height were removed, defined as larger or smaller than the median  three times that of the middle 50th percentile for height. For the EU cohort, the height measurement selected was from the first instance or the second if the first was missing. 124 Some steps in this study involved preadjusting the height phenotype (e.g., the effect estimation for the PGS). For this, the residuals from an ordinary least squares (OLS) regression of height on sex, age, and the first five genotyped principal components were used as the adjusted phenotype. The EU OLS regression also included batch and center. Genome-wide association study (GWAS) A GWAS for height, including sex, age, and the first five SNP-derived PCs as additional covariates, was evaluated for each sample size set and for each data cohort (EU and AF) using PLINK36. That is, a single marker regression for 𝑇𝑅𝑁∗, where ‘*’ = EU or AF, was evaluated as: 𝐲∗ = 𝐒𝐍𝐏∗,(cid:3037)𝛽∗,(cid:3037) + 𝐙∗𝛼∗ + 𝐞∗ [10] for j = 1...p SNPs, where 𝐲∗ is the height vector for the ‘*’ = EU or AF ancestry group and 𝐒𝐍𝐏∗,(cid:3037) is the vector of the number of allele copies for the jth SNP and the ‘*’ = EU or AF ancestry group. 𝐙∗ ∈ ℝ(cid:3041)(cid:3269)(cid:3267)(cid:3263)∗ (cid:2934) (cid:2875) is a predictor matrix, consisting of sex, age, and PC1 to PC5 (the first five genotype-derived principal components). To obtain GWAS p-values that considered both data cohorts (EU and AF), we combined the ancestry group GWAS SNP effects (𝛽(cid:4632) (cid:3006)(cid:3022),(cid:3037) and 𝛽(cid:4632) (cid:3002)(cid:3007),(cid:3037)) estimated in [10] to obtain a meta- analysis-based estimate, 𝛽(cid:4632) (cid:3014)(cid:3006)(cid:3021)(cid:3002),(cid:3037), for each SNP (j = 1...p)37: 𝛽(cid:4632) (cid:3014)(cid:3006)(cid:3021)(cid:3002),(cid:3037) = (cid:3050)(cid:3254)(cid:3270),(cid:3285)(cid:3081)(cid:3553) (cid:3254)(cid:3270),(cid:3285)(cid:2878)(cid:3050)(cid:3250)(cid:3255),(cid:3285)(cid:3081)(cid:3553) (cid:3250)(cid:3255),(cid:3285) (cid:3050)(cid:3254)(cid:3270),(cid:3285)(cid:2878)(cid:3050)(cid:3250)(cid:3255),(cid:3285) [11] where 𝑤(cid:3006)(cid:3022),(cid:3037) = (cid:2869) (cid:3020)(cid:3006)(cid:3435)(cid:3081)(cid:3553) (cid:3254)(cid:3270),(cid:3285)(cid:3439) (cid:3118) and 𝑤(cid:3002)(cid:3007),(cid:3037) = (cid:2869) (cid:3020)(cid:3006)(cid:3435)(cid:3081)(cid:3553) (cid:3250)(cid:3255),(cid:3285)(cid:3439) (cid:3118) . The variance of the meta-estimator is 𝑆𝐸(cid:3435)𝛽(cid:4632) (cid:3014)(cid:3006)(cid:3021)(cid:3002),(cid:3037)(cid:3439) = (cid:3495) (cid:2869) (cid:3050)(cid:3254)(cid:3270),(cid:3285)(cid:2878)(cid:3050)(cid:3250)(cid:3255),(cid:3285) and the meta-test statistic for each SNP (j = 1...p) was defined as: 𝑍(cid:3014)(cid:3006)(cid:3021)(cid:3002),(cid:3037) = (cid:3081)(cid:3553) (cid:3020)(cid:3006)(cid:3435)(cid:3081)(cid:3553) (cid:3262)(cid:3254)(cid:3269)(cid:3250),(cid:3285) (cid:3262)(cid:3254)(cid:3269)(cid:3250),(cid:3285)(cid:3439) . [12] The final meta-GWAS p-value was then defined to be twice the area under a standard normal 125 distribution of negative infinity to the negative absolute value of 𝑍(cid:3014)(cid:3006)(cid:3021)(cid:3002),(cid:3037) [12]. PGS and prediction accuracy calculation SNP selection. The primary difference between the PGS evaluated in each of the scenarios described above was in the SNPs entering into each PGS. In all three scenarios, in the cases where 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0 or 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0, the single-cohort GWAS p-value was used instead of the meta-GWAS. In Scenario 1 (‘A typical PGS’), the number of SNPs varied based on the training sample sizes 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255). For each sample size combination, the SNPs were selected for the PGS if the meta-GWAS p-value [12], if applicable, was less than 1e-4. In Scenario 2 (‘Isolating effect estimation’), SNPs entering into each PGS were selected if they had a p-value < 1e-4 based on the meta-GWAS [12], if applicable, using the sample sets 1) 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 100,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000 to select p=5,234 SNPs and 2) 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 25,000 to select p=817 SNPs. In Scenario 3 (‘SNP portability’), the SNPs entering into each PGS were selected if they had a p-value < 1e-2 from the meta-GWAS, if applicable. Then, the PGS for each sample size combination was split into two sub-PGS. Since the SNPs were subset into two PGS for each sample size combination, the threshold for selection was relaxed to allow for an adequate number of SNPs in each PGS. One PGS sub-score was based on the SNPs with the top 20% of predicted portability, and the second was based on the SNPs with the lowest 20% of predicted portability. Portability was defined by the MC-ANOVA14 predicted cross-ancestry R-squared for a small chromosome segment based on UKB (EU to AF) array data: 𝑅(cid:3006)(cid:3022)→(cid:3002)(cid:3007) (cid:2870) = Corr(cid:3435)𝐱(cid:3036)(cid:3250)(cid:3255) (cid:4593) 𝛃(cid:3006)(cid:3022), 𝐳(cid:3036)(cid:3250)(cid:3255) (cid:4593) 𝛂(cid:3439) (cid:2870) , where 𝐳(cid:3036)(cid:3250)(cid:3255) is the (centered) vector of SNP genotypes at causal loci (QTL) for the AF ancestry group, 𝐱(cid:3036)(cid:3250)(cid:3255) is the (centered) vector of SNP genotypes at markers for the AF ancestry group, 𝛂 is 126 the vector of effects, and 𝛃(cid:3006)(cid:3022) = 𝚺(cid:3025)(cid:3254)(cid:3270) (cid:2879)(cid:2869) 𝚺(cid:3025)(cid:3254)(cid:3270)(cid:3027)(cid:3254)(cid:3270)𝛂 are the EU ancestry group (population) marker effects. The portability estimates used were obtained from the portability maps provided by Lupi et al., 202414. SNP effects. Once the SNP set was determined, every PGS was based on summary statistics from the AF and EU training cohorts. To jointly estimate effects for J PGS SNPs, 𝐛(cid:4632) , we fit a Bayesian ridge regression model using a Markov Chain Monte Carlo (MCMC) Gibbs sampler algorithm: 𝐲∗ = 1𝜇∗ + ∑ (cid:3011) (cid:3037)(cid:2880)(cid:2869) 𝐒𝐍𝐏∗,(cid:3037) 𝐛∗,(cid:3037) + 𝒆∗ [13] where, for the ‘*’ = EU or AF ancestry group, 𝐲∗ is the preadjusted vector of height, 𝜇∗ is an intercept, 𝐒𝐍𝐏∗,(cid:3037) is the vector of the number of allele copies for the jth SNP, and 𝒆∗ = {𝑒(cid:2869)∗, … 𝑒(cid:3041)∗} are independent normal residuals. The residual variance has a scaled inverse-chi squared prior and the shrinkage parameter lambda (𝜆∗) in the Bayesian ridge regression prior was kept fixed for each ancestry group. The ridge estimator is 𝐛(cid:4632) ∗,(cid:3037) = (𝐗∗ (cid:4593) 𝐗∗ + 𝜆∗𝐈)(cid:2879)(cid:2869)𝐗∗ (cid:4593) 𝐲∗, where 𝐗∗ is the design matrix of the intercept and SNPs. The prior mean, 𝜇(cid:3029), was defined as an unweighted average across the two training cohorts: (cid:4672)∑ (cid:3289)(cid:3250)(cid:3255) (cid:3284)(cid:3128)(cid:3117) 𝜇(cid:3029) = 𝒚(cid:3269)(cid:3267) (cid:3250)(cid:3255) (cid:3041)(cid:3269)(cid:3267)(cid:3263)(cid:3250)(cid:3255) 𝒚(cid:3269)(cid:3267)(cid:3263)(cid:3254)(cid:3270) (cid:2878)∑ (cid:3289)(cid:3254)(cid:3270) (cid:3284)(cid:3128)(cid:3117) (cid:2878)(cid:3041)(cid:3269)(cid:3267)(cid:3263)(cid:3254)(cid:3270) (cid:4673) , [14] and the prior variance, since the two cohorts are independent, was defined simply as the unweighted combined variance: (cid:2870) = 𝜎(cid:3029) (cid:4672)(𝒚(cid:4594)𝒚)(cid:3269)(cid:3267)(cid:3263)(cid:3250)(cid:3255) (cid:3041)(cid:3269)(cid:3267)(cid:3263)(cid:3250)(cid:3255) (cid:2878)(𝒚(cid:4594)𝒚)(cid:3269)(cid:3267)(cid:3263)(cid:3254)(cid:3270)(cid:4673) (cid:2878)(cid:3041)(cid:3269)(cid:3267) (cid:3254)(cid:3270) − 𝜇(cid:3029) (cid:2870) , [15] where 𝜇(cid:3029) is the expression defined in [14], 𝐲(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) and 𝐲(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) are the AF and EU preadjusted height vectors, respectively, and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) and 𝑛(cid:3021)(cid:3019) (cid:3254)(cid:3270) are the ancestry group-specific training set sample sizes. 127 We ran the BLR algorithm with 50,000 MC iterations and used a burn-in of 2,000 using the ‘BLRCross’ function from the R package BGLR38. This function takes summary statistics as inputs rather than the traditional inputs of an incidence matrix 𝐗 and phenotype vector. That is, for p SNPs entering into a PGS, the model involved the summary statistics 𝐗(cid:4593)𝐗 (cid:3021)(cid:3019)(cid:3015)∗, 𝐗(cid:4593)𝐲(cid:3021)(cid:3019)(cid:3015)∗, 𝐲(cid:4593)𝐲(cid:3021)(cid:3019) ∗, and 𝑛(cid:3021)(cid:3019)(cid:3015)∗, where, for the ancestry group subscript (* = AF or EU), 𝑛(cid:3021)(cid:3019)(cid:3015)∗ is the sample size, 𝐗 (cid:3021)(cid:3019)(cid:3015)∗ ∈ ℝ(cid:3041) (cid:3051) (cid:3043) is the centered and imputed matrix of genotypes coded as 0, 1, or 2 (the count of reference allele at each SNP), and 𝐲(cid:3021)(cid:3019)(cid:3015)∗ ∈ ℝ(cid:3041) (cid:3051) (cid:2869) is the preadjusted vector of the height phenotype. The training sets were then combined additively (without weights) into one set of summary statistics, and the combined summary statistics are were entered into the BLR algorithm described above. Prediction accuracy estimation. To evaluate each PGS in a testing set, we used summary statistics from each test set (for AF, 𝐗(cid:4593)𝐗 (cid:3021)(cid:3020)(cid:3021)(cid:3250)(cid:3255), 𝐗(cid:4593)𝐲(cid:3021)(cid:3020)(cid:3021)(cid:3250)(cid:3255), and 𝐲(cid:4593)𝐲(cid:3021)(cid:3020) (cid:3250)(cid:3255), and for EU, 𝐗(cid:4593)𝐗 (cid:3021)(cid:3020) (cid:3254)(cid:3270), 𝐗(cid:4593)𝐲(cid:3021)(cid:3020)(cid:3021)(cid:3254)(cid:3270), and 𝐲(cid:4593)𝐲(cid:3021)(cid:3020) (cid:3254)(cid:3270)) and the estimated SNP effects, 𝐛(cid:4632) , to estimate the prediction correlation for each ancestry group, 𝑅(cid:3021)(cid:3020)(cid:3021)(cid:3250)(cid:3255) and 𝑅(cid:3021)(cid:3020) (cid:3254)(cid:3270) (for ease of notation we will drop the ancestry group subscript here, with the understanding that 𝑇𝑆𝑇 either equals 𝑇𝑆𝑇(cid:3002)(cid:3007) or 𝑇𝑆𝑇(cid:3006)(cid:3022)): 𝑅(cid:3021)(cid:3020)(cid:3021) = 𝐶𝑜𝑟𝑟(𝐲(cid:3548)(cid:3021)(cid:3020)(cid:3021), 𝐲(cid:3021)(cid:3020)(cid:3021)) = 𝐶𝑜𝑣(𝐲(cid:3548)(cid:3021)(cid:3020)(cid:3021), 𝐲(cid:3021)(cid:3020)(cid:3021)) (cid:3493)𝑉𝑎𝑟(𝐲(cid:3021)(cid:3020)(cid:3021))𝑉𝑎𝑟(𝐲(cid:3548)(cid:3021)(cid:3020)(cid:3021)) = (cid:4594) (cid:3006)(cid:3435)𝐲(cid:3548)(cid:3269)(cid:3268)(cid:3269) (cid:3117) (cid:3289)(cid:3269)(cid:3268)(cid:3269) (cid:3495) 𝐲(cid:3269)(cid:3268)(cid:3269)(cid:3439)(cid:2879)(cid:3091)(cid:3300)(cid:3549)(cid:3269)(cid:3268)(cid:3269) (cid:3091)(cid:3300)(cid:3269)(cid:3268)(cid:3269) 𝐲(cid:4594)𝐲(cid:3269)(cid:3268)(cid:3269)𝐛(cid:4632) (cid:4594)𝐗(cid:4594)𝐗(cid:3269)(cid:3268)(cid:3269)𝐛(cid:4632) . Since 𝐸(𝐲(cid:3548)(cid:3021)(cid:3020)(cid:3021)) = 𝐗𝐛(cid:4632) , 𝐸(𝐲(cid:3021)(cid:3020)(cid:3021)) = 𝐲(cid:3021)(cid:3020)(cid:3021), and since 𝐲(cid:3021)(cid:3020)(cid:3021) is centered around zero, 𝜇(cid:3021)(cid:3020)(cid:3021) = 0: = (cid:3495) (cid:3117) (cid:3289)(cid:3269)(cid:3268)(cid:3269) 𝐛(cid:4632) (cid:4594)𝐗(cid:4594)𝐲(cid:3269)(cid:3268)(cid:3269) 𝐲(cid:4594)𝐲(cid:3269)(cid:3268)(cid:3269) 𝐛(cid:4632) (cid:4594)𝐗(cid:4594)𝐗(cid:3269)(cid:3268)(cid:3269)𝐛(cid:4632) . [16] 128 The prediction R-squared, 𝑅(cid:3021)(cid:3020)(cid:3021) (cid:2870) , is the prediction correlation squared. Results In this study, we explored how factors, particularly sample size and the ancestry group of the training set, affected cross-ancestry prediction accuracy. We evaluated different scenarios of PGS varying the size of the training set consisting of two ancestry groups: EU ancestry data from the UKB and AF from AoU. Age, sex, and height were well-balanced across the sample size sets and datasets (Table C1). The average height across the UKB sets was 169.0  9.2 cm and was 169.4  9.8 cm across the AoU sets. The AoU samples were younger on average (48.8  13.4 years old) and more female (56.3%) than the UKB sets (56.8  8.0 years old and 53.3% female). Scenario 1: A typical PGS (SNP filtering and estimation depend on sample size) In this scenario, the training set was composed of varying sample sizes of AF and EU (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270)), and was used for both SNP selection as well as SNP effect estimation. Figure 9a shows the prediction accuracy in the AF testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007), while Figure 9b shows the prediction accuracy in the EU testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022). In the AF testing set (Figure 9a), the pure within-ancestry (EU 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0) is the first column of results and the pure cross-ancestry (AF 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0) is the last row of results. The prediction correlation increased as the training sample size increased for both the pure cross-ancestry and pure within-ancestry prediction (Figure 9a). The maximum prediction correlation achieved with pure within-ancestry prediction was 0.19, and it was at the maximum sample size explored (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 40,000). This was, as expected, larger than the maximum prediction correlation achieved with pure cross-ancestry prediction (𝑅(cid:3021)(cid:3020) (cid:3250)(cid:3255) = 0.15). The maximum pure cross-ancestry prediction correlation was approximately equivalent to pure within-ancestry at 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000 (Figure 9a), implying that for comparing strictly cross- ancestry and within-ancestry PGS, ten EU individuals were required for every one AF individual. 129 The 10:1 (EU:AF) relationship was not linear though (Figure 9a), as the ancestry ratio was 1.3:1 to achieve a prediction correlation of about half as much (𝑅(cid:3021)(cid:3020)(cid:3021)(cid:3250)(cid:3255) = 0.08), and the ratio was 5.6:1 to achieve a prediction correlation of about twice as much (𝑅(cid:3021)(cid:3020)(cid:3021)(cid:3250)(cid:3255) = 0.29). The number of SNPs selected for each PGS varied depending on the training sample sizes used for the meta-GWAS, and, interestingly, the SNP sets filtered from AF only (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0) tended to be compared to the SNP sets filtered from EU only (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0). For example, when 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0 but 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000, 510 SNPs were selected. However, when this was reversed such that 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0 but 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 25,000, 770 SNPs were selected. This could be because the EU ancestry group tends to have more LD compared to the AF ancestry group. 130 a. Scenario 1: African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007) Figure 9: PGS varying the training sample sizes used to filter SNPs and estimate effects. The prediction correlation, 𝑅(cid:3021)(cid:3020)(cid:3021), for height using different PGS for: (a) the African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007), and (b) the European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022), for different combinations of 𝑇𝑅𝑁(cid:3002)(cid:3007) and 𝑇𝑅𝑁(cid:3006)(cid:3022) sample sizes used for both SNP filtering and SNP effect estimation. SNPs entering into each PGS are based on the training sample size combinations (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270)) and the number of SNPs, p, for each PGS is shown in parenthesis. For 𝑇𝑆𝑇(cid:3002)(cid:3007), when there are no AF in the training (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0), this is pure cross-ancestry prediction and when there are no EU in the training (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0), this is pure within-ancestry prediction (and vice versa for 𝑇𝑆𝑇(cid:3006)(cid:3022)). 131 Figure 9 (cont’d) b. Scenario 1: European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022) When training the PGS using a combination of AF and EU data, prediction correlation testing in the AF group, 𝑇𝑆𝑇(cid:3002)(cid:3007), tended to increase as both training sample sizes increased (Figure 9a), such that the maximum prediction correlation achieved was for the maximum sample sizes of both (AF 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 40,000 and EU 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 250,000). As seen comparing the pure within- ancestry and cross-ancestry PGS, the number of EU individuals included in the training data required to obtain equivalent prediction accuracy was more than that required of the AF number of individuals. For example, to achieve a prediction correlation of 0.27, four scenarios in Figure 9a achieved this: AF 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 35,000 and 40,000 with EU 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 100,000 (2.7:1), AF 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 30,000 and EU 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 150,000 (5:1), and AF 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000 and EU 132 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 250,000 (10:1). This means at this prediction accuracy level, the additional EU individuals added was worth increasingly less than adding more within-ancestry (AF) individuals. Conversely, we observed from the results in Figure 9b when testing in EU, 𝑇𝑆𝑇(cid:3006)(cid:3022), that adding AF samples did not have much impact on prediction accuracy, rather, the prediction accuracy increased as the EU sample size increased. Figure 10 has contour lines over the prediction correlations testing in AF (𝑇𝑆𝑇(cid:3002)(cid:3007)), identifying the sample sizes required of each training ancestry to achieve equivalent prediction correlation. As the EU sample size (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270)) increased in size, the contour lines started to level off, showing the relative decrease of information added after about 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 100,000 EU individuals. This leveling off suggests that beyond some threshold for sample size, increasing the size of the non-target ancestry group in the training data (e.g., more EU individuals) contributes less to enhancing prediction accuracy and there are diminishing returns from additional data of this type. Additionally, the larger negative slopes at smaller EU sample sizes means that, while additional AF individuals increased prediction correlation more than the equivalent number of EU individuals, they were closer to a 1:1 equivalency (slope becomes closer to negative one) than at the higher EU sample sizes (where the slope becomes closer to zero). 133 Contour Lines Figure 10: Prediction correlation contour lines. Contour lines of predictive correlations (shown in red) over the estimated prediction correlation, 𝑅(cid:3021)(cid:3020)(cid:3021), for height using different PGS for the AF testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007). The axes show different combinations of AF and EU training set sample size combinations (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270)). The training sample sizes were used for both SNP filtering and SNP effect estimation. Scenario 2: Isolating effect estimation (fixing the SNP set) To explore how SNP effect estimation accuracy affects PGS accuracy, we evaluated each PGS with a fixed SNP set of p = 5,234 SNPs (Figure 11) but estimated the SNP effects in different training sample sizes of EU and AF. In Figure 11a (testing in AF, 𝑇𝑆𝑇(cid:3002)(cid:3007)), as the EU sample size increased relative to the AF sample size, the prediction accuracy decreased since the SNP effect estimates converged toward the EU-specific estimates. The highest prediction accuracy was with the largest sample size of AF and a small number of EU (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 40,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 10,000; Figure 11a). This is due to the EU sample having a substantially higher total size than the AF sample. For both testing in AF and EU, Figures 11a and 11b, respectively, as expected, pure cross-ancestry prediction (training in one ancestry group and testing in the other ancestry group) generally had the lowest prediction accuracy. When testing in the EU group 134 (Figure 11b), the highest accuracy was among the PGS with the largest sample size of EU. The range of prediction correlation estimates (Figure 11a) within the AF testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007), was narrow compared to the range in Scenario 1 (Figure 9a). Excluding pure within- and cross- ancestry cases, for the AF testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007), the range of prediction correlation was 0.19 – 0.28. Similarly, for the EU testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022), the range of prediction correlation was 0.33 – 0.46. In Scenario 1, both testing sets (𝑇𝑆𝑇(cid:3002)(cid:3007) and 𝑇𝑆𝑇(cid:3006)(cid:3022)) had a range starting from nearly zero and a larger maximum. When comparing the fixed SNP set combinations in Figure 11a to another fixed SNP set of p = 817 SNPs (Figure C1a), the prediction correlations were typically higher (although six cases when 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0 were smaller), which is what we hypothesized for PGS selecting a larger number of SNPs. The pattern of the correlation estimates was the same, in that the highest correlation estimates were among a smaller number of 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) rather than the maximum. 135 a. Scenario 2: African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007) Figure 11: PGS fixing the SNP set but varying the training sample sizes used to estimate SNP effects. The prediction correlation, 𝑅(cid:3021)(cid:3020)(cid:3021), for height using different PGS for: (a) the African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007), and (b) the European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022), for different combinations of 𝑇𝑅𝑁(cid:3002)(cid:3007) and 𝑇𝑅𝑁(cid:3006)(cid:3022) sample sizes used for effect estimation. The SNPs entering into each PGS are the same (p=5,234 SNPs) and is noted in parentheses. For 𝑇𝑆𝑇(cid:3002)(cid:3007), when there are no AF in the training (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0), this is pure cross-ancestry prediction and when there are no EU in the training (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0), this is pure within-ancestry prediction (and vis versa for 𝑇𝑆𝑇(cid:3006)(cid:3022)). The scale of the prediction correlation is based on the values from Figure 9a for Figure 11a (and from Figure 9b for Figure 11b) to allow for straightforward comparison between the plots. 136 Figure 11 (cont’d) b. Scenario 2: European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022) Scenario 3: SNP portability In Scenario 3 we used cross-ancestry PGS portability estimates from the Relative Accuracy maps derived from the UK Biobank arrays, available at Lupi et al., 202414 (see ‘SNP selection’ in Methods for details on portability estimates). To build each PGS for height we partitioned the SNPs into two groups using a p < 1e-2 inclusion level. The two groups were the top 20th percentile of portable SNPs and the bottom 20th percentile of portable SNPs. Similar to Scenario 1, the training sample sizes (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270)) for this scenario determined both SNP selection as well as SNP effect estimation. Figure 12a shows the prediction accuracy when testing in AF (𝑇𝑆𝑇(cid:3002)(cid:3007)) for 99 combinations of training sample sizes (different sizes and ancestries) by SNP portability. Figure 12b shows the prediction accuracies for the same settings but testing in 137 EU (𝑇𝑆𝑇(cid:3006)(cid:3022)). Figure 12a shows that even among the pure within-ancestry PGS, that did not involve any EU data, the PGS involving the SNPs predicted to be most portable across ancestry group had higher prediction correlation compared to the PGS involving the SNPs predicted to be the least portable. Interestingly, for the most portable SNPs testing in AF (𝑇𝑆𝑇(cid:3002)(cid:3007)), the maximum prediction correlation achieved was only 0.18, and was with the maximum training sample sizes used for both SNP filtering and estimation (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 40,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 250,000). For the least portable SNPs testing in AF (𝑇𝑆𝑇(cid:3002)(cid:3007)), the maximum prediction correlation achieved was only 0.16, and was also with the maximum training sample sizes used for both SNP filtering and estimation (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 40,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 250,000). While this sample size combination had fewer SNPs compared to that in Figure 9a (about 9,080 SNPs versus 15,039 SNPs, respectively), which used the same sample sizes for effect estimation but differed in the SNP set, other combinations had a larger number of SNPs entering into the PGS but still had poorer prediction accuracy. One example of this is for 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 25,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 25,000, in which the typical PGS (Figure 9a) had 817 SNPs and a prediction correlation of 0.20, while the most portable SNP-based PGS (Figure 12a) had 2,091 SNPs and a prediction correlation of 0.14 (0.12 for the least portable SNP-based PGS). 138 a. Scenario 3: African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007) Least Portable SNPs Most Portable SNPs b. Scenario 3: European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022) Least Portable SNPs Most Portable SNPs Figure 12: PGS subset by SNP portability. The prediction correlation, 𝑅(cid:3021)(cid:3020)(cid:3021), for height using different PGS for: (a) AF (𝑇𝑆𝑇(cid:3002)(cid:3007)) and (b) EU (𝑇𝑆𝑇(cid:3006)(cid:3022)) when using the top 20% most portable SNPs (based on MC-ANOVA’s cross-ancestry R-squared14) compared to the bottom 20% most portable SNPs by training set sample size (AF [𝑇𝑅𝑁(cid:3002)(cid:3007)] and EU [𝑇𝑅𝑁(cid:3006)(cid:3022)]). The scale of the prediction correlation is based on the values from Figure 9a for Figure 12a (and from Figure 9b for Figure 12b) to allow for straightforward comparison between the plots. 139 Comparing the most portable PGS to the least in the AF testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007) (Figure C2a), the pure within-ancestry PGS had an average (median) increase in prediction R-squared (prediction correlation squared) of 103.8% (103.1%), the pure cross-ancestry PGS had an average (median) increase of 119.1% (83.7%), and the PGS involving both EU and AF in training had an average (median) increase of 78.8% (49.1%). In the EU testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022) (Figure C2b), the most portable SNPs set tended to have higher prediction R-squared compared to the least portable SNPs. There was an average (median) increase of 158.5% (27.1%) across all PGS combinations (including pure within- and cross-ancestry PGS). Our prediction correlations for Scenario 3 (Figure 12) are generally lower than in Scenario 1 (Figure 9). Of the cases when Scenario 1 had fewer SNPs than Scenario 3 (59 out of 99), only 13 of those had a smaller prediction correlation. The thirteen combinations with lower prediction correlation and a lower number of SNPs selected were all when 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) ≤ 15,000 and 𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) ≤ 25,000. Since Scenario 3 represents the genomic regions with high portability instead of being genome-wide, if the SNPs with higher LD (portability) are located near one another, they could be picking up the same QTL signal, leading to less diversity in the signal being picked up by markers. Indeed, in Table C2 when defining AF peaks (unique local chromosome regions consisting of GWAS-significant hits in LD picking up the same QTL signal) from the two portability-based SNP hit sets (GWAS p-value < 1e-2), the SNP set consisting of the highly portable SNPs condensed to 511 peaks on average across sample size cases (on average 9.3% of the total SNPs) and the SNPs with low portability condensed to 500 peaks (8.6% of the total SNPs). Conversely, the Scenario 1 SNPs condensed 2,015 peaks on average (on average 50.2% of the total SNPs). 140 Discussion In this study, we postulate that training sample size and ancestry group are the major limitations in cross-ancestry prediction accuracy. Previous studies have shown that the sample size used to train models affects prediction accuracy by influencing both the identification of phenotype-associated SNPs selected for inclusion in the PGS and the precision of SNP effect estimates8–12. To shed light on this problem, we evaluated polygenic scores (PGS) under various scenarios using both within and cross-ancestry training data from the UK Biobank and All of Us. Using individuals from multiple ancestry groups is important in identifying genomic regions with QTL signal. However, we found that cross-ancestry prediction did not produce accurate marker effect estimates across ancestries. When the training sample size of the target ancestry group (AF) was very small (or zero), the estimated effects were poor and prediction accuracy among AF was low. However, when the AF ancestry group had a certain training sample size (e.g., at 15,000 AF samples in Scenario 2) it was more precise to use a European (EU) training size smaller than available since we observed that over-increasing the number of EU ancestry samples compared to the number of AF samples dominated the effect estimation, resulting in poorer prediction accuracy when testing among AF ancestry. These results might suggest that the QTL, or QTL effects, are not the same across ancestry groups, but Hou et al., 202339 suggest that causal variants tend to be similar across ancestry groups. Our results may align with this if different LD patterns exist between SNP markers and QTL across ancestry groups, as these variations could alter the QTL effects captured by markers in each group. Indeed, previous studies have shown that there are LD and allele frequency differences between ancestry groups13,14,19. Thus, the QTL effect that we may capture in a marker in one ancestry group may not exist in the other. SNP portability describes the transferability of a PGS 141 across ancestry groups based on how similar the (local) genetic regions in the PGS are across ancestry groups. As shown by Lupi et al.14, the higher the portability, the better conserved the LD and allele frequency in the region between groups and the more portable a cross-ancestry PGS will be. Classifying SNPs based on their portability across ancestries showed that SNPs in higher portability regions improved predictive accuracy in PGS compared to SNPs with lower portability (Scenario 3). This was true even in within-ancestry predictions. This suggests that cross-ancestry SNP portability is a valuable tool for identifying regions that are more suitable for prediction across ancestries and those that are less suitable. Classifying SNPs by portability also demonstrated that the number of SNPs included in the PGS doesn’t necessarily translate to better selection or identification of QTL or QTL markers. The SNP filtering threshold was less conservative (larger) in Scenario 3 compared to Scenario 1, yet under Scenario 3, some of the PGS yielded lower prediction correlation estimates (among 𝑇𝑆𝑇(cid:3002)(cid:3007)) even when more SNPs were selected. Since the portability-based SNP sets included substantially fewer GWAS peaks than the typical PGS, there was less variation in the QTL signal picked up by the portability-based SNP sets. This is similar to that observed by Kim et al., 20172, who compared SNP selection based on LD blocks to selecting the top SNPs independent of LD blocks and found that for a small number of SNPs, the top SNP method underperformed due to the top SNPs clustering in regions, thus, having poor genome coverage. Our findings demonstrate the inefficiency of using cross-ancestry data (EU) compared to within-ancestry data (AF) for PGS predictions in AF individuals, similar to that found by Lehmann et al., 20236. The study suggests that a significantly larger number of EU individuals is required to achieve the same prediction accuracy as a smaller number of AF individuals. This inefficiency is particularly pronounced at higher levels of prediction accuracy, where the required 142 EU sample size disproportionately increases compared to the AF sample size required. This non- linear relationship indicates diminishing returns from adding more cross-ancestry data beyond a certain point. Yet, our results highlight the need for further increasing non-EU data collection, since PGS involving both within and cross-ancestry data still greatly improved upon the pure within-ancestry prediction at the limited sample sizes available of AF individuals. Our study has some limitations. First, this study used data from different cohorts, and both cohorts had different genotyping platforms. Therefore, we obtained a common set of SNPs with sufficient genome coverage (calls for All of Us and the UK Biobank imputed SNP set). This is a limitation since SNP imputation can induce artifacts related to the reference panels used for imputation, which are often EU-dominant. Additionally, imputed SNPs have a higher marker density and higher LD compared to genotyped SNPs. Another limitation is that this study only evaluated height. Thus, our results are not necessarily representative of other traits with different heritability, polygenicity, and genetic architecture. Lehmann et al., 20236, found that for cross- ancestry prediction, the optimal training strategy, e.g., the sample sizes of each ancestry (EU and non-EU), varied substantially depending on the trait. Nevertheless, their conclusion that additional non-EU genomic data collection is critical is consistent with our findings. These findings have important implications for genomic research and the development of PGS. First, they highlight the necessity of increasing sample sizes for non-European ancestry groups to achieve more accurate prediction. Second, our findings show the value of cross-ancestry information borrowing to identify genomic regions with QTL signals. Finally, they highlight the limitations in estimating effects across ancestry groups. Overall, by highlighting the limitations and inefficiencies in using cross-ancestry data, our findings advocate for prioritizing the continuation of ongoing efforts in collecting data for underrepresented ancestry 143 groups. 144 REFERENCES 1. GWAS Catalog [Internet]. [cited 2023 Apr 20]. Available from: https://www.ebi.ac.uk/gwas/ 2. Kim H, Grueneberg A, Vazquez AI, Hsu S, de los Campos G. Will Big Data Close the Missing Heritability Gap? Genetics. 2017;207(3):1135–45. 3. Lello L, Avery SG, Tellier L, Vazquez AI, de los Campos G, Hsu SDH. Accurate Genomic Prediction of Human Height. Genetics. 2018;210(2):477–97. 4. de los Campos G, Vazquez AI, Hsu S, Lello L. Complex-Trait Prediction in the Era of Big Data. Trends Genet. 2018;34(10):746–54. 5. Albiñana C, Zhu Z, Schork AJ, Ingason A, Aschard H, Brikell I, et al. Multi-PGS enhances polygenic prediction by combining 937 polygenic scores. Nat Commun. 2023 Aug 5;14(1):4702. 6. Lehmann B, Mackintosh M, McVean G, Holmes C. Optimal strategies for learning multi- ancestry polygenic scores vary across traits. Nat Commun. 2023 Jul 7;14(1):4023. 7. Maier RM, Zhu Z, Lee SH, Trzaskowski M, Ruderfer DM, Stahl EA, et al. Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat Commun. 2018 Mar 7;9(1):989. 8. de los Campos G, Sorensen D, Gianola D. Genomic Heritability: What Is It? Barsh GS, editor. PLoS Genet. 2015 May 5;11(5):e1005048. 9. Goddard M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009 Jun;136(2):245–57. 10. Goddard ME, Wray NR, Verbyla K, Visscher PM. Estimating Effects and Making Predictions from Genome-Wide Marker Data. Statist Sci. 2009 Nov 1;24(4):517-529. 11. Goddard ME, Hayes BJ, Meuwissen THE. Using the genomic relationship matrix to predict the accuracy of genomic selection. J Anim Breed Genet. 2011 Dec;128(6):409–21. 12. Daetwyler HD, Villanueva B, Woolliams JA. Accuracy of Predicting the Genetic Risk of Disease Using a Genome-Wide Approach. Weedon MN, editor. PLoS ONE. 2008 Oct 14;3(10):e3395. 13. Wang Y, Guo J, Ni G, Yang J, Visscher PM, Yengo L. Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat Commun. 2020 Jul 31;11(1):3865. 14. Lupi AS, Vazquez AI, de los Campos G. Mapping the relative accuracy of cross-ancestry prediction. Nat Commun. 2024; accepted 2024. 145 15. Sirugo G, Williams SM, Tishkoff SA. The Missing Diversity in Human Genetic Studies. Cell. 2019 Mar;177(1):26–31. 16. Martin AR, Kanai M, Kamatani Y, Okada Y, Neale BM, Daly MJ. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat Genet. 2019 Apr;51(4):584–91. 17. Duncan L, Shen H, Gelaye B, Meijsen J, Ressler K, Feldman M, et al. Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun. 2019 Jul 25;10(1):3328. 18. Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. The American Journal of Human Genetics. 2015 Oct;97(4):576–92. 19. Privé F, Aschard H, Carmi S, Folkersen L, Hoggart C, O’Reilly PF, et al. Portability of 245 polygenic scores when derived from the UK Biobank and applied to 9 ancestry groups from the same cohort. Am J Hum Genet. 2022 Jan 6;109(1):12–23. 20. Zhao Z, Fritsche LG, Smith JA, Mukherjee B, Lee S. The construction of cross-population polygenic risk scores using transfer learning. Am J Hum Genet. 2022 Nov 3;109(11):1998–2008. 21. Cavazos TB, Witte JS. Inclusion of variants discovered from diverse populations improves polygenic risk score transferability. Human Genetics and Genomics Advances. 2021 Jan;2(1):100017. 23. Guo J, Bakshi A, Wang Y, Jiang L, Yengo L, Goddard ME, et al. Quantifying genetic heterogeneity between continental populations for human height and body mass index. Sci Rep. 2021 Mar 4;11(1):5240. 24. Ding Y, Hou K, Xu Z, Pimplaskar A, Petter E, Boulier K, et al. Polygenic scoring accuracy varies across the genetic ancestry continuum. Nature. 2023 Jun 22;618(7966):774–81. 25. Shi H, Burch KS, Johnson R, Freund MK, Kichaev G, Mancuso N, et al. Localizing Components of Shared Transethnic Genetic Architecture of Complex Traits from GWAS Summary Data. The American Journal of Human Genetics. 2020 Jun;106(6):805–17. 26. Hu S, Ferreira LAF, Shi S, Hellenthal G, Marchini J, Lawson DJ, et al. Leveraging fine-scale population structure reveals conservation in genetic effect sizes between human populations across a range of human phenotypes [Internet]. 2023 [cited 2024 Aug 26]. Available from: http://biorxiv.org/lookup/doi/10.1101/2023.08.08.552281 27. Majara L, Kalungi A, Koen N, Tsuo K, Wang Y, Gupta R, et al. Low and differential polygenic score generalizability among African populations due largely to genetic diversity. Human Genetics and Genomics Advances. 2023 Apr;4(2):100184. 28. Wang Y, Namba S, Lopera E, Kerminen S, Tsuo K, Läll K, et al. Global Biobank analyses 146 provide lessons for developing polygenic risk scores across diverse cohorts. Cell Genomics. 2023 Jan;3(1):100241. 29. Ruan Y, Lin YF, Feng YCA, Chen CY, Lam M, Guo Z, et al. Improving polygenic prediction in ancestrally diverse populations. Nat Genet. 2022 May;54(5):573–80. 30. Márquez-Luna C, Loh PR, South Asian Type 2 Diabetes (SAT2D) Consortium, SIGMA Type 2 Diabetes Consortium, Price AL. Multiethnic polygenic risk scores improve risk prediction in diverse populations. Genet Epidemiol. 2017 Dec;41(8):811–23. 31. Hoggart CJ, Choi SW, García-González J, Souaiaia T, Preuss M, O’Reilly PF. BridgePRS leverages shared genetic effects across ancestries to increase polygenic risk score portability. Nat Genet. 2024 Jan;56(1):180–6. 32. Mester R, Hou K, Ding Y, Meeks G, Burch KS, Bhattacharya A, et al. Impact of cross- ancestry genetic architecture on GWASs in admixed populations. The American Journal of Human Genetics. 2023 Jun;110(6):927–39. 33. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018 Oct;562(7726):203–9. 34. The All of Us Research Program Investigators. The “All of Us” Research Program. N Engl J Med. 2019 Aug 15;381(7):668–76. 35. Controlled CDR Directory [Internet]. User Support. 2024 [cited 2024 Oct 18]. Available from: https://support.researchallofus.org/hc/en-us/articles/4616869437204-Controlled-CDR- Directory 36. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaSci. 2015 Dec;4(1):7. 37. Lee CH, Cook S, Lee JS, Han B. Comparison of Two Meta-Analysis Methods: Inverse- Variance-Weighted Average and Weighted Sum of Z-Scores. Genomics Inform. 2016;14(4):173. 38. Pérez P, de los Campos G. Genome-wide regression and prediction with the BGLR statistical package. Genetics. 2014 Oct;198(2):483–95. 39. Hou K, Ding Y, Xu Z, Wu Y, Bhattacharya A, Mester R, et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat Genet. 2023 Mar 20;55:549-558. 147 APPENDIX C: Chapter 3 Supplementary Figures a. Scenario 2: African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007), with a p = 817 SNP set Figure C1: PGS fixing the SNP set but varying the training sample sizes used to estimate SNP effects. The prediction correlation, 𝑅(cid:3021)(cid:3020)(cid:3021), for height using different PGS for: (a) the African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007), and (b) the European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022), for different combinations of 𝑇𝑅𝑁(cid:3002)(cid:3007) and 𝑇𝑅𝑁(cid:3006)(cid:3022) sample sizes used for effect estimation. The SNPs entering into each PGS are the same (p = 817 SNPs) and is noted in parentheses. For 𝑇𝑆𝑇(cid:3002)(cid:3007), when there are no AF in the training (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3250)(cid:3255) = 0), this is pure cross-ancestry prediction and when there are no EU in the training (𝑛(cid:3021)(cid:3019)(cid:3015)(cid:3254)(cid:3270) = 0), this is pure within-ancestry prediction (and vis versa for 𝑇𝑆𝑇(cid:3006)(cid:3022)). The scale of the prediction correlation is based on the values from Figure 9a for Figure 11a (and from Figure 9b for Figure 11b) to allow for straightforward comparison between the plots. 148 Figure C1 (cont’d) b. Scenario 2: European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022), with a p = 817 SNP set 149 a. Scenario 3: African testing set, 𝑇𝑆𝑇(cid:3002)(cid:3007) Figure C2: The percent increase in prediction R-squared, (cid:3118) (cid:4672)(cid:3019)(cid:3269)(cid:3268)(cid:3269) ((cid:3040)(cid:3042)(cid:3046)(cid:3047) (cid:3043)(cid:3042)(cid:3045)(cid:3047)(cid:3028)(cid:3029)(cid:3039)(cid:3032))(cid:2879)(cid:3019)(cid:3269)(cid:3268)(cid:3269) (cid:3118) ((cid:3039)(cid:3032)(cid:3028)(cid:3046)(cid:3047) (cid:3043)(cid:3042)(cid:3045)(cid:3047)(cid:3028)(cid:3029)(cid:3039)(cid:3032))(cid:4673) (cid:3118) (cid:3019)(cid:3269)(cid:3268)(cid:3269) ((cid:3039)(cid:3032)(cid:3028)(cid:3046)(cid:3047) (cid:3043)(cid:3042)(cid:3045)(cid:3047)(cid:3028)(cid:3029)(cid:3039)(cid:3032)) × 100%14, for height using different PGS for: (a) AF (𝑇𝑆𝑇(cid:3002)(cid:3007)) and (b) EU (𝑇𝑆𝑇(cid:3006)(cid:3022)) when using the top 20% most portable SNPs (based on MC- ANOVA’s cross-ancestry R-squared14) compared to the bottom 20% most portable SNPs by training set sample size (AF [𝑇𝑅𝑁(cid:3002)(cid:3007)] and EU [𝑇𝑅𝑁(cid:3006)(cid:3022)]) used for SNP filtering and estimation. 150 Figure C2 (cont’d) b. Scenario 3: European testing set, 𝑇𝑆𝑇(cid:3006)(cid:3022) 151 Supplementary Tables Table C1: Descriptive statistics of the European (EU) and African (AF) training and testing sets. Continuous traits are described by the mean plus or minus one standard deviation. Ancestry Group Sample Size Female (%) Age (years) Height (cm) European (EU) African (AF) n=5,000 n=10,000 n=25,000 n=50,000 n=75,000 n=100,000 n=150,000 n=200,000 n=250,000 Testing Set (n=10,000) n=5,000 n=7,500 n=10,000 n=15,000 n=20,000 n=25,000 n=30,000 n=35,000 n=40,000 Testing Set (n=9,078) 53.7 53.2 53.1 53.3 53.2 53.1 53.2 53.3 53.2 54.1 55.7 55.7 55.9 56.5 56.6 56.6 56.6 56.7 56.7 56.0 56.9  8.0 168.9  9.2 56.8  8.0 169.0  9.1 56.7  8.0 169.1  9.2 56.8  8.0 169.0  9.2 56.8  8.0 169.1  9.2 56.8  8.0 169.1  9.2 56.8  8.0 169.1  9.2 56.8  8.0 169.1  9.2 56.8  8.0 169.1  9.2 56.7  8.0 168.9  9.2 48.7  13.4 169.5  9.8 48.7  13.4 169.5  9.8 48.6  13.5 169.4  9.9 48.7  13.5 169.4  9.8 48.8  13.5 169.4  9.8 48.8  13.4 169.3  9.7 48.9  13.4 169.4  9.7 48.9  13.4 169.3  9.8 48.9  13.4 169.3  9.8 49.1  13.4 169.5  9.7 152 Table C2: The number of AF peaks at an R-squared threshold of 0.1 for each sample size combination SNPs (EU and AF). SNP set refers to the scenario under which the SNPs were selected. ‘1e-4’ is the p-value cutoff used in Scenario 1, and ‘1e-2 Low’ and ‘1e-2 High’ are the p-value cutoffs and portability sets used in Scenario 3. AF Sample Size 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 EU Sample Size 0 0 0 0 0 0 0 0 0 5000 5000 5000 5000 5000 5000 5000 5000 5000 10000 10000 10000 10000 10000 10000 10000 10000 10000 25000 25000 25000 25000 25000 25000 25000 25000 25000 50000 50000 50000 50000 50000 50000 50000 50000 50000 75000 75000 75000 SNP Set 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 # of Peaks 72 76 102 147 205 273 351 413 513 70 74 83 123 170 238 282 346 455 97 100 109 155 180 233 289 332 424 345 366 353 369 405 433 475 507 553 886 878 879 896 908 921 956 977 1022 1519 1524 1517 Total # of SNPs 75 87 119 210 374 509 680 843 1058 81 92 106 156 260 380 466 610 834 143 134 153 211 277 374 466 567 735 710 716 684 714 770 816 895 994 1084 2100 2036 1998 1984 1974 1984 2026 2029 2079 3670 3649 3635 # of Peaks out of # of SNPs (%) 96.0 87.4 85.7 70.0 54.8 53.6 51.6 49.0 48.5 86.4 80.4 78.3 78.8 65.4 62.6 60.5 56.7 54.6 67.8 74.6 71.2 73.5 65.0 62.3 62.0 58.6 57.7 48.6 51.1 51.6 51.7 52.6 53.1 53.1 51.0 51.0 42.2 43.1 44.0 45.2 46.0 46.4 47.2 48.2 49.2 41.4 41.8 41.7 153 Table C2 (cont’d) 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 75000 75000 75000 75000 75000 75000 100000 100000 100000 100000 100000 100000 100000 100000 100000 150000 150000 150000 150000 150000 150000 150000 150000 150000 200000 200000 200000 200000 200000 200000 200000 200000 200000 250000 250000 250000 250000 250000 250000 250000 250000 250000 0 0 0 0 0 0 0 0 0 5000 5000 5000 5000 5000 5000 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1513 1506 1512 1537 1547 1570 2228 2211 2188 2144 2136 2134 2150 2154 2165 3569 3533 3530 3478 3510 3463 3414 3422 3433 4880 4849 4816 4794 4795 4805 4784 4768 4794 6236 6209 6206 6172 6167 6104 6109 6116 6085 18 22 25 27 58 69 81 105 115 16 15 14 16 38 57 3588 3556 3535 3510 3509 3508 5625 5510 5445 5301 5253 5233 5195 5140 5112 9138 9055 9016 8914 8760 8680 8536 8512 8478 12364 12253 12171 12061 12006 11938 11864 11775 11802 15719 15662 15586 15418 15371 15233 15177 15153 15038 1055 1190 1256 1443 1708 1954 2153 2305 2548 1132 1193 1230 1368 1546 1744 42.2 42.4 42.8 43.8 44.1 44.8 39.6 40.1 40.2 40.4 40.7 40.8 41.4 41.9 42.4 39.1 39.0 39.2 39.0 40.1 39.9 40.0 40.2 40.5 39.5 39.6 39.6 39.7 39.9 40.2 40.3 40.5 40.6 39.7 39.6 39.8 40.0 40.1 40.1 40.3 40.4 40.5 1.7 1.8 2.0 1.9 3.4 3.5 3.8 4.6 4.5 1.4 1.3 1.1 1.2 2.5 3.3 154 Table C2 (cont’d) 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 5000 5000 10000 10000 10000 10000 10000 10000 10000 10000 10000 25000 25000 25000 25000 25000 25000 25000 25000 25000 50000 50000 50000 50000 50000 50000 50000 50000 50000 75000 75000 75000 75000 75000 75000 75000 75000 75000 100000 100000 100000 100000 100000 100000 100000 100000 100000 150000 150000 150000 150000 150000 150000 150000 150000 150000 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 71 81 115 15 12 15 24 33 44 61 73 96 69 73 71 77 88 89 88 104 104 191 195 195 202 200 196 196 213 218 372 375 375 367 354 359 368 368 370 543 535 536 550 536 546 541 537 529 884 874 884 867 844 859 857 849 853 1911 2096 2299 1248 1293 1312 1423 1529 1694 1873 2019 2211 1892 1877 1865 1908 2004 2090 2164 2247 2401 2913 2884 2852 2835 2874 2906 2962 3002 3100 3899 3873 3856 3818 3836 3815 3856 3886 3930 4856 4827 4796 4769 4762 4768 4784 4776 4797 6516 6496 6463 6415 6389 6353 6362 6360 6368 3.7 3.9 5.0 1.2 0.9 1.1 1.7 2.2 2.6 3.3 3.6 4.3 3.6 3.9 3.8 4.0 4.4 4.3 4.1 4.6 4.3 6.6 6.8 6.8 7.1 7.0 6.7 6.6 7.1 7.0 9.5 9.7 9.7 9.6 9.2 9.4 9.5 9.5 9.4 11.2 11.1 11.2 11.5 11.3 11.5 11.3 11.2 11.0 13.6 13.5 13.7 13.5 13.2 13.5 13.5 13.3 13.4 155 Table C2 (cont’d) 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 200000 200000 200000 200000 200000 200000 200000 200000 200000 250000 250000 250000 250000 250000 250000 250000 250000 250000 0 0 0 0 0 0 0 0 0 5000 5000 5000 5000 5000 5000 5000 5000 5000 10000 10000 10000 10000 10000 10000 10000 10000 10000 25000 25000 25000 25000 25000 25000 25000 25000 25000 50000 50000 50000 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 Low 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1236 1237 1220 1219 1214 1210 1191 1175 1220 1606 1609 1594 1574 1594 1587 1577 1571 1574 18 15 18 37 50 62 91 97 131 9 14 22 36 46 59 66 91 118 27 27 34 45 57 73 87 91 118 116 123 117 123 131 143 151 157 163 252 254 268 7909 7862 7840 7794 7759 7718 7720 7692 7716 9268 9235 9203 9181 9128 9113 9077 9061 9078 1055 1188 1255 1443 1706 1953 2153 2305 2548 1132 1194 1230 1368 1545 1743 1912 2094 2299 1248 1292 1313 1423 1529 1700 1873 2019 2210 1892 1875 1865 1905 2004 2090 2164 2247 2402 2917 2883 2852 15.6 15.7 15.6 15.6 15.6 15.7 15.4 15.3 15.8 17.3 17.4 17.3 17.1 17.5 17.4 17.4 17.3 17.3 1.7 1.3 1.4 2.6 2.9 3.2 4.2 4.2 5.1 0.8 1.2 1.8 2.6 3.0 3.4 3.5 4.3 5.1 2.2 2.1 2.6 3.2 3.7 4.3 4.6 4.5 5.3 6.1 6.6 6.3 6.5 6.5 6.8 7.0 7.0 6.8 8.6 8.8 9.4 156 Table C2 (cont’d) 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 5000 7500 10000 15000 20000 25000 30000 35000 40000 50000 50000 50000 50000 50000 50000 75000 75000 75000 75000 75000 75000 75000 75000 75000 100000 100000 100000 100000 100000 100000 100000 100000 100000 150000 150000 150000 150000 150000 150000 150000 150000 150000 200000 200000 200000 200000 200000 200000 200000 200000 200000 250000 250000 250000 250000 250000 250000 250000 250000 250000 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 1e-2 High 262 266 265 262 269 281 405 415 414 403 410 409 396 404 420 593 593 583 566 587 576 575 569 566 913 917 901 895 897 870 861 861 868 1204 1187 1206 1177 1186 1187 1181 1190 1185 1485 1504 1494 1474 1476 1454 1462 1470 1451 2833 2875 2905 2964 3001 3100 3899 3872 3856 3819 3836 3815 3856 3885 3930 4856 4828 4797 4768 4759 4768 4784 4779 4797 6516 6499 6463 6414 6391 6354 6362 6360 6369 7902 7861 7838 7794 7760 7716 7719 7690 7714 9272 9237 9208 9179 9126 9104 9075 9074 9079 9.2 9.3 9.1 8.8 9.0 9.1 10.4 10.7 10.7 10.6 10.7 10.7 10.3 10.4 10.7 12.2 12.3 12.2 11.9 12.3 12.1 12.0 11.9 11.8 14.0 14.1 13.9 14.0 14.0 13.7 13.5 13.5 13.6 15.2 15.1 15.4 15.1 15.3 15.4 15.3 15.5 15.4 16.0 16.3 16.2 16.1 16.2 16.0 16.1 16.2 16.0 157 CONCLUSION 158 In this dissertation, I discuss three projects that address challenges in the analysis and prediction of high-dimensional genetic data, focusing on utilizing local genetic information and improving the accuracy of models across underrepresented ancestry groups. One approach in genomics is to do a genome-wide analysis. However, genome-wide analyses can have challenges, such as heavy computational burdens or interpretation difficulties. If instead the analysis is evaluated locally, some of these challenges can be overcome. Both Chapters 1 and 2 present approaches that leverage local information, such as linkage disequilibrium. Chapter 1 estimates local genetic covariances within local segments in linkage disequilibrium, identifying segments with opposing directionality to the overall genetic correlation that would typically be masked in genome-wide correlation analyses. In the context of cross-ancestry prediction, the second study develops the MC-ANOVA method, which estimated the loss of prediction accuracy due to (local) differences in linkage disequilibrium and allele frequencies between ancestry groups. The study highlights the significant variability in prediction accuracy across local SNP segments, identifying some segments that are portable in PGS across ancestry groups and other segments that are not portable. Chapter 2 highlighted limitations in the non-European data available, and the importance of continuing ongoing efforts to collect non-European genomic data. Thus, Chapter 3 explores the impact of sample size on cross-ancestry prediction accuracy by meta-analyzing data from the UK Biobank and All of Us to investigate how varying European and African ancestry training sample sizes affect prediction in African ancestry. The findings further demonstrate the importance of cross-ancestry sample sizes in improving prediction accuracy among underrepresented ancestry groups and emphasize the need for increased sample sizes of non- Europeans. 159 Collectively, these projects contribute important methodological advancements and computational tools in the field of statistical genetics, particularly in the context of leveraging local genetic and ancestry information. 160