INTRODUCINGSPARSITYINTOSELECTIONINDEXMETHODOLOGYWITH APPLICATIONSTOHIGH-THROUGHPUTPHENOTYPINGANDGENOMIC PREDICTION By MarcoAntonioLopezCruz ADISSERTATION Submittedto MichiganStateUniversity inpartialful˝llmentoftherequirements forthedegreeof PlantBreeding,GeneticsandBiotechnologyCropandSoilSciencesDoctorofPhilosophy 2020 ABSTRACT INTRODUCINGSPARSITYINTOSELECTIONINDEXMETHODOLOGYWITH APPLICATIONSTOHIGH-THROUGHPUTPHENOTYPINGANDGENOMIC PREDICTION By MarcoAntonioLopezCruz Researchinplantandanimalbreedinghasbeenlargelyfocusedonthedevelopmentofmethods foramoree˚cientselectionbyalteringthefactorsthata˙ectgeneticprogress:selectionintensity, selectionaccuracy,geneticvariance,andlengthofthebreedingcycle.Mostofthebreedinge˙orts havebeenprimarilytowardsincreasingselectionaccuracyandreducingthebreedingcycle. Genomicselectionhasbeensuccessfullyadoptedbymanypublicandprivatebreedingorgani- zations.Overyears,theseinstitutionshavedevelopedandaccumulatedlargevolumesofgenomic datalinkedtophenotypesfrommultiplepopulationsandmultiplegenerations.Thisdataabun- danceo˙erstheopportunitytorevolutionizegeneticresearch.However,thesedatasetsarealso increasinglyheterogeneous,withmanysubpopulationsandmultiplegenerationsrepresentedinthe data.Thistranslatesintopotentiallyheterogeneousallelefrequenciesanddi˙erentLDpatterns, thusleadingtoSNP-e˙ectheterogeneity. Genomicselectionmethodsweredevelopedwithreferencetohomogeneouspopulationsin whichSNP-e˙ectsareassumedconstantacrossthewholepopulation.Thesemethodsarenot necessarilyoptimalforthecontemporaryavailabledatasetsformodeltraining.Therefore,a˝rst focusofthisdissertationisondevelopingnovelmethodsthatcanleveragethelarge-scaleofmodern datasetswhilecopingwiththeheterogeneityandcomplexityofthistypeofdata. Inrecentyears,therehavealsobeenimportantadvancesinhigh-throughputphenotyping (HTP)technologiesthatcangeneratelargevolumesofdataatmultipletime-pointsofacrop. Examplesofthisincludehyper-spectralimagingtechnologiesthatcancapturethere˛ectanceof electromagneticpowerbycropsatpotentiallythousandsofwavelengths.TheintegrationofHTP ingeneticevaluationsrepresentsagreatopportunitytofurtheradvanceplantbreeding;however, thehigh-dimensionalnatureofHTPdataposesimportantchallenges.Therefore,asecondfocusof thisdissertationisonthedevelopmentofanovelapproachtoe˚cientlyincorporateHTPdatafor breedingvaluesprediction. Thus,thisdissertationaimstocontributenovelmethodsthatcanimprovetheaccuracyof genomicpredictionbyoptimizingtheuseoflarge,potentiallyheterogeneous,genomicdatasets andbyenablingtheintegrationofHTPdata.Wepresentanovelstatisticalapproachthatcombines thestandardselectionindexmethodologywithvariable-selectionmethodscommonlyusedin machinelearningandstatistics,anddevelopedsoftwaretoimplementthemethod.Ourapproach o˙erssolutionstobothgenomicselectionwithpotentiallyhighlyheterogeneousgenomicdatasets, andtheintegrationofHTPingeneticevaluations. Dedicatedtomymotherandtothememoryofmyfather. iv ACKNOWLEDGEMENTS IwanttoexpressmyspecialgratitudetoDr.GustavodelosCamposforhismentoringduringthe curseofmydoctorate,forallhisvaluableacademicandpersonaladvice,andfortheconstructive suggestionsduringtheplanninganddevelopmentofthisresearch. Iwanttothankmygraduatecommittee,Dr.EricOlson,Dr.GustavodelosCampos,Dr.David Douches,andDr.DechunWangforacceptingbeingpartofmyPh.D.committeeandfortheir adviceandassistance. IwanttoexpressmygratitudetoDr.JoseCrossaandDr.SusanneDreisigackerfromthe InternationalMaizeandWheatImprovementCenter(CIMMYT)totrustmeandencouragemeto pursueaPh.D.,andfortheirsupportingettingfundingformystudies. Io˙ermyspecialacknowledgmenttoDr.PaulinoPerezforbeingmy˝rstcontactwiththe ˝eldofStatisticalGeneticsandhissupportinthedevelopmentofthesoftwaregeneratedfromthis research.IwishtothanklabmembersoftheQuantGengroupandotherfriendsImetatMSUfor theirsupportandformakinganimpactonmylifeduringmyjourneyatMSU. IwanttoexpressmyacknowledgmentstotheMonsanto'sBeachell-BorlaugInternationalSchol- arshipProgram(MBBISP)forsponsoringmeduringthe˝rstfouryearsofmydoctorate.Likewise, IextendmygratitudetotheMSUGraduateSchoolforprovidingmethedissertationcompletion fellowshipinthelastsemesterofmyPh.D.IwouldalsoliketothankDr.JasonandDr.DanaLily, andMr.ChrisandMrs.JudithRossmanforgrantingmethegraduatestudentfunds. Lastly,Iwouldliketothankmyfamily,myparentsAmparoCruzandAntonioLopez,andmy siblingsJuan,Vlady,andLuis,foralltheirunconditionalsupporttomakethisdreampossible. v TABLEOFCONTENTS LISTOFTABLES ....................................... ix LISTOFFIGURES ....................................... xi CHAPTER1INTRODUCTION ............................... 1 1.1Chapter2:Incorporatinghyper-spectralimagedataintoselectionindicesfor breedingvalueprediction...............................4 1.2Chapters3and4:Improvingtheaccuracyofgenomicpredictionusingsparse selectionindices....................................6 CHAPTER2REGULARIZEDSELECTIONINDICESFORBREEDINGVALUEPRE- DICTIONUSINGHYPER-SPECTRALIMAGEDATA ........... 8 2.1Abstract........................................9 2.2Introduction......................................9 2.3Results.........................................11 2.3.1Regularizedselectionindices.........................11 2.3.1.1Reduced-rankselectionindices..................12 2.3.1.2Penalizedselectionindices.....................12 2.3.2Accuracyofindirectselection........................14 2.3.3Regularizedselectionindicesforwheatgrainyieldusinghyper-spectral imagedata...................................14 2.3.4Regularizationimprovestheheritabilityandtheaccuracyoftheindex...15 2.3.5Usingdatafrommultipletime-pointsfurtherimprovesselectionaccuracy.18 2.3.6L1-penalizationleadstosparseselectionindices...............18 2.3.7Comparisonwithphenotypicprediction...................19 2.4Discussion.......................................20 2.4.1IntegrationofPSIandPC-SIintogeneticevaluations............22 2.4.2Impactoftheuseofhigh-throughputphenotypesinbreedingprograms...24 2.4.3Regularizedselectionindicescanalsobeavaluabletoolingeneticresearch24 2.5Methods........................................25 2.5.1Standardselectionindex...........................25 2.5.2Reduced-rankselectionindex.........................26 2.5.3Penalizedselectionindices..........................26 2.5.4Data......................................28 2.5.5Phenotypepre-processing...........................28 2.5.6Heritabilityestimation............................29 2.5.7Training-testingpartitions..........................30 2.5.8Estimationofphenotypicandgeneticparameters..............31 2.5.9Estimationoftheaccuracyofindirectselection...............31 2.5.10Software....................................31 2.5.11Materialsanddataavailability........................32 vi 2.6Acknowledgments...................................32 CHAPTER3OPTIMALBREEDINGVALUEPREDICTIONUSINGASPARSEAM- ILINDEX .................................. 33 3.1Abstract........................................34 3.2Introduction......................................34 3.3MaterialsandMethods................................36 3.3.1SparseSelectionIndexMethodology.....................37 3.3.2Data......................................38 3.3.3Analyses....................................39 3.3.4Software....................................41 3.3.5Dataavailability................................41 3.4Results.........................................41 3.4.1Sparsityimprovespredictionaccuracy....................41 3.4.2Usinganinternalcross-validationtoachieveoptimalsparsity........43 3.4.3SparseSelectionIndicesbuildsubject-speci˝ctrainingsets.........45 3.4.4Genomicrelationshipsandweightsinstandardandsparseselectionindices47 3.5Discussion.......................................48 3.6Acknowledgments...................................52 CHAPTER4GENOMICPREDICTIONINMULTI-GENERATIONALMAIZEHY- BRIDSUSINGSPARSEKERNELMODELS ................ 53 4.1Abstract........................................54 4.2Introduction......................................54 4.3MaterialsandMethods................................57 4.3.1Genotypesandphenotypicdata........................57 4.3.2Phenotypespre-processing..........................58 4.3.3Genomicselectionmodels..........................59 4.3.4Variancecomponents.............................62 4.3.5Assessmentofpredictionaccuracy......................63 4.3.6Software....................................63 4.4Results.........................................65 4.4.1PredictionaccuracycomparisonofG-BLUPandK-BLUPmodels.....65 4.4.2E˙ectofsparsityonpredictionaccuracy...................66 4.4.3Automatictraining-sampleselection.....................70 4.5Discussion.......................................73 CHAPTER5CONCLUDINGREMARKSANDFUTUREDIRECTIONS ......... 76 APPENDICES ......................................... 77 APPENDIXASUPPLEMENTARYFIGURESANDTABLESFROMCHAP- TER2 ................................. 78 APPENDIXBSUPPLEMENTARYMATERIALFROMCHAPTER3 ....... 89 APPENDIXCSUPPLEMENTARYFIGURESANDTABLESFROMCHAP- TER4 ................................. 106 vii BIBLIOGRAPHY ........................................ 123 viii LISTOFTABLES Table2.1:Averagegrainyieldandheritabilitybyenvironmentalcondition..........15 Table2.2:Accuracyandrelativee˚ciencyofindirectselectionofanL1-penalizedSI usingdatafromoneandninetime-points......................18 Table3.1:Predictionaccuracy(averageacross100partitions)achievedbysparsese- lectionindices(SSIs)andbytheG-BLUP(standardSI),bydatasetand environmentalcondition...............................44 Table4.1:Trainingset(TS)compositionusedineachpredictionscenario.(Thepredic- tionsetwasthesameforalltrainingscenariosandconsistedof612randomly chosenindividualsfrom2019)............................64 Table4.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including15%ofsubjectsfromthe2019cycle),GY-OPTtrait-environment combination.....................................67 Table4.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including15%ofsubjectsfromthe2019cycle),PH-OPTtrait-environment combination.....................................69 TableA.1:Accuracyofindirectselection(averageover100training-testingpartitions) forbestphenotypicprediction(principalcomponents(PCR),L1-penalized prediction(L1-Phen),RNDVI,andGNDVI)andforbestgenotypicprediction (standardSI,optimalPC-SI,L1-PSI,andL2-PSI).................87 TableB.1:Numberofavailableobservations,averagegrainyield,andheritabilityby environmentalconditionfortheWheat-largedataset................96 TableB.2:Numberofavailableobservations,averagegrainyield,andheritabilityby environmentalconditionfortheWheat-599dataset................96 TableB.3:Maximumpredictionaccuracy(averageacross100partitions)achievedbythe SSIfordi˙erentvaluesoftheparameter ofanElastic-Net-typeSSI,by environmentalconditionfortheWheat-largedataset................97 TableB.4:Maximumpredictionaccuracy(averageacross100partitions)achievedbythe SSIfordi˙erentvaluesoftheparameter ofanElastic-Net-typeSSI,by environmentalconditionfortheWheat-599dataset................98 ix TableC.1:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,and10%ofsubjectsfromthe2019cycle),traitGY, environmentOPT..................................115 TableC.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitGY, environmentDRT..................................117 TableC.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,and10%ofsubjectsfromthe2019cycle),traitPH, environmentOPT..................................119 TableC.4:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitPH, environmentDRT..................................121 x LISTOFFIGURES Figure2.1:Predictionofthegeneticmeritforgrainyieldusinghyper-spectralcropimage data.(A)Dataconsistsofhyper-spectralre˛ectancedata( x i )andphenotypic measurementsofthetargettrait( y i ,e.g.,grainyield).(B)Asubsetofthedata (thetrainingset)isusedtoderivethecoe˚cientsofaselectionindex.(C) Thesecoe˚cientsarethenappliedtoimagedataofindividualsinthetesting settoderivetheindex( I i )foreachindividual.Thepredictiveabilityofthe indexisassessedbycalculatingtheaccuracyofindirectselection( Acc ¹Iº ) inthetestingset...................................13 Figure2.2:AccuracyofindirectselectionofregularizedSIsanditscomponents.Square rootheritability(green),geneticcorrelation(orange),andaccuracyofindirect selection(purple,allaveragedover100training-testingpartitions),versusthe numberofpredictorsusedtobuildtheindex:(A)numberofactivebands inthecaseoftheL1-PSI,or(B)numberofPCsinthePC-SI.Eachpanel representsoneenvironment(latesttime-point)...................16 Figure2.3:Accuracyofindirectselectionachievedbyastandard(SI)andbyregularized (PC-SIandL1-PSI)selectionindices.Thelinesprovidetheaverageaccuracy over100training-testingpartitions.Verticallinesrepresenta95%con˝dence intervalfortheaverage.Thehorizontalaxisgivesthetime-pointatwhich imageswerecollectedandareexpressedinbothdaysaftersowing(DAS)and stages(VEG=vegetative,GF=grain˝lling,MAT=maturity)............17 Figure2.4:Heatmapofregressioncoe˚cientsforL1-penalizedselectionindices.Sepa- rateindiceswerederivedforeachenvironmentusingmultitime-pointdata. DAS=daysaftersowing,VEG,GF,MATrepresentvegetative,grain-˝lling andmaturitystages,respectively.Thebottomcolor-barshowsthelightcolor associatedwitheachwavebandinthevisiblespectrum( 750nm);blackwas usedtorepresentthenear-infraredspectrum(wavelength>750nm).......19 Figure3.1:Predictionaccuracy(averageacross100trn-tstpartitions)oftheSSIversus the(average)numberofpredictorsintrainingsetsupportingtheSSIofeach individualintestingset(x-axis).Genomic-BLUP(bluerightmostpoint) appearsasaspecialcaseofanSSI.Eachpanelrepresentsoneenvironment withindataset.(A)Wheat-largedataset.(B)Wheat-599dataset.Vertical barsrepresenta95%con˝denceintervalfortheaverage..............42 xi Figure3.2:Predictionaccuracyoftheoptimalsparseselectionindex(SSI)versusthatof theG-BLUP.Eachpointrepresentsatrn-tstpartition(atotalof100partitions wereimplemented),thepointshapeandcolorrepresentenvironments.(A) Wheat-largedataset.(B)Wheat-599dataset.Thevalueof intheSSI wasestimatedusing105-foldcross-validationsconductedwithinthetraining data.Inparenthesis,bythelegend,isthep-valueforthetwo-sidedSign (binomial)testforwithin-environmentdi˙erencesinaccuracybetweenthe SSIandtheG-BLUP................................43 Figure3.3:Distributionofthenumberoftrainingsupportpoints( n sup )inoptimalsparse selectionindices(resultsobtainedover100trn-tstpartitions; n trn =sizeofthe trainingdataset),byenvironmentalcondition,Wheat-largedataset........45 Figure3.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)and thecorrespondingsupportpoints(green).Greypointsrepresentgenotypes thatdidnotcontributetothepredictionofthegeneticvalueofthegenotypein yellow.Allpanelsrepresentsolutionsfortheenvironment EHT ,Wheat-large dataset.......................................46 Figure3.5:(A)Weights( ij )ofastandardSI(G-BLUP)andoftheoptimalsparse selectionindex(SSI)versusthegenomicrelationship( g ij ).(B)Proportionof weightsintheSSIthatwerezero(non-active)andnon-zero(supportpoints); Wheat-largedataset,environment EHT ......................48 Figure4.1:(A)First3principalcomponentsoftheadditivegenomicrelationshipsmatrix, G .Pointsrepresentindividualsthatarecolorseparatedaccordingtocycle (2017,2018,or2019).(B)Heatmapofthegenomicrelationshipsmatrix.....64 Figure4.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthe datafromthe2017,2018,or2017+2018cyclesalone(top-leftpanel),orin combinationwithaproportion(5%,10%,15%)ofthedatafromthe2019 cycle.Thepredictionsetconsistedof612genotypesfromthe2019cycle thatwerenotusedformodeltraining.Modelswiththesameletterwithin panelindicatenosigni˝cantdi˙erencefromeachother( = 0 : 05 ,ANOVA followedbyTukeytest).GY-OPTtrait-environmentcombination.........66 Figure4.3:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(hori- zontalaxis)versusthepredictionaccuracyofallothermodels(verticalaxis ofeachpanel).(B)Predictionaccuracyofthestandard*-BLUPmodel(hori- zontalaxis)versusthepredictionaccuracyofitssparseversion(verticalaxis), bytypeofkernelusedinpanels.Eachpointrepresentatraining-testingpar- titionwithineachtrainingsetcomposition.Coloredpointsabove(below)the 45degreelinerepresentcasesforwhichonemodeloutperformedtheother model.P:p-valueforthetest(fromANOVA)fordi˙erencesinaccuracy betweenthetwomodels.TraitGY,environmentOPT...............68 xii Figure4.4:Heatmapofthecoe˚cientsintheHatmatrix( ~ B ¹ º G )ofthesparseG- BLUPmodelforonetraining-prediction(TS-PS)partitioninthepredictionof n PS = 612 individualsfrom2019using n TS = 2427 individuals(2017+2018 plus15%ofthe2019set).Predictedindividualsarepresentedincolumnsand trainingindividualsarepresentedinrowsseparatedbycycleandnumberof individualsinparentheses.Thevalueof wasobtainedbycross-validation. Eachcolumnrepresentsvaluesofthevector ~ b ¹ º i G = f ~ b ij g , j = 1 ;:::; 2427 (Equation(4.6)).Individualsnocontributingtothepredictionhaveacoe˚- cient ~ b ij = 0 representedingreycolor.Individualswithanon-zerocoe˚cient areshowninayellow-bluelogarithmscale(intheoriginalscale,yellowindi- cateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environment combination.....................................71 Figure4.5:Proportionofthetrainingindividualsfromeachcyclethatcontributedtothe predictionofthe612testinggenotypesfrom2019,usingsparsemodelswith di˙erentrelationshipmatrices(horizontalaxis): G , K 1 , K 2 ,or K A .The trainingsetwascomposedbyindividualsfrom2017( n = 901 )and2018 ( n = 1417 )alone(top-leftpanel)orincombinationwithaproportion(5%, 10%,15%)ofthedatafromthe2019cycle.GY-OPTtrait-environment combination.....................................72 FigureA.1:Box-plotofgrainyieldphenotypicrecordsbyenvironmentalcondition. n ˇ 3200 observationswithinenvironment.SD:standarddeviation..........79 FigureA.2:Lightre˛ectancepatternsasfunctionofthewavelength.Eachlinerepresents themean(across n ˇ 3200 observations)re˛ectanceforeachwaveband, withintime-point(˛ightdate).Withineachenvironment,meanswerescaled toliewithin0and1bydividingthembythemaximumaverage..........80 FigureA.3:AccuracyofindirectselectionofL1-PSIanditscomponents.Squareroot heritability,geneticcorrelationandaccuracyofindirectselection,allaveraged over100training-testingpartitionsversusthenumberofbandsenteringin theindex;bytime-point(DAS=daysaftersowing,Stage:VEG=vegetative, GF=grain˝lling,orMAT=maturity)withinenvironment.............81 FigureA.4:AccuracyofindirectselectionofL2-PSIanditscomponents.Squareroot heritability,geneticcorrelationandaccuracyofindirectselection,allaver- agedover100training-testingpartitionsversusthepenalizationparameter ( ,logarithmscale)usedtobuildtheindex;bytime-point(DAS=daysafter sowing,Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)within environment.....................................82 xiii FigureA.5:AccuracyofindirectselectionofPC-SIanditscomponents.Squareroother- itability,geneticcorrelationandaccuracyofindirectselection,allaveraged over100training-testingpartitionsversusthenumberofprincipalcompo- nentsusedtobuildtheindex;bytime-point(DAS=daysaftersowing,Stage: VEG=vegetative,GF=grain˝lling,orMAT=maturity)withinenvironment....83 FigureA.6:Squarerootofheritabilityofthestandard(SI),oftheregularized(PC-SIand L1-PSI)selectionindices,andoftheRNDVI.Thelinesprovidetheaverage squarerootheritabilityover100training-testingpartitions.Verticallines representa95%CIfortheaverage.Thehorizontalaxisgivethetime-point atwhichimageswerecollectedandareexpressedinbothdaysaftersowing (DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity)......84 FigureA.7:Geneticcorrelationbetweengrainyieldandall:thestandard(SI),thereg- ularized(PC-SIandL1-PSI)selectionindices,andtheRNDVI.Thelines providetheaveragegeneticcorrelationover100training-testingpartitions. Verticallinesrepresenta95%CIfortheaverage.Thehorizontalaxisgivethe time-pointatwhichimageswerecollectedandareexpressedinbothdaysafter sowing(DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity)..85 FigureA.8:Phenotypic,genetic,andenvironmentalcovariances(absolutevalue)between wavebandsandgrainyield.'D':discrepancybetweenphenotypicandgenetic covariancesasmeasuredbythesumoftheabsolutedi˙erences;bytime- point(DAS:daysaftersowing,Stage:VEG=vegetative,GF=grain˝lling, MAT=maturity)withinenvironment........................86 FigureB.1:Toptwoprincipalcomponentsofthegenomicrelationshipmatrix, G ,for eachdataset.Eachpointrepresentindividuals.(A)Wheat-599dataset.(B) Wheat-largedataset.Individualsarecolor-groupedbythecycle(sowing- harvestyear).....................................90 FigureB.2:Boxplotofgrainyieldphenotypicrecords(intonha 1 )byenvironmental conditionforbothWheat-599andWheat-largedatasets.SDstandarddeviation.91 FigureB.3:Distributionofthenumberoftrainingsupportpoints( n sup )inoptimalsparse selectionindices(resultsobtainedover100trn-tstpartitions; n trn =sizeofthe trainingdataset),byenvironmentalcondition,Wheat-599dataset........92 FigureB.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)and thecorrespondingsupportpoints(green).Greypointsrepresentgenotypes thatdidnotcontributetothepredictionofthegeneticvalueofthegenotype inyellow.Allpanelsrepresentsolutionsfortheenvironment1,Wheat-599 dataset.......................................93 xiv FigureB.5:(leftandcenter)Weights( ij )ofastandardSI(G-BLUP)andoftheoptimal sparseselectionindex(SSI)versusthegenomicrelationship( g ij ),and(right) proportionofweightsintheSSIthatbelongedtoeitherthesupportingor non-activesets,bygenomic-relationship;byenvironment,Wheat-largedataset.94 FigureB.6:(leftandcenter)Weights( ij )ofastandardSI(G-BLUP)andoftheoptimal sparseselectionindex(SSI)versusthegenomicrelationship( g ij ),and(right) proportionofweightsintheSSIthatbelongedtoeitherthesupportingor non-activesets,bygenomic-relationship;byenvironment,Wheat-599dataset..95 FigureC.1:Genomicrelationships( G ij )versuskernelrelationships( K ij )ofindividuals incycle2019withthoseincycles2017(left)and2018(right). G i j and K i j arethe ij th elementof G and K ¹ º ,respectively.(A) K 1 = K ¹ 0 : 2 º .(B) K 2 = K ¹ 1 º .(C) K 3 = K ¹ 5 º .............................107 FigureC.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthe datafromthe2017,2018,or2017+2018cyclesalone(top-leftpanel),orin combinationwithaproportion(5%,10%,15%)ofthedatafromthe2019 cycle.Thepredictionsetconsistedof612genotypesfromthe2019cycle thatwerenotusedformodeltraining.Modelswiththesameletterwithin panelindicatenosigni˝cantdi˙erencefromeachother( = 0 : 05 ,ANOVA followedbyTukeytest).GY-DRTtrait-environmentcombination.........108 FigureC.3:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthe datafromthe2017,2018,or2017+2018cyclesalone(top-leftpanel),orin combinationwithaproportion(5%,10%,15%)ofthedatafromthe2019 cycle.Thepredictionsetconsistedof612genotypesfromthe2019cycle thatwerenotusedformodeltraining.Modelswiththesameletterwithin panelindicatenosigni˝cantdi˙erencefromeachother( = 0 : 05 ,ANOVA followedbyTukeytest).TraitPH(OPTandDRTenvironments).........109 FigureC.4:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(hori- zontalaxis)versusthepredictionaccuracyofallothermodels(verticalaxis ofeachpanel).(B)Predictionaccuracyofthestandard*-BLUPmodel(hori- zontalaxis)versusthepredictionaccuracyofitssparseversion(verticalaxis), bytypeofkernelusedinpanels.Eachpointrepresentatraining-testingpar- titionwithineachtrainingsetcomposition.Coloredpointsabove(below)the 45degreelinerepresentcasesforwhichonemodeloutperformedtheother model.P:p-valueforthetest(fromANOVA)fordi˙erencesinaccuracy betweenthetwomodels.GY-DRTtrait-environmentcombination........110 xv FigureC.5:(left)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(hor- izontalaxis)versusthepredictionaccuracyofallothermodels(verticalaxis ofeachpanel).(right)Predictionaccuracyofthestandard*-BLUPmodel (horizontalaxis)versusthepredictionaccuracyofitssparseversion(vertical axis),bytypeofkernelusedinpanels.Eachpointrepresentatraining-testing partitionwithineachtrainingsetcomposition.Coloredpointsabove(be- low)the45degreelinerepresentcasesforwhichonemodeloutperformed theothermodel.P:p-valueforthetest(fromANOVA)fordi˙erencesin accuracybetweenthetwomodels.TraitPH(OPTandDRTenvironments)....111 FigureC.6:Heatmapofthecoe˚cientsintheHatmatrix( ~ B ¹ º K )ofthesparseKA- BLUPmodelforonetraining-prediction(TS-PS)partitioninthepredictionof n PS = 612 individualsfrom2019using n TS = 2427 individuals(2017+2018 plus15%ofthe2019set).Predictedindividualsarepresentedincolumnsand trainingindividualsarepresentedinrowsseparatedbycycleandnumberof individualsinparentheses.Thevalueof wasobtainedbycross-validation. Eachcolumnrepresentsvaluesofthevector ~ b ¹ º i K = f ~ b ij g , j = 1 ;:::; 2427 (Equation(4.6)).Individualsnocontributingtothepredictionhaveacoe˚- cient ~ b ij = 0 representedingreycolor.Individualswithanon-zerocoe˚cient areshowninayellow-bluelogarithmscale(intheoriginalscale,yellowindi- cateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environment combination.....................................112 FigureC.7:Proportionofthetrainingindividualsfromeachcyclethatcontributedtothe predictionofthe612testinggenotypesfrom2019,usingsparsemodelswith di˙erentrelationshipmatrices(horizontalaxis): G , K 1 , K 2 ,or K A .The trainingsetwascomposedbyindividualsfrom2017( n = 901 )and2018 ( n = 1417 )alone(top-leftpanel)orincombinationwithaproportion(5%, 10%,15%)ofthedatafromthe2019cycle.GY-DRTtrait-environment combination.....................................113 FigureC.8:Proportionofthetrainingindividualsfromeachcyclethatcontributedto thepredictionofthe612testinggenotypesfrom2019,usingsparsemodels withdi˙erentrelationshipmatrices(horizontalaxis): G , K 1 , K 2 ,or K A . Thetrainingsetwascomposedbyindividualsfrom2017( n = 901 )and 2018( n = 1417 )alone(top-leftpanels)orincombinationwithaproportion (5%,10%,15%)ofthedatafromthe2019cycle.TraitPH(OPTandDRT environments)....................................114 xvi CHAPTER1 INTRODUCTION Plantbreedingbeganthousandsofyearsagowhenhumansstarteddomesticatingwildspecies turningthemintocrops(Tangetal.,2010).Domesticationoccurredwhenthemostpreferableplants wereselectedtobepropagatedforapurpose(e.g.,foodrequirement).Thetransformationfrom shatteringtonon-shatteringcereals(e.g.,riceandwheat)tofacilitateharvest,andthedeveloping ofthecultivatedmaizefromitswildrelative(teosinte)aretwoimportantexamplesofplants' domestication.Inthenineteenthcentury,hybridizationexperimentsandthediscoveryofthe rulesofinheritancebyGregorMendelandDarwin'stheoryofnaturalselectionshedlightonthe foundationsofadaptationbymeansofnaturalorarti˝cialselection. Massphenotypicselection hasbeenextremelysuccessfulinproducingmoderncultivarsand hybridsofgreatagronomicpotential(Jiang,2013).Theseadvancesrequirednotonlyselectionfor yieldpotentialbutalsoformorerobustplants;thedevelopmentofthehigh-yieldingsemi-dwarf wheatandricevarietiesduringtheRevisaclearexampleofhowplantarchitecture andplanphysiologyneedstobeadaptedtoimproveagronomicperformance. Theexpectedrateofgeneticprogressfromselectionisdeterminedbytheinterplayingoffour factors:selectionintensity,selectionaccuracy,geneticvariance,andlengthofthebreedingcycle. Despiteofthegreatprogressachievedbymeansofdirectphenotypicselection,thistechnologyhas severallimitations:( i )selectionaccuracyisboundedbytraitheritability,( ii )extensive(andexpen- sive)phenotypictestingisrequiredtoachievehighselectionintensity,and( iii )theopportunities toshortenthebreedingcyclearelimited.Inthelastcentury,researchinplantandanimalbreeding hasbeenlargelyfocusedonthedevelopmentoftechnologiesthatcanimproveselectionbyaltering thefourfactorsthata˙ectgeneticprogress. Moreaccuratepredictionsofbreedingvalues(BV)canbeobtainedusingstatisticalmethods (e.g.,selectionindices,BestLinearUnbiasedPrediction,BLUP)thatsmooth-outenvironmental componentsofinter-individualdi˙erencesinphenotypesandenableborrowingofinformation 1 betweenrelatedindividuals. SelectionIndex(SI)methodology beganbypredictingBVsobtainedbycombiningindividual performancedataandprogenyevaluations(Lush,1935).Smith(1936)andHazel(1943)generalized ideasusedinprogenytestingtoamoregeneralframework(SI)whichusesallinformativedata.A phenotypicmeasurementisinformativeabouttheBVofaselectioncandidateifitisgenetically correlatedwiththeBV.Informativephenotypicmeasurementsincludetherecordsoftheselection goalontheselectioncandidateorinrelativesoftheselectioncandidate,andmeasurementsoftraits geneticallycorrelatedwiththeselectionobjective.Inaselectionindex,thetotalgeneticvalueof anindividualispredictedusingaweightedsumoftheavailablephenotypicmeasurements.The weightsoneachofthemeasuredphenotypesdependongeneticandphenotypic(co)variances. Henderson'sBLUP :Bythemiddleofthetwentycentury,C.R.Henderson(1963)introduced theconceptoftheBestLinearUnbiasedPredictor(BLUP)ofthebreedingvalues.BLUPsof BVsareobtainedusingmixedmodelsthatincorporategeneticrelationshipsamongallevaluated individuals.Henderson'sBLUPcanbeshowntobeequivalenttoaSI;however,theBLUP methodologyprovidesanadequateframeworkforsimultaneousmodelingofgeneticandnon- genetice˙ects(e.g.,location,year,year-location,block).BLUPalsoprovidesaframeworkforthe estimationofgeneticandenvironmental(co)varianceparameters. SelectionindexandpedigreeBLUPbecomethemethodofchoiceforBVpredictioninthe secondhalfofthetwentycenturyandthebeginningofthecurrentcentury.However,pedigree informationpresentssomelimitations;forexample,pedigreeandancestraldatacannotpredict inter-individualdi˙erencesingeneticvaluesbetweenmembersofabi-parentalfamily. Thedevelopmentofmolecularmarkers(e.g.,DNAmarkers)hasbene˝tedplantsciencein manyaspects,includinggermplasmcharacterization,geneintrogression,andthedevelopmentof DNA-assistedprediction/selectionmethods(Xu&Crouch,2008). Marker-assistedselection (MAS)reliesontheidenti˝cationandvalidationofDNAmarkers predictive(i.e.,tightlylinked)ofquantitativetraitloci(QTL)genotypes.MAScanincrease selectionaccuracyandmayenableearlyscreening(e.g.,attheseedlingstage).MAShasbeenan 2 e˚cienttoolinthedetectionandvalidationintheimprovementofqualitativetraits(e.g.,disease resistanceandgrainquality)invariouscrops(e.g.,Williametal.,2007;Xuetal.,2009;Miahetal., 2013).However,mostofthetraitsofagronomicperformance(e.g.,grainyield,oilconcentration) area˙ectedbyalargenumberofsmall-e˙ectloci.Detectingmarker-QTLassociationsforcomplex traitsrequiresanexceedinglylargersamplesize(Melchingeretal.,1998)andspeci˝cdesigns aimingtomaximizepower(e.g.,Yuetal.,2008;Zhuetal.,2008).Thedi˚cultyofmapping small-e˙ectQTLlimitedtheadoptionofMASfortheimprovementoftraitsa˙ectedbyalarge numberofQTLs. GenomicSelection(GS) :Thecontinueddevelopmentofgenotypingandsequencingtechnolo- gieshasledtoasteadydecreaseingenotypingcosts.Inthe˝rstdecadeofthe21stcentury, arrays,includingtensofthousandsofSNPs,becomeavailableformostagriculturalspecies.These genotypingarrayso˙eredtheopportunitytotrack,vialinkagedisequilibriumwithcausalvariants, geneticssignalsdistributedovertheentiregenomes.Inalandmarkpublication,Meuwissenetal. (2001)proposedusingalargenumberofSNPsforbreedingvalueprediction.Genomicselection exploitsmulti-locuslinkagedisequilibrium(LD)withthecausalvariants;withenoughmarker density,GScanpotentiallycaptureallgeneticsignals(He˙neretal.,2009). Genomicselectionhasbeensuccessfullyadoptedbymanypublicandprivatebreedingorgani- zations.SamplesizehasbeenrecognizedasoneofthemainfactorslimitingtheaccuracyofGS (Daetwyleretal.,2008;Goddard,2009;delosCamposetal.,2013a).Earlyimplementationsof GSinplantbreedingwerebasedonrelativelysmalltrainingdatasets.However,overyears,private andpublicorganizationshaveaccumulatedlargevolumesofgenomicdatalinkedtophenotypes frommultiplepopulationsandmultiplegenerations.Theverylargesamplesizeofthesedatasets impliesthathighlycomplexgenomicpredictionmodelscannowbeaccuratelycalibrated.However, thesedatasetsarealsoincreasinglyheterogeneous,withmanysubpopulationsandmultiplegener- ationsrepresentedinthedata.Thistranslatesintopotentiallyheterogeneousallelefrequenciesand di˙erentLDpatterns,thusleadingtoSNP-e˙ectheterogeneity. GSmethodsweredevelopedwithreferencetohomogeneouspopulationsinwhichSNP-e˙ects 3 canbeassumedconstantacrosssubsetsofthedatasets;thesemethodsarenotnecessarilyoptimalfor thetypeofdatasetsavailableformodeltraining.Therefore,animportantfocusofthisdissertation isondevelopingnovelmethodsthatcanleveragethelargesamplesizeofmoderngenomicdata setswhilecopingwiththechallengesposedbytheinherentheterogeneityandcomplexityofthese datasets. Theadventofhigh-throughputphenotyping(HTP) :Inrecentyears,therehasbeenanimpor- tantimprovementinHTPtechnologies.Modernphenotypingplatformscangeneratelargevolumes ofdataatmultipletime-pointsofacrop(Montesetal.,2007).Examplesofthisincludetheuseof hyper-spectralimagingtechnologieswhichcancapturetheabsorbanceofelectromagneticpower bycropsatpotentiallythousandsofwavelengths.TheintegrationofHTPingeneticevaluations representsagreatopportunitytofurtheradvanceplantbreeding;however,thehigh-dimensional natureofHTPdataposesimportantchallenges.Therefore,asecondfocusofthisdissertationison thedevelopmentofanovelapproachforBVpredictionusingHTPdata. Thus,theoverallaimofthisdissertationistocontributenovelmethodsthatcanimprovethe accuracyofgenomicpredictionbyoptimizingtheuseoflarge,potentiallyheterogeneous,genomic datasetsandbyenablingtheintegrationofHTPdata.Toachievethesegoals,wedevelopedanovel statisticalapproachthatcombinesthestandardSImethodologywithsparsity-inducingmethods commonlyusedinthe˝eldofstatisticsandmachinelearning.Theprocedurethatwedevelop, whichwenamedassparseselectionindex(SSI),o˙erssolutionstobothGSwithpotentiallyhighly heterogeneousgenomicdatasets,andtheintegrationofHTPingeneticevaluations. 1.1Chapter2:Incorporatinghyper-spectralimagedataintoselectionin- dicesforbreedingvalueprediction TheuseofHTPtechnologiescanenablescreeningalargernumberofgenotypesovermany environmentsandlocations,thuso˙eringopportunitiestoincreaseselectionintensity.Furthermore, indirectselectionbasedonphenotypescollectedearlyinthegrowingcyclecanleadtoareduction inthelengthofthebreedingcycleofperennials.Therefore,theintegrationofHTPinbreeding 4 evaluationscanleadtoanincreaseofgeneticprogressbyeitherenablingamoreintensiveand/or fasterselection. Mostoftheresearche˙ortsoncropimaginghavefocusedondevelopingmethodstopredict phenotypesfromHTPdata.Forinstance,vegetationindices(VI)derivedfromspectradata(e.g., NDVI;Tucker,1979),havebeenusedtopredictgreenbiomass,leafarea,chlorophyllcontent, andyieldinwheatandmaizeunder˝eldconditions(e.g.,Babaretal.,2006;Garrigaetal.,2017; Haboudaneetal.,2002;Rutkoskietal.,2016);andtodetectabiotic(e.g.,Roemeretal.,2012)and abioticstress(e.g.,Mahleinetal.,2012)incrops.VIsusinginformationfromareducednumber ofwavelengthsinthespectrum.Morerecently,researchershaveconsidereddevelopingmethods thatusewhole-spectradata.Oneapproachconsistsofusingdimension-reductiontechniques(e.g., principalcomponents,PC,andpartialleastsquares,PLSregression)topredictagronomictraits, e.g.,biomass(e.g.,Hansen&Schjoerring,2003)orgrainyieldinwheat(e.g.,Ferrioetal.,2005; Hernandezetal.,2015)andinmaize(e.g.,Weberetal.,2012;Aguateetal.,2017).Another approachconsistsofintroducingwhole-spectradataintoBayesianorpenalizedregression;this approachhasbeenusedtopredictmilkcomponentsfrommilk-spectradata(Ferraginaetal.,2015) andgrainyieldinwheat(e.g.,Montesinos-Lópezetal.,2017)andinmaize(e.g.,Aguateetal., 2017)fromhyper-spectralcropimaging. Theapproachesdescribedintheprecedingparagrapharewell-suitedforphenotypicprediction butcanbesub-optimalforselectionbecausethebestpredictorofaphenotypeisnotnecessarily thebestpredictorofthegeneticvalue. Therefore,toaddressthelimitationsofexistingmethods,in Chapter2 wepresentanovel methodologytodeveloppenalizedandreduced-rankSIsusinghigh-dimensionalphenotypes.Our approachintegratesintoauni˝edframeworkstandardSImethodologywithmethodsusedinhigh- dimensionalregressionthatcanpreventover-˝tting.Weevaluatetheproposedmethodologyusing amulti-environmentwheatdataset( n ˘ 3 ; 200 )containinghyper-spectral( p = 2 ; 250 wavebands) andgrainyieldinformation.Ourresultsshowthatpenalizedandreduced-rankSIso˙erimproved selectionaccuracy( ˘ 10 40 %gain)relativetothestandard(i.e.,non-regularized)SI. 5 1.2Chapters3and4:Improvingtheaccuracyofgenomicpredictionusing sparseselectionindices Thesetwochaptersfurtherdevelopthepenalizedselectionindexmethodologyintroducedin Chapter2todevelopmethodstoimprovethepredictionaccuracyofGS.Thecontextthatmotivates thedevelopmentofthesemethodsisthatoflargeandpotentiallyheterogeneousgenomicdatasets forwhichtheassumptionofe˙ect-homogeneitymaynothold. Thechallengesposedbytheavailabilityoflarge,potentiallyheterogeneous,datasetshavebeen recognizedinrecentyears,andtherehavebeenseveralattemptstodevelopmethodstoconfrontthese challenges.A˝rstlineofresearchinvolvesusingmodelsthatincludeinteractionsbetweenSNPs andgeneticgroups.Inthesemodels,SNP-e˙ectse˙ectsarerepresentedasthesumofamaine˙ect plusadeviationthatisgroup-speci˝c(e.g.,Schulz-Streecketal.,2012;delosCamposetal.,2015; Veturietal.,2019).Asimilar(insomecasesstatisticallyequivalent)approachistousemultivariate modelsinwhichtheSNPe˙ectsareassumedtobedi˙erentbutcorrelatedbetweendi˙erentgenetic groups(e.g.,Olsonetal.,2012;Lehermeieretal.,2015).Thesemethodsarewell-suitedfordata setsinwhichindividualsclusterintode˝nedgroupswhichmaybeknowninadvance(e.g.,breeds, families,pools)ormaybeinferredusingstatisticalmethods(e.g.,STRUCTURE;Pritchardetal., 2000).Themethodsjustdescribedhaveshownpromise;however,theyarenotwellsuitedfor datasetsinwhichgeneticheterogeneityexhibitsmorecomplex/crypticpatternsandthuscannotbe reducedtotheclusteringofindividualsintowell-de˝neddistinctgroups. Anotherlineofresearchconsistsofidentifying,foragivenpredictionsetasubsetofthetraining datathatmaybeoptimaltopredictbreedingvaluesoftheselectioncandidates.Thisapproachis motivatedbythefactthatgeneticsimilarities(e.g.,familyrelationships)haveagreatimpacton predictionaccuracy(Habieretal.,2010).Indeed,severalstudieshaveshownthattheaccuracy ofGScanbeverylowwhensubjectsinthetrainingsetaregeneticallydistanttothoseinthe predictionset(Clarketal.,2012;Lorenz&Smith,2015).Furthermore,somestudies(delos Camposetal.,2013b;Wolcetal.,2016)suggestthatpredictionaccuracycanbereducedwhen individualsdistantlyrelatedtothoseofthepredictionsetarealsousedformodeltraining.Basedon 6 theseobservations,severalstudiesseektodevelopmethodsfortrainingsetoptimization/designing (e.g.,Clarketal.,2012;Jacobsonetal.,2014;Lehermeieretal.,2014;Lorenz&Smith,2015). Researchinthisareahasfocusedoncomparingvariousoptimizationcriteria(e.g.,Rincentetal., 2012;Akdemir&Isidro-Sanchez,2019;Rothetal.,2020).However,alltheseapproachesassume thatasingletrainingsetisoptimalforallthesubjectsinthepredictionset;thisassumptionmaynot betruebecauseeachcandidateofselectionmaydrawusefulinformationfromdi˙erenttraining datapoints. Therefore,toovercomethelimitationsoftheexistingmethods,in Chapter3 wepresenta genomicpredictionapproachthatidenti˝es,foreachindividualinthepredictionset,anoptimal trainingset(i.e.,asetofsupportpoints).Ourapproachisbasedonasparseselectionindex (SSI)whichintegratesSImethodologywithsparsity-inducingmethods(i.e.,anL1-penalization). TheSSIcanbeseenasasparseversionofthepopulategenomic-BLUP(G-BLUP,VanRaden, 2008).WedevelopedsoftwarethatimplementsSSIandevaluatedthemethodologyusingtwo multi-environmentalwheatdatasets,arelativelysmall( n = 599 )highlystructuredoneandavery large( n ˘ 29 ; 000 )datasetthathasarelativelymorecrypticstructure.Inbothcases,wefound that,comparedwiththeG-BLUP,theSSIcanyieldanimprovementinpredictionaccuracyofabout 5-10%. Finally, Chapter4 presentsanapplicationoftheSSImethodologytoaverylarge( n = 3 ; 039 ) multi-generationdouble-haploid(DH)maizedatasetcomprisinggenotype,grainyield,andplant heightrecords.Here,weappliedtheSSImethodologyusingadditivegenomicrelationshipsand alsousinga(non-linear)Gaussiankernel(K-BLUP).LiketheSSI,theGaussiankernelcanbe tunedtomaximizetheborrowingofinformationbetweencloselyrelatedindividuals.Thesparse andnon-sparseK-BLUPmodelsperformedsimilarlywithupto28%ofgaininpredictionaccuracy comparedwithG-BLUPbasedonadditiverelationships.Insomecases,thesparseK-BLUP outperformedthenon-sparseK-BLUP. 7 CHAPTER2 REGULARIZEDSELECTIONINDICESFORBREEDINGVALUEPREDICTION USINGHYPER-SPECTRALIMAGEDATA [Materialpublishedin:Lopez-Cruz etal. 2020. Scienti˝cReports 10(8195)] MarcoLopez-Cruz 1 ,EricOlson 1 ,GabrielRovere 2 ; 3 ; 4 ,JoseCrossa 6 ,SusanneDreisigacker 6 , SuchismitaMondal 6 ,RaviSingh 6 ,andGustavodelosCampos 3 ; 4 ; 5 ; 1 DepartmentofPlant,SoilandMicrobialSciences,MichiganStateUniversity,USA 2 DepartmentofAnimalScience,MichiganStateUniversity,USA 3 DepartmentofEpidemiologyandBiostatistics,MichiganStateUniversity,USA 4 InstituteforQuantitativeHealthScienceandEngineering,MichiganStateUniversity,USA 5 DepartmentofStatisticsandProbability,MichiganStateUniversity,USA 6 InternationalMaizeandWheatImprovementCenter(CIMMYT),Mexico CorrespondenceandrequestshouldbeaddressedtoG.D.L.C.(e-mail:gustavoc@msu.edu). 8 2.1Abstract High-throughputphenotyping(HTP)technologiescanproducedataonthousandsofphenotypes perunitbeingmonitored.Thesedatacanbeusedtobreedforeconomicallyandenvironmentally relevanttraits(e.g.,droughttolerance);however,incorporatinghigh-dimensionalphenotypesin geneticanalysesandinbreedingschemesposesimportantstatisticalandcomputationalchallenges. Toaddressthisproblem,wedevelopedregularizedselectionindices;themethodologyintegrates techniquescommonlyusedinhigh-dimensionalphenotypicregressions(includingpenalization andrank-reductionapproaches)intotheselectionindex(SI)framework.Usingextensivedatafrom CIMMYT's(InternationalMaizeandWheatImprovementCenter)wheatbreedingprogramwe showthatregularizedSIsderivedfromhyper-spectraldatao˙erconsistentlyhigheraccuracyfor grainyieldthanthoseachievedbystandardSIs,andbyvegetationindicescommonlyusedtopredict agronomictraits.RegularizedSIso˙erane˙ectiveapproachtoleverageHTPdatathatisroutinely generatedinagriculture;themethodologycanalsobeusedtoconductgeneticstudiesusinghigh- dimensionalphenotypesthatareoftencollectedinhumansandmodelorganismsincludingbody imagesandwhole-genomegeneexpressionpro˝les. 2.2Introduction High-throughputphenotyping(HTP)technologieshavebeenadoptedatafastpaceinagri- culture;applicationsrangefromtheuseofHTPinhighlycontrolledenvironments(e.g.,growth chambers(Nageletal.,2012))toextensiveHTPusingsensingdevicesmountedonaerial(e.g., hyper-spectralcamerasmountedonaerialvehicles(Araus&Cairns,2014))andterrestrialequip- mentsuchastractorsandcombineharvesters(Montesetal.,2006).Modernagriculturalproduction systemsuseHTPdatatooptimizemanagementpractices(Whiteetal.,2012),forecastagricultural outputs(Ferrioetal.,2005)andtoassessthequality(e.g.,proteincontent)ofagriculturalcom- modities(Spielbaueretal.,2009).HTPdatacanalsobeavaluableinputforbreedingprograms. Forinstance,extensiveHTPmayenableanexpansionofgenetictestingthatcanleadtohigher 9 intensityofselectionandfastergeneticprogress.Moreover,HTPdatamayo˙eropportunitiesto improvetraitssuchasdroughttolerancethatareotherwisedi˚culttomeasureandbreedfor. Sensorscangeneratedataonhundredsorthousandsofphenotypesperunitbeingmonitored. Forexample,hyper-spectralcamerascangeneratere˛ectanceofelectromagneticpowerathundreds ofwavelengthsinthevisibleandinfraredspectrum.Thesemeasurementscanbeconsideredas indicatorphenotypesthatcanbeusedtopredictothertraits.Anextensivebodyofresearchdeals withtheuseHTPdatatopredictphenotypessuchasgrainyield(Ferrioetal.,2005;Garriga etal.,2017;Weberetal.,2012;Aguateetal.,2017),drymatter(Montesetal.,2006),oiland proteincontent(Garnsworthyetal.,2000;Oblathetal.,2016).However,therehasbeenmuchless researchonhowtointegrateHTPdataingeneticstudiesandinbreedingschemes.Ingenetics,the problemofpredictingthegeneticmeritofatargettraitgivenasetofcorrelatedphenotypeswas ˝rstaddressedbySmith(1936)andHazel(1943)whointroducedtheconceptofselectionindex (SI)inplantandanimalbreeding,respectively. ASIseekstoimproveatargettrait y i (e.g.,grainyield)usinginformationfromanothersetof measuredtraits(e.g.,hyper-spectralimagedata).AlinearSIisaweightedsumofthemeasured phenotypeswithweightsderivedtomaximizethecorrelationbetweenthegeneticmeritforthe selectiontargetandtheSI.Thus,theSImethodologyo˙ersanaturalframeworkforintegrating HTPdataintobreedingdecisions.However,whenthemeasuredphenotypeishigh-dimensional, thenaïveapplicationoftheSIcanleadtoover˝ttingandsub-optimalaccuracyofindirectselection. Toaddressthisproblem,wedevelopedregularizedselectionindices(includingpenalizedand reduced-rankmethods)thataretailoredtoachieveaccuratepredictionofgeneticvaluesusinghigh- dimensionalphenotypes.TheproposedmethodologyintegratesintotheSIframeworkmethods oftenusedtopreventover˝ttinginhigh-dimensionalphenotypicregressions(Hastieetal.,2009). Usingextensivemulti-environmentcropimagingdatafromCIMMYT'swheatbreedingprogram weshowthatregularizedSIso˙erimprovedaccuracyofindirectselectioninbothoptimaland stressenvironments. 10 2.3Results Aselectionindexisalinearcombinationof p measuredphenotypes, x i = ¹ x i 1 ;:::; x ip º 0 ,ofthe form I i = x 0 i ,where = ¹ 1 :::; p º 0 isavectorofregressioncoe˚cientswhoseentriesde˝nethe weightsofeachofthemeasuredphenotypesintheSI.InastandardSItheweightsarederivedby minimizingtheexpectedsquareddeviationbetweenthegeneticmeritfortheselectiongoal( g y i , e.g.,thegeneticmeritforgrainyieldofthe i th genotype)andtheindex,thatis: ^ = argmin 1 2 E g y i x 0 i 2 (2.1) Thesolutiontothisoptimizationproblemis(see Methods section): ^ = P 1 x G x ; y ; (2.2) where G x ; y = E ¹ x i º isavectorcontainingthegeneticcovariancesbetweentheselectionobjective ( y i )andeachofthemeasuredtraits( x i ),and P x isthe(population)phenotypicvariance-covariance matrixofthemeasuredphenotypes,thatis, P x = E ¹ x i x 0 i º .Thus,astandardSItakestheform I i = x 0 i P 1 x G x ; y .ThetheoryunderlyingthederivationofSIsandresponsetoindirectselectionis wellestablished(Bulmer,1985;Falconer&Mackay,1996). TheSIisbyconstructionthebestlinearpredictor(BLP)ofthegeneticmeritfortheselection goal;thispropertyholdswhen G x ; y and P x areknown.However,whenthenumberofmeasured phenotypesislarge,errorsintheestimationof P x and G x ; y mayleadtoover˝ttingandsub-optimal accuracyofindirectselection. 2.3.1Regularizedselectionindices Reduced-rank(e.g.,principalcomponentsmethods)andpenalizedregression(Hastieetal.,2009) aretwoapproachescommonlyusedtoconfrontover˝ttinginhigh-dimensionalregressionproblems. Thesemethodologiesweredevelopedforregressionproblemsinvolvinganobservablephenotype ( y i ).IntheSI,theresponse( g y i )isunobservable;however,thesameprinciplesthatareappliedin phenotypicreduced-rankandpenalizedregressionscanbeintegratedintotheSIframework. 11 2.3.1.1Reduced-rankselectionindices Inprincipalcomponents(PC)regression,theresponseisregressedonareducednumber( q < p )of PCsextractedfromasetofpredictors( x i );thesameconceptcanbeusedtoderiveareduced-rank SI.Forinstance,onecanextractareducednumberofPCsfromacropimageandtheresulting PCscanbeusedas'measuredtraits'inEquation(2.1).ThesolutionofEquation(2.1)willrender estimatesoftheregressioncoe˚cientsforthePCs,whichcanbetransformedbacktocoe˚cients applicabletothemeasuredtraits(see Methods ).Thus,areduced-rankSI(referredtoasPC-SI)can bederivedfollowingthesesteps:( i )extract,usingthesingularvaluedecomposition, q PCsfrom thematrixcontainingthemeasuredphenotypes,( ii )estimatethegeneticcovariancesbetweenthe ˝rst q PCsandtheselectionobjective,( iii )usetheseestimated(co)variancestoderivecoe˚cients associatedwiththetop q PCs;˝nally,( i v )transformthesecoe˚cientsintocoe˚cientsforthe measuredphenotypes.Thisprocesscanbedoneusing q = 1 ; 2 ;:::; p PCs( q = p rendersthe standardSI).Forthesequenceofestimates( ^ ¹ 1 º ; ^ ¹ 2 º ;:::; ^ ¹ p º ),onecanevaluatetheaccuracyof indirectselectionoftheresultingSIandanoptimalrankforthePC-SIcanbechosentomaximize theaccuracyofindirectselection. 2.3.1.2Penalizedselectionindices Inapenalizedregression,regularizationisachievedbyincludingintheobjectivefunctionapenalty onmodelcomplexity.InthecontextofaSI,wehave ^ = argmin 1 2 E g y i x 0 i 2 + J ¹ º ; (2.3) where isapenaltyparameter( = 0 yieldsthecoe˚cientsforthestandardSI)and J ¹ º isapenalty function.CommonlyusedpenaltiesincludetheL2( jj jj 2 2 = Í p j = 1 2 j )andL1( jj jj 1 = Í p j = 1 j j j ) norms(Fu,1998),oraweightedsumofthetwo(Zou&Hastie,2005).Using J ¹ º = 1 š 2 Í p j = 1 2 j inEquation(2.3)rendersaRidge-regression-typePSI(RR-PSI,see Methods ): ^ L 2 = ¹ P x + I º 1 G x ; y ; 12 where I isa p p identitymatrix.TheRR-PSI(referredtoastheL2-PSI)yieldsshrunkenestimates oftheregressioncoe˚cients. Inmanyapplications,variableselection(i.e.,aSIthatisafunctionofasubsetofthemeasured phenotypes)maybedesirable.ThispropertycanbeobtainedusingpenaltiesinvolvingtheL1- norm,eitheralone, J ¹ º = Í p j = 1 j j j (LASSO(Tibshirani,1996)),orincombinationwiththe L2-norm, J ¹ º = 1 š 2 ¹ 1 º Í p j = 1 2 j + Í p j = 1 j j j (elastic-net(Zou&Hastie,2005)).Unlikethe L2-PSI,theLASSOandelastic-netSIs(hereinafterreferredtoasL1-PSIandEN-PSI,respectively) donothaveaclosed-formsolution(Hastieetal.,2009).However,solutionsforPSIsinvolvingan L1-penaltycanbeobtainedusingiterativeproceduressuchasthecoordinatedescent(Friedman etal.,2007)andtheleastangleregression(Efronetal.,2004)(LARS)algorithms(see Methods ). AswiththePC-SI,anoptimalPSIcanbeobtainedbychoosingthevaluesoftheregularizing parameters( ; )thatmaximizetheaccuracyofindirectselection. Figure2.1:Predictionofthegeneticmeritforgrainyieldusinghyper-spectralcropimagedata.(A) Dataconsistsofhyper-spectralre˛ectancedata( x i )andphenotypicmeasurementsofthetargettrait ( y i ,e.g.,grainyield).(B)Asubsetofthedata(thetrainingset)isusedtoderivethecoe˚cientsof aselectionindex.(C)Thesecoe˚cientsarethenappliedtoimagedataofindividualsinthetesting settoderivetheindex( I i )foreachindividual.Thepredictiveabilityoftheindexisassessedby calculatingtheaccuracyofindirectselection( Acc ¹Iº )inthetestingset. 13 2.3.2Accuracyofindirectselection Indirectselectionaccuracyisde˝nedasthecorrelationbetweentheindexusedtorankgenotypes andthegeneticmeritoftheselectionobjective,thatis, Acc ¹Iº = cor ¹I i ; g y i º .Thisparameteris equaltotheproductofthesquarerootoftheheritabilityoftheSI( h I )timesthegeneticcorrelation betweentheSIandtheselectiontarget, cor ¹ g I i ; g y i º (Falconer&Mackay,1996).Toavoid estimationbias Acc ¹Iº mustbeestimatedusingdatathatwasnotusedtoderivethecoe˚cientsof theindex(Figure2.1);therefore,intheapplicationpresentedbelowwe:( i )partitionedthedatainto trainingandtestingsets,( ii )derivedthecoe˚cientsoftheSIinthetrainingset,( iii )appliedthese coe˚cientstoimagedataofthetestingset( I i = x 0 i ),and( i v )estimated h I , cor ¹ g I i ; g y i º ,and Acc ¹Iº inthetestingset.Furthermore,wequanti˝edthee˚ciencyofindirectselectionrelativeto massphenotypicselection(RE)using RE = h I h y cor ¹ g I i ; g y i º (Falconer&Mackay,1996). 2.3.3Regularizedselectionindicesforwheatgrainyieldusinghyper-spectralimagedata Weappliedthemethodologydescribedintheprevioussectiontodata( n = 3 ; 276 )fromthe CIMMYT'sGlobalWheatProgramconsistingofgrainyield(tonha 1 )andhyper-spectralimage data.ThedatawerecollectedatCIMMYT'sexperimentalstationinCiudadObregon,Sonora, Mexico(27 20'N,109 54'W,38masl)from39yieldtrialsinwhichatotalof1,092genotypes weretested.RainfallinObregonisverylimited;therefore,fourdi˙erentenvironmentswere generatedrepresentingacombinationofplantingmethods( Flat or Bed ),controlledirrigation (minimal,2or5irrigations),andplantingdates(optimumorearly-heat).Asexpected,average yielddecreasedasdroughtstressintensityincreased(seeTable2.1andSupplementaryFigureA.1 forboxplotsofyieldbyenvironment). Imagedatawascollectedusinganinfraredandanhyper-spectralcameraandconsistedofre- ˛ectanceofelectromagneticpowerat250wavebandswithinthevisibleandnear-infraredspectrums (392-850nm).Separateimageswerecollectedat9time-pointscoveringvegetative(VEG),grain ˝lling(GF),andmaturity(MAT)stagesofthecrop(seeSupplementaryFigureA.2).Grainyield 14 andimagedatawerepre-adjustedusingmixed-e˙ectsmodelthataccountedforgenotype,trial, replicate,andsub-block(see Methods section). Table2.1:Averagegrainyieldandheritabilitybyenvironmentalcondition. PlantingconditionsNumberof Abbreviation Average(SD) Heritability(SD) DateSystemirrigationsYield Optimum FlatMinimalFlat-Drought2.06(0.58)0.83(0.016) Bed 2Bed-2IR3.67(0.43)0.66(0.032) 5Bed-5IR6.11(0.61)0.43(0.025) Early5Bed-EHeat6.43(0.73)0.61(0.018) SD:standarddeviation. 2.3.4Regularizationimprovestheheritabilityandtheaccuracyoftheindex Toassessthee˙ectofregularizationontheaccuracyofindirectselectionwe˝ttedanL1-PSIovera gridofvaluesoftheregularizationparameter( ¹ 1 º > ¹ 2 º >:::> 0 inEquation(2.3),using = 0 rendersastandardSI).Foreachofthesolutions( ^ ¹ ¹ 1 º º ; ^ ¹ ¹ 2 º º ;::: )weestimatedtheheritability oftheresultingindexandthegeneticcorrelationbetweentheindexandtheselectiontarget,and fromthoseestimateswederivedtheaccuracyofindirectselection.Thesameapproachwasused toevaluatetheaccuracyofindirectselectionofPC-SIswithavaryingnumber( 1 ; 2 ;::: )ofPCs. We˝rst˝ttedPSIsandPC-SIsusingdatafromasingletime-point;theresultsfromthelatest time-point(correspondingtoMATorlateGFstagesdependingontheenvironment)arepresented inFigure2.2(seeSupplementaryFiguresA.3toA.5forothertime-points).Theheritabilityofthe L1-PSI(Figure2.2A)decreasedasmorebandsbecameactiveintheindex.Likewise,theheritability ofPC-SI(Figure2.2B)decreasedwiththenumberofPCsused.However,thegeneticcorrelation increasedaseithermorebandsbecomeactiveintheL1-PSIormorePCswereusedinthePC-SI. Consequently,themaximumaccuracyofindirectselectionwasachievedwithaSIofintermediate complexity(withanywherebetween20and60ofthe250bandsbeingactiveintheL1-PSI,and between20-60PCsinthePC-SI).Resultsforothertime-pointsandenvironments(Supplementary FiguresA.3toA.5)exhibitedsimilarpatternswithsomedi˙erencesbetweenenvironments.The 15 accuracyofindirectselectionoftheoptimalL1-PSIwasalwaysclosetothatoftheoptimalPC-SI andthatoftheoptimalL2-PSI(SupplementaryTableA.1).Importantly,inallcasestheaccuracyof indirectselectionoftheoptimalregularizedSIswasconsiderablyhigherthanthatofthestandard SI,whichistheonecorrespondingto250activebandsor250PCs(i.e.,theright-mostresultsin theplotsinFigure2.2). Figure2.2:AccuracyofindirectselectionofregularizedSIsanditscomponents.Squareroot heritability(green),geneticcorrelation(orange),andaccuracyofindirectselection(purple,all averagedover100training-testingpartitions),versusthenumberofpredictorsusedtobuildthe index:(A)numberofactivebandsinthecaseoftheL1-PSI,or(B)numberofPCsinthePC-SI. Eachpanelrepresentsoneenvironment(latesttime-point). Figure2.3displaystheaccuracyofindirectselectionacrosstime-pointsfortheoptimal(i.e., theonewiththehighestaccuracyofindirectselection)L1-PSIandPC-SI.Forcomparisonwe alsodisplayintheplottheaccuracyofindirectselectionachievedbyastandardSI(ingreen). Theestimated95%con˝denceintervalsoftheaccuracyoftheregularizedSIs(eitherPC-SIor L1-PSI)areallabove(anddonotoverlap)withthecon˝denceintervalsfortheaccuracyofthe standardSI,exceptforasingletime-point(57DASinenvironment Bed-2IR ).ResultsfromTukey's HonestSigni˝canceDi˙erencecon˝rmedthattheaccuracyoftheregularizedSIsisstatistically di˙erent(higher)thanthestandardSIata5%ofsigni˝cance(SupplementaryTableA.1)forall 16 butone(57DASinenvironment Bed-2IR )time-point/environment.Regularizationincreasedthe selectionaccuracyacrosstime-pointsandenvironments.RegularizedSIs(eitherPC-SIorL1-PSI) hadanaccuracyofindirectselectionthatwasinaverage10-40%higherthantheaccuracyachieved byastandardSI.Thesegainsinaccuracywerestrongerintheoptimalenvironment( Bed-5IR withamedianof36%)andsmallerinthestressedenvironments( Flat-Drought and Bed-EHeat withamedianof16%).Interestingly,therewerenosizabledi˙erencesbetweentheaccuracyof indirectselectionachievedwiththeoptimalL1-PSIandthatoftheoptimalPC-SI.Comparedwitha standardSI,regularizedSIshadhigherheritability(SupplementaryFigureA.6);thiswasachieved withoutcompromisingthegeneticcorrelation(SupplementaryFigureA.7),thusleadingtoahigher accuracyofindirectselectionachievedbyeitherpenalizationorrank-reductionstrategies. Figure2.3:Accuracyofindirectselectionachievedbyastandard(SI)andbyregularized(PC-SI andL1-PSI)selectionindices.Thelinesprovidetheaverageaccuracyover100training-testing partitions.Verticallinesrepresenta95%con˝denceintervalfortheaverage.Thehorizontalaxis givesthetime-pointatwhichimageswerecollectedandareexpressedinbothdaysaftersowing (DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity). 17 2.3.5Usingdatafrommultipletime-pointsfurtherimprovesselectionaccuracy Theresultspresentedabovewerebasedondatafromasingletime-point.Wealsogenerated selectionindicesusingdatafrommultipletime-points(inthiscase, x i wasavectorcontaining 2,250traits,correspondingto250wavebandsmeasuredateachof9time-points).Integratingdata frommultipletime-pointsfurtherincreasedtheaccuracyofL1-PSIbyamarginthatrangedfrom1 to8pointsonthecorrelationscale(Table2.2).Thegainsinselectionaccuracyobtainedusingdata frommultipletime-pointsweremoreevidentinenvironmentswithloweraccuracy;similarresults wereobtainedforthePC-SIandL2-PSI(SupplementaryTableA.1). Table2.2:Accuracyandrelativee˚ciencyofindirectselectionofanL1-penalizedSIusingdata fromoneandninetime-points. Accuracy(SD)RelativeE˚ciency(SD) EnvironmentBestsingleNinetime-pointsBestsingleNinetime-points time-point*combinedtime-point*combined Flat-Drought0.69(0.05)0.70(0.05)0.74(0.05)0.75(0.05) Bed-2IR0.46(0.04)0.54(0.03)0.57(0.05)0.67(0.04) Bed-5IR0.47(0.06)0.55(0.05)0.72(0.08)0.83(0.08) Bed-EHeat0.68(0.04)0.71(0.04)0.88(0.05)0.91(0.04) Valuesarepresentedasanaverageacross100training-testingpartitions.SD:standarddeviation.*Foreach environmentweincludethetime-pointthatgavethehighestaccuracyofselection(seeFigure2.3forother time-points). 2.3.6L1-penalizationleadstosparseselectionindices Figure2.4showsaheatmapforthesolutionsoftheoptimalL1-PSIthatintegrateddatafrom the9time-points.Eachpanelrepresentsanenvironment,horizontalbandsrepresenttime-points. Withineachtime-pointwavebandsnotenteringinthesolutionareingreyandnon-zerocoe˚cients arerepresentedinayellow-redscale(redindicateslargeabsolute-valuecoe˚cients).Thewell- irrigatedenvironments( Bed-5IR and Bed-EHeat )hadconsiderablysparserindiceswithonlya reducednumberofwavebandsinthesolutions;theseweremostlylocatedintheviolet,blueand redregionsofthespectrum.Instressedenvironments( Flat-Drought and Bed-2IR )therewerealso 18 afewwavebandsinthegreenandinfraredregionsthatwereactive.Inalltheindices,therewere wavebandsfromseveraltime-pointsthatwereactiveintheoptimalsolution,suggestingthatdata frombothearlyandlatephenologicalstagesareinformativeaboutthegeneticmeritforgrainyield. Figure2.4:Heatmapofregressioncoe˚cientsforL1-penalizedselectionindices.Separateindices werederivedforeachenvironmentusingmultitime-pointdata.DAS=daysaftersowing,VEG,GF, MATrepresentvegetative,grain-˝llingandmaturitystages,respectively.Thebottomcolor-bar showsthelightcolorassociatedwitheachwavebandinthevisiblespectrum( 750nm);blackwas usedtorepresentthenear-infraredspectrum(wavelength>750nm). 2.3.7Comparisonwithphenotypicprediction WecomparedtheaccuracyofindirectselectionofthePSIandPC-SIwithvegetationindices andpenalizedphenotypicprediction.Vegetationindicesareoftenusedtopredictyield(Tattaris etal.,2016),biomass,andchlorophyllcontent(Babaretal.,2006;Haboudaneetal.,2002).We consideredtwovegetationindices:theRedandGreenNormalizedDi˙erenceVegetationIndices 19 (RNDVI(Tucker,1979)andGNDVI(Gitelsonetal.,1996)respectively).Foreachoftheseindices weestimatedthegeneticcorrelationwithgrainyield,aswellastheirheritabilityandaccuracy ofindirectselection(SupplementaryTableA.1).Overall,theaccuracyofindirectselectionof theGNDVIandRNDVIwaslowerthantheoneachievedwithaPSI(theaveragedi˙erencein accuracybetweenRNDVIandtheL1-PSIvariedbyenvironmentfrom0.02to0.14pointsin correlation,SupplementaryTableA.1,infavoroftheL1-PSI).TheheritabilityoftheGNDVI andRNDVIwassimilarandsuperiorinsomecasestothatoftheL1-PSI(SupplementaryFigure A.6);however,thegeneticcorrelationbetweenthevegetationindicesandgrainyieldwas(inmost time-pointsandenvironments)lowerthanthegeneticcorrelationbetweentheL1-PSIandgrain yield(SupplementaryFigureA.7).Thus,themaindriverofthedi˙erenceinaccuracybetweenthe L1-PSIandthevegetationindiceswasthedi˙erenceingeneticcorrelation. Wealso˝ttedL1-penalizedphenotypicprediction(L1-Phen)andcomparedtheaccuracyof indirectselectionofthesephenotypicpredictionmethodswiththatofpenalizedSIs.Overall,the L1-PhenachievedanaccuracyofindirectselectionveryclosetothatoftheL1-PSI(Supplementary TableA.1);however,inafewenvironmentsatsometime-points,theL1-PSIachievedahigher accuracyofindirectselectionthanthephenotypicprediction. 2.4Discussion High-throughputphenotypinghasbeenextensivelyadoptedinagriculturalresearchandcom- mercialproduction.ExtractinginterpretableinformationfromHTPdataposesimportantstatistical challenges.Theclearmajorityofresearchinthisareahasfocusedoncalibratingequationsto predictphenotypes(e.g.,totalbiomass,grainyield)usingHTPdataasinputs.Thisapproachis well-suitedforphenotypicprediction;however,thesameapproachcanbesub-optimalforselection becausethebestpredictorofaphenotypeisnotalwaysthebestpredictorofthegeneticmeritof thesametrait. Thebest(linear)phenotypicpredictoristhesumofthebestlinearpredictorofthegeneticmerit ( g y )plusthebestlinearpredictoroftheenvironmentalterm( " y ),thatis, E ¹ y j x º = E ¹ g y j x º + E ¹ " y j x º . 20 The˝rstterm, E ¹ g y j x º ,istheSIanditis,byconstruction,maximallycorrelatedwiththegenetic merit.Thesecondterm, E ¹ " y j x º ,isrelevantforphenotypicpredictionbutrepresentsnoisewhen theproblemisthatofselectingthebestgenotypes. Selectionindicesexploitgeneticcovariances,whilephenotypicpredictionreliesonphenotypic covariancesbetweentheselectiontargetandthemeasuredphenotype(e.g.,cropimaging).Thus,the twomethodsyielddi˙erentresultswheneverthepatternsofphenotypiccorrelationsaresu˚ciently di˙erentfromthepatternsofgeneticcorrelations.Inourdataset,environmentalconditions werehighlycontrolled,withrelativelylowun-controlledwithin-trialvariabilityinenvironmental conditions.Consequently,thepatternsofphenotypicandgeneticcorrelationswereverysimilar (seeSupplementaryFigureA.8).Thiswastrueformanytime-pointsandenvironmentsbutnotin others(e.g.,80,85and93DASin Flat-Drought ,and90and98DASin Bed-2IR );itwasexactlyin thosetime-pointsandenvironmentsthattheL1-PSIachievedhigheraccuracyofindirectselection thantheL1-Phenmethod(SupplementaryTableA.1). AstandardSI(Equation(2.1))is,byconstruction,maximallycorrelatedwiththegenetic meritoftheselectionobjective.Thisoptimalitypropertyholdswhenthegeneticandphenotypic (co)variancematricesthatareneededtoderivethecoe˚cientsoftheSI(seeEquation(2.2))are knownwithouterror.However,whenthemeasuredphenotypeishigh-dimensional,estimation errorsinthephenotypic(co)variancematrix( P x ),aswellasinthegeneticcovariances( G x ; y ),can makethestandardSIsub-optimal.Ourempiricalresultscon˝rmthis:standardSIsover-˝ttedthe data;thisleadstoaSIwithlowheritabilityandlowaccuracyofindirectselection. Topreventover˝tting,weconsideredintegratingideascommonlyusedinhigh-dimensional regressionintotheSImethodology.Ourempiricalresultsshowthatregularizationconsistently improvestheaccuracyofindirectselectionrelativetostandardSIs.Weveri˝edthisforvarious environmentalconditionsandforcropimagingdatacollectedat9di˙erenttime-points.The optimalPSIandtheoptimalPC-SIachievedalmostthesameaccuracyofindirectselectionforall theenvironmentsandtime-points,suggestingthateithertypeofregularizationcanbee˙ective. Reduced-rankselectionindicesareappealingbecauseafterdimensionreductiontheproblem 21 ofderivingaSIistrivialandcanbedealtwithmethodscommonlyusedtoderivestandardSIs. Moreover,afterHTPhasbeenreducedtoafewderived-traits(saythetop10PCs),thesetraits canbeintegratedintogeneticevaluations(eitherpedigree-based(Henderson&Quaas,1976)or genomic-enabled(Meuwissenetal.,2001))usingstandardmulti-traitmodels. Principalcomponents-basedmethodshavebeenconsideredbeforeintheanalysisofFourier- transformedinfrared(FTIR)spectraderivedfrommilksamples.Forinstance,Soyeurtetal. (2010)usedareducednumberofFTIR-derivedPCstoestimatevariancecomponentsforselection objectives(e.g.,fatorproteincontentinmilk).Buildinguponthisidea,Dagnachewetal.(2013) suggestedpredictingthegeneticmeritformilkfattyacidsusingFTIR-derivedPCsas'traits'ina geneticevaluation.However,whenmappingfromgeneticpredictionsofPC-lodgingsontogenetic predictionsfortheselectionobjectivetheauthorsusedcoe˚cientsderivedfromaphenotypic (partialleastsquares)regression.Thisdoesnotguaranteethattheresultingindexismaximally correlatedwiththegeneticmeritoftheselectiontarget.ThepenalizedandPC-SIpresentedinthis studyaddressthatproblembyusingcoe˚cientsthatarederivedusinggenetic(andnotphenotypic) covariances. AdisadvantageofthePC-SIisthatthemethodologydoesnotnaturallyprovidevariable selection,afeaturethatmaybedesirablewhenthemeasuredphenotypeishigh-dimensional. Penalizedselectionindicescanperformvariableselectionbasedongeneticcovariances.While thederivationofaPSIisabitmorechallengingthanthatofthePC-SI,thecomputationalburden involvedinthederivationofaPSIisnotextremelyhigh. 2.4.1IntegrationofPSIandPC-SIintogeneticevaluations TheSIsconsideredherepredictgeneticmeritforaselectiontargetfromasetoftraitsmeasured onanindividual( I i = x 0 i );suchindicesexploitborrowingofinformationbetweentraitswithin anindividual.Borrowingofinformationbetweenindividualsincreasesselectionaccuracy;we envisiontwowaysinwhichregularizedSIscanbeintegratedintopedigreeorgenomic-based geneticevaluations. 22 Onepossibilityistouseatwo-stepsapproachwhereasinthe˝rststepaPSIoraPC-SIis usedtopredictthegeneticmeritusingwithin-individualinformation.Thisstepcanbeconsidered asataskwherepatternsattributabletogeneticcovariancesareextractedandthoseattributable toenvironmentalcovariancesaresmoothed-out.Then,inasecondstep,theresultingindex-data fI 1 ;:::; I n g couldbeusedasatraitinageneticevaluation. OurstudyshowsthattheuseofaregularizedSIleadstoaderived-phenotypethathashigher geneticaccuracythanstandardSIs,andthatofbestphenotypicprediction.Inprinciple,usinga moreaccuratephenotypeshouldleadtoahigheraccuracyofthepredictedbreedingvaluesinthe secondstep.However,furtherstudiesareneededtodeterminewhetherthegainsinaccuracyat theleveloftheHTP-derivedphenotypewillfullytranslateintoahigheraccuracyofthepredicted breedingvaluesinatwo-stepsprocedure. Aone-stepapproachisconceptuallyfeasibleandstatisticallymoree˚cientasito˙ersthe possibilityofconsideringcorrelationsbetweentraits,relationshipsbetweengenotypes,andthe e˙ectsofnon-geneticfactorsjointly;however,theimplementationoftheone-stepapproachusing high-dimensionalphenotypescanbecomputationallychallenging.Toimplementaone-stepap- proach,theoptimizationproblemofEquation(2.3)canbemodi˝edbyreplacing x i ,thevectorwith themeasuredphenotypesonthe i th individual,withavector x = ¹ x 0 1 ; x 0 2 ;:::; x 0 n º 0 thatcontainsall theavailableHTPdata(measuredonallnindividuals);afterexpandingthesquarederrorlossand takingexpectationsweget ^ i = argmin i 1 2 E g 2 y i 0 i G g x + 1 2 0 i P x i + J ¹ i º where G g x isa pn 1 vectorofgeneticcovariancesincludingbetween-traits-within-individual (co)variancesandbetween-subjectscovariances.Instandardgeneticmodels, G g x takesaKronecker form G g x = A i G x ; y ,where A i aregenetic(eitherDNA-orpedigree-derived)relationshipsbetween thecandidateforselectionandeachoftheindividualsinthetrainingset,and G x ; y is,asbefore,a vectorofgeneticcovariancesbetweentheselectionobjectiveandthemeasuredtraits( x ).Likewise, P x isa pn pn phenotypic(co)variancematrix.Estimating P x wouldrequireestimatingallthe geneticandenvironmentalcovariancesamongthemeasuredtraits. 23 2.4.2Impactoftheuseofhigh-throughputphenotypesinbreedingprograms Accordingtobreeders'equation(Falconer&Mackay,1996;Lush,1937),therateofgeneticgain fromselectionisdirectlyproportionaltoselectionaccuracyandselectionintensity.Thus,relativeto theuseofstandardSIs,theuseofregularizedSIsisexpectedtoincreaseselectiongainsbyafactor equaltothegainsobservedinaccuracy,thatisbetween10%and40%.Relativetomassphenotypic selection,thePSIshade˚ciencies,RE,rangingfrom60%to90%;therefore,relativetodirect phenotypicselection,selectiondecisionsbasedonPSIderivedfromimagesareexpectedtoyield lowergeneticgainsthantheonesthatcouldbeachievedviadirectmassselection.However,the useofHTPtechnologies(e.g.,cropmonitoringusinghyper-spectralcamerasmountedindrones) mayenabletheexpansionofthenumberofgenotypestested/measuredaswellasthenumberof locationswherethosegenotypesaretested.Thiscouldleadtoanincreaseinselectionintensity whichwillinturnincreaseselectiongains.Forinstance,iftheuseofHTPenablesdoublingthe numberofgenotypestested,theincreaseinselectiongainsthatcouldbeachievedwithHTPmay rangefrom20%(inthecasewherethePSIhasREof60%)to80%(forthetraits/environments withREof90%). Thediscussionintheprecedingparagraphisentirelybasedonbreeders'equation,whichdoes notcontemplatethelong-termimpactsofselectioningeneticdiversity.Amoreaccurateand moreintensiveselectionmaya˙ectdiversityandlong-termresponsetoselection.Toaddressthis problem,attentiontodiversitywillbeneededwithregularizedSIsaswithanyotherselection criteria. 2.4.3Regularizedselectionindicescanalsobeavaluabletoolingeneticresearch High-dimensionalphenotypesarealsobecomingincreasinglyavailableingeneticstudiesinvolving humansubjectsandmodelorganisms.Performinggeneticstudies(e.g.,genome-wideassociation analyses)onhigh-dimensionalphenotypesischallengingandtheburdenofmultipletestingacross hundredsorthousandsofphenotypes(e.g.,RNA-abundanceacrossthousandsofgenes)may criticallycompromisepower.ThePSIandPC-SIpresentedinthisstudycouldbeusedtoextract 24 geneticpatternsfromhigh-dimensionalphenotypedatasuchasbrainimagingorwhole-genome geneexpressionpro˝lesandthesepatternscanthenbeusedastraitsingeneticstudies. Conclusion .Weproposedtwonovelmethodsforpredictingthegeneticmeritforselectionob- jectivesfromhigh-dimensionalphenotypes.Thesephenotypesarebecomingincreasinglyavailable astheadoptionofHTPincropandanimalproductionincreases.Theproposedmethodsintegrate regularizationprocedurescommonlyusedinhigh-dimensionalregressionsintotheSImethodology. Regularizationpreventsover˝ttingandincreasestheaccuracyoftheindex.Themethodsproposed herecanbeusedtoextractgeneticpatternsfromalmostanykindofhigh-dimensionalphenotype, includingnotonlyHTPdataemerginginagriculturebutalsohigh-dimensionalphenotypesthat emergeingeneticstudiesinvolvinghumansubjectsandmodelorganisms. 2.5Methods 2.5.1Standardselectionindex TheweightsonaSIarederivedasthesolutiontotheoptimizationproblemofEquation(2.1): ^ = argmin 1 2 E g y i x 0 i 2 Theright-handsidecanbeexpressedas E ¹ g y i x 0 i º 2 = E ¹ g 2 y i º 2 E ¹ g y i x i º 0 + 0 E ¹ x i x 0 i º . The˝rstterm, E ¹ g 2 y i º ,doesnotinvolve ;therefore,itcanbedroppedfromtheobjectivefunction. Furthermore,if x i hasnullmean,andassumingthattheenvironmentale˙ectson x i areorthogonal to g y i ,then E ¹ g y i x i º = G x ; y isavectorcontainingthegeneticcovariancesbetweentheselection targetandeachofthemeasuredphenotypes.Likewise, E ¹ x i x 0 i º = P x isthephenotypic(co)variance matrixof x i .Therefore,theprobleminEquation(2.1)canbewrittenas ^ = argmin G 0 x ; y + 1 2 0 P x : Di˙erentiatingtheright-handsidewithrespecttovector andsettingthederivativesequalto zeroleadstothe˝rstorderconditions: P x ^ = G x ; y ;therefore, ^ = P 1 x G x ; y : 25 2.5.2Reduced-rankselectionindex Recallthatthesingularvaluedecompositionofareal-valuedmatrix, X = » x 1 ; x 2 ;:::; x n ¼ 0 (indi- vidualsinrows,phenotypesincolumns)takestheform X = UDV 0 ,where U = » u 1 ;:::; u p ¼ isthe matrixcontainingtheleft-singularvectorsthatspantherow-spaceof X , V = » v 1 ;:::; v p ¼ isthe matrixwiththeright-singularvectors,and D = dia g ¹ d 1 ;:::; d p º isadiagonalmatrixwithpositive orzeroelements.ThePCs W = XV = UD arelinearcombinationsofthemeasuredphenotypes.A reduced-rankregressionusesthe˝rst q PCs( ~ W = » w 1 ;:::; w q ¼ , q p )as"measuredphenotypes" intheSI: ^ ¹ q º = argmin 1 2 E g y i ~ w 0 i ¹ q º 2 ; where ~ w i isavectorcontainingthescoresforthe i th observationonthe˝rst q PCs.Thesolutionto theoptimizationproblemtakestheform ^ ¹ q º = P 1 ~ w G ~ w ; y ,where P ~ w isthephenotypic(co)variance matrixofthe˝rst q PCsand G ~ w ; y isavectorcontainingthegeneticcovariancesbetweeneachof thetop q PCsandtheselectionobjective.Sincetheleft-singularvectorsareorthonormal(i.e., u 0 j u j = 1 and u 0 j u k = 0 ,for j , k ),then W 0 W = D 2 = dia g ¹ d 2 1 ;:::; d 2 p º .Hence,amethod-of- momentsestimateofthephenotypic(co)variancematrixof ~ W containsonlythe˝rst q elements ~ D 2 = dia g ¹ d 2 1 ;:::; d 2 q º ;thisis ^ P ~ w = 1 n 1 ~ D 2 Using ^ P ~ w makesthecoe˚cientsofthePCstobeproportionaltothegeneticcovariancebetween eachofthePCsandtheselectionobjective: ^ ¹ q º = ¹ n 1 º¹ ~ D 2 º 1 G ~ w ; y .Thissolutioncanbemapped tocoe˚cientsforthemeasuredtraitsusing ^ ¹ q º = ¹ n 1 º ~ V ¹ ~ D 2 º 1 G ~ w ; y ,where ~ V isthematrix containingonlythe˝rst q right-singularvectors. 2.5.3Penalizedselectionindices TheobjectivefunctionofthepenalizedSIisgivenbyEquation(2.3).HereweconsideredPSIs usingeitherL1orL2-normsoracombinationofthetwo. 26 L2-PSI .UsinganL2-normaspenalty, J ¹ º = 1 š 2 Í p j = 1 2 j = 1 š 2 0 ,inEquation(2.3)leads tothefollowingoptimizationproblem: ^ L 2 = argmin 1 2 E g y i x 0 i 2 + 1 2 0 Therefore: ^ L 2 = argmin G 0 x ; y + 1 2 0 P x + 1 2 0 Thesecondandthirdright-handsidetermscanbecombinedtoobtain: ^ L 2 = argmin G 0 x ; y + 1 2 0 ¹ P x + I º ; where I isa p p identitymatrix.Di˙erentiatingwithrespectto andsettingthederivativesequal tozero,weobtainthe˝rst-orderconditions: ¹ P x + I º ^ L 2 = G x ; y ;therefore: ^ L 2 = ¹ P x + I º 1 G x ; y EN-PSI .Thecoe˚cientsfortheelastic-netfamilyareobtainedbyconsideringanobjective functionasinEquation(2.3),with J ¹ º = 1 š 2 ¹ 1 º Í p j = 1 2 j + Í p j = 1 j j j ;therefore, ^ EN = argmin 2 6 6 6 6 4 G 0 x ; y + 1 2 0 P x + 1 2 0 + 1 2 ¹ 1 º p Õ j = 1 2 j + p Õ j = 1 j j j 3 7 7 7 7 5 : TheL1-PSIandL2-PSIareparticularcasescorrespondingto = 1 and = 0 ,respectively. When = 0 thesolutionhasaclosed-form(seeL2-PSIabove).If > 0 ,noclosed-formsolution exists;however,asolutioncanbeobtainedusingthesameiterativealgorithmsthatareusedtosolve elastic-netregressions(e.g.,LARSandcoordinatedescent(Hastieetal.,2009)).Thesealgorithms canbeimplementedeitherby"partialresiduals"orusing"covarianceupdates"(Friedmanetal., 2010).Inourcase,theobjectivefunctionisentirelybasedon(co)varianceterms.Theobjects P x and G x ; y enterintheobjectivefunctionofthePSIinthesamewaythat X 0 X and X 0 y enterina standardelastic-netregression.Therefore,toobtainsolutions,weimplementedthestandardLARS algorithm(e.g.,Hastieetal.,2009)entirelybasedoncovarianceupdates. 27 2.5.4Data Thedatasetconsistsof 1 ; 092 inbredwheatlinesgroupedinto39trialsandgrownduringthe2013- 2014seasonattheNormanBorlaugexperimentalresearchstationinCiudadObregon,Sonora, Mexico.Eachtrialconsistedof28breedinglinesthatwerearrangedinanalpha-latticedesign withthreereplicatesandsixsub-blocks.Thetrialsweregrowninfourdi˙erentenvironments: Flat-Drought (sowingin˛atwithirrigationof180mmthroughdripsystem), Bed-2IR (sowingin bedwith2irrigationsapproximately250mm), Bed-EHeat (bedsowing30daysbeforeoptimal plantingdatewith5normalirrigationsapproximately500mm),and Bed-5IR (bedsowingwith5 normalirrigations).In2013,allthetrialswereplantedbymid-November(optimalplantingdate), onthe 21 st ( Bed-2IR and Bed-5IR )andonthe 26 th for Flat-Drought .Trialsfor Bed-EHeat were plantedonOctober 30 th .Grainyield(tonha 1 ,totalplotyieldaftermaturity)wasrecorded. Re˛ectancedatawerecollectedfromthe˝eldsusingbothinfrared(A600seriesInfrared camera,FLIR,Wilsonville,OR)andhyper-spectral(A-series,Micro-Hyperspec,VNIRHeadwall Photonics,Fitchburg,MA)camerasmountedonaPiperPA-16Clipperaircrafton9di˙erentdates (time-points)betweenJanuary 10 th andMarch 27 th ,2014.Duringeach˛ight,datafrom p = 250 wavebandsrangingfrom392to850nmwerecollectedforeachpixelinthepictures.UsingArcMap software(ESRI,CA),theaveragere˛ectanceofallthepixelswithineachgeo-referencedtrialplot wascalculatedandreportedasasingledata-pointforeachgenotypeforeachband.Daystoheading wererecordedasthenumberofdaysfromthedateofsowing/˝rstirrigationuntil50%ofspike emergenceineachplot.Headingofabout50-80%ofthetotalnumberofplotswasusedascriterion todistinguishbetweenvegetative(VEG)andgrain˝lling(GF)stages.Thecropwasconsideredto beatmaturity(MAT)stagewhentheaverageRNDVIdecreasedto ˘ 0 : 4 . 2.5.5Phenotypepre-processing Withineachenvironment,grainyieldphenotypicrecordswerepre-adjustedby˝ttingthefollowing mixedmodel, y jklm = + g j + t k + r l ¹ k º + b m ¹ kl º + " jklm 28 where y jklm isthegrainyieldphenotypevalueforthe j th genotype, k th trial, l th replicate(within trial), m th sub-block(withintrialandreplicate), istheoverallmeanand g j , t k , r l ¹ k º ,and b m ¹ kl º arethegenotype,trial,replicate,andsub-blocke˙ects,respectively(allassumedtoberandom) and " jklm isanerrorterm.Randome˙ectswereassumedtobeindependentlyandidentically distributed( iid )normalwithnullmeanande˙ect-speci˝cvariances.Likewise,theerrorterms wereassumedtobeiidwithnullmeanandcommonerrorvariance. Grainyielddatawerepre-adjustedbysubtractingfromthephenotypicrecord( y jklm )themean ( ^ )plusBLUPsoftrial,replicate,andsub-blocke˙ects;thisis y jklm = y jklm ^ ^ t k ^ r l ¹ k º ^ b m ¹ kl º = ^ g j + ^ " jklm (2.4) Re˛ectancedatawaspre-adjustedby˝ttingtheabovemodel,usingre˛ectanceatindividual bandsasphenotypeexpandedwiththeinclusionofatime-pointe˙ect.Separatemodelswere˝tted toeachofthewavebands.Aswithgrainyield,re˛ectancedatawerepre-adjustedbysubtracting fromthemeasuredre˛ectancetheestimatedmeanandpredictedtime-point,trial,replicate,and sub-blocke˙ects. Forqualitycontrol,pre-adjustedgrainyieldandre˛ectancephenotypeswereremovedforthose grainyieldscoreslyingbeyond3timestheinter-quantileregionfromthe0.25and0.75quantiles. Afterpre-adjusting,allphenotypeswerestandardized(tohaveunitvariance);foreaseofexposition, hereinafterwerefertotheadjusted-scaledphenotypes(includinggrainyieldandtheimagedata) simplyasphenotypes. 2.5.6Heritabilityestimation Afterpre-adjustingstandardization,weanalyzedphenotypesusingamixedmodeloftheform y ij = g j + " ij (2.5) where y ij isthephenotypeforthe i th observation( i hereisasingleindexforindices k , l ,and m inEquation(2.4))ofthe j th genotype;thegeneticvaluesare g j iid ˘ N ¹ 0 ;˙ 2 g y º ,where ˙ 2 g y isthe 29 geneticvariance;andtheenvironmentaltermsare " ij iid ˘ N ¹ 0 ;˙ 2 " y º .Plot-basisheritabilitywas calculatedfromvariancecomponentsestimatesusing h 2 y = ˙ 2 g y ˙ 2 g y + ˙ 2 " y : 2.5.7Training-testingpartitions Thedatasetcontainsinformationfrom39trialswith84observationseach.Toassesstheaccuracyof indirectselection,werandomlyassignedcompletetrialstotestingsets.Thetrainingsetcomprised allthedatafromthetrialsnotassignedtothetestingset.Thisapproachguaranteesthatnodata fromasingletrialispresentinbothtrainingandtestingsets.Thisapproachaimsatrepresenting asituationwhereonehascalibratedthecoe˚cientsoftheindexusinghistoricaltrialsandapply thesecoe˚cientstoimagedataoffuturetrials.Asimilarvalidationschemehasbeenused(using herd-year-seasongroupsinsteadoftrials)invalidationproblemsinpreviousstudiesinvolvingmilk spectradata(Ferraginaetal.,2015).Ineachtraining-testingpartition,outofthe29trialsavailable, 26trials( n trn ˇ 2 ; 184 observations)wererandomlyassignedtothetrainingset,andthedata fromtheremaining13trials( n tst ˇ 1 ; 092 )wasusedfortestingset.Theregressioncoe˚cients oftheindices(the 'sforthestandardSI,PSI,andPC-SI)werecalculatedusinggrainyieldand re˛ectancedataofthetrainingset.Estimatesofthecoe˚cientsandre˛ectancedatawerethen usedtocalculatetheSI, I ij = x 0 ij ^ ,foreachobservation i inthetestingset( i = 1 ;:::; n tst ).The heritabilityoftheindexandthegeneticcorrelationbetweentheindexandtheselectiongoalwere estimatedinthetestingset. Thetraining-testingproceduredescribedabovewasrepeated100timesbyrandomlyassigning trialstotrainingandtestingsets.Fromtheseanalyses,wereportedthemeanofheritability,genetic correlation,andselectionaccuracy;andtheirstandarddeviationacrosstraining-testingpartitions. 30 2.5.8Estimationofphenotypicandgeneticparameters Thepopulationphenotypic(co)variancematrix P x wasestimatedwithinthetrainingsetusingthe unbiasedsample(co)variancematrixgivenby ^ P x = 1 n 1 Í n trn i = 1 ¹ x i x º¹ x i x º 0 ,where x isthe vectorcontainingthesamplemeanofeachwaveband.Sincere˛ectancedataarecenteredand standardized,thisreducesto ^ P x = 1 n 1 X 0 X ,where X = » x 1 ; x 2 ;:::; x n ¼ 0 isthematrixcontainingall measuredtraitsinthetrainingset. Thegeneticcovariance( G x j ; y )betweengrainyieldandthe j th measuredtrait( j = 1 ;:::; p )was estimatedusingasequenceofunivariategeneticmodelsasinEquation(2.5).We˝ttedthatmodel withgrainyieldphenotypesasresponse,thenforeachofthere˛ectancebandsandthenforthesum ofgrainyieldandeachofthebands.Thegeneticcovariancesbetweenthebandsandgrainyield werethenestimatedusing ^ G y ; x j = 1 2 ^ ˙ 2 g y + x j ^ ˙ 2 g y ^ ˙ 2 g x j where ^ ˙ 2 g y , ^ ˙ 2 g x j and ^ ˙ 2 g y + x j aretheestimatedgeneticvariancesforgrainyield,the j th band,and thesumofgrainyieldandthe j th band,respectively. 2.5.9Estimationoftheaccuracyofindirectselection Toassesstheaccuracyofindirectselectionweappliedtheregressioncoe˚cientsderivedinthe trainingsettoimagedatafromthetestingsettoderive I ij = x 0 ij ^ .Then,usingamixedmodel analysislikethatdescribedintheprevioussectionweestimatedtheheritabilityoftheSI( h 2 I ), theheritabilityofgrainyield( h 2 y ),andthegeneticcorrelationbetweentheSIandgrainyield ( cor ¹ g I i ; g y i º ).Fromtheseestimates,wederivedtheaccuracyofindirectselection, Acc ¹Iº = h I cor ¹ g I i ; g y i º ,andtherelativee˚ciency, RE = h I h y cor ¹ g I i ; g y i º . 2.5.10Software AlltheaforementionedanalyseswereimplementedintheRsoftwareenvironment(RCoreTeam, 2019),version3.5.1.Linearmixedmodelswereimplementedusingthe"lmer"functionfromthe 31 LME4(Batesetal.,2015)R-package.ThesoftwarethatimplementstheLARSandcoordinate descentalgorithmsareavailablethroughtheSFSIR-package(https://github.com/MarcooLopez/ SFSI). 2.5.11Materialsanddataavailability ThedatausedinthisstudyarepubliclyavailablebyCIMMYT(https://www.cimmyt.org/)who ownsallrightsinthedata.ThedataisalsoincludedintheSFSIR-package.TheR-scriptsneeded toperformtheanalysespresentedinthisstudycanbefoundinthedocumentationoftheSFSI R-package. 2.6Acknowledgments WeacknowledgeCIMMYT'sGlobalWheatProgramthatprovidedbothexperimental˝eld andHTPdatausedinthiswork.M.L.C.wassupportedbytheMonsanto'sBeachell-Borlaug InternationalScholarshipProgram(MBBISP). 32 CHAPTER3 OPTIMALBREEDINGVALUEPREDICTIONUSINGASPARSEAMILINDEX [Materialsubmittedto: Genetics (conditionalaccepted,revisionpending)] MarcoLopez-Cruz 1 andGustavodelosCampos 2 ; 3 ; 4 ; 1 DepartmentofPlant,SoilandMicrobialSciences,MichiganStateUniversity,USA 2 DepartmentofEpidemiologyandBiostatistics,MichiganStateUniversity,USA 3 InstituteforQuantitativeHealthScienceandEngineering,MichiganStateUniversity,USA 4 DepartmentofStatisticsandProbability,MichiganStateUniversity,USA CorrespondenceandrequestshouldbeaddressedtoG.D.L.C.(e-mail:gustavoc@msu.edu). 33 3.1Abstract GenomicpredictionusesDNAsequencesandphenotypestopredictgeneticvalues.Inho- mogeneouspopulations,theoryindicatesthattheaccuracyofgenomicpredictionincreaseswith samplesize.However,di˙erencesinallelefrequenciesandinlinkagedisequilibriumpatterns canleadtoheterogeneityinSNPe˙ects.Inthiscontext,calibratinggenomicpredictionsusinga large,potentiallyheterogeneous,trainingdatamaynotleadtooptimalpredictionaccuracy.Some studiestriedtoaddressthissamplesize/homogeneitytrade-o˙bydesigningalgorithmstoidentify anoptimaltrainingset;however,thisapproachassumesthatasingletrainingdatasetisoptimum forallindividualsinthepredictionset.Here,weproposeanapproachthatidenti˝es,foreach individualinthepredictionset,asubsetfromthetrainingdata(i.e.,asetofsupportpoints)from whichpredictionsarederived.Themethodologythatwepropose(whichwelabelSparseSelection Index,SSI)integratestraditionalSelectionIndexmethodologywithsparsity-inducingtechniques commonlyusedinhigh-dimensionalregressionsettings.Thesparsityoftheresultingindexiscon- trolledbyaregularizationparameter( );theG-BLUP(thepredictionmethodmostcommonlyused inplantandanimalbreeding)appearsasaspecialcasewhichhappenwhen = 0 .Inthisstudy,we presentthemethodologyanddemonstrate,usingtwowheatdatasets(averylargemulti-generation breedingpanelandasmaller,highly-structured,dataset)withphenotypescollectedintendi˙erent environments,thattheSSIcanachievesigni˝cant(anywherebetween5-10%)gainsinprediction accuracyrelativetotheG-BLUP. 3.2Introduction Selectiondecisionsinplantandanimalbreedingrelyonthepredictedgeneticmeritofselection candidates.Earlypredictionmethodswerebasedeitheronphenotypesmeasuredinthesame selectioncandidatesoronprogenytesting(e.g.,Lush,1935).Thesemethodswerelaterextended intoselectionindices(Hazel,1943;Smith,1936)thatcanuseinformationfromvarioussources ofcorrelateddata,includingsecondarytraitsmeasuredonthesameindividual,measurementsof 34 thesamephenotypecollectedfromrelatives,andcombinationsthereof(Lush,1948).Henderson (1950)furtherextendedthemethodologybydevelopingmixed-modelsthatcaninclude˝xedand randome˙ects. TheBestLinearUnbiasedPredictor(BLUP)predictsbreedingvaluesbyborrowing(i.e.,aver- aging)informationfrommultiplesourcesofcorrelateddata.Pedigreesoftentracebackalimited numberofgenerations(e.g.,5);therefore,inmostanimalandplantbreedingdatasets,pedigree relationshipsoftende˝neandborrowingofinformationspanwithinthescopeofeach family.However,thisisnotthecaseingenomic-BLUP(G-BLUP;VanRaden,2008)because genomicrelationshipsarenotsparseaspedigree-derivedrelationships. Inthelasttwodecades,genomicprediction(i.e.,genomicselection,GS;Meuwissenetal., 2001)hasbecomethemethodofchoiceforbreedingvalueprediction.GSmodelspredictbreeding valuesusinggenome-widemarkersandrelyinthemulti-locuslinkagedisequilibrium(LD)between SNPsandquantitativetraitloci(QTL).However,itisalsowell-establishedthatfamilyrelationships andpopulationstructurecontributetotheaccuracyofgenomicprediction(Habieretal.,2007).Ina Genomicrelationshipmatrix(VanRaden,2007)allindividualsarerelatedtosomeextent;therefore, everytrainingdatapointcontributestothepredictionofeachindividualinthetestingset. Genomicpredictionmodelswereoriginallydevelopedwithreferencetoahomogeneouspopu- lationinwhichmarkere˙ectsareassumedtobethesameacrosssubgroupsofthedata.However, severalfactors,includingimperfectLDbetweenmarkersandQTLandnon-additivee˙ectscou- pledwithpopulationstructureandadmixturecanmakemarkere˙ectsvaryacrosssubgroupsin thesample(delosCamposetal.,2015;Pritchard&Donnelly,2001).Allthesefactorscanmake thegenomicrelationshipsderivedfrommarkersinaccuratepredictorsofthegenomicrelationships realizedatcausalloci(e.g.,delosCamposetal.,2013b).Therefore,theaccuracyofG-BLUPmay besub-optimalwhenthetrainingdataconsistsofheterogeneousgroups(e.g.,multiplefamiliesor multiplestrainsorbreeds)orevenmulti-generationdatainwhichLDpatternsmayvaryacross distantgenerations. SeveralauthorshaverecognizedtheneedtomodelheterogeneousSNP-e˙ectsinthecontextof 35 multi-breed(e.g.,Hayesetal.,2009)andstructured(e.g.,delosCamposetal.,2015)data.Mostof theexistingmethodsmodelgroup-speci˝ce˙ectsusingeithermultivariateGaussianmodels(e.g., Olsonetal.,2012;Schulz-Streecketal.,2012)orinteractionmodels(e.g.,delosCamposetal., 2015;Isidroetal.,2015;Veturietal.,2019).However,theseapproachescanbedi˚culttousein thepresenceofcrypticgenetic-heterogeneitypatternswherenocleargroupscanbediscerned. Anotherlineofresearchseekstoidentifyanoptimaltrainingsetforagivenpredictionset. Theseoptimaltrainingsetsoftenconsistofindividualsthatarecloselyrelatedtotheindividualsin thepredictionset,i.e.,thecandidatesofselection(Rincentetal.,2012;Akdemiretal.,2015;Isidro etal.,2015;Pszczola&Calus,2016;Akdemir&Isidro-Sanchez,2019).However,thesemethods assumethatasingletrainingsetisoptimalforalltheindividualsinthepredictionsetwhichisnot necessarilythecase.Therefore,inthisstudy,wefocusondevelopingagenomicpredictionmethod thatwillidentify,foreachindividualinapredictionsetanoptimaltrainingset(i.e.,asetofsupport points).Ourapproachachievesthisgoalbyintegratingsparsity(byaddinganL1-penalty)intoa selectionindex(SI)problem,werefertothemethodasasparseselectionindex(SSI). 3.3MaterialsandMethods Astandardselectionindex( I i )predictsthebreedingvalueofanindividual( u i )usingalinear combinationofthetrainingphenotypes( y = ¹ y 1 ;:::; y n º 0 ): I i = 0 i y = Í n j = 1 ij y j .Here,pheno- typesareassumedtobecenteredandcorrectedbynon-genetice˙ects(e.g.,experimentandblock e˙ects),and i = ¹ i 1 ;:::; in º 0 isavectorofweightswhichareobtainedasthesolutiontothe followingoptimizationproblem: ^ i = argmin i 1 2 E u i 0 i y 2 Theright-handsideoftheaboveproblemexpandsto E ¹ u 2 i º + 0 i E ¹ yy 0 º i 2 E ¹ y u i º 0 i . Assumingthatgenetic( u i )andnon-genetice˙ects( " i )areindependent,eachwithmeanzeroand (co)variancematricesvar ¹ u º = ˙ 2 u G andvar ¹ " º = ˙ 2 " I ,wehavethat E ¹ u i 0 i y º 2 = ˙ 2 u + 0 i P i 2 ˙ 2 u G 0 i i ,where ˙ 2 u isageneticvarianceparameter, P = ˙ 2 u G + ˙ 2 " I isthephenotypic(co)variance 36 matrixofthetrainingphenotypes,and G i isavectorcontainingthegeneticrelationshipsbetween the i th subjectofthepredictionsetandeachofthesubjectsinthetrainingdata.Since ˙ 2 u doesnot dependon i ,theaforementionedoptimizationproblemcanbereducedto ^ i = argmin i 1 2 0 i ¹ G + 0 I º i G 0 i (3.1) where 0 = ˙ 2 " ˙ 2 u = 1 h 2 h 2 istheratiooftheerrortothegeneticvariance,whichcanbeexpressedin termsoftheheritability, h 2 .Thesolutiontotheaboveproblemcanbeshowntobe ^ i = ¹ G + 0 I º 1 G i (3.2) Thevector ^ i canbeshowntobethe i th rowoftheHatmatrixoftheBLUPsofthegenetic valuesoftheindividualsinthepredictionset(seeAppendixB.1foraproof),thus,dependingon whether G isapedigree-orgenomic-derivedrelationshipmatrix,thestandardSIisequivalenttoa pedigree-(Henderson,1963)orgenomic-BLUP,respectively. When G isapedigree-basedrelationshipmatrixtheo˙-diagonalentriescorrespondingtopairs ofsubjectsnotconnectedthroughthepedigreeareequaltozero.Inthatcase,someoftheentries of ^ i canalsobeequaltozerowhichimpliesthatthecorrespondingpredictedbreedingvalue ( ^ I i = ^ 0 i y )drawsinformationfromasubsetofthetrainingdata.However,when G isagenomic relationshiptypicallynoneoftheo˙-diagonalsareequaltozero;therefore,noneoftheentriesof ^ i willbeexactlyequaltozero.Thisimpliesthatalltheobservationsinthetrainingsetcontribute tosomeextenttopredictthebreedingvaluesofalltheindividualsinthepredictionset. 3.3.1SparseSelectionIndexMethodology Asnotedearlier,thereareseveralreasons(e.g.,imperfectLD,e˙ectheterogeneity)whyborrowing ofinformationbetweendistantlyrelatedindividualsmayhaveadetrimentale˙ectonpredictionac- curacy.Therefore,toachievesparsity(andpossiblydi˙erentialshrinkageonthe ^ i )weconsidered addinganL1-penaltytotheobjectivefunctioninEquation(3.1);therefore, ~ i = argmin i 2 6 6 6 6 4 1 2 0 i ¹ G + 0 I º i G 0 i + n Õ j = 1 j ij j 3 7 7 7 7 5 (3.3) 37 Theaboveoptimizationproblemdoesnothaveaclosed-formsolution;however,solutionscan beobtainedusingaCoordinateDescentalgorithmverysimilartotheoneusedtosolveLASSO problems(seeLopez-Cruzetal.,2020).Theregularizationparameter controlshowsparse ~ i willbe;thisparameterisalsoexpectedtoa˙ecttheaccuracyoftheindex.Therefore,anoptimal valueof canbefoundbymaximizingtheaccuracyoftheresultingindex. 3.3.2Data Weusedtwowheatbreedingdatasetstoevaluateandtocomparethepredictionperformanceof standardandsparseselectionindices.The˝rstdataset(Wheat-large)isamulti-generationwheat breedingdatasetofaverylargesamplesize( n ˘ 29 ; 000 ).Thesecondoneis(Wheat-599)isa small,structureddata(seeSupplementaryFigureB.1). TheWheat-largedatasetisfromCIMMYT'sGlobalWheatProgramanditincludesphenotype datafrom 58 ; 798 wheatlinesthatwereevaluatedduring˝veyears(2009-2013)attheCIMMYT's experimentalstationinCiudadObregon,Mexico.Lineswereevaluatedundersixenvironmental conditions( B2I , B5I , MEL , LHT , DRB , EHT )representingacombinationofplantingsystem(bed vs˛at,thelaterreferredtoasmelgas),numberofirrigations(2,5irrigationsordripirrigation), andsowingdate(optimum,lateorearlyplanting).Eachyear,grainyieldtrialswereestablishedin an -latticedesignwiththreereplicatesintoincompleteblocks.Moisture-standardizedgrainyield (tonha 1 )wasmeasuredateachplot.Weusedmixed-e˙ectsmodelswitha(`˝xed')interceptand therandome˙ectsofthetrial,block(withintrial)andreplicate(withintrial)toderiveleast-square meansbylineandenvironmentalcondition.Separatemixedmodelswere˝ttedtodatafromeach ofthesimulatedenvironments.Theaveragegrainyieldinthisdatasetvariedfrom2.72to7.12ton ha 1 (seeSupplementaryFigureB.2Bforboxplotsofgrainyield)andtheheritabilityofsingle- plotrecordsvariedbetween0.23and0.57(seeSupplementaryTableB.1forasummaryofthe data).Onlyasubsetof 29 ; 484 genotypeswasgenotypedusingaGBS(Genotyping-by-sequencing) technologythatyielded 42 ; 706 SNPs.WeremovedSNPswithmorethan70%ofmissingvalues andthosewithminorallelefrequencylowerthan5%.Afterapplyingthese˝lters,weretained 38 9,045SNPs.ThemissingvaluesateachSNPwereimputedasthemeanoftheobservedSNPdata acrossgenotypes.ThedatasethasbeenpreviouslydescribedandanalyzedbyPérez-Rodríguez etal.(2017). TheWheat-599datasetisalsofromCIMMYT'sGlobalWheatProgramanditiscomprised ofgrainyieldandgenotypedatafor599historicalinbredlinesderivedalong25years.Lineswere evaluatedintheEliteSpringWheatYieldTrials(ESWYT)thatweregroupedintofourdi˙erent mega-environments( Env1 ,..., Env4 ).Theavailablephenotypicvaluesareleast-squaremeans fromtworeplicates.Theaveragegrainyieldinthisdatasetrangedfrom3.23to5.14tonha 1 (seeSupplementaryFigureB.2Aforboxplotsofgrainyielddata)withheritabilityestimatesfor theleast-squaremeansrangingbetween0.43to0.50(seeSupplementaryTableB.2).Eachofthe lineswasgenotypedfor 1 ; 279 diversityarraytechnology(DArT)markers.Thedatasetisavailable withtheBGLRR-package(Perez&delosCampos,2014)andhasbeendescribedandanalyzedby previousauthors(e.g.,delosCamposetal.,2009b;Crossaetal.,2010). 3.3.3Analyses Foreachdataset,wecomputedagenomicrelationshipmatrix G using(centeredandstandardized) markerinformation, X = f x im g ,as G = ZZ 0 š p ,where p isthenumberofmarkersand Z = f¹ x im x m ºš sd x m ºg isthematrixofcenteredandstandardizedmarkersobtainedbysubtracting fromeachmarkerentrythemeanofeachcolumn( x m )followedbyscalingbythestandarddeviation ofthecolumn( sd x m ).Theresultingmatrixhasanaverageofthediagonalelementsequalto1. Toquantifythepredictionaccuracyofeachoftheindices,wedividedeachdatasetintotraining (trn)andtesting(tst)setsbyrandomlyassigning30%(70%)ofthedatapointstotesting(training). Predictionswerederivedby˝rstusingEquation(3.2): ^ i = ¹ G + 0 I º 1 G i (forthestandardSI)andEquation(3.3): ~ ¹ º i = argmin i 2 6 6 6 6 4 1 2 0 i ¹ G + 0 I º i G 0 i + n Õ j = 1 j ij j 3 7 7 7 7 5 39 (fortheSSI),with G = G trn representingthegenomicmatrixofthetrainingdatapoints(i.e.,with dimensions n trn n trn ,where n trn = 0 : 7 n ),and G i = G trn ; tst ¹ i º beingthevectorcontainingthe genomicrelationshipsbetweenthe i th data-pointofthetestingset,witheachoftheindividuals assignedtothetrainingset(i.e.,thedimensionsof G i are n trn 1 ).Thiswasrepeatedforeach individualinthetestingset( i = 1 ;:::; n tst ,where n tst = 0 : 3 n ).Subsequently,predictionsforeach individualwereobtainedusing ^ I i = ^ 0 i y trn (forthestandardSI)and ^ I i = ~ ¹ º 0 i y trn (fortheSSI) where y trn isa n trn 1 vectorwiththeadjusted-centeredphenotypesofthetrainingset. TheimplementationoftheSIrequiresheritabilityestimates.Wederivedthoseby˝ttinga G-BLUPmodeloftheform y i = + u i + " i with " i iid ˘ N ¹ 0 ;˙ 2 " º and u ˘ N ¹ 0 ;˙ 2 u G º .The modelwas˝ttedusingtherrBLUPR-package,separatemodelswere˝ttedtograinyieldwithin eachenvironmentineachdatasetwithinthetrainingset.Wethenusedthevarianceparameters estimatestoderive h 2 = ˙ 2 u š¹ ˙ 2 u + ˙ 2 " º forgrainyield. Predictionaccuracy( ˆ )wasmeasuredwiththecorrelationbetweenthephenotypeandthe index,dividedbythesquare-rootofthetraitheritabilityofthetrait, ˆ = Acc ¹ ^ Iº = cor ¹ ^ I i ; y i ºš h (Dekkers,2007). FortheSSI,weestimatedaccuracyoveragridofvaluesoftheregularizationparameter ( = 0 < ¹ 1 º < ¹ 2 º < < max )where max = max i f j G i j dia g ¹ G º + 0 g .Here max istheminimum valueof thatyieldsanSSIwithnoactivepredictors,and = 0 givestheweightsofthestandard SI.FollowingFriedmanetal.(2010)weusedagridofvaluesevenlyspacedinthelogarithmscale withatotalof100values.Thus,foreachvalueof inthegrid,wehadanestimateoftheresulting accuracyoftheSSI.Thiswasusedtopro˝leaccuracyasafunctionoftheregularizationparameter andalsotochooseanoptimalvalueof . Todetermineanoptimalvalueof weimplementedacalibrationanalysisusingdatafromthe trainingdataonly.Speci˝cally,foreachtrainingset,weconductedaninternalcross-validation (CV)asfollows:( i )Thetrainingdatasetwaspartitionedinto k subsets.( ii )SSIswerederived overagridofvaluesof usingdatafrom k 1 foldsfortrainingandthedatainthe k th foldas testing(i.e.,fortheestimationofaccuracy,seethepreviousparagraph).( iii )Theresultingcurves 40 pro˝lingaccuracy( ˆ )byvaluesof wereusedtoidentifythevalueof ( ^ c v )thatmaximized accuracy.( i v )Finally,weusedallthedatafromthetrainingsettoderive I i ¹ ^ c v º andevaluatedthe accuracy( ˆ )oftheresultingindexintheleft-outdatafromthetestingset. 3.3.4Software AlltheanalyseswereperformedintheRenvironment-language(RCoreTeam,2019)version 3.5.TheheritabilityofeachofthetraitswasestimatedusingtherrBLUPR-package(Endelman, 2011).SparseSIswerederivedusingtheSSIfunctionfromtheSFSIR-packagethatimplements theCoordinateDescentalgorithmdescribedinLopez-Cruzetal.(2020).Thepackageisaided byggplot2(Hadley,2016)andparallel(RCoreTeam,2019)packagestovisualizeresultsandto speedcomputation.ThispackageisavailablethroughtheGitHubrepositoryathttps://github.com/ MarcooLopez/SFSI.ScriptsillustratingtheuseofthispackageusingtheWheat-599datasetare presentedintheAppendixB.2. 3.3.5Dataavailability BothphenotypicandmarkerdatafortheWheat-largedatasetcanbedownloadedfromCIM- MYT'srepositoryathttp://genomics.cimmyt.org/wheat_50k/PG/(accessedSep14th,2020).The Wheat-599datasetcanbedownloadedfromtheBGLRR-packagebycallingAll supplemental˝guresandtablesarecontainedintheAppendixB.Allsupplemental˝lesareavailable atFigshareathttps://˝gshare.com/s/ce2258d168b16a86454d. 3.4Results 3.4.1Sparsityimprovespredictionaccuracy Weassessedthee˙ectofsparsityontheaccuracy,by˝ttingtheSSIfor100valuesof ( 0 < ¹ 1 º < ¹ 2 º < < max ;thevalue = 0 producesthecoe˚cientsofthestandardSIorG-BLUP).The results(averagedover100trn-tstpartitions)areshowninFigure3.1.Thenumberofsupportpoints 41 (i.e.,thenumberoftrainingdatapointscontributingtotheprediction)was,asexpected,inversely proportionalto ;therefore,tofacilitateinterpretation,thex-axisofFigure3.1isdisplayedasthe averagenumberofsupportpoints,whichismoremeaningfulthanthe values.Theaccuracyof theG-BLUPisalsoshownattherightmostsideoftheplotwhosenumberofsupportpointsisequal tothesizeofthetrainingdataset.Intermediatevaluesof ledtosparseindicesthat,inmostcases, achievedhigherpredictionaccuracythanthatoftheG-BLUP(shadedareainFigure3.1). Figure3.1:Predictionaccuracy(averageacross100trn-tstpartitions)oftheSSIversusthe(average) numberofpredictorsintrainingsetsupportingtheSSIofeachindividualintestingset(x-axis). Genomic-BLUP(bluerightmostpoint)appearsasaspecialcaseofanSSI.Eachpanelrepresents oneenvironmentwithindataset.(A)Wheat-largedataset.(B)Wheat-599dataset.Verticalbars representa95%con˝denceintervalfortheaverage. 42 Themaximumaccuracyintheenvironment EHT (seeFigure3.1A)wasobtainedwitha penalizationthatleadstoasparseindexwithanaverageof120supportpoints( n sup ).This predictivesetofindividualsrepresentsaround8%ofthetotaltrainingset( n trn = 1 ; 428 )available forprediction. Forthesmalldataset(Figure3.1B),thesamepatterncanbeobservedinallenvironments, exceptforenvironment2.ThiscaseshowsthattheSSIdoesnotalwaysoutperformtheG-BLUP; however,theSSIachievesthepredictionaccuracyoftheG-BLUPwithasmallersupportset ( n sup ˇ 151 outof419). Figure3.2:Predictionaccuracyoftheoptimalsparseselectionindex(SSI)versusthatofthe G-BLUP.Eachpointrepresentsatrn-tstpartition(atotalof100partitionswereimplemented),the pointshapeandcolorrepresentenvironments.(A)Wheat-largedataset.(B)Wheat-599dataset. Thevalueof intheSSIwasestimatedusing105-foldcross-validationsconductedwithinthe trainingdata.Inparenthesis,bythelegend,isthep-valueforthetwo-sidedSign(binomial)testfor within-environmentdi˙erencesinaccuracybetweentheSSIandtheG-BLUP. 3.4.2Usinganinternalcross-validationtoachieveoptimalsparsity TheresultsinFigure3.1suggestthatonecan˝ndavalueof thatleadstoanindexwitha predictiveperformanceasleastashigh(andinmostcaseshigher)astheG-BLUP.However,to obtainanunbiasedestimateofthemaximumaccuracythatonecouldachievewithanSSI,one 43 shouldnotusedatafromthetestingsettoselecttheoptimalvalueof .Therefore,werepeated theanalysesdescribedabove,thistimeperformingthegridsearchforanoptimalvalueof by implementing105-foldCVswithineachtrainingdataset.ThisCVwasusedtochooseanoptimal valueof ( ^ c v ).Then,wesolvedtheSSIusing ^ c v andallthetraininggenotypes,andevaluated theaccuracyof I i ¹ ^ c v º inatestingsetthatwasnotusedtochoose ^ c v .Thiswasrepeatedfor100 trn-tstpartitions.Figure3.2showstheaccuracyof I i ¹ ^ c v º versusthatoftheG-BLUP,eachpoint intheplotrepresentsatrn-tstpartition. Table3.1:Predictionaccuracy(averageacross100partitions)achievedbysparseselectionindices (SSIs)andbytheG-BLUP(standardSI),bydatasetandenvironmentalcondition. Environment n tst n trn Method c v a n sup b Accuracy(SD) c Counts d Wheat-large B2I1,1202,612 G-BLUP0.00002,6120.617(0.031)b 97 SSI0.01354340.648(0.031)a B5I8,84220,631 G-BLUP0.000020,6310.555(0.010)b 100 SSI0.01071,4700.609(0.009)a MEL1,3213,082 G-BLUP0.00003,0820.600(0.045)b 99 SSI0.01315240.661(0.046)a LHT1,3223,082 G-BLUP0.00003,0820.669(0.024)b 99 SSI0.01683800.709(0.025)a DRB1,1292,634 G-BLUP0.00002,6340.629(0.035)b 98 SSI0.03221360.675(0.037)a EHT6121,428 G-BLUP0.00001,4280.614(0.049)b 94 SSI0.03011780.649(0.047)a Wheat-599 Env1180419 G-BLUP0.00004190.721(0.070)b 87 SSI0.0413780.760(0.067)a Env2180419 G-BLUP0.00004190.702(0.087)a 41 SSI0.01232540.692(0.085)a Env3180419 G-BLUP0.00004190.585(0.101)a 53 SSI0.0613840.586(0.093)a Env4180419 G-BLUP0.00004190.663(0.082)b 87 SSI0.0617540.714(0.075)a SD:Standarddeviationacrossthe100trn-tstpartitions.G-BLUPmodelcorrespondstoanSSIwhere = 0 . a Averagevalueof estimatedbycross-validatingthetrainingset. b Averagenumberofindividuals intrainingsetsupportingthepredictionofindividualsfromtestingset. c Modelswiththesameletterarenot signi˝cantlydi˙erentfromothers(ANOVAfollowedbyTukey'sHSDtest,5%signi˝cancelevel). d Number oftimes(outofthe100partitions)thattheSSIoutperformedtheG-BLUPinpredictionaccuracy. 44 IntheWheat-largedataset,theoptimalSSIoutperformedtheG-BLUPin94%ofthecases (Table3.1).Forthisdataset,theSSIo˙eredsigni˝cant(accordingtoTukey'sHonestSigni˝cance Di˙erencetest,HSD)gainsinaccuraciesacrossenvironments.Thesegainsrangefrom5%(inthe environmentB2I)to10%(intheenvironment MELGAS )inthecorrelationmetric. SimilarpatternswereobservedwiththeWheat-599dataset.InEnvironments1and4the SSIoutperformedtheG-BLUPinmorethan80%ofthetrn-tstpartitions(Table3.1),withgains inaccuracyrangingfrom5-7%.However,inEnvironments2and3,therewerenostatistically signi˝cantgainsinaccuracy(seeTable3.1). Figure3.3:Distributionofthenumberoftrainingsupportpoints( n sup )inoptimalsparseselec- tionindices(resultsobtainedover100trn-tstpartitions; n trn =sizeofthetrainingdataset),by environmentalcondition,Wheat-largedataset. 3.4.3SparseSelectionIndicesbuildsubject-speci˝ctrainingsets Foreachindividualinthepredictionset,anSSIyieldsasetofsupportpointsinthetrainingset consistingoftheindexofallthenon-zeroentriesof ~ ¹ º i .Figure3.3showsthedistribution(across 100trn-tstpartitions)ofthenumberofsupportpoints( n sup )for ^ c v foreachoftheenvironments 45 oftheWheat-largedataset.At ^ c v , n sup rangesfrom30to ˇ 5 ; 000 .In3oftheenvironments( B2l , MELGAS ,and LHT )theaveragenumberofsupportpointswas n sup ˇ 450 ,thatis ˘ 15 20 %of thesizeofthetrainingset.Inenvironment B5l ,theproportionofactivetrainingsupportpointswas ˘ 5 10 %.Ontheotherhand,inenvironment EHT predictionsreliedonanaverageof n sup ˇ 178 (of 1 ; 428 )individualsfromtraining(Figure3.3).SimilarpatternswerealsoobservedintheWheat- 599data(SupplementaryFigureB.3);forinstance,testingphenotypesfromenvironment1were optimallypredictedinaveragewith n sup ˇ 78 (of419);however,therelativesparsity( n sup š n trn ) wassmallerintheWheat-largedataset(5-17%)thanintheWheat-599dataset(12-60%). Figure3.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)andthe correspondingsupportpoints(green).Greypointsrepresentgenotypesthatdidnotcontributeto thepredictionofthegeneticvalueofthegenotypeinyellow.Allpanelsrepresentsolutionsforthe environment EHT ,Wheat-largedataset. 46 Figure3.4shows(forselectedtestinggenotypes)thecoordinatesonthe 1 st and 2 nd PCof boththepredictionpoint(yellowcircle)andthetraininggenotypes.Activetraininggenotypes arerepresentedinagreencircle,andthosenon-active(i.e.,withzeroweightintheindex)are representedingrey.Insomecases,thesupportsetincludestraininggenotypesthatarenearby (accordingtothecoordinatesonthe 1 st 2PCs)thepredictionpoint.However,inothercases, thesupportsetspannedoutsidetheofwherethepredictionpointresides.This suggeststhattheSSIdoesnotnecessarilychoosetrainingpointswithintheclusters.Asimilarplot fortheWheat-599datasetispresentedinSupplementaryFigureB.4. 3.4.4Genomicrelationshipsandweightsinstandardandsparseselectionindices Figure3.5Ashowsthecoe˚cientsoftheG-BLUPandoftheSSI(i.e.,the ij 'sderivedfrom Equations(3.2)and(3.3),respectively)versusthegenomicrelationship( g ij ,the ij entryof G ). InFigure3.5A,the ij 'swerederivedforatraining-testingpartitionwith˝xedheritabilityand chosenbyCVconductedwithinthetrainingset,forenvironmentEHTfromtheWheat-largedata set.TheweightsusedbytheG-BLUPare,asexpected,alldi˙erentfromzeroandarepositively associatedwiththegenomicrelationships(i.e.,onaverage,traininggenotypescloselyrelatedto genotypesinthepredictionsetreceivehigherweightontheindex).However,thepointsdonotfall overaperfectlinebecausetheweightgiventoeachofthetrainingpointsdependsnotonlyonthe relationshipbetweenthetrainingpointandthepredictionpointbutalsoontherelationshipsamong trainingpoints.Ontheotherhand,asexpected,theSSIzero-outsmostoftheweights.Interestingly, theSSIseemstozero-outmostoftheweightsthatareinthetopleftandlower-rightquadrants(i.e., pointsthathadanegative(positive)relationshipandintheG-BLUPgotpositive(negative)weight, comparebothplotsinFigure3.5A).PanelBinFigure3.5showstheproportionofcoe˚cients thatarezeroed-outbylevelofgenomicrelationship.Mostofthecoe˚cientscorrespondingto traininggenotypeswithrelationshipswithpredictionpointsbetween-0.1and0.1arezeroed-out; theproportionofcoe˚cientsthatarezeroeddecreasesrapidlyas g ij increases;however,the decreaseseemstobefasterfortheWheat-largedatasetthanfortheWheat-599(Supplementary 47 FigureB.6).Interestingly,theproportionofcoe˚cientszeroed-outalsodecreasesfor`negative' genomicrelationships,suggestingthattheSSIdoesnotusea`local'supportset;instead,theSSI seemstouseinformativesupportpoints.Atleastinthecontextofa`linear'kernelastheoneused here,anegativepriorcorrelation(i.e., g ij < 0 )canbeinformative.Thepatternsobservedinother environmentsoftheWheat-largedatasetandinthefourenvironmentsoftheWheat-599dataset wereconceptuallysimilartotheonespresentedinFigure3.5(seeSupplementaryFiguresB.5and B.6). Figure3.5:(A)Weights( ij )ofastandardSI(G-BLUP)andoftheoptimalsparseselectionindex (SSI)versusthegenomicrelationship( g ij ).(B)ProportionofweightsintheSSIthatwerezero (non-active)andnon-zero(supportpoints);Wheat-largedataset,environment EHT . 3.5Discussion Samplesizehasbeenrecognizedasoneofthemainfactorslimitingpredictionaccuracy ingenomicprediction(Lorenzana&Bernardo,2009;delosCamposetal.,2013a;Habieretal., 2013).Inun-structuredpopulations,SNPe˙ectscanbeassumedtobehomogeneousand,therefore, genomicpredictionaccuracyincreaseswithsamplesize(e.g.,Daetwyleretal.,2008;delosCampos etal.,2013a).However,thisisnotnecessarilythecaseinstructuredandadmixedpopulations,in multi-familydata(e.g.,datafrombi-parentalfamilies),orinmulti-generationdata.Inthosecases, di˙erencesinallelefrequenciesandinLD-patternsacrossmaymakeSNPe˙ectsheterogenous acrosssubgroupsinthesample.Inthatcontext,alargertrainingdatasetmaynottranslateinto 48 higherpredictionaccuracy.Thisphenomenonhasbeenrecognizedinbothplantandanimal breeding,aswellasincomplextraitpredictioninhumans. Forexample,usingdatafromabroilerbreedingpopulation,Wolcetal.(2016)showedthatusing trainingsetsthatincludeddatafrommanygenerationsledtoslightlylowerpredictionaccuracythan theoneachievedwhenmodelsweretrainedwithdatafromjustthelast3generations.Likewise, Hayesetal.(2009)showedthatthepredictionaccuracyforHolsteincattlewasnotimprovedby addingtothetrainingsetdatafromJerseycattle.Inplantbreeding,usingdatafrombi-parental families,Jacobsonetal.(2014)reportedthatwithinfamilypredictionaccuracycouldbeincreased bytrainingmodelsusingonlydatafromfamiliesthatshareatleastoneoftheparents.Finally,in thecontextofhumandata,delosCamposetal.(2013b)notedthattheaccuracyofSNP-derived genomicrelationshipscouldbeverylowfordistantlyrelatedindividuals.Thus,combiningfamily datawithlargevolumesofdatafromdistantlyrelatedindividualsmaynotimprove(ormayeven reduce)predictionaccuracyrelativetomodelstrainedwithfamilydataonly. Thus,whendataoriginatesfromheterogeneoussourcestheremaybetrade-o˙sbetweensample sizeandthepossibilityofhavingahomogenousdatasetinwhichSNPcanbeconceivedas homogenouswithinthetrainingdataandbetweentrainingandtestingsets.Therecognitionthat ingenomicprediction'biggerisnotalwaysbetter'ledtothedevelopmentofseveralmodelsand model-trainingstrategiesaimingtoimprovepredictionaccuracy.Onelineofresearchattemptsto modele˙ectheterogeneityusinggroup-speci˝ce˙ects(e.g.,Veturietal.,2019;Rioetal.,2020). However,thisapproachisusefulwhenindividualsclusterinasmallnumber(e.g.,2or3)of well-de˝nedclusters,andbecomeslessusefulanddi˚culttoapplywhendataischaracterized byeitheralargenumberofgroups(e.g.,bi-parentalfamilies)orwhengroupsoverlapincryptic manners(e.g.,admixedpopulationsorpartially-overlapping-multi-generationdata).Anotherline ofresearchseekstoidentifyantrainingbyeitherselectingdatafromindividuals thatarecloselyrelatedtothepredictionset(e.g.,Rincentetal.,2012;Jacobsonetal.,2014;Wolc etal.,2016)orbyusingmoresophisticatedoptimizationalgorithms(e.g.,Akdemiretal.,2015). However,theseapproachesassumethatitispossibletobuildatrainingsetthatisoptimalforall 49 thegenotypesinthepredictionset.Ourapproachdi˙ersfromtheseonesinthatwedevelopeda methodologythatbuildssubject-speci˝ctrainingsets.Indeed,theSSIselects,foreachindividual inthepredictionset,acustomtrainingset(orsupportpoints)fromwhichpredictionsarederived. OurapproachbuildsonselectionindexmethodologybyaddinganL1-(sparsity-inducing-)penalty intotheoptimizationproblem.Theresultisanindexinwhichonlyasubsetofthetrainingpoints contributestopredictionaccuracy. Whenthetrainingdataconsistsofdisconnectedfamilies,pedigreeBLUPequationscanalso besparse.However,thisisnotthecaseoftheG-BLUPbecausegenomicrelationshipmatrices aredense.TheSSIbringsbacksparsityintogenomicprediction.Thelevelofsparsityislargely controlledbythepenalizationparameter( ).Thisparametercanbetunedusingcross-validation withinthetrainingdata.Aswithanyotherparameter,thevalueof thatmaximizesaccuracy maychangeslightlybetweentrn-tstpartitions;however,inourexperience,usingafew(e.g.,10) training-testingpartitionsareenoughtoobtainanaccurateestimateofthevalueoftheregularization parameterthatmaximizesaccuracy. Asnoted,anSSIidenti˝es,foreachindividualinthepredictionset,anetworkofgenotypes inthetrainingdataset(seeFigure3.4andSupplementaryFigureB.4)thatcontributetothe prediction.At˝rstglance,thisappearssimilartotheapproachusedinak-nearestneighbor(KNN) regression(Cover&Hart,1967).InKNN,the k geneticallyclosestindividuals(neighbors)predict eachselectioncandidate,andpredictionsarederivedusinganaverageofthephenotypesinthe neighborhood.Thereareimportantdi˙erencesbetweentheKNNandtheSSI.First,theKNNuses onlymarginalsimilarities/distancesbetweenapredictionpointandthepointsinthetrainingdata tobuilda'neighborhood';theSSI,however,alsoconsidersthecorrelations(i.e.,redundancies) betweenpointswithinthetrainingdata.Asaconsequence,theoptimalsupportsetoftheSSImay includesomedistantlyrelatedindividuals(seeFigure3.4andSupplementaryFigureB.4).Second, whileinthestandardKNNpredictionsaresimplythearithmeticmeanofthephenotypesinthe neighborhood,intheSSIeachtrainingpointcontributesdi˙erentlywithweights(the ij 's)that re˛ectboththecorrelationofthetrainingpointwiththepredictionpointaswellascorrelations 50 amongpointsinthetrainingset. BestLinearUnbiasedPrediction(BLUP)methodsareequivalenttoL2-penalizedregressions. InBLUP,shrinkageiscontrolledbythenoiseandsignalvariances( 0 = ˙ 2 " š ˙ 2 u ,seeEquation (3.2).WeaddedtotheoptimizationproblemanL1-penality;thus,theSSIusesbothL1andL2 (whichisintrinsicallybuiltintheSI)penalties.Therefore,theSSIcanbeseenasbeingatypeof Elastic-Net(Zou&Hastie,2005)regression.However,intheSSItheweightontheL2-penalty isdeterminedbytheratioofvariancecomponents( 0 = ˙ 2 " š ˙ 2 u )whichmayormaynotbean optimalchoicefromapredictionperspective(particularlyiftheunderlyingassumptionsofthe BLUPmethod,e.g.,homogeneityofe˙ects,donothold).Therefore,toadd˛exibilitytotheSSI weconsideredexplicitlyaddingL1-andL2-penalties,andsearchingforanoptimalcombination, usingcross-validation,oftherelativeweightsofthepenalizationparametersoftheElastic-Net( and )optimizationproblem: ~ ¹ ; º i = argmin i 2 6 6 6 6 4 1 2 0 i ¹ G + 0 I º i G 0 i + 1 2 ¹ 1 º n Õ j = 1 2 ij + n Õ j = 1 j ij j 3 7 7 7 7 5 Toavoidtoo-muchpenalization,wedecreasedtheweightoftheinitialL2-penaltyto 0 : 5 0 .We foundthatthispracticecouldincreasepredictionaccuracybyasmallfactor(2-3.5%,seeSupple- mentaryTableB.3fortheWheat-largedataset)relativetotheoriginalSSImethod(Equation(3.3)). However,thispracticedidnotprovideanyadvantageovertheoriginalSSIintheWheat-599data set(seeSupplementaryTableB.4). Inconclusion,wepresentedanovelpredictionmethodthatcombinesinasingleframework, selectionindexmethodologywithsparsity-inducingmethods.TheresultingSSIidenti˝esoptimal trainingsetsforeachpointinthepredictionset.Themethodcanbeusefulformultipleapplications, includingtheuseingenomicpredictionofdatafromstructuredpopulations,bi-parentalfamilies, andtheanalysesofverylargemulti-generationdatasets. 51 3.6Acknowledgments WearegratefultoCIMMYT'sGlobalWheatProgramthatprovidedboththeexperimental˝eld andmarkerdatausedinthisstudy.M.L.C.wassupportedbytheMonsanto'sBeachell-Borlaug InternationalScholarshipProgram(MBBISP)andbytheDissertationCompletionFellowship fundedbytheMichiganStateUniversityGraduateSchool. 52 CHAPTER4 GENOMICPREDICTIONINMULTI-GENERATIONALMAIZEHYBRIDSUSING SPARSEKERNELMODELS MarcoLopez-Cruz 1 ,YosephBeyene 2 ,ManjeGowda 2 ,JoseCrossa 3 , andGustavodelosCampos 4 ; 5 ; 6 1 DepartmentofPlant,SoilandMicrobialSciences,MichiganStateUniversity,USA 2 GlobalMaizeProgram,InternationalMaizeandWheatImprovementCenter(CIMMYT),Kenya 3 BiometricsandStatisticsUnit,InternationalMaizeandWheatImprovementCenter(CIMMYT), Mexico 4 DepartmentofEpidemiologyandBiostatistics,MichiganStateUniversity,USA 5 InstituteforQuantitativeHealthScienceandEngineering,MichiganStateUniversity,USA 6 DepartmentofStatisticsandProbability,MichiganStateUniversity,USA 53 4.1Abstract Therehasbeenmuchinterestintheuseoflargehistoricaldatatocalibratemoreaccurate predictionmodelstoimprovecurrentbreedingprograms.However,multi-generationaldataoften comeswithincreasedheterogeneitythatmightincludecomplexadmixturepatterns,inwhich geneticrelationshipsarereducedasgenerationsadvance.Ithasbeenrecognizedthatdi˙erences inheterogeneitypatternsbetweenthetrainingandthepredictionset,orincludinginthetraining setindividualsthataredistantlyrelatedtothepredictionsetcanreducethepredictionaccuracy. Mostoftheresearchinthissensefocusesondesigningoptimaltrainingsetsthatincludeonlyafew previousgenerationsoragroupofgenotypesthataremorecloselyrelatedtothecurrentprediction set.However,sometrainingindividualscanbeoptimalforsome,butnotallindividualsinthe predictionset.Thesparseselectionindex(SSI)determines,foreachindividualintheprediction set,acustomizedoptimaltrainingset.Usingadditivegenomicrelationships,theSSIcanprovide anincreasedaccuracyrelativetothestandardG-BLUP.ModelswithGaussiankernels(K-BLUP) havebeenshowntoyieldahigheraccuracybymaximizingthecovariancebetweencloselyrelated genotypes.WestudiedwhethertheSSIusingGaussiankernelscanprovideincreasedaccuracies. Usingathree-generationdoubledhaploidmaizedatasetfromtheInternationalMaizeandWheat ImprovementCenter(CIMMYT),weshowthatthestandardK-BLUPoutperformedtheG-BLUP. Also,wefoundthatusinganSSIwithadditivegenomicrelationships(sparseG-BLUP)leadsto gainsinaccuracybetween5%-20%,relativetothestandardG-BLUP.FortheK-BLUP,thegains obtainedbyaddingsparsityweresmallerandnotalwayssigni˝cant. 4.2Introduction AlmosttwodecadeshavepassedsinceGenomicSelection(GS)was˝rstproposedby(Meuwis- senetal.,2001).Thisgroundbreakingideawasquicklyadoptedforbreedingdairycattle(Hayes etal.,2009),beefcattle(Garrick,2011),broilers(Wolcetal.,2016),maize(Bernardo&Yu,2007), wheat(Polandetal.,2012),andmanyotheranimalspeciesandcrops(Xuetal.,2020).Overtime, 54 investmentsbypublicandprivateorganizationsledtothedevelopmentoflargegenomicdatasets comprisingDNA-sequencesandphenotypes.Thelargesamplesizeofmoderngenomicdatasets hasincreasedourabilitytotrainhigh-dimensionalgenomicpredictionequationsaccurately. However,thelargersamplesizeoftencomeswithanincreasedgeneticheterogeneity,including manygenerationsofdataandoftencomplexadmixturepatterns.Moreover,therehavebeensome signsthatingenomicprediction,`biggermaynotalwaysbebetter'.Forexample,Wolcetal.(2016) reportedthattheaccuracyofgenomicpredictioninabroilerbreedingprogramwashigherwhen usingdatafromthelastthreegenerationsrelativetopredictionequationstrainedusingdatafromthe last˝vegenerations.Likewise,Riedelsheimeretal.(2013)andJacobsonetal.(2014)reportedthat thepredictionaccuracywashigherwhenmodelsweretrainedusingdatafrombiparentalfamilies thatsharedatleastoneparentrelativetotrainingusingdatafromalltheavailablebiparental families. EarlyworkbyHabieretal.(2010)showedthatfamilyrelationshipshaveanimportantimpact onpredictionaccuracy,andmanystudieshavedemonstratedthatdistantlyrelatedindividuals makeasmall(sometimesnegligible)contributiontothepredictionaccuracy.However,asnoted above,someevidencesuggeststhatusingtrainingsetsformedbyindividualsdistantlyrelatedtothe genotypesofthepredictionsetmayactuallyhaveanegativeimpactonthepredictionaccuracy(e.g., Lorenz&Smith,2015).Thismayhappenif,forexample,heterogeneityinallelefrequencyand inlinkagedisequilibrium(LD)patternsbetweenthetrainingandpredictionsetleadtoSNP-e˙ect heterogeneity. Issuesrelatedtodata-ande˙ect-heterogeneityhavespawnedmultipleresearche˙orts.One lineofresearchmodelse˙ect-heterogeneityexplicitlyusingSNP-by-groupinteractionmodels ormultivariatemodels(`multi-breedgenomicprediction')inwhiche˙ectsareassumedtobe correlatedamonggroups(e.g.,Olsonetal.,2012;Lehermeieretal.,2015;Rioetal.,2020).This approachhasshownpromise,butitisonlyadequatewhengenotypescanbeclusteredintoclearly disjointgroups. Anotherlineofresearchseekstoincreasepredictionaccuracybyoptimaldesignoftraining 55 datasets.Themethodsproposedandusedtoidentifyanoptimaltrainingsetspanfromsimple threshold-basedmethods(e.g.,Clarketal.,2012;Lorenz&Smith,2015)tomoresophisticated algorithmsthatseektominimizepredictionerrorvarianceandfunctionsthereof(Rincentetal., 2012;Akdemir&Isidro-Sanchez,2019;Rothetal.,2020).Amainassumptionofthesetrainingset optimizationmethodsisthatasingletrainingsetisoptimalforallindividualsinthepredictionset. Butthismaynotbethecaseifsomegenotypesinthetrainingsetcanimprovepredictionaccuracy forsomeofthecandidatesofselectionandreduceitforothers. Toaddressthelimitationsofexistingmethods,inChapter3ofthisdissertation,wedevelopeda predictionmethod(sparseselectionindex,SSI)thatidenti˝es,foreachindividualintheprediction set,acustomizedtrainingset.Ourapproachintegratesintotheselectionindexmethodology(Smith, 1936;Hazel,1943),asparsity-inducingpenaltythatleadstosparseselectionindices. InChapter3,weusedtheSSImethodologytopredictgrainyieldintwowheatdatasets. Theapplicationpresentedinthatchapterusedadditivegenomicrelationships,andtheresults showedthattheSSIoutperformedthegenomic-BLUP(G-BLUP;VanRaden,2008)by5-10%in thecorrelationscale. ReproducingKernelHilbertSpaces(RKHS)regressionhasshowngoodpredictiveperformance inmanygenomicapplications(e.g.,delosCamposetal.,2010;González-Camachoetal.,2012). TheG-BLUPisaspecialcaseofRKHSregressioninwhichalinearkernel(additivegenomic relationships)isusedtodescribethegeneticsimilaritybetweengenotypes.However,several studies(e.g.,Crossaetal.,2010;Morota&Gianola,2014)havesuggestedthatusingnon-linear kernels(e.g.,Gaussiankernels)mayleadtoahighergenomicpredictionaccuracy.InaGaussian kernel,thecovariancebetweengeneticvaluesishigherforcloselyrelatedindividualsanddrops astwogenotypesbecomeincreasinglydistant.Therateatwhichthepriorcovariancebetween geneticvaluesdropsiscontrolledbyabandwidthparameter.Largebandwidthparametervalues (thatleadtohighlylocalcovariances)canbeusedtoderivepredictionswhicharelargelydependent oncloselyrelatedindividuals.Thus,thereisaclearlinkbetweenRKHSwithGaussiankernelsand theSSImethodologypresentedinChapter3.However,theGaussiankerneldoesnotyieldstrictly 56 sparsepredictionequations. Therefore,tofurtheradvanceourresearchintheuseofSSIs,inthischapter,westudywhether theSSIcanalsoimprovethepredictionperformanceofRKHSregressionswithnon-linearkernels. InthischapterweevaluatetheperformanceoftheSSIusingadditive(sparseG-BLUP)andnon- additive(sparseK-BLUP)kernelsusingathree-generationDH(doubledhaploid)maizedataset fromtheInternationalMaizeandWheatImprovementCenter(CIMMYT).Forseveralscenarios oftrainingsetcomposition,weshowthatthestandardRKHSregressionwithaGaussiankernel outperformedtheadditiveG-BLUP.InagreementwithwhatwereportinChapter3,wefoundthat theuseofanSSIwithadditivegenomicrelationships(sparseG-BLUP)leadstogainsinprediction accuracybetween5%-20%,relativetothestandard(non-sparse)G-BLUP.FortheK-BLUP,the gainsobtainedbyaddingsparsityweresmallerandnotalwayssigni˝cant. 4.3MaterialsandMethods 4.3.1Genotypesandphenotypicdata Thegenotypesusedinthestudyconsistof3068DHlinesderivedfrom54biparentalfamilies. TheDHlinesweredevelopedin2017,2018,and2019atCIMMYT'sMaizeDHfacilityatthe Agricultural&LivestockResearchOrganization(KALRO)inKiboko,Kenya.Thebiparental familieswereobtainedbycrossingeliteinbredlineswithdrought-tolerantlines.Seedsfromeach ofthefamilieswerecollectedandsubmittedforDHinduction.The3068DHlineswereselected fromalargerpopulationforstageImulti-locationyieldtrialsin2017-2019,basedontheresults ofevaluatinggermination,goodstandestablishment,planttype,lowearplacement,andwell-˝lled ears. Eachyear,theselectedDHlineswerecrossedwithasingle-crosstesterfromthecomplementary heteroticgroupandevaluatedunderwell-watered(denotedas OPT )anddrought(denotedas DRT ) environmentalconditions.Thenumberofhybrids(trials)plantedin2017,2018,and2019were 923(14),1423(34),and722(17),respectively;trialswereconnectedbyacommoncheckandthree 57 tosixcommercialchecks,plantedinanalpha-latticedesignwithtworeplications,andevaluatedin twowell-wateredlocationsandonemanageddroughtstresslocationsduringthe2017,2018,and 2019growingseasons.TheOPTexperimentswereconductedduringtherainyseason,applying supplementalirrigationasneeded.TheDRTexperimentswereconductedduringthedry(rain- free)seasonandirrigationwassuspended2weeksbefore˛oweringanduntilharvest.Entrieswere plantedintwo-rowplots,5mlong,with0.75mspacingbetweenrowsand0.25mbetweenhills. Twoseedsperhillwereinitiallyplantedandthen,threeweeksafteremergence,oneplantperhill wasmaintainedtoobtaina˝nalplantdensityof53333plants/ha.Fertilizerswereappliedatthe rateof60kgNand60kgP 2 O 5 /ha,asrecommendedforthearea.Nitrogenwasappliedtwice:at plantingand6weeksafteremergence.Fieldswerekeptfreeofweedsbyhandweeding.Grain yield(GY,tonsha 1 ),anthesisdate(AD,days),plantheight(PH,cm)traitswererecorded.Plots weremanuallyharvestedandGYwascorrectedtoa12.5%moisture.ADwasmeasuredfrom plantingtowhen50%oftheplantsshedpollen,andPHwasmeasuredfromthesoilsurfacetothe ˛agleafcollaron˝verepresentativeplantswithineachplot. Leafsamplesweretakenfromeachofthe3068DHlinesandsenttoIntertek,Sweden,forDNA extraction.TheDNAsampleplateswereforwardedtotheInstituteforGenomicDiversity,Cornell University,Ithaca,NY,USA,forgenotypingwithrepetitivesequences(rAmpSeq)asdescribedby Buckleretal.(2016).Atotalof5465markerscodedas0(absence)and1(presence)were˝ltered byminorallelefrequency(MAF<0.05)fromwhichonly5173werekeptforanalyses. Furtherinformationaboutthe2017and2018datacanbefoundinBeyeneetal.(2019)and Atandaetal.(2020),whohavepreviouslydescribedandanalyzedthesedatasets. 4.3.2Phenotypespre-processing AdjustedmeansofGY,PH,ADwereobtainedusingmixede˙ectsmodel˝ttedseparatelyforeach trait-year-environmental-conditioncombination.TheBestLinearUnbiasedEstimates(BLUE)of genotypesacrosslocationsfortheOPTexperimentswereestimatedusingtheMETA-Rsoftware 58 (Alvaradoetal.,2020)followingthelinearmixed: Y ijkl = + g i + L j + R k ¹ j º + B l ¹ kj º + ¹ g L º ij + e ijkl where Y ijkl isthephenotypicrecordofgenotype i atlocation j inreplicate k withintheblock l , istheoverallmean, L j isthe˝xede˙ectofthelocation j , R k ¹ j º isthe˝xede˙ectofthereplicate k withinlocation j , B l ¹ kj º istherandome˙ectoftheincompleteblock l withinreplicate k and location j assumedtobeindependentlyandidenticallydistributed( iid )normalwithmeanzeroand variance ˙ 2 b , g i isthe˝xede˙ectofgenotype i , ¹ g L º ij isthe˝xede˙ectofthegenotype location interaction,and e ijkl istherandomerrorassumedtobe iid normalwithmeanzeroandvariance ˙ 2 e .After˝ttingthemodeljustdescribed,adjustedphenotypes( y i )wereobtainedbysubtracting fromphenotypicrecords(GY,PH,andAD),theestimatede˙ectsoflocation,replicate,incomplete block,genotype locationinteraction,anderror.Likewise,withineachyear,theBLUEforeach traitforthesingle-locationDRTexperimentwasobtainedthroughthelinearmodel Y ikl = + g i + R k + B l ¹ k º + e ikl where R k isthe˝xede˙ectofthereplicate k , B l ¹ k º istherandome˙ectoftheincompleteblock l withinreplicate k assumedtobe iid normalwithmeanzeroandvariance ˙ 2 b ,andtheremaining factorsareasbefore.Theadjustedphenotypeswereobtainedbysubtractingfromthephenotypic records,theestimatede˙ectsofreplicate,incompleteblock,anderror. Afterphenotypespre-processing,atotalof n = 3039 linescontainingmarkerinformationand thatwereobservedinallenvironmentsforalltraitswerekeptforGSmodels.The˝nalnumberof linesineachyearisasfollows: n 1 = 901 linesin2017, n 2 = 1417 in2018,and n 3 = 721 in2019. 4.3.3Genomicselectionmodels Weconsideredfourdi˙erentmodels:genomic-BLUP(G-BLUP)usingadditivegenomicrelation- ships(VanRaden,2008),ReproducingKernelHilbertSpaces(RKHS)regression(Gianolaetal., 2006)whichisequivalenttoaG-BLUPwithanon-linearkernel,andsparseselectionindices(SSI) 59 obtainedbyimposinganL1-penaltyontheG-BLUP(sparseG-BLUP)andontheRKHS(sparse K-BLUP).Inwhatfollowswedescribeeachofthesemodels;forsimplicity,sinceallphenotypes werecentered,wepresentmodelswithoutinterceptnor˝xede˙ects. G-BLUP: Inthismodel,thedata-equationtakestheform y = u + " (4.1) where y = ¹ y 1 ;:::; y n º 0 , u = ¹ u 1 ;:::; u n º 0 ,and " = ¹ " 1 ;:::;" n º 0 arethevectorsofadjustedphenotypes, breedingvalues(BV),andenvironmentalerrorterms,respectively.Breedingvaluesanderrorsare assumedtobenormallydistributed u ˘ N ¹ 0 ;˙ 2 u G º and " ˘ N ¹ 0 ;˙ 2 " I º ,where ˙ 2 u and ˙ 2 " arethe geneticanderrorvariances, G istheadditivegeneticrelationshipmatrix(GRM),and I isanidentity matrix. Thegenomicrelationshipmatrix G = f G ij g wasderivedfrommarkers, X = f x im g ,using G = ZZ 0 š p ,where p = 5173 isthenumberofmarkersand Z = f¹ x im x m ºš sd x m g isthematrix ofcenteredandscaledmarkersobtainedbysubtractingfromeachmarkerentrythemeanofthe correspondingcolumnfollowedbyscalingbythestandarddeviationofthecolumn.Theresulting matrixhasanaverageofthediagonalelementsequaltoone. ThepredictedBVs( ^ u PS )fortheindividualsinthepredictionset(PS)arethenlinearcombi- nationsofthephenotypes( y TS )ofthesubjectsinthetrainingset(TS),thisis(e.g.,Searleetal., 1992) ^ u PS = B G y TS (4.2) where B G = G PS ; TS ¹ G TS + 0 I º 1 isaHatmatrix(i.e.,coe˚cientsofregressionofBVson phenotypes), 0 = ˙ 2 " š ˙ 2 u istheratiobetweenresidualandgeneticvariances, G PS ; TS isamatrix containingtheadditivegeneticrelationshipsbetweenthedatapointsinthepredictionsetwiththose inthetrainingset,and G TS representstheadditiveGRMofthetrainingdatapoints. RKHSregression: InaRKHSregression,thevectorofgenomicpredictions( u )inEquation (4.1)areobtainedaslinearcombinationsonkernelevaluationsas u = K ,where K isan n n matrixofkernelevaluationsonmarkersgenotypes,and isavectorofregressioncoe˚cients.As 60 shownindelosCamposetal.(2009a),ifthevector isassumedtobedistributed ˘ N ¹ 0 ;˙ 2 a K 1 º , itfollowsthat u ˘ N ¹ 0 ;˙ 2 a K º : (4.3) Therefore,thevectorofgenomicpredictionsfortheindividualsinthepredictionsetare ^ u PS = B K y TS (4.4) wheretheHatmatrixisnow B K = K PS ; TS ¹ K TS + 0 I º 1 , 0 = ˙ 2 " š ˙ 2 a , K PS ; TS isthematrix containingtheevaluationoftheGaussiankernelforpairsoftraining-predictiongenotypes,and K TS isthekernelGRMofthetrainingdata. FindingthesolutionofanRKHSmodelisthereforeequivalentto˝ndingthesolutionforthe G-BLUPmodelbutusing K insteadoftheadditivematrix G .Inthissense,werefertotheRKHS regressionastoK-BLUPmodel. WeconsideredK-BLUPmodelsusingGaussiankernelmatrices K = f K ij g , i ; j = 1 ;:::; n ,given by K ij ¹ º = exp ~ d 2 ij º ,where isabandwidthparameterand ~ d 2 ij isthescaledsquaredEuclidean distancebetweenindividuals i and j givenbytheirmarkersgenotypes,obtainedbydividing thedistance d 2 ij = Í p m = 1 ¹ x im x jm º 2 bytheaveragedistance d = 1 n 2 Í i Í j d 2 ij .Threeextreme Gaussiankernelswereusedasde˝nedbyGonzález-Camachoetal.(2012),namely K 1 = f K ij ¹ 1 ºg , K 2 = f K ij ¹ 2 ºg ,and K 3 = f K ij ¹ 3 ºg ,where 1 = 0 : 2 , 2 = 1 ,and 3 = 5 .SeeSupplementary FigureC.1forpairwisecomparisonsofkernel( K ij )versusadditive( G ij )genomicrelationships. Inaddition,kernelaveraging(KA)wasalsoimplementedasdescribedindelosCamposetal. (2010)usingthethreekernels K k , k = 1 ; 2 ; 3 .Brie˛y,thethreekernelsareconsideredtojointly contributetothepredictionas u = Í 3 k = 1 K k k ,whereeachsummandhasitsowndistribution(as inEquation(4.3))as K k k = u k ˘ N ¹ 0 ;˙ 2 a k K k º .ThisKA-BLUPmodelis˝ttedinaBayesian fashion;however,itcanberewrittenasasinglerandome˙ect(asinEquation(4.1))bymaking u ˘ N ¹ 0 ;˙ 2 a K A º ,where ˙ 2 a = ˙ 2 a 1 + ˙ 2 a 2 + ˙ 2 a 3 isthetotalkernelvarianceand K A isanaverage kernelGRMgivenby K A = ˙ 2 a 1 ˙ 2 a K 1 + ˙ 2 a 2 ˙ 2 a K 2 + ˙ 2 a 3 ˙ 2 a K 3 : (4.5) 61 SparseBLUPmodels: SparsitywasincorporatedintotheG-BLUPandK-BLUPmodelsas describedinChapter3.Brie˛y,inaSSI,theweightsoftheselectionindexforthe i th individualin thepredictionsetwasobtainedfromthefollowingL1-penalizedoptimizationproblem ~ b ¹ º i = argmin b i 2 6 6 6 6 4 1 2 b 0 i ¹ G TS + 0 I º b i G 0 i b i + n Õ j = 1 j b ij j 3 7 7 7 7 5 (4.6) where G 0 i = G PS ¹ i º ; TS isthevectorcontainingtheadditiverelationshipsbetweenthe i th subject inthepredictionsetandeachofthesubjectsinthetrainingset, isaparametercontrollingthe degreeofsparsityof ~ b ¹ º i ,and Í n j = 1 j b ij j isapenaltyontheL1-normof b i .A(sparse)Hatmatrix fortheSSI, ~ B ¹ º G ,containsineachrowthesolutionstoEquation(4.6),obtainedforeachtesting genotype.Avalueof = 0 yieldsthesame(non-sparse)HatmatrixofthestandardG-BLUPin Equation(4.2).ForthesparseK-BLUPmodelsweusedEquation(4.6)withtheGaussiankernel (either K 1 , K 2 , K 3 ,or K A )insteadofadditiverelationshipmatrices( G ).Althoughoptimization probleminEquation(4.6)doesnothaveaclosed-form,solutionsforitcanbederivedusinga coordinatedescentalgorithm(seeChapter3forfurtherdetails).Finally,anoptimalvalueof can beobtainedusingcross-validationwithinthetrainingset(moredetailsinChapter3). 4.3.4Variancecomponents TheimplementationofG-BLUP,K-BLUP,andthecorrespondingsparseversionsofthesemodels requireestimatesofvariancecomponents.Weobtainedtheseestimatesby˝ttingBayesiangenomic modelswithineachtrait-environmentcombination.Theseanalyseswereperformedusingthe BGLRR-package,withthedefaultsettingforhyper-parameters(Perez&delosCampos,2014). After˝ttingthemodels,posteriormeansofthevariancecomponentswereobtained.Forthe standardKA-BLUP,themodelwas˝ttedwiththethreekernelstogethertoestimatethekernel- speci˝cvariancesandthenusedtoderive ˙ 2 a andthekernel K A (Equation(4.5)).Aheritability estimatefortheG-BLUPmodelwasderivedas h 2 = ˙ 2 u š¹ ˙ 2 u + ˙ 2 " º .FortheK-BLUPmodels,the heritabilitywasobtainedbyreplacingtheestimateof ˙ 2 u bythekernelgeneticvarianceestimate ˙ 2 a .Allthesemodelswere˝ttedusingdatafromthetrainingsetonly. 62 4.3.5Assessmentofpredictionaccuracy VariancecomponentsestimatesandthecorrespondingGRM( G , K 1 , K 2 , K 3 ,or K A )wereused toderivethenon-sparse( B G or B K forthestandardBLUP)andthesparse( ~ B ¹ º G = f ~ b ¹ º 0 i G g or ~ B ¹ º K = f ~ b ¹ º 0 i K g )Hatmatrices.(NotethatintheSSI,therowsofthesparseHatmatrix aresimplythesolutionstoEquation(4.6),obtainedforeachtestinggenotype.)Thepredictions ( ^ u PS )werederived(asinEquation(4.2)and(4.4))astheproductofthe(non-sparseorsparse)Hat matrixtimesthevectorofphenotypesinthetrainingset.Predictionaccuracywasmeasuredasthe correlationbetweenobservedandpredictedvaluesinthepredictionset,i.e., ˆ = cor ¹ y PS ; ^ u PS º . Predictionaccuracywasassessedfordi˙erentpredictionscenariosusingcycle2019asthe predictionsetwithdi˙erenttrainingsetcompositions,asfollows:( i )thedatafromthe2019 cyclewasrandomlypartitionedinto85%-15%(i.e.,612and109individuals),( ii )the85%- set( n PS = 612 )fromtheyear2019waspredictedusingdataoftheearliergenerations2017 ( n TS = 901 ),2018( n TS = 1417 ),and2017+2018combined( n TS = 2318 )astrainingset,( iii )the predictionofthe612individualswasalsoperformedusingthesametrainingsetsbutaugmented byprogressivelyincludingtheremaining15%-setfrom2019,˝rst37(5%),then73(10%),and lastly109(15%)individuals.SeeTable4.1forasummaryofallthetrainingsetcompositions.All predictionswereperformed100timesusingdi˙erentrandompartitionsofthe2019data. 4.3.6Software AlltheaforementionedanalyseswereperformedintheRenvironment-language(RCoreTeam, 2019).AllstandardBayesianG-BLUPandK-BLUPmodelswere˝ttedusingtheBGLRR-package (Perez&delosCampos,2014)toestimatevariancecomponents.ThesparseHatmatrices( ~ B ¹ º G or ~ B ¹ º K )wereobtainedwiththeSFSIR-package(Lopez-Cruzetal.,2020).Foreachtrait- environment-partition,anoptimalvalueof wasobtainedusing10-foldcross-validationwithin thetrainingset. 63 Table4.1:Trainingset(TS)compositionusedineachpredictionscenario.(Thepredictionsetwas thesameforalltrainingscenariosandconsistedof612randomlychosenindividualsfrom2019). Datafrom%of2019datausedTotaltraining previousyears( n )fortraining( n )size( n TS ) 0(0)901 20175(37)938 (901)10(73)978 15(109)1010 0(0)1417 20185(37)1454 (1417)10(73)1490 15(109)1526 0(0)2318 2017+20185(37)2355 (2318)10(73)2391 15(109)2427 Figure4.1:(A)First3principalcomponentsoftheadditivegenomicrelationshipsmatrix, G . Pointsrepresentindividualsthatarecolorseparatedaccordingtocycle(2017,2018,or2019).(B) Heatmapofthegenomicrelationshipsmatrix. 64 4.4Results Thegermplasmusedinthisstudyisderivedfromdi˙erentbiparentalfamiliesacrossyears. Thisrichnessofthedataisre˛ectedinahighpopulationheterogeneityinwhichindividuals clusterintogroupswithinandacrossgenerations(Figure4.1).However,thecrossesperformed preventedtheformationofaclearstructure(e.g.,2clusters);insteadthepopulationshowsamore crypticsubstructurewithvaryingdegreesofadmixturebetweenfamilies.Theintermixingbetween generationsthatisobservedinFigure4.1canbeattributedtoallelesharingasonlyallelesfromthe selectedeliteparentsinonegenerationarepassedtothenextgeneration. 4.4.1PredictionaccuracycomparisonofG-BLUPandK-BLUPmodels Figure4.2showstheaccuracyofprediction(averagedacrossall100partitions)forGY-OPTusing allstandardBLUPmodelsforalldi˙erenttrainingsetcompositionsrepresentingacombinationof previouscycles(2017,2018,or2017+2018)plustheinclusionof0,5,10,and15%(i.e.,0,37,73, and109subjects)ofthetotalindividualsfromthesame2019cycle(seeTable4.1).Asexpected, theinclusionofindividualsfromthesamecycleincreasesthepredictionaccuracyacrossallmodels andtrainingsetcomposition.Forinstance,usingthe2017cycleastrainingset,theaccuracyofthe G-BLUPwasincreasedby100%whenadding73individuals;however,using2018astrainingset showeda60%increase,andwhenusing2017+2018,againof27%wasobserved. Likewise,predictionaccuracywashigherwhencombiningdatafrom2017+2018astraining setcomparedwithmodelsusingdatafrom2018or2017alonefortraining(seetop-leftpanelin Figure4.2).However,theaccuracywasnotalwaysincreasedwhenthetrainingsetisaugmented toinclude2017+2018together(seebottompanelsandtop-rightpanelinFigure4.2),and,insome cases,wasevenlowered(seeSupplementaryFigureC.3fortraitPH).Contrastingly,theprediction usingindividualsfrom2017+2018wassometimesequallyperformed(seethebottom-leftpanel inFigure4.2)andinsomecasesoutperformed(seeSupplementaryFigureC.2forGY-DRTand SupplementaryFigureC.3forPH)whenusingonlydatafromthe2017cycle. 65 Ingeneral,kernel-basedmodelsachievedhigherpredictionaccuracythanthestandardG-BLUP. Althoughthekernels K 1 , K 2 ,and K 3 arerankeddi˙erentlyacrosstrainingsetcompositions,models with K A seemstobemorestableacrossallscenariosperformingsimilartothebestofthethree kernels K 1 , K 2 ,or K 3 .Thisresultisinagreementwiththe˝ndingsindelosCamposetal.(2010). Figure4.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthedatafrom the2017,2018,or2017+2018cyclesalone(top-leftpanel),orincombinationwithaproportion (5%,10%,15%)ofthedatafromthe2019cycle.Thepredictionsetconsistedof612genotypes fromthe2019cyclethatwerenotusedformodeltraining.Modelswiththesameletterwithin panelindicatenosigni˝cantdi˙erencefromeachother( = 0 : 05 ,ANOVAfollowedbyTukey test).GY-OPTtrait-environmentcombination. 4.4.2E˙ectofsparsityonpredictionaccuracy Thesamepartitionsoftraining-predictionsetsusedtoobtaintheresultsforthestandardmodels wereusedtoevaluatethepredictionaccuracyofSSIs(sparsemodels).Across-validatedvalue CV wasfoundwithinthetrainingsettocalculateanoptimalsparseBLUPmodel.Table4.2contains theresultsofthepredictionsoftheGY-OPTtrait-environmentcombinationforthescenariowhere 15%ofdatafrom2019isincludedinthetrainingset(2017,2018,or2017+2018).Resultsfor 66 thecaseswhenadding0%,5%,and10%ofthe2019dataarepresentedinSupplementaryTable C.1.Withthistrainingsetcomposition,theaccuracyofthestandardG-BLUPmodelswasbetween 0.46-0.48.K-BLUP(standardorsparse)andsparseG-BLUPmodelsachievedhigherprediction accuracythanthestandardG-BLUP,withgainsinaccuracy(relativetoG-BLUP)rangingfrom minimal(1%)tosubstantial(12%). Table4.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition(including 15%ofsubjectsfromthe2019cycle),GY-OPTtrait-environmentcombination. Accuracy(SD)%Gain TS( n TS )GRM CV a n sup (RS) b h 2 StandardSparseI c II d G0.0181171(17)0.520.47(0.031)0.51(0.030)-9 K10.0037188(19)0.870.48(0.030)0.50(0.034)34 2017K20.0060219(22)0.750.51(0.028)0.52(0.027)92 (1010)K30.00001007(100)0.920.50(0.029)0.50(0.029)70 KA0.0041240(24)0.840.51(0.028)0.52(0.029)92 G0.0112337(22)0.610.46(0.026)0.48(0.028)-5 K10.0016480(31)0.910.48(0.026)0.49(0.027)33 2018K20.0023684(45)0.800.50(0.026)0.50(0.027)80 (1526)K30.00001526(100)0.900.46(0.029)0.46(0.029)00 KA0.0015683(45)0.870.49(0.027)0.49(0.028)70 G0.0188268(11)0.530.48(0.025)0.51(0.027)-7 K10.0033350(14)0.880.50(0.025)0.51(0.026)42 2017+18K20.0031750(31)0.770.52(0.025)0.52(0.026)90 (2427)K30.00002427(100)0.870.50(0.027)0.50(0.027)40 KA0.0020922(38)0.830.52(0.026)0.52(0.027)80 GRM:Geneticrelationshipmatrix.SD:standarddeviation. a PenalizationparameterinEquation(4.6)found bycross-validatingtheTS. b n sup =averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient inthesparseHatmatrix(supportset).RS:relativesparsity( 100 n TS š n sup ).Inthestandardmodels CV is equaltozeroand n sup isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy ofthe c standardK-BLUPrelativetothestandardG-BLUP,and d sparse*-BLUPrelativetothestandard *-BLUP(*=G-orK-). Thegainsinpredictionaccuracyaremoreevidentwhentheaccuracyofthenon-sparseG-BLUP modelswaslower(i.e.,whenfewerindividualsfromthe2019cycleareincludedinthetraining set).Forinstance,whentheaccuracyoftheG-BLUPisaslowas0.2(thecasewherenodata from2019isincludedinthetrainingset),SSIsyieldedgainsinaccuracy(relativetothestandard G-BLUP)ofupto28%(SupplementaryTableC.1).Itwasonlyintheselow-accuracysituations 67 that,insomecases,theuseofastandardK-BLUPmodelwith K 3 resultedin ˘ 6 %lossofaccuracy (relativetothestandardG-BLUP)andthatusingsparsemodels,theaccuracylostwas3-10%(see SupplementaryTableC.1).Theadvantageinaccuracyofasparsemodeloveritsstandardversion wasmoremarkedfortheG-BLUP(i.e.,whenusingadditiverelationshipsmatrices,23%);however, thesamegainsweresmallerfortheRKHSregressionsusingGaussiankernels.( ˘ 2 8 %gain inaccuracy,SupplementaryTableC.1).Nosigni˝cantdi˙erencebetweensparseandnon-sparse modelwasobservedwhenusingthelarge-bandwidthkernel K 3 .Similarresultscanbefoundfor traitGY-DRT(seeSupplementaryTableC.2). Figure4.3:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(horizontalaxis) versusthepredictionaccuracyofallothermodels(verticalaxisofeachpanel).(B)Prediction accuracyofthestandard*-BLUPmodel(horizontalaxis)versusthepredictionaccuracyofits sparseversion(verticalaxis),bytypeofkernelusedinpanels.Eachpointrepresentatraining- testingpartitionwithineachtrainingsetcomposition.Coloredpointsabove(below)the45degree linerepresentcasesforwhichonemodeloutperformedtheothermodel.P:p-valueforthetest (fromANOVA)fordi˙erencesinaccuracybetweenthetwomodels.TraitGY,environmentOPT. ForthePH-OPTtrait-environmentcombination,whenmodelsweretrainedusing15%ofthe 2019data,thegainsinaccuracyobtainedwiththesparseG-BLUP,relativetothenon-sparse G-BLUP,wereashighas11%,andupto18%withasparseK-BLUPmodel(Table4.3).These gainsinaccuracywereverynotable(>100%)whenaddingtothetrainingset10%orfewerof 68 theindividualsfromthe2019cycle(seeSupplementaryTableC.3).ResultsforthePH-DRT trait-environmentarereportedinSupplementaryTableC.4wheresimilarpatternsareobserved. Table4.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition(including 15%ofsubjectsfromthe2019cycle),PH-OPTtrait-environmentcombination. Accuracy(SD)%Gain TS( n TS )GRM CV a n sup (RS) b h 2 StandardSparseI c II d G0.0112506(50)0.560.49(0.037)0.51(0.042)0.03.4 K10.0008726(72)0.890.51(0.036)0.51(0.039)3.11.2 2017K20.0007773(77)0.770.54(0.034)0.54(0.034)9.10.1 (1010)K30.00001010(100)0.920.51(0.036)0.50(0.036)2.4-0.1 KA0.0003862(85)0.870.53(0.035)0.53(0.035)7.3-0.7 G0.0161208(14)0.710.50(0.038)0.56(0.033)0.010.9 K10.0020350(23)0.950.53(0.036)0.56(0.035)5.64.6 2018K20.0021627(41)0.880.56(0.033)0.57(0.031)11.31.4 (1526)K30.00001526(100)0.920.53(0.032)0.53(0.032)5.40.0 KA0.0015647(42)0.920.56(0.033)0.56(0.033)10.21.4 G0.0132356(15)0.640.47(0.038)0.52(0.039)0.010.8 K10.0015696(29)0.930.52(0.036)0.54(0.037)9.34.2 2017+18K20.00071728(71)0.850.56(0.034)0.56(0.034)17.7-0.2 (2427)K30.00002427(100)0.920.51(0.036)0.51(0.036)8.30.1 KA0.00041896(78)0.890.55(0.034)0.55(0.035)16.4-0.8 GRM:Geneticrelationshipmatrix.SD:standarddeviation. a PenalizationparameterinEquation(4.6)found bycross-validatingtheTS. b n sup :averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient inthesparseHatmatrix(supportset).RS:relativesparsity( 100 n TS š n sup ).Inthestandardmodels CV is equaltozeroand n sup isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy ofthe c standardK-BLUPrelativetothestandardG-BLUP,and d sparse*-BLUPrelativetothestandard *-BLUP(*=G-orK-). Acrossallscenarios,thestandardG-BLUPshowedthelowestaccuracyamongallmodels(SSI andstandardK-BLUP,seeFigure4.3AforGY-OPT).ThisinferiorityofthestandardG-BLUPwas alsoobservedforGY-DRTandPH(seeSupplementaryFigureC.4AandSupplementaryFigure C.5A).TheadditionofsparsitytotheK-BLUPmodelsresultedsometimesinanextraadvantage inaccuracywhenusingakernelwithasmallbandwidth( K 1 with = 0 : 2 ,and K 2 with = 1 )or averagedacrossextremekernels( K A )forGY(Figure4.3BandSupplementaryFigureC.4B)and PH(SupplementaryFigureC.5B). 69 4.4.3Automatictraining-sampleselection Tables4.2and4.3(andSupplementaryTablesC.1-C.4)showtheoptimalvalueofthepenalization parameter andthedegreeofsparsityoftheresultingindex,measuredbytheaveragenumberof subjectsfromthetrainingsetinthesupportset( n sup ,subjectswithanon-zerocoe˚cientinthe estimatedHatmatrix)ofeachpredictedgenotype.Thedegreeofsparsityvariedacrossmodels. FortheGY-OPTtrait-environmentcombination,acrossalltrainingsetcompositions,thestrongest sparsitywasachievedusingthegenomicmatrixGwitharelativesparsity( n sup š n TS )of11-33% (Table4.2andSupplementaryTableC.1)whiletherelativesparsitywithkernelsincreasesasthe bandwidthparameter increases(relativesparsityof14-47%for K 1 and22-59%for K 2 ).The relativesparsityachievedwhenusingthe K A kernel(25-62%)wassimilartothatofthe K 2 kernel (seeTable4.2andSupplementaryTableC.1).Thefactthatnodi˙erenceinaccuracywasobserved betweenstandardandsparse K 3 -BLUPmodelisduethattheoptimal CV waszero;thus,thesparse modelwasequivalenttothestandardmodel. Figure4.4displaysaheatmapofthesparseHatmatrix( ~ B ¹ º G )ofthesparseG-BLUPmodel. Individualsinthetrainingset(2017+2018plus15%from2019)appearincolumnsandthosein thepredictionsetareshowninrows.Individualsfromthetrainingsetthatdidnotcontributeto eachindex(i.e.,thosewithzeroweightintheindex)aredisplayedingrey.Thosewithanon-zero coe˚cient(supportset)areshowninayellow-blue(logarithm)scale.Theheatmapmakesevident howSSIsselectcustomtrainingsetsforeachgenotypeinthepredictionset.Individualgenotypes inthetrainingsetsupportsthepredictionofsomebutnotallthegenotypesinthepredictionset.The solutionfortheHatmatrixinFigure4.4isverysparse,withvaryingnumberofsupportpointsby testinggenotype.Predictionsofeachofthe612testinggenotypeswasperformedusingphenotypes from,onaverage,268(outof2427traininggenotypes,i.e.,11%,seeTable4.2)traininggenotypes. Forthesamepredictionscenario,aheatmapforthesparse K A -BLUPmodel(showinga38%of sparsity)isprovidedinSupplementaryFigureC.6. Figure4.5shows,foreachofthesparsemodels,theproportionofthetrainingindividualsfrom eachcycle(2017,2018,or2019)thatcontributedtotheprediction(withincyclesupportset)ofthe 70 testingindividuals.Eachpanelrepresentsthedi˙erenttrainingsetscomposedof2017+2018data plustheadditionofeither0%,5%,10%,or15%ofthe2019data. Figure4.4:Heatmapofthecoe˚cientsintheHatmatrix( ~ B ¹ º G )ofthesparseG-BLUPmodel foronetraining-prediction(TS-PS)partitioninthepredictionof n PS = 612 individualsfrom2019 using n TS = 2427 individuals(2017+2018plus15%ofthe2019set).Predictedindividualsare presentedincolumnsandtrainingindividualsarepresentedinrowsseparatedbycycleandnumber ofindividualsinparentheses.Thevalueof wasobtainedbycross-validation.Eachcolumn representsvaluesofthevector ~ b ¹ º i G = f ~ b ij g , j = 1 ;:::; 2427 (Equation(4.6)).Individualsno contributingtothepredictionhaveacoe˚cient ~ b ij = 0 representedingreycolor.Individualswith anon-zerocoe˚cientareshowninayellow-bluelogarithmscale(intheoriginalscale,yellow indicateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environmentcombination. 71 Asexpected,trainingindividualsthatbelongtothesamegroupasthetestingindividualsare morelikelytobeincludedinthesupportset.Forexample,usingasparseG-BLUPmodeltrained with2017+2018plus5%fromthe2019data(seethetop-rightpanelinFigure4.5),onaverage, 41%ofthe2019genotypesofthe37includedinthetrainingsetcontributedtothepredictionof thetestingindividuals.Althoughmoreabundant,asmallerportionofthetotalindividualsfrom previouscycles(19%ofthe901subjectsfrom2017and18%ofthe1417from2018)arealso contributingtotheprediction.Withasmallerdegreeofsparsity,similarpatternswerealsoobserved forthesparseK-BLUPmodels(seeFigure4.5)exceptwith K 3 thatdidnotrendersparsityatall (notshowninthe˝gure).PlotsshowingthewithincyclesparsitypatternsforGY-DRTandPH (OPTandDRT)areshowninSupplementaryFiguresC.7andC.8. Figure4.5:Proportionofthetrainingindividualsfromeachcyclethatcontributedtotheprediction ofthe612testinggenotypesfrom2019,usingsparsemodelswithdi˙erentrelationshipmatrices (horizontalaxis): G , K 1 , K 2 ,or K A .Thetrainingsetwascomposedbyindividualsfrom2017 ( n = 901 )and2018( n = 1417 )alone(top-leftpanel)orincombinationwithaproportion(5%, 10%,15%)ofthedatafromthe2019cycle.GY-OPTtrait-environmentcombination. 72 Asmoreindividualsfromthesamecycleareaddedtothetrainingset,fewerindividuals frompreviousgenerationsbecomelessfrequentinthesupportset.Forinstance,performingthe predictionwithasparseG-BLUPusing2017+2018dataincluding15%ofthe2019data(bottom- rightpanelinFigure4.5),yieldedasmallersupportsetwithonly10%(90of901)ofthe2017data and10%(142of1417)ofthedatafrom2018. 4.5Discussion Multiplefactorsa˙ectthepredictiveperformanceofGSmodels,includingsamplesize,trait heritability,theextentoflinkagedisequilibrium(LD)betweenmarkersandquantitativetraitloci (QTL),andtherelationshipsbetweentrainingandtestinggenotypes(Daetwyleretal.,2008;He˙ner etal.,2009;Lorenzana&Bernardo,2009;Combs&Bernardo,2013). Generalguidelinessuggestthatpredictionaccuracyismaximizedwhenthetrainingsetincludes asu˚cientlargenumberofindividualswhicharedistantlyrelatedtoeachother(Rincentetal., 2012)andarecloselyrelatedtothesubjectsinthepredictionset(Habieretal.,2010;Clarketal., 2012).Ontheotherhand,thereisevidencesuggestingthatincreasingthetrainingsetsizeby includingindividualsthataregeneticallydistanttothoseinthepredictionsetdoesnotnecessarily increaseandmightevendecreasethepredictionaccuracy(e.g.,Lorenz&Smith,2015). Eachcycleofabreedingprogramproducesanewbatchofgenomicandphenotypicdata; therefore,aftermanyyearsofadoptingGS,thedataavailableformodeltrainingistypically multi-generationalandmayoftenincludecomplexpatternsofpedigreerelationshipswithinand betweengenerations.ThereisclearevidencethataGSmodelneedstobere-trainedeverycycle (Wolcetal.,2011;Pszczola&Calus,2016;Wientjesetal.,2013).Whenre-trainingmodels, breedingorganizationsfacemanychallenges.Shouldalltheavailabledatabeusedformodel training?Shouldinsteadonerestrictthetrainingdatatoonlyincludegenotype/phenotypesfrom recentgenerations?Shouldoneexcludedatafromgenotypesdistantlyrelatedtothecurrentsetof candidatesofselection? Someevidencesuggeststhatingenomicprediction,`biggerisnotnecessarilybetter'.For 73 instance,usinghistoricwheatdatageneratedover17years,Dawsonetal.(2013)observedthat theaccuracyofyear-to-yearpredictionsusingtrainingsetscomposedofallpreviousyearswas approximatelythesameaswhenconsideringonlythreeyearsback.Likewise,inabroilerbreeding population,Wolcetal.(2016)foundthatthemaximumaccuracywasaccomplishedwhenthe trainingsetwascomposedofthethreemostrecentgenerations. TheSSI(sparseG-BLUP)methodologypresentedinChapter3o˙ersaframeworktoidentify, foreachindividualinthepredictionset,acustomizedtrainingset(orsupportpoints)fromwhichthe predictionsarederived.Thismethodologyconsidersboththerelationshipsbetweenthecandidate ofselectionandeachtraininggenotypeaswellasrelationshipsamongtraininggenotypes(Equation (4.6)).Therefore,inthischapter,weproposethatthesparseselectionindicespresentedinChapter 3canbeusedtoaddresstheproblemoftrainingsetoptimizationwithmulti-generationdata.We usedamulti-generationdataoriginatedfrommorethan50biparentalfamiliestomeasuretheimpact ofsparsityusingSSIsformedusingadditiveandnon-additivekernels. Ourresultscon˝rmedthatanSSIbasedonadditiverelationshipsyieldsahigherprediction accuracythanthestandardadditiveG-BLUP.Whenanon-additivekernelwasused,wefoundthat sparsityimprovedpredictionaccuracyprovidedthatthekernelusedwasnotalreadya`local'kernel, thatis,akernelinwhichgeneticcovariancesarepositiveonlyforcloselyrelatedindividuals.For localkernels( K 3 ),addingsparsitydidnotimprovepredictionaccuracyinaclearandsystematic way.Therefore,ourresultscon˝rmthebene˝tof`localpredictions',whichcanbeobtainedeither byusinganRKHSwithlocalkernelsorwithanSSIappliedtoadditivegenomicrelationships. BoththeSSIandtheKernelsregressionrequireoptimizingaparameterthatcontrolshowlocal predictionsare.TheSSIrequiresoptimizingthepenalizationparameter( ),whichcanbedone bycross-validationwithinthetrainingdataset.Ontheotherhand,theRKHSregressionrequires tuningthebandwidthparameterwhichcontrolshowfastcovariancesdropwithgeneticdistance. Thiscanbedoneeitherbycomparingmultiplekernelsusingcross-validationorbyusingmultiple kernelswith'kernelaveraging'asdiscussedindelosCamposetal.(2010). Standardtraining-optimizationmethodsassumethatasingletrainingsetisoptimalforall 74 candidatesofselection.TheSSIdoesnotmakethisassumption.Ourresultsshowclearlythat eachSSIpicksaparticularsetofsupportpointsandthattheoptimaltrainingsetvariesfrom genotypetogenotype.TheinspectionoftheHatmatrixoftheSSImakesitclearthatinprediction, one-size-does-not-˝t-all candidatesofselection.Likewise,theinspectionoftheHatmatrixshows thatoptimizingtrainingsetsbyrestrictingthetrainingdatatorecentgenerationsmayalsonot beoptimal.Indeed,mostoftheSSIspickedinformationfromallthegenerationsavailable,with varyinglevelsofsparsity. In conclusion :SSIscanbeusedtooptimizepredictionaccuracywhenthetrainingdataexhibit complexrelationshippatterns.Inthiscontext,di˙erencesinallelefrequenciesandinLD-patters maymakeSNPe˙ectheterogenousacrossfamiliesandsub-families,thusmakingthestandard G-BLUPsub-optimal.BothlocalkernelsandSSIscanbeusedtooptimizepredictionaccuracyin suchdatasets. 75 CHAPTER5 CONCLUDINGREMARKSANDFUTUREDIRECTIONS Thecoreofthisdissertationhasamethodologypointofview,presentinganinnovativeprocedure thatcombinestwowell-establishedapproaches:selectionindexandpenalizedregression.Thefor- merwasdevelopedforbreedingvaluepredictionusingdi˙erentsourcesofcorrelatedinformation. Thelatteriscommonlyusedinstatisticsandmachinelearningforvariableselectiontopreventover- ˝ttinginsituationswheretherearemorevariablesthanobservations.Thisnovelapproach,which wenamedsparseselectionindex(SSI),o˙ersopportunitiessuchasintegratinghigh-throughput phenotypesingeneticevaluationsandsolutionsfortrainingsetoptimizationingenomicselection withhighlyheterogeneousdata. Wemadeane˙orttopresentourSSImethodologyindeepdetail,developsoftwareforits implementation,andempiricallyvalidateit(withcross-validation)withrealdatausingseveraldata setswithgenomicandphenotypicinformation.Asabrand-newmethod,noexhaustiveevaluation oftheSSIwaspossibletobepresentedinthisdissertationatthisstage;however,theresultsobtained withthesedatasetsareverypromising,performingatleastasgoodasthestandardmethods. TheSSIapplicationspresentedinthisdissertationareofthetypesingle-traitmodel;they mightbefeasiblyextendedtomulti-traitmodelsallowingtheborrowingofinformationwithinand betweenindividualsatthesametime.Multi-traitmodelshavethepotentialtoincreaseprediction accuracy;therefore,furtherresearchisrequiredtoinvestigatewhethertheuseofamulti-traitSSI canalsobeadvantageous. ThescopeoftheSSIcangobeyondthebreedingvaluespredictiononly.Thismethodleaves thedooropentoothergeneticresearchareas(e.g.,genome-wideassociationanalysisonhigh- dimensionalphenotypesandnetworkanalysisfromgeneexpression).Therefore,moreresearchis neededtoexplorethefullpotentialoftheSSIprocedure. 76 APPENDICES 77 APPENDIXA SUPPLEMENTARYFIGURESANDTABLESFROMCHAPTER2 78 FigureA.1:Box-plotofgrainyieldphenotypicrecordsbyenvironmentalcondition. n ˇ 3200 observationswithinenvironment.SD:standarddeviation. 79 FigureA.2:Lightre˛ectancepatternsasfunctionofthewavelength.Eachlinerepresentsthe mean(across n ˇ 3200 observations)re˛ectanceforeachwaveband,withintime-point(˛ight date).Withineachenvironment,meanswerescaledtoliewithin0and1bydividingthembythe maximumaverage. 80 FigureA.3:AccuracyofindirectselectionofL1-PSIanditscomponents.Squarerootheritability, geneticcorrelationandaccuracyofindirectselection,allaveragedover100training-testingpar- titionsversusthenumberofbandsenteringintheindex;bytime-point(DAS=daysaftersowing, Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)withinenvironment. 81 FigureA.4:AccuracyofindirectselectionofL2-PSIanditscomponents.Squarerootheritability, geneticcorrelationandaccuracyofindirectselection,allaveragedover100training-testingparti- tionsversusthepenalizationparameter( ,logarithmscale)usedtobuildtheindex;bytime-point (DAS=daysaftersowing,Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)within environment. 82 FigureA.5:AccuracyofindirectselectionofPC-SIanditscomponents.Squarerootheritability, geneticcorrelationandaccuracyofindirectselection,allaveragedover100training-testingparti- tionsversusthenumberofprincipalcomponentsusedtobuildtheindex;bytime-point(DAS=days aftersowing,Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)withinenvironment. 83 FigureA.6:Squarerootofheritabilityofthestandard(SI),oftheregularized(PC-SIandL1-PSI) selectionindices,andoftheRNDVI.Thelinesprovidetheaveragesquarerootheritabilityover 100training-testingpartitions.Verticallinesrepresenta95%CIfortheaverage.Thehorizontal axisgivethetime-pointatwhichimageswerecollectedandareexpressedinbothdaysaftersowing (DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity). 84 FigureA.7:Geneticcorrelationbetweengrainyieldandall:thestandard(SI),theregularized (PC-SIandL1-PSI)selectionindices,andtheRNDVI.Thelinesprovidetheaveragegenetic correlationover100training-testingpartitions.Verticallinesrepresenta95%CIfortheaverage. Thehorizontalaxisgivethetime-pointatwhichimageswerecollectedandareexpressedinboth daysaftersowing(DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity). 85 FigureA.8:Phenotypic,genetic,andenvironmentalcovariances(absolutevalue)betweenwave- bandsandgrainyield.'D':discrepancybetweenphenotypicandgeneticcovariancesasmeasuredby thesumoftheabsolutedi˙erences;bytime-point(DAS:daysaftersowing,Stage:VEG=vegetative, GF=grain˝lling,MAT=maturity)withinenvironment. 86 TableA.1:Accuracyofindirectselection(averageover100training-testingpartitions)forbest phenotypicprediction(principalcomponents(PCR),L1-penalizedprediction(L1-Phen),RNDVI, andGNDVI)andforbestgenotypicprediction(standardSI,optimalPC-SI,L1-PSI,andL2-PSI). PhenotypicpredictionGenotypicprediction Env/TP PCRL1-PhenRNDVIGNDVI SIPC-SIL1-PSIL2-PSI Flat-Drought 45 0.24a0.23a0.23a0.21b 0.18c0.24a0.23a0.24a 52 0.27ab0.27ab0.27ab0.25b 0.20c0.27a0.27a0.27ab 65 0.42a0.42a0.35b0.35b 0.35b0.43a0.43a0.42a 73 0.45ab0.45ab0.41cd0.43bc 0.39d0.46a0.46a0.46a 80 0.44bc0.43c0.35e0.39d 0.40d0.46a0.45ab0.46a 85 0.41abc0.40cd0.32f0.39d 0.35e0.43a0.43ab0.43a 93 0.46bc0.47abc0.36e0.45cd 0.44d0.48ab0.49a0.49a 105 0.67a0.67a0.62b0.64b 0.63b0.68a0.67a0.68a 111 0.68ab0.68ab0.67bc0.64d 0.65cd0.69ab0.69ab0.69a Multi 0.68cd0.68bcd0.68d0.65e 0.00f0.70ab0.70abc0.70a Bed-2IR 50 0.18a0.14cd0.00f0.12d 0.09e0.18a0.15bc0.16ab 57 0.19a0.19a0.00c0.03b 0.19a0.20a0.20a0.20a 70 0.37a0.36a0.20c0.21c 0.31b0.37a0.36a0.38a 78 0.35a0.35a0.25c0.30b 0.28b0.36a0.36a0.36a 85 0.37a0.36a0.22c0.30b 0.29b0.38a0.37a0.38a 90 0.30abcd0.29cd0.21e0.28d 0.20e0.32a0.31abc0.32ab 98 0.45a0.46a0.18d0.35c 0.38b0.47a0.46a0.46a 110 0.40abc0.39bc0.34d0.39c 0.35d0.42a0.41ab0.42a 116 0.44a0.44a0.44a0.39b 0.38b0.45a0.44a0.45a Multi 0.53cd0.53d0.46e0.40f 0.01g0.55ab0.54bc0.56a Bed-5IR 50 0.18a0.17ab0.16ab0.15b 0.08c0.17ab0.16ab0.16ab 57 0.25a0.25a0.21c0.21bc 0.14d0.25a0.24a0.24ab 70 0.27a0.26a0.21b0.19b 0.20b0.27a0.27a0.26a 78 0.26a0.24a0.19b0.19b 0.18b0.26a0.24a0.26a 85 0.32a0.32a0.26b0.25b 0.24b0.32a0.32a0.33a 90 0.31a0.31a0.25c0.28b 0.22d0.32a0.32a0.32a 98 0.30a0.29a0.26b0.25b 0.16c0.29a0.28a0.28a 110 0.46a0.45a0.10d0.22c 0.34b0.45a0.45a0.45a 116 0.47a0.47a0.20d0.34c 0.38b0.47a0.47a0.47a Multi 0.54a0.54a0.32c0.37b 0.00d0.54a0.55a0.55a Bed-Heat 72 0.57a0.57a0.54b0.53b 0.50c0.57a0.57a0.57a 79 0.61a0.61a0.60a0.58b 0.51c0.61a0.61a0.61a 92 0.64a0.64a0.65a0.63a 0.55b0.64a0.64a0.64a 100 0.66a0.66a0.67a0.65a 0.57b0.66a0.66a0.66a 107 0.66a0.66a0.67a0.66a 0.56b0.66a0.67a0.66a 112 0.68ab0.68ab0.69a0.66b 0.62c0.68a0.68a0.69a 120 0.69a0.68a0.69a0.66b 0.59c0.69a0.68a0.68a 132 0.62a0.61a0.55b0.54b 0.54b0.62a0.61a0.61a 138 0.54a0.53a0.47b0.46b 0.46b0.54a0.53a0.54a 87 TableA.1(cont'd) Multi 0.71a0.70a0.70a0.67b 0.00c0.71a0.71a0.72a *Eachrowcontainsresultsforeachenvironmentandtime-point(DAS:daysaftersowing).Modelswiththe sameletter(withineachrow)arenotsigni˝cantlydi˙erentfromeachother( = 0 : 05 ,ANOVAfollowedby Tukeytest). 88 APPENDIXB SUPPLEMENTARYMATERIALFROMCHAPTER3 89 FigureB.1:Toptwoprincipalcomponentsofthegenomicrelationshipmatrix, G ,foreachdataset. Eachpointrepresentindividuals.(A)Wheat-599dataset.(B)Wheat-largedataset.Individuals arecolor-groupedbythecycle(sowing-harvestyear). 90 FigureB.2:Boxplotofgrainyieldphenotypicrecords(intonha 1 )byenvironmentalcondition forbothWheat-599andWheat-largedatasets.SDstandarddeviation. 91 FigureB.3:Distributionofthenumberoftrainingsupportpoints( n sup )inoptimalsparseselec- tionindices(resultsobtainedover100trn-tstpartitions; n trn =sizeofthetrainingdataset),by environmentalcondition,Wheat-599dataset. 92 FigureB.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)andthe correspondingsupportpoints(green).Greypointsrepresentgenotypesthatdidnotcontributeto thepredictionofthegeneticvalueofthegenotypeinyellow.Allpanelsrepresentsolutionsforthe environment1,Wheat-599dataset. 93 FigureB.5:(leftandcenter)Weights( ij )ofastandardSI(G-BLUP)andoftheoptimalsparse selectionindex(SSI)versusthegenomicrelationship( g ij ),and(right)proportionofweightsin theSSIthatbelongedtoeitherthesupportingornon-activesets,bygenomic-relationship;by environment,Wheat-largedataset. 94 FigureB.6:(leftandcenter)Weights( ij )ofastandardSI(G-BLUP)andoftheoptimalsparse selectionindex(SSI)versusthegenomicrelationship( g ij ),and(right)proportionofweightsin theSSIthatbelongedtoeitherthesupportingornon-activesets,bygenomic-relationship;by environment,Wheat-599dataset. 95 TableB.1:Numberofavailableobservations,averagegrainyield,andheritabilitybyenvironmental conditionfortheWheat-largedataset. PlantingconditionsNumberof Namen AverageHeritability DateSystemirrigations(SD)Yield(SD) a OptimumBed2B2I3,7324.53(0.261)0.41(0.029) OptimumBed5B5I29,4737.12(0.372)0.57(0.025) OptimumFlat5MEL4,4035.76(0.305)0.23(0.025) LateBed5LHT4,4043.83(0.375)0.51(0.025) OptimumBedMinimalDRB3,7632.74(0.275)0.38(0.029) EarlyBed5EHT2,0406.16(0.525)0.41(0.038) SD.Standarddeviation. a PosteriormeanandSDobtainedacross10,000MonteCarloreplicatesusing Gibbssampling. TableB.2:Numberofavailableobservations,averagegrainyield,andheritabilitybyenvironmental conditionfortheWheat-599dataset. MoistureregimeTemperatureNamen AverageHeritability (SD)Yield(SD) a Optimalirrigation,lowrainfallOptimalEnv15995.14(0.614)0.50(0.054) HighrainfallOptimalEnv25994.51(0.790)0.46(0.056) LowrainfallHighdroughtEnv35993.86(0.592)0.43(0.062) IrrigationorrainfallHot,lowhumidityEnv45993.23(0.636)0.44(0.061) SD.Standarddeviation. a PosteriormeanandSDobtainedacross10,000MonteCarloreplicatesusing Gibbssampling. 96 TableB.3:Maximumpredictionaccuracy(averageacross100partitions)achievedbytheSSIfor di˙erentvaluesoftheparameter ofanElastic-Net-typeSSI,byenvironmentalconditionforthe Wheat-largedataset. Environment 0 a opt b n sup c Accuracy(SD) d B2I 1.53201.000.01413950.649(0.032)b 0.7660 0.250.06673200.663(0.032)a 0.500.03203300.664(0.032)a 0.750.02332900.664(0.032)a 1.000.01553380.664(0.032)a B5I 1.84121.000.01191,2260.610(0.009)b 0.9215 0.250.04601,1870.630(0.009)a 0.500.02231,2030.631(0.009)a 0.750.01641,0440.631(0.009)a 1.000.01329430.631(0.009)a MEL 3.79341.000.01165610.665(0.046)b 1.8967 0.250.07053380.685(0.045)a 0.500.04062700.686(0.045)a 0.750.02942360.687(0.045)a 1.000.01952820.687(0.045)a LHT 0.98411.000.02182370.712(0.026)b 0.4921 0.250.08542480.727(0.025)a 0.500.04911940.729(0.025)a 0.750.02952230.729(0.025)a 1.000.02352020.730(0.025)a DRB 1.75551.000.03441030.679(0.038)b 0.8778 0.250.14611020.694(0.039)a 0.500.0823790.696(0.039)a 0.750.0588690.697(0.040)a 1.000.0386850.697(0.040)a EHT 1.55141.000.02841200.657(0.046)a 0.7757 0.250.09701590.670(0.048)a 0.500.05541260.672(0.048)a 0.750.03991100.672(0.048)a 1.000.02641330.673(0.048)a SD:Standarddeviationacrossthe100trn-tstpartitions. a ShrinkagefactorinvolvedinthestandardSI (Equation(3.2)).Withinenvironment,inthetoprowavalueof 0 wasusedasintheG-BLUPandin rowsbelow, 0 wasreducedtohalf. b Optimalvalueof thatyieldedanSSIwiththemaximumaccuracy amongallindicesobtainedforagridof100valuesof . c Averagenumberofindividualsinsupportingthe predictionofindividualsfromtestingset. d Modelswiththesameletterarenotsigni˝cantlydi˙erentfrom others(ANOVAfollowedbyTukey'sHSDtest,5%signi˝cancelevel. 97 TableB.4:Maximumpredictionaccuracy(averageacross100partitions)achievedbytheSSIfor di˙erentvaluesoftheparameter ofanElastic-Net-typeSSI,byenvironmentalconditionforthe Wheat-599dataset. Environment 0 a opt b n sup c Accuracy(SD) d Env1 1.21011.000.0314840.769(0.062)a 0.5061 0.250.1042990.772(0.063)a 0.500.04921010.773(0.063)a 0.750.02961100.773(0.063)a 1.000.02361030.773(0.063)a Env2 1.30341.000.01751510.708(0.085)a 0.6517 0.250.06861470.711(0.086)a 0.500.03971260.710(0.086)a 0.750.02401360.710(0.086)a 1.000.01921290.710(0.086)a Env3 1.40841.000.0514500.609(0.090)a 0.7042 0.250.2213480.611(0.089)a 0.500.1017480.610(0.090)a 0.750.0601540.609(0.091)a 1.000.0474500.609(0.091)a Env4 1.43801.000.0615400.722(0.073)a 0.7190 0.250.2689390.727(0.074)a 0.500.1483310.727(0.074)a 0.750.0870350.728(0.075)a 1.000.0681320.728(0.075)a SD:Standarddeviationacrossthe100trn-tstpartitions. a ShrinkagefactorinvolvedinthestandardSI (Equation(3.2)).Withinenvironment,inthetoprowavalueof 0 wasusedasintheG-BLUPandin rowsbelow, 0 wasreducedtohalf. b Optimalvalueof thatyieldedanSSIwiththemaximumaccuracy amongallindicesobtainedforagridof100valuesof . c Averagenumberofindividualsinsupportingthe predictionofindividualsfromtestingset. d Modelswiththesameletterarenotsigni˝cantlydi˙erentfrom others(ANOVAfollowedbyTukey'sHSDtest,5%signi˝cancelevel. 98 B.1EquivalencebetweenStandardSelectionIndexandBLUP Considerastandardsingle-traitmodeloftheform y = u + " where y = ¹ y 1 ;:::; y n º 0 , u = ¹ u 1 ;:::; u n º 0 ,and " = ¹ " 1 ;:::;" n º 0 arevectorsofphenotypes,genetic, andenvironmentale˙ects,respectively.Here,forsimplicityweassumethatallthesevectorshave zero-mean. InastandardG-BLUPmodel, u and " areassumedtobeindependent(i.e., co v ¹ u ; " 0 º = 0 ), bothhavenullmeans(i.e., E ¹ u º = E ¹ " º = 0 ),and(co)variancematrices v ar ¹ u º = ˙ 2 u G and v ar ¹ " º = ˙ 2 " I ,respectively;here G isarelationshipmatrixthatcouldbederivedfromapedigree orfromDNAsequences. Considernowapartitionofeachofthedatainintoatraining(trn)andatesting(tst)set. Theobjectiveistopredictthegeneticvaluesoftheindividualsinthetestingset( u tst )usingthe phenotypedataavailablefromthetrainingset( y trn ).The(co)variancematrixofthevectorof breedingvaluescanbepartitionedasfollows v ar © « 2 6 6 6 6 6 4 u trn u tst 3 7 7 7 7 7 5 ª ® ® ¬ = ˙ 2 u 2 6 6 6 6 6 4 G trn G trn ; tst G 0 trn ; tst G tst 3 7 7 7 7 7 5 where G trn and G tst arethegeneticrelationshipsubmatricesforthetrainingandtestingdatapoints, respectively,and G trn ; tst isthegeneticrelationshipsubmatrixbetweentrainingandtestingsubjects. TheBestLinearPredictor(BLP)of u tst ( ^ u tst )takestheform(e.g.,Searleetal.,1992): E ¹ u tst j y trn º = E ¹ u tst º + co v ¹ u tst ; y 0 trn º v ar ¹ y trn º 1 y trn E ¹ y trn º = G 0 trn ; tst ¹ G trn + 0 I º 1 y trn : Alternatively,onecanwrite ^ u tst = H y trn ,where H = G 0 trn ; tst ¹ G trn + 0 I º 1 isa matrix.Thus,theBLUPofthegeneticvalueofthe i th testingindividualis ^ u tst ¹ i º = H 0 i y trn where H 0 i isthe i th rowof H ,thatis H 0 i = G 0 i ¹ G trn + 0 I º 1 whichisequaltotheweightsofthestandard selectionindex, ^ 0 i = G 0 i ¹ G trn + 0 I º 1 (seeEquation(3.2)inthemanuscript). 99 B.2SparseSelectionIndices(SSI)usingtheSFSIR-package Inthissection,weusedatafromtheWheat-599datasetusedinthemanuscripttoillustrate howto˝tSparseSelectionIndicesusingtheSFSIR-package(Lopez-Cruzetal.,2020). B.2.1InstallingthepackagefromGitHub ThefollowingsnippetshowshowtoinstallthepackagefromGitHub. rm(list=ls()) #Installdevtoolspackagefirst install.packages('devtools',repos='https://cran.r-project.org/') #InstallSFSIpackagefromGitHub devtools::install_git('https://github.com/MarcooLopez/SFSI') library(SFSI)#Loadthepackage #InstallBGLRpackage(neededtodownloadthedata) install.packages('BGLR',repos='https://cran.r-project.org/') B.2.2Datapreparation ToillustratetheuseofthesoftwarewewillusedatafromtheWheat-599datasetwhichisavailable withtheBGLRR-package(Perez&delosCampos,2014).Thefollowingcodeshowshowto preparedataforenvironment1;alltheanalyseshereinafterarebasedonthisdata. data(wheat,package="BGLR")#LoaddatafromtheBGLRpackage #Selecttheenvironment1toworkwith y<-as.vector(scale(wheat.Y[,1])) #CalculateGmatrix G<-tcrossprod(scale(wheat.X))/ncol(wheat.X) #Createadirectoryandsavedata dir.create("data",recursive=TRUE) save(y,G,file="data/geno_pheno.RData") 100 B.2.3Heritabilityandvariancecomponents ImplementingtheSSIrequiresanestimateoftheheritability.WeobtainthisusingaG-BLUP model y i = + u i + " i with " i iid ˘ N ¹ 0 ;˙ 2 " º and u ˘ N ¹ 0 ;˙ 2 u G º .Thismodelcanbe˝ttedwiththe '˝tBLUP'functionincludedintheSFSIR-package.TheBGLRR-packagecanbealsousedto˝t aBayesianversionofthemodel.Thecodebelowillustrateshowtoestimateheritabilityusingthe '˝tBLUP'function. load("data/geno_pheno.RData")#Loaddata #Fitmodel fm0<-fitBLUP(y,K=G) fm0$h2<-fm0$varU/(fm0$varU+fm0$varE)#Estimateheritability c(fm0$varU,fm0$varE,fm0$h2)#Variancecomponents(varU,varE,h2) #Createadirectoryandsavedata dir.create("output",recursive=TRUE) save(fm0,file="output/varComps.RData") B.2.4Training-testingpartitions Thecodebelowproducestraining(trn, 70% )andtesting(tst, 30% )partitions.Theparameter 'nPart'de˝nesthenumberofpartitions.Theoutputisamatrixwith'nPart'columnsand180rows containingindicesrepresentingtheobservationsthatareassignedtothetestingsets.Theobjectis savedinthe˝le'partitions.RData'andwillbeusedinlateranalyses. nPart<-5#Numberofpartitions load("data/geno_pheno.RData")#Loaddata nTST<-ceiling(0.3*length(y))#NumberofelementsinTSTset partitions<-matrix(NA,nrow=nTST,ncol=nPart)#Matrixtostorepartitions seeds<-round(seq(1E3,.Machine$integer.max,length=nPart)) for(kin1:nPart) {set.seed(seeds[k]) partitions[,k]<-sample(1:length(y),nTST,replace=FALSE) } save(partitions,file="output/partitions.RData")#Savepartitions 101 B.2.5AccuracyoftheG-BLUPandoftheSparseSI ThefollowingscriptshowshowtoderiveSSIsusingthepartitionsabovecreated.Theweightsofthe SSIarecomputedusingthe'SSI'functionfor'nLambda=100'valuesof .TheG-BLUPmodelis ˝ttedforcomparisonusingthe'˝tBLUP'function.Estimatesof and h 2 arecomputedinternally inthe'SSI'functionwhenthesearenotprovided.TheseestimatesobtainedfromtheG-BLUP modelwillbepassedtothe'SSI'functiontosavetime.Indicesdenotingtrainingandtestingsets arepassedthroughthe'trn'and'tst'parameters,respectively.TheaccuracyoftheG-BLUPand SSImodelsarestoredintheobject'accSSI',andsavedinthe˝le'results_accuracy.RData'. #Loaddata load("data/geno_pheno.RData") load("output/varComps.RData") load("output/partitions.RData") accSSI<-mu<-h2<-c()#Objectstostoreresults for(kin1:ncol(partitions)) {cat("partition=",k,"\n") tst<-partitions[,k] trn<-(1:length(y))[-tst] yNA<-y;yNA[tst]<-NA #G-BLUPmodel fm1<-fitBLUP(yNA,K=G) mu[k]<-fm1$b#Retrievemuestimate h2[k]<-fm1$h2#Retrieveh2estimate #SparseSI fm2<-SSI(y,K=G,b=mu[k],h2=h2[k],trn=trn,tst=tst,mc.cores=1,nLambda=100) fm3<-summary(fm2)#Usefulfunctiontogetresults accuracy<-c(GBLUP=cor(fm1$u[tst],y[tst]),fm3$accuracy)/sqrt(fm0$h2) lambda<-c(min(fm3$lambda),fm3$lambda) df<-c(max(fm3$df),fm3$df) accSSI<-rbind(accSSI,data.frame(rep=k,SSI=names(accuracy),accuracy,lambda,df)) } save(mu,h2,accSSI,file="output/results_accuracy.RData") 102 B.2.6DisplayingResults Thefollowingcodecreatesaplot(asinFigure3.1inthemanuscript)showingtheestimatedgenetic predictionaccuracybyvaluesofthepenaltyparameter(inlogarithmicscale).Therightmostpoint intheplotcorrespondstotheG-BLUPmodel(obtainedwhen = 0 ).Thepointatthepeakdenotes themaximumaccuracythatwasobtainedbytheSSI. load("output/results_accuracy.RData") dat<-data.frame(do.call(rbind,lapply(split(accSSI,accSSI$SSI), function(x)apply(x[,-c(1:2)],2,mean)))) dat$Model<-unlist(lapply(strsplit(rownames(dat),"\\."),function(x)x[1])) dat2<-do.call(rbind,lapply(split(dat,dat$Mod),function(x)x[which.max(x$acc),])) ggplot(dat[dat$df>1,],aes(-log(lambda),accuracy))+ geom_hline(yintercept=dat["GBLUP",]$accuracy,linetype="dashed")+ geom_line(aes(color=Model),size=1.1)+theme_bw()+ geom_point(data=dat2,aes(color=Model),size=2.5) B.2.7Cross-validatingtoobtainanoptimalpenalization Thesnippetbelowcanbeusedtoperform,withineachtrn-tstpartition, k -foldsCVtogetan 'optimal'valueof withinthetrainingdata,andthenusedto˝tanSSIforthetestingset.TheCV isimplementedusingthe'SSI_CV'functionfromtheSFSIR-packageforone5-foldsCV,thiscan besetbychangingthe'nCV'and'nFolds'parameters. 103 load("data/geno_pheno.RData");load("output/varComps.RData") load("output/partitions.RData");load("output/results_accuracy.RData") lambdaCV<-accSSI_CV<-dfCV<-c()#Objectstostoreresults for(kin1:ncol(partitions)) {cat("partition=",k,"\n") tst<-partitions[,k] trn<-(1:length(y))[-tst] #Cross-validatingthetrainingset fm1<-SSI_CV(y,K=G,trn.CV=trn,nLambda=100,mc.cores=1,nFolds=5,nCV=1) lambdaCV[k]<-summary(fm1)$optCOR["mean","lambda"] #FitaSSIwiththeestimatedlambda fm2<-SSI(y,K=G,b=mu[k],h2=h2[k],trn=trn,tst=tst,lambda=lambdaCV[k]) accSSI_CV[k]<-summary(fm2)$accuracy/sqrt(fm0$h2) dfCV<-cbind(dfCV,fm2$df) } save(accSSI_CV,lambdaCV,dfCV,file="output/results_accuracyCV.RData") Afterrunningtheaboveanalysis,thefollowingsnippetcanberuntocreateaplot(asinFigure 3.2inthemanuscript)comparingpartition-wisetheaccuracyoftheoptimalSSIwiththatofthe G-BLUP.Theaverageaccuraciesarealsoshownintheplot. load("output/results_accuracy.RData") load("output/results_accuracyCV.RData") dat<-data.frame(GBLUP=accSSI[accSSI$SSI=="GBLUP",]$acc,SSI=accSSI_CV) rg<-range(dat) tmp<-c(mean(rg),diff(rg)*0.4) ggplot(dat,aes(GBLUP,SSI))+geom_abline(slope=1,linetype="dotted")+ geom_point(shape=21,color="orange")+xlim(rg)+ylim(rg)+ annotate("text",tmp[1],tmp[1]-tmp[2],label=round(mean(dat$GBLUP),3))+ annotate("text",tmp[1]-tmp[2],tmp[1],label=round(mean(dat$SSI),3)) Thecodebelowcreatesaplot(asinFigure3.3inthemanuscript)showingthedistributionof thenumberofpointsinthesupportsetfortheSSI,acrossallpartitions. 104 load("output/results_accuracyCV.RData") dat<-data.frame(df=as.vector(dfCV)) bw<-round(diff(range(dat$df))/40) ggplot(data=dat,aes(df,stat(count)/length(dfCV)))+theme_bw()+ geom_histogram(color="gray20",fill="lightblue",binwidth=bw)+ labs(x=bquote("Supportsetsize("*n[sup]*")"),y="Frequency") B.2.8Subject-speci˝ctrainingsets Thenextscriptcanbeusedtocreateaplot(asinFigure3.4inthemanuscript)showing(fora singletrn-tstpartition)thesubsetofpointsinthesupportset,foreachindividualbeingpredicted. Thisplotcanbemadethroughthe'plotNet'functionfromtheSFSIpackage. #Loaddata load("data/geno_pheno.RData") load("output/partitions.RData") load("output/results_accuracyCV.RData") part<-1#Chooseanypartitionfrom1,...,nPart tst<-partitions[,part] trn<-(1:length(y))[-tst] #FitSSIwithlambdapreviouslyestimatedusingCV fm<-SSI(y,K=G,trn=trn,tst=tst,lambda=lambdaCV[part]) plotNet(fm,K=G,tst=fm$tst[1:25],single=FALSE,title=NULL,bg.col="white") 105 APPENDIXC SUPPLEMENTARYFIGURESANDTABLESFROMCHAPTER4 106 FigureC.1:Genomicrelationships( G ij )versuskernelrelationships( K ij )ofindividualsincycle 2019withthoseincycles2017(left)and2018(right). G i j and K i j arethe ij th elementof G and K ¹ º ,respectively.(A) K 1 = K ¹ 0 : 2 º .(B) K 2 = K ¹ 1 º .(C) K 3 = K ¹ 5 º . 107 FigureC.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthedatafrom the2017,2018,or2017+2018cyclesalone(top-leftpanel),orincombinationwithaproportion (5%,10%,15%)ofthedatafromthe2019cycle.Thepredictionsetconsistedof612genotypes fromthe2019cyclethatwerenotusedformodeltraining.Modelswiththesameletterwithin panelindicatenosigni˝cantdi˙erencefromeachother( = 0 : 05 ,ANOVAfollowedbyTukey test).GY-DRTtrait-environmentcombination. 108 FigureC.3:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthedatafrom the2017,2018,or2017+2018cyclesalone(top-leftpanel),orincombinationwithaproportion (5%,10%,15%)ofthedatafromthe2019cycle.Thepredictionsetconsistedof612genotypes fromthe2019cyclethatwerenotusedformodeltraining.Modelswiththesameletterwithin panelindicatenosigni˝cantdi˙erencefromeachother( = 0 : 05 ,ANOVAfollowedbyTukey test).TraitPH(OPTandDRTenvironments). 109 FigureC.4:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(horizontalaxis) versusthepredictionaccuracyofallothermodels(verticalaxisofeachpanel).(B)Prediction accuracyofthestandard*-BLUPmodel(horizontalaxis)versusthepredictionaccuracyofits sparseversion(verticalaxis),bytypeofkernelusedinpanels.Eachpointrepresentatraining- testingpartitionwithineachtrainingsetcomposition.Coloredpointsabove(below)the45degree linerepresentcasesforwhichonemodeloutperformedtheothermodel.P:p-valueforthetest (fromANOVA)fordi˙erencesinaccuracybetweenthetwomodels.GY-DRTtrait-environment combination. 110 FigureC.5:(left)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(horizontalaxis) versusthepredictionaccuracyofallothermodels(verticalaxisofeachpanel).(right)Prediction accuracyofthestandard*-BLUPmodel(horizontalaxis)versusthepredictionaccuracyofits sparseversion(verticalaxis),bytypeofkernelusedinpanels.Eachpointrepresentatraining- testingpartitionwithineachtrainingsetcomposition.Coloredpointsabove(below)the45degree linerepresentcasesforwhichonemodeloutperformedtheothermodel.P:p-valueforthetest (fromANOVA)fordi˙erencesinaccuracybetweenthetwomodels.TraitPH(OPTandDRT environments). 111 FigureC.6:Heatmapofthecoe˚cientsintheHatmatrix( ~ B ¹ º K )ofthesparseKA-BLUPmodel foronetraining-prediction(TS-PS)partitioninthepredictionof n PS = 612 individualsfrom2019 using n TS = 2427 individuals(2017+2018plus15%ofthe2019set).Predictedindividualsare presentedincolumnsandtrainingindividualsarepresentedinrowsseparatedbycycleandnumber ofindividualsinparentheses.Thevalueof wasobtainedbycross-validation.Eachcolumn representsvaluesofthevector ~ b ¹ º i K = f ~ b ij g , j = 1 ;:::; 2427 (Equation(4.6)).Individualsno contributingtothepredictionhaveacoe˚cient ~ b ij = 0 representedingreycolor.Individualswith anon-zerocoe˚cientareshowninayellow-bluelogarithmscale(intheoriginalscale,yellow indicateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environmentcombination. 112 FigureC.7:Proportionofthetrainingindividualsfromeachcyclethatcontributedtotheprediction ofthe612testinggenotypesfrom2019,usingsparsemodelswithdi˙erentrelationshipmatrices (horizontalaxis): G , K 1 , K 2 ,or K A .Thetrainingsetwascomposedbyindividualsfrom2017 ( n = 901 )and2018( n = 1417 )alone(top-leftpanel)orincombinationwithaproportion(5%, 10%,15%)ofthedatafromthe2019cycle.GY-DRTtrait-environmentcombination. 113 FigureC.8:Proportionofthetrainingindividualsfromeachcyclethatcontributedtotheprediction ofthe612testinggenotypesfrom2019,usingsparsemodelswithdi˙erentrelationshipmatrices (horizontalaxis): G , K 1 , K 2 ,or K A .Thetrainingsetwascomposedbyindividualsfrom2017 ( n = 901 )and2018( n = 1417 )alone(top-leftpanels)orincombinationwithaproportion(5%, 10%,15%)ofthedatafromthe2019cycle.TraitPH(OPTandDRTenvironments). 114 TableC.1:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,and10%ofsubjectsfromthe2019cycle),traitGY,environmentOPT. Accuracy(SD)%Gain TS( n TS )GRM CV a n sup (RS) b h 2 StandardSparseI c II d TS+0%(2019) G0.0139271(30)0.510.20(0.017)0.25(0.018)024 K10.0028307(34)0.860.21(0.017)0.20(0.017)5-9 2017K20.0037388(43)0.730.24(0.017)0.26(0.017)177 (901)K30.0000901(100)0.960.26(0.016)0.26(0.016)280 KA0.0023435(48)0.860.24(0.019)0.26(0.029)196 G0.0086469(33)0.610.25(0.015)0.22(0.015)0-11 K10.0011664(47)0.910.25(0.015)0.22(0.015)2-13 2018K20.0019743(52)0.790.26(0.015)0.24(0.016)5-8 (1417)K30.00001417(100)0.920.23(0.015)0.23(0.015)-70 KA0.0015680(48)0.880.26(0.016)0.24(0.017)4-8 G0.0120569(25)0.520.33(0.015)0.33(0.015)02 K10.0028484(21)0.880.33(0.015)0.31(0.015)2-6 2017+18K20.00131374(59)0.760.34(0.015)0.33(0.015)4-2 (2318)K30.00002316(100)0.880.31(0.015)0.31(0.015)-40 KA0.00101446(62)0.830.33(0.016)0.33(0.016)3-3 TS+5%(2019) G0.0149241(26)0.500.32(0.036)0.36(0.035)012 K10.0027292(31)0.860.34(0.036)0.36(0.036)58 2017K20.0028469(50)0.730.37(0.037)0.38(0.038)162 (938)K30.0000938(100)0.930.34(0.038)0.34(0.038)70 KA0.0015550(59)0.840.36(0.038)0.37(0.038)141 G0.0097412(28)0.610.26(0.021)0.24(0.020)0-7 K10.0014563(39)0.910.27(0.021)0.26(0.024)2-3 2018K20.0015835(57)0.790.28(0.021)0.27(0.021)7-3 (1454)K30.00001454(100)0.920.28(0.021)0.28(0.021)60 KA0.0012749(52)0.890.28(0.021)0.26(0.024)6-6 G0.0142441(19)0.530.34(0.019)0.36(0.019)04 K10.0025534(23)0.880.35(0.019)0.34(0.024)3-4 2017+18K20.00181117(47)0.760.37(0.021)0.37(0.021)9-1 (2355)K30.00002354(100)0.880.36(0.022)0.36(0.022)40 KA0.00131235(52)0.830.37(0.021)0.37(0.020)8-2 TS+10%(2019) G0.0186183(19)0.520.42(0.029)0.45(0.030)07 K10.0041180(18)0.870.44(0.029)0.46(0.030)44 2017K20.0067218(22)0.750.47(0.030)0.47(0.030)121 (974)K30.0000974(100)0.920.48(0.032)0.48(0.032)140 KA0.0045244(25)0.850.46(0.030)0.47(0.032)101 G0.0111349(23)0.610.40(0.029)0.41(0.030)02 K10.0017480(32)0.910.41(0.029)0.42(0.030)32 115 TableC.1(cont'd) 2018K20.0017781(52)0.790.44(0.029)0.43(0.029)10-1 (1490)K30.00001490(100)0.900.42(0.028)0.42(0.028)50 KA0.0012781(52)0.880.43(0.029)0.42(0.029)8-1 G0.0126455(19)0.530.42(0.021)0.44(0.021)04 K10.0027459(19)0.880.44(0.021)0.44(0.026)41 2017+18K20.00211032(43)0.760.46(0.023)0.46(0.023)10-1 (2391)K30.00002390(100)0.870.46(0.025)0.46(0.025)90 KA0.00151153(48)0.830.46(0.023)0.46(0.023)9-1 GRM:Geneticrelationshipmatrix.SD:standarddeviation. a PenalizationparameterinEquation(4.6)found bycross-validatingtheTS. b n sup =averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient inthesparseHatmatrix(supportset).RS:relativesparsity( 100 n TS š n sup ).Inthestandardmodels CV is equaltozeroand n sup isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy ofthe c standardK-BLUPrelativetothestandardG-BLUP,and d sparse*-BLUPrelativetothestandard *-BLUP(*=G-orK-). 116 TableC.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitGY,environmentDRT. Accuracy(SD)%Gain TS( n TS )GRM CV a n sup (RS) b h 2 StandardSparseI c II d TS+0%(2019) G0.0071418(46)0.570.27(0.014)0.29(0.014)06 K10.0012522(58)0.890.28(0.014)0.20(0.020)4-30 2017K20.0015597(66)0.770.31(0.014)0.33(0.014)154 (901)K30.0000901(100)0.940.34(0.013)0.34(0.013)230 KA0.0008680(75)0.880.31(0.018)0.30(0.033)13-2 G0.0075561(40)0.410.20(0.018)0.23(0.018)017 K10.0011768(54)0.810.20(0.018)0.19(0.017)3-8 2018K20.00011387(98)0.600.23(0.018)0.24(0.018)182 (1417)K30.00001417(100)0.660.26(0.013)0.26(0.014)34-2 KA0.00001417(100)0.690.23(0.020)0.23(0.021)19-3 G0.0066954(41)0.440.23(0.018)0.27(0.018)014 K10.00131013(44)0.840.24(0.018)0.25(0.018)25 2017+18K20.00032049(88)0.650.28(0.017)0.29(0.017)211 (2318)K30.00002318(100)0.750.38(0.013)0.38(0.013)630 KA0.00002318(100)0.730.29(0.019)0.29(0.029)222 TS+5%(2019) G0.0151243(26)0.570.41(0.028)0.46(0.027)011 K10.0031331(35)0.890.43(0.028)0.44(0.053)42 2017K20.0035484(52)0.770.46(0.026)0.46(0.026)12-1 (938)K30.0000938(100)0.920.45(0.021)0.45(0.021)90 KA0.0010659(70)0.870.46(0.027)0.45(0.033)10-1 G0.0126373(26)0.410.39(0.034)0.43(0.031)010 K10.0013678(47)0.820.41(0.033)0.42(0.035)44 2018K20.00001453(100)0.610.45(0.030)0.45(0.032)150 (1454)K30.00001454(100)0.670.48(0.026)0.48(0.026)22-1 KA0.00001454(100)0.700.46(0.029)0.45(0.031)160 G0.0163387(16)0.440.40(0.035)0.46(0.024)016 K10.0020728(31)0.840.41(0.034)0.46(0.030)311 2017+18K20.00081696(72)0.650.46(0.031)0.45(0.034)15-1 (2355)K30.00002355(100)0.760.49(0.020)0.49(0.020)240 KA0.00061773(75)0.730.46(0.031)0.46(0.038)150 TS+10%(2019) G0.0174190(19)0.580.54(0.033)0.58(0.030)07 K10.0038184(19)0.890.55(0.032)0.58(0.028)24 2017K20.0064229(23)0.780.57(0.029)0.59(0.029)62 (974)K30.0000974(100)0.920.54(0.031)0.54(0.031)00 KA0.0040284(29)0.880.57(0.029)0.58(0.031)52 G0.0124354(24)0.410.49(0.026)0.53(0.025)09 K10.0020488(33)0.820.50(0.026)0.53(0.027)36 117 TableC.2(cont'd) 2018K20.00021416(95)0.610.54(0.024)0.54(0.025)110 (1490)K30.00001490(100)0.680.56(0.021)0.56(0.022)150 KA0.00001482(99)0.700.54(0.024)0.54(0.026)110 G0.0152389(16)0.450.51(0.030)0.56(0.028)010 K10.0022622(26)0.840.52(0.029)0.55(0.028)36 2017+18K20.00051981(83)0.660.56(0.026)0.56(0.027)100 (2391)K30.00002391(100)0.760.58(0.021)0.58(0.021)140 KA0.00012292(96)0.730.56(0.026)0.56(0.027)100 TS+15%(2019) G0.0236126(12)0.580.64(0.029)0.68(0.025)07 K10.0058125(12)0.890.65(0.028)0.68(0.025)24 2017K20.0097125(12)0.790.67(0.026)0.68(0.023)52 (1010)K30.00001010(100)0.920.64(0.025)0.64(0.025)10 KA0.0061145(14)0.870.66(0.026)0.68(0.024)43 G0.0126343(22)0.420.60(0.026)0.65(0.030)08 K10.0013705(46)0.830.62(0.026)0.64(0.028)33 2018K20.00011491(98)0.620.65(0.025)0.65(0.025)80 (1526)K30.00001526(100)0.690.65(0.024)0.65(0.024)80 KA0.00001518(99)0.710.65(0.026)0.65(0.026)80 G0.0175273(11)0.450.60(0.030)0.67(0.026)011 K10.0028458(19)0.840.62(0.029)0.66(0.026)37 2017+18K20.00111488(61)0.660.65(0.027)0.66(0.026)91 (2464)K30.00002427(100)0.760.65(0.024)0.65(0.024)80 KA0.00071682(69)0.740.65(0.027)0.66(0.026)90 GRM:Geneticrelationshipmatrix.SD:standarddeviation. a PenalizationparameterinEquation(4.6)found bycross-validatingtheTS. b n sup =averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient inthesparseHatmatrix(supportset).RS:relativesparsity( 100 n TS š n sup ).Inthestandardmodels CV is equaltozeroand n sup isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy ofthe c standardK-BLUPrelativetothestandardG-BLUP,and d sparse*-BLUPrelativetothestandard *-BLUP(*=G-orK-). 118 TableC.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,and10%ofsubjectsfromthe2019cycle),traitPH,environmentOPT. Accuracy(SD)%Gain TS( n TS )GRM CV a n sup (RS) b h 2 StandardSparseI c II d TS+0%(2019) G0.0024670(74)0.550.14(0.014)0.12(0.014)0-9 K10.0005735(82)0.880.14(0.014)0.13(0.015)1-8 2017K20.0006763(85)0.760.13(0.014)0.14(0.014)-23 (901)K30.0000901(100)0.960.04(0.012)0.04(0.012)-73-3 KA0.0002857(95)0.870.12(0.019)0.11(0.028)-14-3 G0.0131317(22)0.710.02(0.015)0.08(0.015)0430 K10.0020387(27)0.950.06(0.015)0.08(0.015)26543 2018K20.0024615(43)0.870.10(0.015)0.12(0.014)56422 (1417)K30.00001417(100)0.930.06(0.013)0.06(0.013)322-1 KA0.0019545(38)0.930.09(0.021)0.13(0.017)51733 G0.0129499(22)0.640.11(0.013)0.07(0.014)0-38 K10.0014835(36)0.930.15(0.013)0.07(0.013)39-51 2017+18K20.00081660(72)0.840.16(0.013)0.13(0.014)49-19 (2318)K30.00002318(100)0.940.06(0.013)0.06(0.013)-46-1 KA0.00051786(77)0.890.15(0.018)0.14(0.024)40-8 TS+5%(2019) G0.0010819(87)0.550.23(0.031)0.22(0.031)0-1 K10.0002859(92)0.880.23(0.031)0.23(0.032)3-3 2017K20.0003860(92)0.750.25(0.031)0.24(0.031)9-2 (938)K30.0000938(100)0.920.20(0.032)0.20(0.032)-100 KA0.0002882(94)0.870.24(0.033)0.22(0.035)5-5 G0.0103373(26)0.700.10(0.037)0.18(0.036)076 K10.0016458(31)0.940.16(0.036)0.19(0.036)5917 2018K20.0018714(49)0.870.24(0.032)0.25(0.032)1365 (1454)K30.00001454(100)0.920.24(0.027)0.24(0.027)1380 KA0.0013682(47)0.920.23(0.034)0.24(0.033)1215 G0.0121500(21)0.630.16(0.024)0.14(0.028)0-11 K10.00061388(59)0.920.22(0.025)0.17(0.027)34-19 2017+18K20.00061815(77)0.840.26(0.026)0.25(0.026)61-4 (2355)K30.00002355(100)0.920.21(0.027)0.21(0.027)280 KA0.00041847(78)0.890.25(0.028)0.24(0.026)53-5 TS+10%(2019) G0.0026706(72)0.550.28(0.043)0.28(0.044)0-1 K10.0004817(84)0.880.29(0.042)0.28(0.044)3-2 2017K20.0004857(88)0.760.30(0.043)0.30(0.043)8-1 (974)K30.0000974(100)0.920.28(0.046)0.28(0.046)-10 KA0.0002907(93)0.860.29(0.042)0.28(0.043)4-2 G0.0128291(20)0.700.17(0.072)0.25(0.065)048 K10.0020363(24)0.940.23(0.066)0.25(0.066)3511 119 TableC.3(cont'd) 2018K20.0022624(42)0.870.30(0.054)0.31(0.055)792 (1490)K30.00001490(100)0.920.32(0.043)0.32(0.043)920 KA0.0019540(36)0.920.29(0.057)0.30(0.055)704 G0.0117468(20)0.630.22(0.047)0.22(0.055)00 K10.0013798(33)0.920.28(0.044)0.25(0.053)26-8 2017+18K20.00051846(77)0.840.32(0.042)0.31(0.042)47-3 (2391)K30.00002391(100)0.920.29(0.042)0.29(0.042)320 KA0.00031982(83)0.890.31(0.043)0.30(0.044)43-3 GRM:Geneticrelationshipmatrix.SD:standarddeviation. a PenalizationparameterinEquation(4.6)found bycross-validatingtheTS. b n sup =averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient inthesparseHatmatrix(supportset).RS:relativesparsity( 100 n TS š n sup ).Inthestandardmodels CV is equaltozeroand n sup isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy ofthe c standardK-BLUPrelativetothestandardG-BLUP,and d sparse*-BLUPrelativetothestandard *-BLUP(*=G-orK-). 120 TableC.4:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition (including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitPH,environmentDRT. Accuracy(SD)%Gain TS( n TS )GRM CV a n sup (RS) b h 2 StandardSparseI c II d TS+0%(2019) G0.0025658(73)0.640.18(0.013)0.17(0.013)0-4 K10.0005718(80)0.920.19(0.013)0.34(0.012)978 2017K20.0003842(93)0.840.24(0.013)0.25(0.013)344 (901)K30.0000901(100)1.000.33(0.012)0.33(0.012)841 KA0.0000901(100)0.920.22(0.024)0.11(0.063)26-51 G0.0175263(19)0.59-0.15(0.017)-0.11(0.017)0-29 K10.0028319(23)0.91-0.15(0.017)-0.13(0.017)-3-13 2018K20.0027589(42)0.81-0.13(0.018)-0.12(0.018)-14-9 (1417)K30.00001417(100)0.96-0.06(0.019)-0.06(0.019)-630 KA0.0017675(48)0.87-0.14(0.021)-0.13(0.018)-11-8 G0.0178363(16)0.58-0.05(0.017)-0.04(0.017)0-18 K10.0028447(19)0.91-0.05(0.017)-0.11(0.018)-3115 2017+18K20.0033741(32)0.82-0.04(0.017)-0.05(0.017)-2836 (2318)K30.00002316(100)0.970.08(0.019)0.08(0.019)-2430 KA0.0022853(37)0.87-0.04(0.020)-0.05(0.026)-3039 TS+5%(2019) G0.0025664(71)0.660.55(0.024)0.56(0.025)02 K10.0003781(83)0.920.57(0.023)0.60(0.021)35 2017K20.0001907(97)0.860.61(0.020)0.60(0.021)90 (938)K30.0000938(100)0.980.58(0.030)0.58(0.029)50 KA0.0000919(98)0.930.60(0.022)0.57(0.039)7-5 G0.0157271(19)0.590.17(0.040)0.26(0.044)060 K10.0030286(20)0.910.22(0.043)0.29(0.044)3234 2018K20.0024625(43)0.820.31(0.045)0.33(0.045)877 (1454)K30.00001454(100)0.950.34(0.037)0.34(0.037)1040 KA0.0018658(45)0.870.30(0.046)0.32(0.047)835 G0.0214261(11)0.580.21(0.030)0.34(0.033)059 K10.0034344(15)0.910.27(0.032)0.32(0.080)2420 2017+18K20.0037636(27)0.820.36(0.034)0.36(0.035)681 (2355)K30.00002355(100)0.960.43(0.027)0.43(0.027)1020 KA0.0028699(30)0.870.35(0.037)0.36(0.038)652 TS+10%(2019) G0.0050625(64)0.670.63(0.025)0.64(0.028)01 K10.0006769(79)0.930.65(0.024)0.65(0.027)20 2017K20.0000970(99)0.860.68(0.023)0.67(0.025)60 (974)K30.0000974(100)0.960.67(0.029)0.67(0.029)50 KA0.0001947(97)0.930.67(0.024)0.66(0.028)5-1 G0.0184211(14)0.590.51(0.037)0.62(0.037)023 K10.0030260(17)0.910.58(0.033)0.64(0.035)1310 121 TableC.4(cont'd) 2018K20.0023628(42)0.830.65(0.028)0.66(0.028)282 (1490)K30.00001490(100)0.940.60(0.031)0.60(0.031)170 KA0.0018653(44)0.870.64(0.029)0.66(0.030)272 G0.0212231(10)0.580.49(0.030)0.63(0.027)028 K10.0034309(13)0.910.57(0.028)0.65(0.027)1513 2017+18K20.0040559(23)0.820.65(0.023)0.67(0.023)332 (2391)K30.00002391(100)0.950.64(0.022)0.64(0.022)310 KA0.0033556(23)0.870.65(0.024)0.67(0.025)323 TS+15%(2019) G0.0055487(48)0.660.67(0.023)0.69(0.025)03 K10.0039613(61)0.930.69(0.022)0.70(0.020)22 2017K20.0006804(80)0.860.71(0.021)0.72(0.021)60 (1010)K30.00001010(100)0.970.70(0.027)0.69(0.027)30 KA0.0003857(85)0.920.71(0.022)0.70(0.025)5-1 G0.0198173(11)0.590.54(0.032)0.65(0.031)020 K10.0034204(13)0.910.60(0.029)0.66(0.027)1110 2018K20.0025567(37)0.830.67(0.025)0.68(0.026)232 (1526)K30.00001526(100)0.940.67(0.027)0.67(0.027)220 KA0.0018626(41)0.870.67(0.025)0.68(0.026)231 G0.0213193(8)0.580.51(0.028)0.65(0.025)027 K10.0038253(10)0.910.59(0.026)0.66(0.025)1413 2017+18K20.0034600(25)0.820.67(0.023)0.68(0.023)302 (2464)K30.00002427(100)0.950.67(0.022)0.67(0.022)320 KA0.0028618(25)0.870.66(0.023)0.67(0.025)292 GRM:Geneticrelationshipmatrix.SD:standarddeviation. a PenalizationparameterinEquation(4.6)found bycross-validatingtheTS. b n sup =averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient inthesparseHatmatrix(supportset).RS:relativesparsity( 100 n TS š n sup ).Inthestandardmodels CV is equaltozeroand n sup isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy ofthe c standardK-BLUPrelativetothestandardG-BLUP,and d sparse*-BLUPrelativetothestandard *-BLUP(*=G-orK-). 122 BIBLIOGRAPHY 123 BIBLIOGRAPHY Aguate,FernandoM.,SamuelTrachsel,LorenaGonzález-Pérez,JuanBurgueño,JoséCrossa, MónicaBalzarini,DavidGouache,MatthieuBogard&GustavodelosCampos.2017.Use ofhyperspectralimagedataoutperformsvegetationindicesinpredictionofmaizeyield. Crop Science 57. Akdemir,Deniz&JulioIsidro-Sanchez.2019.Designoftrainingpopulationsforselectivepheno- typingingenomicprediction. Scienti˝cReports 9. Akdemir,Deniz,JulioISanchez&Jean-LucJannink.2015.Optimizationofgenomicselection trainingpopulationswithageneticalgorithm. GeneticsSelectionEvolution 47(38). Alvarado,Gregorio,FranciscoM.Rodríguez,AngelaPacheco,JuanBurgueño,JoséCrossa,Mateo Vargas,PaulinoPérez-Rodríguez&MarcoA.Lopez-Cruz.2020.META-R:Asoftwareto analyzedatafrommulti-environmentplantbreedingtrials. TheCropJournal 8(5). Araus,Luis&JillECairns.2014.Fieldhigh-throughputphenotyping:thenewcropbreeding frontier. TrendsinPlantScience 19(1). Atanda,SikiruAdeniyi,MichaelOlsen,JuanBurgueño,JoseCrossa,DanielDzidzienyo,Yoseph Beyene,ManjeGowda,KateDreher,XuecaiZhang,BoddupalliM.Prasanna,PangirayiTon- goona,EricYirenkyiDanquah,GbadeboOlaoye&KellyR.Robbins.2020.Maximizing e˚ciencyofgenomicselectioninCIMMYT'stropicalmaizebreedingprogram. Theoretical andAppliedGenetics . Babar,MA,MPReynolds,MVanGinkel,ARKlatt,WRRaun&MLStone.2006.Spectral re˛ectancetoestimategeneticvariationforin-seasonbiomass,leafchlorophyll,andcanopy temperatureinwheat. CropScience 46. Bates,Douglas,MartinMächler,BenjaminMBolker&StevenCWalker.2015.Fittinglinear mixed-e˙ectsmodelsusinglme4. JournalofStatisticalSoftware 67(1). Bernardo,Rex&JianmingYu.2007.ProspectsforGenomewideSelectionforQuantitativeTraits inMaize. CropScience 47. Beyene,Yoseph,ManjeGowda,MichaelOlsen,KellyR.Robbins,PaulinoPérez-Rodríguez, GregorioAlvarado,KateDreher,StarYanxinGao,StephenMugo,BoddupalliM.Prasanna& JoseCrossa.2019.EmpiricalComparisonofTropicalMaizeHybridsSelectedThroughGenomic andPhenotypicSelections. FrontiersinPlantScience 10(1502). Bulmer,MG.1985. Themathematicaltheoryofquantitativegenetics .NewYork,USA:Oxford UniversityPress. delosCampos,G.,D.Gianola&G.J.Rosa.2009a.ReproducingkernelHilbertspacesregression: ageneralframeworkforgeneticevaluation. Journalofanimalscience 87. 124 delosCampos,Gustavo,DanielGianola,GuilhermeJ.M.Rosa,KentA.Weigel&JoseCrossa. 2010.Semi-parametricgenomic-enabledpredictionofgeneticvaluesusingreproducingkernel Hilbertspacesmethods. GeneticsResearch 92(4). delosCampos,Gustavo,JohnM.Hickey,RicardoPong-Wong,HansD.Daetwyler&MarioP.L. Calus.2013a.Whole-genomeregressionandpredictionmethodsappliedtoplantandanimal breeding. Genetics 193(2). delosCampos,Gustavo,HugoNaya,DanielGianola,JoséCrossa,AndrésLegarra,Eduardo Manfredi,KentWeigel&JoséMiguelCotes.2009b.Predictingquantitativetraitswithregression modelsfordensemolecularmarkersandpedigree. Genetics 182. delosCampos,Gustavo,AnaIVazquez,RohanFernando,YannCKlimentidis&DanielSorensen. 2013b.PredictionofcomplexhumantraitsusingtheGenomicBestLinearUnbiasedPredictor. PLoSGenetics 9(7). delosCampos,Gustavo,YogasudhaVeturi,AnaI.Vazquez,ChristinaLehermeier&Paulino Pérez-Rodríguez.2015.Incorporatinggeneticheterogeneityinwhole-genomeregressionsusing interactions. JournalofAgricultural,Biological,andEnvironmentalStatistics 20(4). Clark,SamuelA.,JohnM.Hickey,HansD.Daetwyler&JuliusH.J.vanderWerf.2012.The importanceofinformationonrelativesforthepredictionofgenomicbreedingvaluesandthe implicationsforthemakeupofreferencedatasetsinlivestockbreedingschemes. Genetics SelectionEvolution 44(4). Combs,Emily&RexBernardo.2013.Accuracyofgenomewideselectionfordi˙erenttraitswith constantpopulationsize,heritability,andnumberofmarkers. ThePlantGenome 6(1). Cover,T&PHart.1967.Nearestneighborpatternclassi˝cation. IEEETransactionsonInformation Theory 13(1). Crossa,José,GustavodelosCampos,PaulinoPérez,DanielGianola,JuanBurgueño,JoséLuis Araus,DanMakumbi,RaviP.Singh,SusanneDreisigacker,JianbingYan,ViviArief,Marianne Banziger&HansJoachimBraun.2010.Predictionofgeneticvaluesofquantitativetraitsinplant breedingusingpedigreeandmolecularmarkers. Genetics 186(2). Daetwyler,HansD.,BeatrizVillanueva&JohnA.Woolliams.2008.Accuracyofpredictingthe geneticriskofdiseaseusingagenome-wideapproach. PLoSONE 3(10). Dagnachew,B.S.,T.H.E.Meuwissen&T.Ådnøy.2013.GeneticcomponentsofmilkFourier- transforminfraredspectrausedtopredictbreedingvaluesformilkcompositionandqualitytraits indairygoats. JournalofDairyScience 96(9). Dawson,JulieC.,Je˙reyB.Endelman,NicolasHeslot,JoseCrossa,JessePoland,SusanneDreisi- gacker,YannManès,MarkE.Sorrells&JeanLucJannink.2013.Theuseofunbalanced historicaldataforgenomicselectioninaninternationalwheatbreedingprogram. FieldCrops Research 154. 125 Dekkers,JCM.2007.Predictionofresponsetomarker-assistedandgenomicselectionusing selectionindextheory. JournalofAnimalBreedingandGenetics 124. Efron,B,THastie,IJohnstone&RTibshirani.2004.Leastangleregression. TheAnnalsof Statistics 32(2). Endelman,Je˙reyB.2011.RidgeRegressionandOtherKernelsforGenomicSelectionwithR PackagerrBLUP. ThePlantGenomeJournal 4(3). Falconer,DouglasS&TrudyFCMackay.1996. Introductiontoquantitativegenetics .Essex,UK: PrenticeHall4thedn. Ferragina,A,GdelosCampos,AIVazquez,ACecchinato&GBittante.2015.Bayesian regressionmodelsoutperformpartialleastsquaresmethodsforpredictingmilkcomponents andtechnologicalpropertiesusinginfraredspectraldata. JournalofDairyScience 98(11). Ferrio,JP,DVillegas,JZarco,NAparicio,JLAraus&CRoyo.2005.Assessmentofdurum wheatyieldusingvisibleandnear-infraredre˛ectancespectraofcanopies. FieldCropsResearch 94. Friedman,Jerome,TrevorHastie,HolgerHö˛ing&RobertTibshirani.2007.Pathwisecoordinate optimization. TheAnnalsofAppliedStatistics 1(2). Friedman,Jerome,TrevorHastie&RobTibshirani.2010.Regularizationpathsforgeneralized linearmodelsviacoordinatedescent. JournalofStatisticalSoftware 33(1). Fu,WenjiangJ.1998.Penalizedregressions:theBridgeversustheLASSO. JournalofComputa- tionalandGraphicalStatistics 7(3). Garnsworthy,PC,JWiseman&KFegeros.2000.Predictionofchemical,nutritiveandagronomic characteristicsofwheatbynearinfraredspectroscopy. JournalofAgriculturalScience 135. Garrick,DorianJ.2011.Thenature,scopeandimpactofgenomicpredictioninbeefcattleinthe UnitedStates. GeneticsSelectionEvolution 43(1). Garriga,Miguel,SebastiánRomero-Bravo,FélixEstrada,AlejandroEscobar,IvanAMatus,Ale- jandrodelPozo,CesarAAstudillo&GustavoALobos.2017.Assessingwheattraitsbyspectral re˛ectance:dowereallyneedtofocusonpredictedtrait-valuesordirectlyidentifytheelite genotypesgroup? FrontiersinPlantScience 8(280). Gianola,Daniel,RohanL.Fernando&AlessandraStella.2006.Genomic-assistedpredictionof geneticvaluewithsemiparametricprocedures. Genetics 173. Gitelson,AnatolyA,YoramJKaufman&MarkNMerzlyak.1996.Useofagreenchannelin remotesensingofglobalvegetationfromEOS-MODIS. RemoteSensingofEnvironment 58(3). 126 Goddard,Mike.2009.Genomicselection:Predictionofaccuracyandmaximisationoflongterm response. Genetica 136. González-Camacho,J.M.,G.delosCampos,P.Pérez,D.Gianola,J.E.Cairns,G.Mahuku, R.Babu&J.Crossa.2012.Genome-enabledpredictionofgeneticvaluesusingradialbasis functionneuralnetworks. TheoreticalandAppliedGenetics 125(4). Habier,D,RLFernando&JCMDekkers.2007.Theimpactofgeneticrelationshipinformation ongenome-assistedbreedingvalues. Genetics 177. Habier,D,RohanLFernando&DorianJGarrick.2013.GenomicBLUPDecoded:ALookinto theBlackBoxofGenomicPrediction. Genetics 194. Habier,David,JensTetens,FranzReinholdSeefried,PeterLichtner&GeorgThaller.2010.The impactofgeneticrelationshipinformationongenomicbreedingvaluesinGermanHolsteincattle. GeneticsSelectionEvolution 42(5). Haboudane,Driss,JohnRMiller,NicolasTremblay,PabloJZarco-Tejada&LouiseDextraze. 2002.Integratednarrow-bandvegetationindicesforpredictionofcropchlorophyllcontentfor applicationtoprecisionagriculture. RemoteSensingofEnvironment 81. Hadley,Wickman.2016. ggplot2:Elegantgraphicsfordataanalysis .Springer-VerlagNewYork. Hansen,PM&JKSchjoerring.2003.Re˛ectancemeasurementofcanopybiomassandnitrogen statusinwheatcropsusingnormalizeddi˙erencevegetationindicesandpartialleastsquares regression. RemoteSensingofEnvironment 86. Hastie,Trevor,RobertTibshirani&J.H.Friedman.2009. Theelementsofstatisticallearning: datamining,inference,andprediction .NewYork,USA:Springer2ndedn. Hayes,BenJ,PhillipJBowman,AmandaCChamberlain,KlaraVerbyla&MikeEGoddard.2009. Accuracyofgenomicbreedingvaluesinmulti-breeddairycattlepopulations. GeneticsSelection Evolution 41.51. Hazel,LN.1943.Thegeneticbasisforconstructingselectionindexes. Genetics 28(6). He˙ner,ElliotL,MarkESorrells&Jean-LucJannink.2009.GenomicSelectionforCrop Improvement. CropScience 49. Henderson,CR.1950.Estimationofgeneticparameters. TheAnnalsofMathematicalStatistics 21. Henderson,CR.1963.Selectionindexandexpectedgeneticadvance.In Statisticalgeneticsand plantbreeding:Asymposiumandworkshop ,Washington,D.C.:NationalAcademyof Sciences-NationalResearchCouncil. Henderson,CR&RLQuaas.1976.Multipletraitevaluationusingrelatives'records. Journalof AnimalScience 43(6). 127 Hernandez,Javier,GustavoALobos,IvánMatus,AlejandrodelPozo,PaolaSilva&Mauricio Galleguillos.2015.Usingridgeregressionmodelstoestimategrainyieldfrom˝eldspectral datainbreadwheat(TriticumaestivumL.)grownunderthreewaterregimes. RemoteSensing 7. Isidro,Julio,JanninkJean-Luc,DenizAkdemir,JessePoland,NicolasHeslot&MarkESorrells. 2015.Trainingsetoptimizationunderpopulationstructureingenomicselection. TheorAppl Genet 128. Jacobson,Amy,LianLian,ShengqiangZhong&RexBernardo.2014.Generalcombiningability modelforgenomewideselectioninabiparentalcross. CropScience 54(3). Jiang,Guo-Liang.2013.MolecularMarkersandMarker-AssistedBreedinginPlants.In Plant breedingfromlaboratoriesto˝elds ,InTech. Lehermeier,Christina,NicoleKrämer,EvaBauer,CyrilBauland,ChristianCamisan,LauraCampo, PascalFlament,AlbrechtEMelchinger,MonicaMenz,NinaMeyer,LaurenceMoreau,Jesús Moreno-González,MilenaOuzunova,HubertPausch,NicolasRanc,WolfgangSchipprack,Man- fredSchönleben,HildrunWalter,AlainCharcosset&Chris-CarolinSchön.2014.Usefulnessof MultiparentalPopulationsofMaize. Genetics 198. Lehermeier,Christina,ChrisCarolinSchön&GustavodelosCampos.2015.Assessmentof geneticheterogeneityinstructuredplantpopulationsusingmultivariatewhole-genomeregression models. Genetics 201(1). Lopez-Cruz,Marco,EricOlson,GabrielRovere,JoseCrossa,SusanneDreisigacker,Mondal Suchismita,RaviSingh&GustavodelosCampos.2020.Regularizedselectionindicesfor breedingvaluepredictionusinghyper-spectralimagedata. Scienti˝cReports 10(8195). Lorenz,Aaron&KevinPSmith.2015.AddingGeneticallyDistantIndividualstoTraining PopulationsReducesGenomicPredictionAccuracyinBarley. CropScience 55. Lorenzana,RobenzonE.&RexBernardo.2009.Accuracyofgenotypicvaluepredictionsfor marker-basedselectioninbiparentalplantpopulations. TheoreticalandAppliedGenetics 120(1). Lush,JayL.1935.Progenytestandindividualperformanceasindicatorsofananimal'sbreeding value. JournalofDairyScience 18(1). Lush,JayL.1937. Animalbreedingplans .Ames,Iowa:IowaStateCollege,Ames. Lush,JayL.1948. Thegeneticsofpopulations .Iowa:IowaStateCollege,Amesspecialreedn. Mahlein,AK,ECOerke,USteiner&HWilhelmDehne.2012.Recentadvancesinsensingplant diseasesforprecisioncropprotection. EuropeanJournalofPlantPathology 133. Melchinger,AlbrechtE.,H.FriedrichUtz&ChrisC.Schör.1998.Quantitativetraitlocus(QTL) mappingusingdi˙erenttestersandindependentpopulationsamplesinmaizerevealslowpower ofQTLdetectionandlargebiasinestimatesofQTLe˙ects. Genetics 149(1). 128 Meuwissen,THE,BJHayes&MEGoddard.2001.Predictionoftotalgeneticvalueusing genome-widedensemarkermaps. Genetics 157(4). Miah,G.,M.Y.Ra˝i,M.R.Ismail,A.B.Puteh,H.A.Rahim,R.Asfaliza&M.A.Latif.2013. Blastresistanceinrice:areviewofconventionalbreedingtomolecularapproaches. Molecular BiologyReports 40(3). Montes,JM,HFUtz,WSchipprack,BKusterer,JMuminovic,CPaul&AEMelchinger.2006. Near-infraredspectroscopyoncombineharvesterstomeasuremaizegraindrymattercontent andqualityparameters. PlantBreeding 125. Montes,JuanM,AlbrechtEMelchinger&JochenCReif.2007.Novelthroughputphenotyping platformsinplantgeneticstudies. TrendsinPlantScience 12(10). Montesinos-López,Osval,AbelardoMontesinos-López,JoseCrossa,GustavodelosCampos, GregorioAlvarado,MondalSuchismita,JessicaRutkoski,LorenaGonzález-Pérez&JuanBur- gueño.2017.Predictinggrainyieldusingcanopyhyperspectralre˛ectanceinwheatbreeding data. PlantMethods 13(4). Morota,Gota&DanielGianola.2014.Kernel-basedwhole-genomepredictionofcomplextraits: Areview. FrontiersinGenetics 5. Nagel,KerstinA.,AlexanderPutz,FrankGilmer,KathrinHeinz,AndreasFischbach,Johannes Pfeifer,MarcFaget,StephanBlossfeld,MichaelaErnst,ChryssaDimaki,BerndKastenholz, AnnKatrinKleinert,AnnaGalinski,HannoScharr,FabioFiorani&UlrichSchurr.2012. GROWSCREEN-Rhizoisanovelphenotypingrobotenablingsimultaneousmeasurementsof rootandshootgrowthforplantsgrowninsoil-˝lledrhizotrons. FunctionalPlantBiology 39(11). Oblath,EmilyA,TerryAIsbell,MarkABerhow,BrettAllen,DavidArcher,JackBrown,RussellW Gesch,JerryLHat˝eld,JalalDJabro,JamesRKiniry&DanielSLong.2016.Developmentof near-infraredspectroscopycalibrationstomeasurequalitycharacteristicsinintactBrassicaceae germplasm. IndustrialCrops&Products 89. Olson,K.M.,P.M.VanRaden&M.E.Tooker.2012.Multibreedgenomicevaluationsusing purebredHolsteins,Jerseys,andBrownSwiss. JournalofDairyScience 95(9). Perez,Paulino&GustavodelosCampos.2014.Genome-wideregressionandpredictionwiththe BGLRstatisticalpackage. Genetics 198(2). Pérez-Rodríguez,Paulino,JoséCrossa,JessicaRutkoski,JessePoland,RaviSingh,AndrésLegarra, EnriqueAutrique,GustavoDeLosCampos,JuanBurgueño&SusanneDreisigacker.2017. Single-stepgenomicandpedigreegenotype Ö environmentinteractionmodelsforpredicting wheatlinesininternationalenvironments. ThePlantGenomeJournal 10(2). Poland,Jesse,Je˙reyEndelman,JulieDawson,JessicaRutkoski,ShuangyeWu,YannManes, SusanneDreisigacker,JoséCrossa,HéctorSánchez-villeda,MarkSorrells&Jean-LucJannink. 2012.GenomicSelectioninWheatBreedingusingGenotyping-by-Sequencing. ThePlant GenomeJournal 5(3). 129 Pritchard,JonathanK&PeterDonnelly.2001.StudiesofAssociationinStructured orAdmixedPopulations. TheoreticalPopulationBiology 60. Pritchard,JonathanK.,MatthewStephens&PeterDonnelly.2000.Inferenceofpopulationstructure usingmultilocusgenotypedata. Genetics 155. Pszczola,M&MPLCalus.2016.Updatingthereferencepopulationtoachieveconstantgenomic predictionreliabilityacrossgenerations. Animal 10(6). RCoreTeam.2019. R:Alanguageandenvironmentforstatisticalcomputing .RFoundationfor StatisticalComputingVienna,Austria.https://www.R-project.org/. Riedelsheimer,Christian,Je˙reyB.Endelman,MichaelStange,MarkE.Sorrells,JeanLucJannink &AlbrechtE.Melchinger.2013.Genomicpredictabilityofinterconnectedbiparentalmaize populations. Genetics 194. Rincent,R,SNicolas,TAltmann,DBrunel,PRevilla,AMelchinger,EBauer,NMeyer, CGiau˙ret,CBauland,PJamin,JLaborde,HMonod,PFlament,ACharcosset&LMoreau. 2012.Maximizingthereliabilityofgenomicselectionbyoptimizingthecalibrationsetof referenceindividuals:comparisonofmethodsintwodiversegroupsofmaizeinbreds(Zeamays L.). Genetics 192. Rio,Simon,LaurenceMoreau,AlainCharcosset&TristanMary-Huard.2020.Accountingfor group-speci˝callelee˙ectsandadmixtureingenomicpredictions:Theoryandexperimental evaluationinmaize. Genetics 216(1). Roemer,C,MWahabzada,ABallvora,FPinto,MRossini,CPanigada,JBehmann,JLeon,CThu- rau,CBauckhage,KKersting,URascher&LPluemer.2012.EarlyDroughtStressDetection inCereals:SimplexVolumeMaximizationforHyperspectralImageAnalysis. FunctionalPlant Biology 39. Roth,Morgane,HélèneMuranty,MarioDiGuardo,WalterGuerra,AndreaPatocchi&Fabrizio Costa.2020.Genomicpredictionoffruittextureandtrainingpopulationoptimizationtowards theapplicationofgenomicselectioninapple. HorticultureResearch 7(1). Rutkoski,Jessica,JessePoland,SuchismitaMondal,EnriqueAutrique,LorenaPérez-González, JoséCrossa,MatthewReynolds&RaviSingh.2016.Canopytemperatureandvegetationindices fromhigh-throughputphenotypingimproveaccuracyofpedigreeandgenomicselectionforgrain yieldinheat. G3:Genes,Genomes,Genetics 6. Schulz-Streeck,T,JOOgutu,ZKaraman,CKnaak&HPPiepho.2012.GenomicSelectionusing MultiplePopulations. CropScience 52. Searle,SR,GCasella&CEMcCulloch.1992. Variancecomponents .JohnWiley&Sons,Inc. Smith,HF.1936.Adiscrimantfunctionforplantselection. AnnalsofEugenics 7. Soyeurt,H,IMisztal&NGengler.2010.Geneticvariabilityofmilkcomponentsbasedon mid-infraredspectraldata. JournalofDairyScience 93(4). 130 Spielbauer,Gertraud,PaulArmstrong,JohnWBaier,WilliamB.Allen,KatinaRichardson, BoShen&AMarkSettles.2009.High-throughputnear-infraredre˛ectancespectroscopy forpredictingquantitativeandqualitativecompositionphenotypesofindividualmaizekernels. CerealChemistry 86(5). Tang,Haibao,UzaySezen&AndrewH.Paterson.2010.Domesticationandplantgenomes. CurrentOpinioninPlantBiology 13. Tattaris,Maria,MatthewP.Reynolds&ScottC.Chapman.2016.Adirectcomparisonofremote sensingapproachesforhigh-throughputphenotypinginplantbreeding. FrontiersinPlantScience 7. Tibshirani,Robert.1996.RegressionshrinkageandselectionviatheLASSO. JournaloftheRoyal StatisticalSocietyB 58(1). Tucker,ComptonJ.1979.Redandphotographicinfraredlinearcombinationsformonitoring vegetation. RemoteSensingofEnvironment 8(2). VanRaden,PM.2007.Genomicmeasuresofrelationshipandinbreeding. InterbullBulletin 37. VanRaden,PM.2008.E˚cientMethodstoComputeGenomicPredictions. JournalofDairy Science 91(11). Veturi,Yogasudha,GustavodelosCampos,NengjunYi,WenHuang,AnaIVazquez&Brigitte Kühnel.2019.Modelingheterogeneityinthegeneticarchitectureofethnicallydiversegroups usingrandome˙ectinteractionmodels. Genetics 211(April). Weber,VS,JLAraus,JECairns,CSanchez,AEMelchinger&EOrsini.2012.Predictionof grainyieldusingre˛ectancespectraofcanopyandleavesinmaizeplantsgrownunderdi˙erent waterregimes. FieldCropsResearch 128. White,Je˙reyW,PedroAndrade-Sanchez,MichaelAGore,KevinFBronson,TerryACo˙elt, MatthewMConley,KennethAFeldmann,AndrewNFrench,JohnTHeun,DouglasJHunsaker, MatthewAJenks,BruceAKimball,RobertLRoth,RobertJStrand,KellyRThorp,GerardW Wall&GuangyaoWang.2012.FieldCropsResearchField-basedphenomicsforplantgenetics research. FieldCropsResearch 133. Wientjes,YvonneCJ,RoelFVeerkamp&MarioPLCalus.2013.TheE˙ectofLinkage DisequilibriumandFamilyRelationshipsontheReliabilityofGenomicPrediction. Genetics 193. William,H.M.,R.Trethowan&E.M.Crosby-Galvan.2007.Wheatbreedingassistedbymarkers: CIMMYT'sexperience. Euphytica 157(3). Wolc,Anna,JesusArango,PetekSettar,JanetE.Fulton,NeilP.O'Sullivan,RudolfPreisinger, DavidHabier,RohanFernando,DorianJ.Garrick&JackC.M.Dekkers.2011.Persistenceof accuracyofgenomicestimatedbreedingvaluesovergenerationsinlayerchickens. Genetics SelectionEvolution 43(23). 131 Wolc,Anna,A.Kranis,J.Arango,P.Settar,J.E.Fulton,N.P.O'Sullivan,A.Avendano,K.A. Watson,J.M.Hickey,G.delosCampos,R.L.Fernando,D.J.Garrick&J.C.M.Dekkers.2016. Implementationofgenomicselectioninthepoultryindustry. AnimalFrontiers 6(1). Xu,Yunbi&JonathanH.Crouch.2008.Marker-assistedselectioninplantbreeding:From publicationstopractice. CropScience 48(2). Xu,Yunbi,XiaogangLiu,JunjieFu,HongwuWang,JiankangWang,ChanglingHuang,Boddu- palliM.Prasanna,MichaelS.Olsen,GuoyingWang&AiminZhang.2020.EnhancingGenetic GainthroughGenomicSelection:FromLivestocktoPlants. PlantCommunications 1(1). 100005. Xu,Yunbi,DebraJ.Skinner,HuixiaWu,NataliaPalacios-Rojas,JoseLuisAraus,JianbingYan, ShibinGao,MarilynL.Warburton&JonathanH.Crouch.2009.Advancesinmaizegenomics andtheirvalueforenhancinggeneticgainsfrombreeding. InternationalJournalofPlant Genomics 1(957602). Yu,Jianming,JamesBHolland,MichaelDMcmullen&EdwardSBuckler.2008.GeneticDesign andStatisticalPowerofNestedAssociationMappinginMaize. GeneticsSelectionEvolution 178. Zhu,Chengsong,MichaelGore,EdwardSBuckler&JianmingYu.2008.Statusandprospectsof associationmappinginplants. ThePlantGenomeJournal 1(1). Zou,Hui&TrevorHastie.2005.Regularizationandvariableselectionviatheelasticnet. Journal oftheRoyalStatisticalSocietyB 67(2). 132