INTRODUCINGSPARSITYINTOSELECTIONINDEXMETHODOLOGYWITH
APPLICATIONSTOHIGH-THROUGHPUTPHENOTYPINGANDGENOMIC
PREDICTION
By
MarcoAntonioLopezCruz
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialful˝llmentoftherequirements
forthedegreeof
PlantBreeding,GeneticsandBiotechnologyCropandSoilSciencesDoctorofPhilosophy
2020
ABSTRACT
INTRODUCINGSPARSITYINTOSELECTIONINDEXMETHODOLOGYWITH
APPLICATIONSTOHIGH-THROUGHPUTPHENOTYPINGANDGENOMIC
PREDICTION
By
MarcoAntonioLopezCruz
Researchinplantandanimalbreedinghasbeenlargelyfocusedonthedevelopmentofmethods
foramoree˚cientselectionbyalteringthefactorsthata˙ectgeneticprogress:selectionintensity,
selectionaccuracy,geneticvariance,andlengthofthebreedingcycle.Mostofthebreedinge˙orts
havebeenprimarilytowardsincreasingselectionaccuracyandreducingthebreedingcycle.
Genomicselectionhasbeensuccessfullyadoptedbymanypublicandprivatebreedingorgani-
zations.Overyears,theseinstitutionshavedevelopedandaccumulatedlargevolumesofgenomic
datalinkedtophenotypesfrommultiplepopulationsandmultiplegenerations.Thisdataabun-
danceo˙erstheopportunitytorevolutionizegeneticresearch.However,thesedatasetsarealso
increasinglyheterogeneous,withmanysubpopulationsandmultiplegenerationsrepresentedinthe
data.Thistranslatesintopotentiallyheterogeneousallelefrequenciesanddi˙erentLDpatterns,
thusleadingtoSNP-e˙ectheterogeneity.
Genomicselectionmethodsweredevelopedwithreferencetohomogeneouspopulationsin
whichSNP-e˙ectsareassumedconstantacrossthewholepopulation.Thesemethodsarenot
necessarilyoptimalforthecontemporaryavailabledatasetsformodeltraining.Therefore,a˝rst
focusofthisdissertationisondevelopingnovelmethodsthatcanleveragethelarge-scaleofmodern
datasetswhilecopingwiththeheterogeneityandcomplexityofthistypeofdata.
Inrecentyears,therehavealsobeenimportantadvancesinhigh-throughputphenotyping
(HTP)technologiesthatcangeneratelargevolumesofdataatmultipletime-pointsofacrop.
Examplesofthisincludehyper-spectralimagingtechnologiesthatcancapturethere˛ectanceof
electromagneticpowerbycropsatpotentiallythousandsofwavelengths.TheintegrationofHTP
ingeneticevaluationsrepresentsagreatopportunitytofurtheradvanceplantbreeding;however,
thehigh-dimensionalnatureofHTPdataposesimportantchallenges.Therefore,asecondfocusof
thisdissertationisonthedevelopmentofanovelapproachtoe˚cientlyincorporateHTPdatafor
breedingvaluesprediction.
Thus,thisdissertationaimstocontributenovelmethodsthatcanimprovetheaccuracyof
genomicpredictionbyoptimizingtheuseoflarge,potentiallyheterogeneous,genomicdatasets
andbyenablingtheintegrationofHTPdata.Wepresentanovelstatisticalapproachthatcombines
thestandardselectionindexmethodologywithvariable-selectionmethodscommonlyusedin
machinelearningandstatistics,anddevelopedsoftwaretoimplementthemethod.Ourapproach
o˙erssolutionstobothgenomicselectionwithpotentiallyhighlyheterogeneousgenomicdatasets,
andtheintegrationofHTPingeneticevaluations.
Dedicatedtomymotherandtothememoryofmyfather.
iv
ACKNOWLEDGEMENTS
IwanttoexpressmyspecialgratitudetoDr.GustavodelosCamposforhismentoringduringthe
curseofmydoctorate,forallhisvaluableacademicandpersonaladvice,andfortheconstructive
suggestionsduringtheplanninganddevelopmentofthisresearch.
Iwanttothankmygraduatecommittee,Dr.EricOlson,Dr.GustavodelosCampos,Dr.David
Douches,andDr.DechunWangforacceptingbeingpartofmyPh.D.committeeandfortheir
adviceandassistance.
IwanttoexpressmygratitudetoDr.JoseCrossaandDr.SusanneDreisigackerfromthe
InternationalMaizeandWheatImprovementCenter(CIMMYT)totrustmeandencouragemeto
pursueaPh.D.,andfortheirsupportingettingfundingformystudies.
Io˙ermyspecialacknowledgmenttoDr.PaulinoPerezforbeingmy˝rstcontactwiththe
˝eldofStatisticalGeneticsandhissupportinthedevelopmentofthesoftwaregeneratedfromthis
research.IwishtothanklabmembersoftheQuantGengroupandotherfriendsImetatMSUfor
theirsupportandformakinganimpactonmylifeduringmyjourneyatMSU.
IwanttoexpressmyacknowledgmentstotheMonsanto'sBeachell-BorlaugInternationalSchol-
arshipProgram(MBBISP)forsponsoringmeduringthe˝rstfouryearsofmydoctorate.Likewise,
IextendmygratitudetotheMSUGraduateSchoolforprovidingmethedissertationcompletion
fellowshipinthelastsemesterofmyPh.D.IwouldalsoliketothankDr.JasonandDr.DanaLily,
andMr.ChrisandMrs.JudithRossmanforgrantingmethegraduatestudentfunds.
Lastly,Iwouldliketothankmyfamily,myparentsAmparoCruzandAntonioLopez,andmy
siblingsJuan,Vlady,andLuis,foralltheirunconditionalsupporttomakethisdreampossible.
v
TABLEOFCONTENTS
LISTOFTABLES
.......................................
ix
LISTOFFIGURES
.......................................
xi
CHAPTER1INTRODUCTION
...............................
1
1.1Chapter2:Incorporatinghyper-spectralimagedataintoselectionindicesfor
breedingvalueprediction...............................4
1.2Chapters3and4:Improvingtheaccuracyofgenomicpredictionusingsparse
selectionindices....................................6
CHAPTER2REGULARIZEDSELECTIONINDICESFORBREEDINGVALUEPRE-
DICTIONUSINGHYPER-SPECTRALIMAGEDATA
...........
8
2.1Abstract........................................9
2.2Introduction......................................9
2.3Results.........................................11
2.3.1Regularizedselectionindices.........................11
2.3.1.1Reduced-rankselectionindices..................12
2.3.1.2Penalizedselectionindices.....................12
2.3.2Accuracyofindirectselection........................14
2.3.3Regularizedselectionindicesforwheatgrainyieldusinghyper-spectral
imagedata...................................14
2.3.4Regularizationimprovestheheritabilityandtheaccuracyoftheindex...15
2.3.5Usingdatafrommultipletime-pointsfurtherimprovesselectionaccuracy.18
2.3.6L1-penalizationleadstosparseselectionindices...............18
2.3.7Comparisonwithphenotypicprediction...................19
2.4Discussion.......................................20
2.4.1IntegrationofPSIandPC-SIintogeneticevaluations............22
2.4.2Impactoftheuseofhigh-throughputphenotypesinbreedingprograms...24
2.4.3Regularizedselectionindicescanalsobeavaluabletoolingeneticresearch24
2.5Methods........................................25
2.5.1Standardselectionindex...........................25
2.5.2Reduced-rankselectionindex.........................26
2.5.3Penalizedselectionindices..........................26
2.5.4Data......................................28
2.5.5Phenotypepre-processing...........................28
2.5.6Heritabilityestimation............................29
2.5.7Training-testingpartitions..........................30
2.5.8Estimationofphenotypicandgeneticparameters..............31
2.5.9Estimationoftheaccuracyofindirectselection...............31
2.5.10Software....................................31
2.5.11Materialsanddataavailability........................32
vi
2.6Acknowledgments...................................32
CHAPTER3OPTIMALBREEDINGVALUEPREDICTIONUSINGASPARSEAM-
ILINDEX
..................................
33
3.1Abstract........................................34
3.2Introduction......................................34
3.3MaterialsandMethods................................36
3.3.1SparseSelectionIndexMethodology.....................37
3.3.2Data......................................38
3.3.3Analyses....................................39
3.3.4Software....................................41
3.3.5Dataavailability................................41
3.4Results.........................................41
3.4.1Sparsityimprovespredictionaccuracy....................41
3.4.2Usinganinternalcross-validationtoachieveoptimalsparsity........43
3.4.3SparseSelectionIndicesbuildsubject-speci˝ctrainingsets.........45
3.4.4Genomicrelationshipsandweightsinstandardandsparseselectionindices47
3.5Discussion.......................................48
3.6Acknowledgments...................................52
CHAPTER4GENOMICPREDICTIONINMULTI-GENERATIONALMAIZEHY-
BRIDSUSINGSPARSEKERNELMODELS
................
53
4.1Abstract........................................54
4.2Introduction......................................54
4.3MaterialsandMethods................................57
4.3.1Genotypesandphenotypicdata........................57
4.3.2Phenotypespre-processing..........................58
4.3.3Genomicselectionmodels..........................59
4.3.4Variancecomponents.............................62
4.3.5Assessmentofpredictionaccuracy......................63
4.3.6Software....................................63
4.4Results.........................................65
4.4.1PredictionaccuracycomparisonofG-BLUPandK-BLUPmodels.....65
4.4.2E˙ectofsparsityonpredictionaccuracy...................66
4.4.3Automatictraining-sampleselection.....................70
4.5Discussion.......................................73
CHAPTER5CONCLUDINGREMARKSANDFUTUREDIRECTIONS
.........
76
APPENDICES
.........................................
77
APPENDIXASUPPLEMENTARYFIGURESANDTABLESFROMCHAP-
TER2
.................................
78
APPENDIXBSUPPLEMENTARYMATERIALFROMCHAPTER3
.......
89
APPENDIXCSUPPLEMENTARYFIGURESANDTABLESFROMCHAP-
TER4
.................................
106
vii
BIBLIOGRAPHY
........................................
123
viii
LISTOFTABLES
Table2.1:Averagegrainyieldandheritabilitybyenvironmentalcondition..........15
Table2.2:Accuracyandrelativee˚ciencyofindirectselectionofanL1-penalizedSI
usingdatafromoneandninetime-points......................18
Table3.1:Predictionaccuracy(averageacross100partitions)achievedbysparsese-
lectionindices(SSIs)andbytheG-BLUP(standardSI),bydatasetand
environmentalcondition...............................44
Table4.1:Trainingset(TS)compositionusedineachpredictionscenario.(Thepredic-
tionsetwasthesameforalltrainingscenariosandconsistedof612randomly
chosenindividualsfrom2019)............................64
Table4.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including15%ofsubjectsfromthe2019cycle),GY-OPTtrait-environment
combination.....................................67
Table4.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including15%ofsubjectsfromthe2019cycle),PH-OPTtrait-environment
combination.....................................69
TableA.1:Accuracyofindirectselection(averageover100training-testingpartitions)
forbestphenotypicprediction(principalcomponents(PCR),L1-penalized
prediction(L1-Phen),RNDVI,andGNDVI)andforbestgenotypicprediction
(standardSI,optimalPC-SI,L1-PSI,andL2-PSI).................87
TableB.1:Numberofavailableobservations,averagegrainyield,andheritabilityby
environmentalconditionfortheWheat-largedataset................96
TableB.2:Numberofavailableobservations,averagegrainyield,andheritabilityby
environmentalconditionfortheWheat-599dataset................96
TableB.3:Maximumpredictionaccuracy(averageacross100partitions)achievedbythe
SSIfordi˙erentvaluesoftheparameter

ofanElastic-Net-typeSSI,by
environmentalconditionfortheWheat-largedataset................97
TableB.4:Maximumpredictionaccuracy(averageacross100partitions)achievedbythe
SSIfordi˙erentvaluesoftheparameter

ofanElastic-Net-typeSSI,by
environmentalconditionfortheWheat-599dataset................98
ix
TableC.1:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,and10%ofsubjectsfromthe2019cycle),traitGY,
environmentOPT..................................115
TableC.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitGY,
environmentDRT..................................117
TableC.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,and10%ofsubjectsfromthe2019cycle),traitPH,
environmentOPT..................................119
TableC.4:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitPH,
environmentDRT..................................121
x
LISTOFFIGURES
Figure2.1:Predictionofthegeneticmeritforgrainyieldusinghyper-spectralcropimage
data.(A)Dataconsistsofhyper-spectralre˛ectancedata(
x
i
)andphenotypic
measurementsofthetargettrait(
y
i
,e.g.,grainyield).(B)Asubsetofthedata
(thetrainingset)isusedtoderivethecoe˚cientsofaselectionindex.(C)
Thesecoe˚cientsarethenappliedtoimagedataofindividualsinthetesting
settoderivetheindex(
I
i
)foreachindividual.Thepredictiveabilityofthe
indexisassessedbycalculatingtheaccuracyofindirectselection(
Acc
¹Iº
)
inthetestingset...................................13
Figure2.2:AccuracyofindirectselectionofregularizedSIsanditscomponents.Square
rootheritability(green),geneticcorrelation(orange),andaccuracyofindirect
selection(purple,allaveragedover100training-testingpartitions),versusthe
numberofpredictorsusedtobuildtheindex:(A)numberofactivebands
inthecaseoftheL1-PSI,or(B)numberofPCsinthePC-SI.Eachpanel
representsoneenvironment(latesttime-point)...................16
Figure2.3:Accuracyofindirectselectionachievedbyastandard(SI)andbyregularized
(PC-SIandL1-PSI)selectionindices.Thelinesprovidetheaverageaccuracy
over100training-testingpartitions.Verticallinesrepresenta95%con˝dence
intervalfortheaverage.Thehorizontalaxisgivesthetime-pointatwhich
imageswerecollectedandareexpressedinbothdaysaftersowing(DAS)and
stages(VEG=vegetative,GF=grain˝lling,MAT=maturity)............17
Figure2.4:Heatmapofregressioncoe˚cientsforL1-penalizedselectionindices.Sepa-
rateindiceswerederivedforeachenvironmentusingmultitime-pointdata.
DAS=daysaftersowing,VEG,GF,MATrepresentvegetative,grain-˝lling
andmaturitystages,respectively.Thebottomcolor-barshowsthelightcolor
associatedwitheachwavebandinthevisiblespectrum(

750nm);blackwas
usedtorepresentthenear-infraredspectrum(wavelength>750nm).......19
Figure3.1:Predictionaccuracy(averageacross100trn-tstpartitions)oftheSSIversus
the(average)numberofpredictorsintrainingsetsupportingtheSSIofeach
individualintestingset(x-axis).Genomic-BLUP(bluerightmostpoint)
appearsasaspecialcaseofanSSI.Eachpanelrepresentsoneenvironment
withindataset.(A)Wheat-largedataset.(B)Wheat-599dataset.Vertical
barsrepresenta95%con˝denceintervalfortheaverage..............42
xi
Figure3.2:Predictionaccuracyoftheoptimalsparseselectionindex(SSI)versusthatof
theG-BLUP.Eachpointrepresentsatrn-tstpartition(atotalof100partitions
wereimplemented),thepointshapeandcolorrepresentenvironments.(A)
Wheat-largedataset.(B)Wheat-599dataset.Thevalueof

intheSSI
wasestimatedusing105-foldcross-validationsconductedwithinthetraining
data.Inparenthesis,bythelegend,isthep-valueforthetwo-sidedSign
(binomial)testforwithin-environmentdi˙erencesinaccuracybetweenthe
SSIandtheG-BLUP................................43
Figure3.3:Distributionofthenumberoftrainingsupportpoints(
n
sup
)inoptimalsparse
selectionindices(resultsobtainedover100trn-tstpartitions;
n
trn
=sizeofthe
trainingdataset),byenvironmentalcondition,Wheat-largedataset........45
Figure3.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)and
thecorrespondingsupportpoints(green).Greypointsrepresentgenotypes
thatdidnotcontributetothepredictionofthegeneticvalueofthegenotypein
yellow.Allpanelsrepresentsolutionsfortheenvironment
EHT
,Wheat-large
dataset.......................................46
Figure3.5:(A)Weights(

ij
)ofastandardSI(G-BLUP)andoftheoptimalsparse
selectionindex(SSI)versusthegenomicrelationship(
g
ij
).(B)Proportionof
weightsintheSSIthatwerezero(non-active)andnon-zero(supportpoints);
Wheat-largedataset,environment
EHT
......................48
Figure4.1:(A)First3principalcomponentsoftheadditivegenomicrelationshipsmatrix,
G
.Pointsrepresentindividualsthatarecolorseparatedaccordingtocycle
(2017,2018,or2019).(B)Heatmapofthegenomicrelationshipsmatrix.....64
Figure4.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthe
datafromthe2017,2018,or2017+2018cyclesalone(top-leftpanel),orin
combinationwithaproportion(5%,10%,15%)ofthedatafromthe2019
cycle.Thepredictionsetconsistedof612genotypesfromthe2019cycle
thatwerenotusedformodeltraining.Modelswiththesameletterwithin
panelindicatenosigni˝cantdi˙erencefromeachother(

=
0
:
05
,ANOVA
followedbyTukeytest).GY-OPTtrait-environmentcombination.........66
Figure4.3:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(hori-
zontalaxis)versusthepredictionaccuracyofallothermodels(verticalaxis
ofeachpanel).(B)Predictionaccuracyofthestandard*-BLUPmodel(hori-
zontalaxis)versusthepredictionaccuracyofitssparseversion(verticalaxis),
bytypeofkernelusedinpanels.Eachpointrepresentatraining-testingpar-
titionwithineachtrainingsetcomposition.Coloredpointsabove(below)the
45degreelinerepresentcasesforwhichonemodeloutperformedtheother
model.P:p-valueforthetest(fromANOVA)fordi˙erencesinaccuracy
betweenthetwomodels.TraitGY,environmentOPT...............68
xii
Figure4.4:Heatmapofthecoe˚cientsintheHatmatrix(
~
B
¹

º
G
)ofthesparseG-
BLUPmodelforonetraining-prediction(TS-PS)partitioninthepredictionof
n
PS
=
612
individualsfrom2019using
n
TS
=
2427
individuals(2017+2018
plus15%ofthe2019set).Predictedindividualsarepresentedincolumnsand
trainingindividualsarepresentedinrowsseparatedbycycleandnumberof
individualsinparentheses.Thevalueof

wasobtainedbycross-validation.
Eachcolumnrepresentsvaluesofthevector
~
b
¹

º
i
G
=
f
~
b
ij
g
,
j
=
1
;:::;
2427
(Equation(4.6)).Individualsnocontributingtothepredictionhaveacoe˚-
cient
~
b
ij
=
0
representedingreycolor.Individualswithanon-zerocoe˚cient
areshowninayellow-bluelogarithmscale(intheoriginalscale,yellowindi-
cateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environment
combination.....................................71
Figure4.5:Proportionofthetrainingindividualsfromeachcyclethatcontributedtothe
predictionofthe612testinggenotypesfrom2019,usingsparsemodelswith
di˙erentrelationshipmatrices(horizontalaxis):
G
,
K
1
,
K
2
,or
K
A
.The
trainingsetwascomposedbyindividualsfrom2017(
n
=
901
)and2018
(
n
=
1417
)alone(top-leftpanel)orincombinationwithaproportion(5%,
10%,15%)ofthedatafromthe2019cycle.GY-OPTtrait-environment
combination.....................................72
FigureA.1:Box-plotofgrainyieldphenotypicrecordsbyenvironmentalcondition.
n
ˇ
3200
observationswithinenvironment.SD:standarddeviation..........79
FigureA.2:Lightre˛ectancepatternsasfunctionofthewavelength.Eachlinerepresents
themean(across
n
ˇ
3200
observations)re˛ectanceforeachwaveband,
withintime-point(˛ightdate).Withineachenvironment,meanswerescaled
toliewithin0and1bydividingthembythemaximumaverage..........80
FigureA.3:AccuracyofindirectselectionofL1-PSIanditscomponents.Squareroot
heritability,geneticcorrelationandaccuracyofindirectselection,allaveraged
over100training-testingpartitionsversusthenumberofbandsenteringin
theindex;bytime-point(DAS=daysaftersowing,Stage:VEG=vegetative,
GF=grain˝lling,orMAT=maturity)withinenvironment.............81
FigureA.4:AccuracyofindirectselectionofL2-PSIanditscomponents.Squareroot
heritability,geneticcorrelationandaccuracyofindirectselection,allaver-
agedover100training-testingpartitionsversusthepenalizationparameter
(

,logarithmscale)usedtobuildtheindex;bytime-point(DAS=daysafter
sowing,Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)within
environment.....................................82
xiii
FigureA.5:AccuracyofindirectselectionofPC-SIanditscomponents.Squareroother-
itability,geneticcorrelationandaccuracyofindirectselection,allaveraged
over100training-testingpartitionsversusthenumberofprincipalcompo-
nentsusedtobuildtheindex;bytime-point(DAS=daysaftersowing,Stage:
VEG=vegetative,GF=grain˝lling,orMAT=maturity)withinenvironment....83
FigureA.6:Squarerootofheritabilityofthestandard(SI),oftheregularized(PC-SIand
L1-PSI)selectionindices,andoftheRNDVI.Thelinesprovidetheaverage
squarerootheritabilityover100training-testingpartitions.Verticallines
representa95%CIfortheaverage.Thehorizontalaxisgivethetime-point
atwhichimageswerecollectedandareexpressedinbothdaysaftersowing
(DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity)......84
FigureA.7:Geneticcorrelationbetweengrainyieldandall:thestandard(SI),thereg-
ularized(PC-SIandL1-PSI)selectionindices,andtheRNDVI.Thelines
providetheaveragegeneticcorrelationover100training-testingpartitions.
Verticallinesrepresenta95%CIfortheaverage.Thehorizontalaxisgivethe
time-pointatwhichimageswerecollectedandareexpressedinbothdaysafter
sowing(DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity)..85
FigureA.8:Phenotypic,genetic,andenvironmentalcovariances(absolutevalue)between
wavebandsandgrainyield.'D':discrepancybetweenphenotypicandgenetic
covariancesasmeasuredbythesumoftheabsolutedi˙erences;bytime-
point(DAS:daysaftersowing,Stage:VEG=vegetative,GF=grain˝lling,
MAT=maturity)withinenvironment........................86
FigureB.1:Toptwoprincipalcomponentsofthegenomicrelationshipmatrix,
G
,for
eachdataset.Eachpointrepresentindividuals.(A)Wheat-599dataset.(B)
Wheat-largedataset.Individualsarecolor-groupedbythecycle(sowing-
harvestyear).....................................90
FigureB.2:Boxplotofgrainyieldphenotypicrecords(intonha

1
)byenvironmental
conditionforbothWheat-599andWheat-largedatasets.SDstandarddeviation.91
FigureB.3:Distributionofthenumberoftrainingsupportpoints(
n
sup
)inoptimalsparse
selectionindices(resultsobtainedover100trn-tstpartitions;
n
trn
=sizeofthe
trainingdataset),byenvironmentalcondition,Wheat-599dataset........92
FigureB.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)and
thecorrespondingsupportpoints(green).Greypointsrepresentgenotypes
thatdidnotcontributetothepredictionofthegeneticvalueofthegenotype
inyellow.Allpanelsrepresentsolutionsfortheenvironment1,Wheat-599
dataset.......................................93
xiv
FigureB.5:(leftandcenter)Weights(

ij
)ofastandardSI(G-BLUP)andoftheoptimal
sparseselectionindex(SSI)versusthegenomicrelationship(
g
ij
),and(right)
proportionofweightsintheSSIthatbelongedtoeitherthesupportingor
non-activesets,bygenomic-relationship;byenvironment,Wheat-largedataset.94
FigureB.6:(leftandcenter)Weights(

ij
)ofastandardSI(G-BLUP)andoftheoptimal
sparseselectionindex(SSI)versusthegenomicrelationship(
g
ij
),and(right)
proportionofweightsintheSSIthatbelongedtoeitherthesupportingor
non-activesets,bygenomic-relationship;byenvironment,Wheat-599dataset..95
FigureC.1:Genomicrelationships(
G
ij
)versuskernelrelationships(
K
ij
)ofindividuals
incycle2019withthoseincycles2017(left)and2018(right).
G
i
j
and
K
i
j
arethe
ij
th
elementof
G
and
K
¹

º
,respectively.(A)
K
1
=
K
¹
0
:
2
º
.(B)
K
2
=
K
¹
1
º
.(C)
K
3
=
K
¹
5
º
.............................107
FigureC.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthe
datafromthe2017,2018,or2017+2018cyclesalone(top-leftpanel),orin
combinationwithaproportion(5%,10%,15%)ofthedatafromthe2019
cycle.Thepredictionsetconsistedof612genotypesfromthe2019cycle
thatwerenotusedformodeltraining.Modelswiththesameletterwithin
panelindicatenosigni˝cantdi˙erencefromeachother(

=
0
:
05
,ANOVA
followedbyTukeytest).GY-DRTtrait-environmentcombination.........108
FigureC.3:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthe
datafromthe2017,2018,or2017+2018cyclesalone(top-leftpanel),orin
combinationwithaproportion(5%,10%,15%)ofthedatafromthe2019
cycle.Thepredictionsetconsistedof612genotypesfromthe2019cycle
thatwerenotusedformodeltraining.Modelswiththesameletterwithin
panelindicatenosigni˝cantdi˙erencefromeachother(

=
0
:
05
,ANOVA
followedbyTukeytest).TraitPH(OPTandDRTenvironments).........109
FigureC.4:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(hori-
zontalaxis)versusthepredictionaccuracyofallothermodels(verticalaxis
ofeachpanel).(B)Predictionaccuracyofthestandard*-BLUPmodel(hori-
zontalaxis)versusthepredictionaccuracyofitssparseversion(verticalaxis),
bytypeofkernelusedinpanels.Eachpointrepresentatraining-testingpar-
titionwithineachtrainingsetcomposition.Coloredpointsabove(below)the
45degreelinerepresentcasesforwhichonemodeloutperformedtheother
model.P:p-valueforthetest(fromANOVA)fordi˙erencesinaccuracy
betweenthetwomodels.GY-DRTtrait-environmentcombination........110
xv
FigureC.5:(left)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(hor-
izontalaxis)versusthepredictionaccuracyofallothermodels(verticalaxis
ofeachpanel).(right)Predictionaccuracyofthestandard*-BLUPmodel
(horizontalaxis)versusthepredictionaccuracyofitssparseversion(vertical
axis),bytypeofkernelusedinpanels.Eachpointrepresentatraining-testing
partitionwithineachtrainingsetcomposition.Coloredpointsabove(be-
low)the45degreelinerepresentcasesforwhichonemodeloutperformed
theothermodel.P:p-valueforthetest(fromANOVA)fordi˙erencesin
accuracybetweenthetwomodels.TraitPH(OPTandDRTenvironments)....111
FigureC.6:Heatmapofthecoe˚cientsintheHatmatrix(
~
B
¹

º
K
)ofthesparseKA-
BLUPmodelforonetraining-prediction(TS-PS)partitioninthepredictionof
n
PS
=
612
individualsfrom2019using
n
TS
=
2427
individuals(2017+2018
plus15%ofthe2019set).Predictedindividualsarepresentedincolumnsand
trainingindividualsarepresentedinrowsseparatedbycycleandnumberof
individualsinparentheses.Thevalueof

wasobtainedbycross-validation.
Eachcolumnrepresentsvaluesofthevector
~
b
¹

º
i
K
=
f
~
b
ij
g
,
j
=
1
;:::;
2427
(Equation(4.6)).Individualsnocontributingtothepredictionhaveacoe˚-
cient
~
b
ij
=
0
representedingreycolor.Individualswithanon-zerocoe˚cient
areshowninayellow-bluelogarithmscale(intheoriginalscale,yellowindi-
cateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environment
combination.....................................112
FigureC.7:Proportionofthetrainingindividualsfromeachcyclethatcontributedtothe
predictionofthe612testinggenotypesfrom2019,usingsparsemodelswith
di˙erentrelationshipmatrices(horizontalaxis):
G
,
K
1
,
K
2
,or
K
A
.The
trainingsetwascomposedbyindividualsfrom2017(
n
=
901
)and2018
(
n
=
1417
)alone(top-leftpanel)orincombinationwithaproportion(5%,
10%,15%)ofthedatafromthe2019cycle.GY-DRTtrait-environment
combination.....................................113
FigureC.8:Proportionofthetrainingindividualsfromeachcyclethatcontributedto
thepredictionofthe612testinggenotypesfrom2019,usingsparsemodels
withdi˙erentrelationshipmatrices(horizontalaxis):
G
,
K
1
,
K
2
,or
K
A
.
Thetrainingsetwascomposedbyindividualsfrom2017(
n
=
901
)and
2018(
n
=
1417
)alone(top-leftpanels)orincombinationwithaproportion
(5%,10%,15%)ofthedatafromthe2019cycle.TraitPH(OPTandDRT
environments)....................................114
xvi
CHAPTER1
INTRODUCTION
Plantbreedingbeganthousandsofyearsagowhenhumansstarteddomesticatingwildspecies
turningthemintocrops(Tangetal.,2010).Domesticationoccurredwhenthemostpreferableplants
wereselectedtobepropagatedforapurpose(e.g.,foodrequirement).Thetransformationfrom
shatteringtonon-shatteringcereals(e.g.,riceandwheat)tofacilitateharvest,andthedeveloping
ofthecultivatedmaizefromitswildrelative(teosinte)aretwoimportantexamplesofplants'
domestication.Inthenineteenthcentury,hybridizationexperimentsandthediscoveryofthe
rulesofinheritancebyGregorMendelandDarwin'stheoryofnaturalselectionshedlightonthe
foundationsofadaptationbymeansofnaturalorarti˝cialselection.
Massphenotypicselection
hasbeenextremelysuccessfulinproducingmoderncultivarsand
hybridsofgreatagronomicpotential(Jiang,2013).Theseadvancesrequirednotonlyselectionfor
yieldpotentialbutalsoformorerobustplants;thedevelopmentofthehigh-yieldingsemi-dwarf
wheatandricevarietiesduringtheRevisaclearexampleofhowplantarchitecture
andplanphysiologyneedstobeadaptedtoimproveagronomicperformance.
Theexpectedrateofgeneticprogressfromselectionisdeterminedbytheinterplayingoffour
factors:selectionintensity,selectionaccuracy,geneticvariance,andlengthofthebreedingcycle.
Despiteofthegreatprogressachievedbymeansofdirectphenotypicselection,thistechnologyhas
severallimitations:(
i
)selectionaccuracyisboundedbytraitheritability,(
ii
)extensive(andexpen-
sive)phenotypictestingisrequiredtoachievehighselectionintensity,and(
iii
)theopportunities
toshortenthebreedingcyclearelimited.Inthelastcentury,researchinplantandanimalbreeding
hasbeenlargelyfocusedonthedevelopmentoftechnologiesthatcanimproveselectionbyaltering
thefourfactorsthata˙ectgeneticprogress.
Moreaccuratepredictionsofbreedingvalues(BV)canbeobtainedusingstatisticalmethods
(e.g.,selectionindices,BestLinearUnbiasedPrediction,BLUP)thatsmooth-outenvironmental
componentsofinter-individualdi˙erencesinphenotypesandenableborrowingofinformation
1
betweenrelatedindividuals.
SelectionIndex(SI)methodology
beganbypredictingBVsobtainedbycombiningindividual
performancedataandprogenyevaluations(Lush,1935).Smith(1936)andHazel(1943)generalized
ideasusedinprogenytestingtoamoregeneralframework(SI)whichusesallinformativedata.A
phenotypicmeasurementisinformativeabouttheBVofaselectioncandidateifitisgenetically
correlatedwiththeBV.Informativephenotypicmeasurementsincludetherecordsoftheselection
goalontheselectioncandidateorinrelativesoftheselectioncandidate,andmeasurementsoftraits
geneticallycorrelatedwiththeselectionobjective.Inaselectionindex,thetotalgeneticvalueof
anindividualispredictedusingaweightedsumoftheavailablephenotypicmeasurements.The
weightsoneachofthemeasuredphenotypesdependongeneticandphenotypic(co)variances.
Henderson'sBLUP
:Bythemiddleofthetwentycentury,C.R.Henderson(1963)introduced
theconceptoftheBestLinearUnbiasedPredictor(BLUP)ofthebreedingvalues.BLUPsof
BVsareobtainedusingmixedmodelsthatincorporategeneticrelationshipsamongallevaluated
individuals.Henderson'sBLUPcanbeshowntobeequivalenttoaSI;however,theBLUP
methodologyprovidesanadequateframeworkforsimultaneousmodelingofgeneticandnon-
genetice˙ects(e.g.,location,year,year-location,block).BLUPalsoprovidesaframeworkforthe
estimationofgeneticandenvironmental(co)varianceparameters.
SelectionindexandpedigreeBLUPbecomethemethodofchoiceforBVpredictioninthe
secondhalfofthetwentycenturyandthebeginningofthecurrentcentury.However,pedigree
informationpresentssomelimitations;forexample,pedigreeandancestraldatacannotpredict
inter-individualdi˙erencesingeneticvaluesbetweenmembersofabi-parentalfamily.
Thedevelopmentofmolecularmarkers(e.g.,DNAmarkers)hasbene˝tedplantsciencein
manyaspects,includinggermplasmcharacterization,geneintrogression,andthedevelopmentof
DNA-assistedprediction/selectionmethods(Xu&Crouch,2008).
Marker-assistedselection
(MAS)reliesontheidenti˝cationandvalidationofDNAmarkers
predictive(i.e.,tightlylinked)ofquantitativetraitloci(QTL)genotypes.MAScanincrease
selectionaccuracyandmayenableearlyscreening(e.g.,attheseedlingstage).MAShasbeenan
2
e˚cienttoolinthedetectionandvalidationintheimprovementofqualitativetraits(e.g.,disease
resistanceandgrainquality)invariouscrops(e.g.,Williametal.,2007;Xuetal.,2009;Miahetal.,
2013).However,mostofthetraitsofagronomicperformance(e.g.,grainyield,oilconcentration)
area˙ectedbyalargenumberofsmall-e˙ectloci.Detectingmarker-QTLassociationsforcomplex
traitsrequiresanexceedinglylargersamplesize(Melchingeretal.,1998)andspeci˝cdesigns
aimingtomaximizepower(e.g.,Yuetal.,2008;Zhuetal.,2008).Thedi˚cultyofmapping
small-e˙ectQTLlimitedtheadoptionofMASfortheimprovementoftraitsa˙ectedbyalarge
numberofQTLs.
GenomicSelection(GS)
:Thecontinueddevelopmentofgenotypingandsequencingtechnolo-
gieshasledtoasteadydecreaseingenotypingcosts.Inthe˝rstdecadeofthe21stcentury,
arrays,includingtensofthousandsofSNPs,becomeavailableformostagriculturalspecies.These
genotypingarrayso˙eredtheopportunitytotrack,vialinkagedisequilibriumwithcausalvariants,
geneticssignalsdistributedovertheentiregenomes.Inalandmarkpublication,Meuwissenetal.
(2001)proposedusingalargenumberofSNPsforbreedingvalueprediction.Genomicselection
exploitsmulti-locuslinkagedisequilibrium(LD)withthecausalvariants;withenoughmarker
density,GScanpotentiallycaptureallgeneticsignals(He˙neretal.,2009).
Genomicselectionhasbeensuccessfullyadoptedbymanypublicandprivatebreedingorgani-
zations.SamplesizehasbeenrecognizedasoneofthemainfactorslimitingtheaccuracyofGS
(Daetwyleretal.,2008;Goddard,2009;delosCamposetal.,2013a).Earlyimplementationsof
GSinplantbreedingwerebasedonrelativelysmalltrainingdatasets.However,overyears,private
andpublicorganizationshaveaccumulatedlargevolumesofgenomicdatalinkedtophenotypes
frommultiplepopulationsandmultiplegenerations.Theverylargesamplesizeofthesedatasets
impliesthathighlycomplexgenomicpredictionmodelscannowbeaccuratelycalibrated.However,
thesedatasetsarealsoincreasinglyheterogeneous,withmanysubpopulationsandmultiplegener-
ationsrepresentedinthedata.Thistranslatesintopotentiallyheterogeneousallelefrequenciesand
di˙erentLDpatterns,thusleadingtoSNP-e˙ectheterogeneity.
GSmethodsweredevelopedwithreferencetohomogeneouspopulationsinwhichSNP-e˙ects
3
canbeassumedconstantacrosssubsetsofthedatasets;thesemethodsarenotnecessarilyoptimalfor
thetypeofdatasetsavailableformodeltraining.Therefore,animportantfocusofthisdissertation
isondevelopingnovelmethodsthatcanleveragethelargesamplesizeofmoderngenomicdata
setswhilecopingwiththechallengesposedbytheinherentheterogeneityandcomplexityofthese
datasets.
Theadventofhigh-throughputphenotyping(HTP)
:Inrecentyears,therehasbeenanimpor-
tantimprovementinHTPtechnologies.Modernphenotypingplatformscangeneratelargevolumes
ofdataatmultipletime-pointsofacrop(Montesetal.,2007).Examplesofthisincludetheuseof
hyper-spectralimagingtechnologieswhichcancapturetheabsorbanceofelectromagneticpower
bycropsatpotentiallythousandsofwavelengths.TheintegrationofHTPingeneticevaluations
representsagreatopportunitytofurtheradvanceplantbreeding;however,thehigh-dimensional
natureofHTPdataposesimportantchallenges.Therefore,asecondfocusofthisdissertationison
thedevelopmentofanovelapproachforBVpredictionusingHTPdata.
Thus,theoverallaimofthisdissertationistocontributenovelmethodsthatcanimprovethe
accuracyofgenomicpredictionbyoptimizingtheuseoflarge,potentiallyheterogeneous,genomic
datasetsandbyenablingtheintegrationofHTPdata.Toachievethesegoals,wedevelopedanovel
statisticalapproachthatcombinesthestandardSImethodologywithsparsity-inducingmethods
commonlyusedinthe˝eldofstatisticsandmachinelearning.Theprocedurethatwedevelop,
whichwenamedassparseselectionindex(SSI),o˙erssolutionstobothGSwithpotentiallyhighly
heterogeneousgenomicdatasets,andtheintegrationofHTPingeneticevaluations.
1.1Chapter2:Incorporatinghyper-spectralimagedataintoselectionin-
dicesforbreedingvalueprediction
TheuseofHTPtechnologiescanenablescreeningalargernumberofgenotypesovermany
environmentsandlocations,thuso˙eringopportunitiestoincreaseselectionintensity.Furthermore,
indirectselectionbasedonphenotypescollectedearlyinthegrowingcyclecanleadtoareduction
inthelengthofthebreedingcycleofperennials.Therefore,theintegrationofHTPinbreeding
4
evaluationscanleadtoanincreaseofgeneticprogressbyeitherenablingamoreintensiveand/or
fasterselection.
Mostoftheresearche˙ortsoncropimaginghavefocusedondevelopingmethodstopredict
phenotypesfromHTPdata.Forinstance,vegetationindices(VI)derivedfromspectradata(e.g.,
NDVI;Tucker,1979),havebeenusedtopredictgreenbiomass,leafarea,chlorophyllcontent,
andyieldinwheatandmaizeunder˝eldconditions(e.g.,Babaretal.,2006;Garrigaetal.,2017;
Haboudaneetal.,2002;Rutkoskietal.,2016);andtodetectabiotic(e.g.,Roemeretal.,2012)and
abioticstress(e.g.,Mahleinetal.,2012)incrops.VIsusinginformationfromareducednumber
ofwavelengthsinthespectrum.Morerecently,researchershaveconsidereddevelopingmethods
thatusewhole-spectradata.Oneapproachconsistsofusingdimension-reductiontechniques(e.g.,
principalcomponents,PC,andpartialleastsquares,PLSregression)topredictagronomictraits,
e.g.,biomass(e.g.,Hansen&Schjoerring,2003)orgrainyieldinwheat(e.g.,Ferrioetal.,2005;
Hernandezetal.,2015)andinmaize(e.g.,Weberetal.,2012;Aguateetal.,2017).Another
approachconsistsofintroducingwhole-spectradataintoBayesianorpenalizedregression;this
approachhasbeenusedtopredictmilkcomponentsfrommilk-spectradata(Ferraginaetal.,2015)
andgrainyieldinwheat(e.g.,Montesinos-Lópezetal.,2017)andinmaize(e.g.,Aguateetal.,
2017)fromhyper-spectralcropimaging.
Theapproachesdescribedintheprecedingparagrapharewell-suitedforphenotypicprediction
butcanbesub-optimalforselectionbecausethebestpredictorofaphenotypeisnotnecessarily
thebestpredictorofthegeneticvalue.
Therefore,toaddressthelimitationsofexistingmethods,in
Chapter2
wepresentanovel
methodologytodeveloppenalizedandreduced-rankSIsusinghigh-dimensionalphenotypes.Our
approachintegratesintoauni˝edframeworkstandardSImethodologywithmethodsusedinhigh-
dimensionalregressionthatcanpreventover-˝tting.Weevaluatetheproposedmethodologyusing
amulti-environmentwheatdataset(
n
˘
3
;
200
)containinghyper-spectral(
p
=
2
;
250
wavebands)
andgrainyieldinformation.Ourresultsshowthatpenalizedandreduced-rankSIso˙erimproved
selectionaccuracy(
˘
10

40
%gain)relativetothestandard(i.e.,non-regularized)SI.
5
1.2Chapters3and4:Improvingtheaccuracyofgenomicpredictionusing
sparseselectionindices
Thesetwochaptersfurtherdevelopthepenalizedselectionindexmethodologyintroducedin
Chapter2todevelopmethodstoimprovethepredictionaccuracyofGS.Thecontextthatmotivates
thedevelopmentofthesemethodsisthatoflargeandpotentiallyheterogeneousgenomicdatasets
forwhichtheassumptionofe˙ect-homogeneitymaynothold.
Thechallengesposedbytheavailabilityoflarge,potentiallyheterogeneous,datasetshavebeen
recognizedinrecentyears,andtherehavebeenseveralattemptstodevelopmethodstoconfrontthese
challenges.A˝rstlineofresearchinvolvesusingmodelsthatincludeinteractionsbetweenSNPs
andgeneticgroups.Inthesemodels,SNP-e˙ectse˙ectsarerepresentedasthesumofamaine˙ect
plusadeviationthatisgroup-speci˝c(e.g.,Schulz-Streecketal.,2012;delosCamposetal.,2015;
Veturietal.,2019).Asimilar(insomecasesstatisticallyequivalent)approachistousemultivariate
modelsinwhichtheSNPe˙ectsareassumedtobedi˙erentbutcorrelatedbetweendi˙erentgenetic
groups(e.g.,Olsonetal.,2012;Lehermeieretal.,2015).Thesemethodsarewell-suitedfordata
setsinwhichindividualsclusterintode˝nedgroupswhichmaybeknowninadvance(e.g.,breeds,
families,pools)ormaybeinferredusingstatisticalmethods(e.g.,STRUCTURE;Pritchardetal.,
2000).Themethodsjustdescribedhaveshownpromise;however,theyarenotwellsuitedfor
datasetsinwhichgeneticheterogeneityexhibitsmorecomplex/crypticpatternsandthuscannotbe
reducedtotheclusteringofindividualsintowell-de˝neddistinctgroups.
Anotherlineofresearchconsistsofidentifying,foragivenpredictionsetasubsetofthetraining
datathatmaybeoptimaltopredictbreedingvaluesoftheselectioncandidates.Thisapproachis
motivatedbythefactthatgeneticsimilarities(e.g.,familyrelationships)haveagreatimpacton
predictionaccuracy(Habieretal.,2010).Indeed,severalstudieshaveshownthattheaccuracy
ofGScanbeverylowwhensubjectsinthetrainingsetaregeneticallydistanttothoseinthe
predictionset(Clarketal.,2012;Lorenz&Smith,2015).Furthermore,somestudies(delos
Camposetal.,2013b;Wolcetal.,2016)suggestthatpredictionaccuracycanbereducedwhen
individualsdistantlyrelatedtothoseofthepredictionsetarealsousedformodeltraining.Basedon
6
theseobservations,severalstudiesseektodevelopmethodsfortrainingsetoptimization/designing
(e.g.,Clarketal.,2012;Jacobsonetal.,2014;Lehermeieretal.,2014;Lorenz&Smith,2015).
Researchinthisareahasfocusedoncomparingvariousoptimizationcriteria(e.g.,Rincentetal.,
2012;Akdemir&Isidro-Sanchez,2019;Rothetal.,2020).However,alltheseapproachesassume
thatasingletrainingsetisoptimalforallthesubjectsinthepredictionset;thisassumptionmaynot
betruebecauseeachcandidateofselectionmaydrawusefulinformationfromdi˙erenttraining
datapoints.
Therefore,toovercomethelimitationsoftheexistingmethods,in
Chapter3
wepresenta
genomicpredictionapproachthatidenti˝es,foreachindividualinthepredictionset,anoptimal
trainingset(i.e.,asetofsupportpoints).Ourapproachisbasedonasparseselectionindex
(SSI)whichintegratesSImethodologywithsparsity-inducingmethods(i.e.,anL1-penalization).
TheSSIcanbeseenasasparseversionofthepopulategenomic-BLUP(G-BLUP,VanRaden,
2008).WedevelopedsoftwarethatimplementsSSIandevaluatedthemethodologyusingtwo
multi-environmentalwheatdatasets,arelativelysmall(
n
=
599
)highlystructuredoneandavery
large(
n
˘
29
;
000
)datasetthathasarelativelymorecrypticstructure.Inbothcases,wefound
that,comparedwiththeG-BLUP,theSSIcanyieldanimprovementinpredictionaccuracyofabout
5-10%.
Finally,
Chapter4
presentsanapplicationoftheSSImethodologytoaverylarge(
n
=
3
;
039
)
multi-generationdouble-haploid(DH)maizedatasetcomprisinggenotype,grainyield,andplant
heightrecords.Here,weappliedtheSSImethodologyusingadditivegenomicrelationshipsand
alsousinga(non-linear)Gaussiankernel(K-BLUP).LiketheSSI,theGaussiankernelcanbe
tunedtomaximizetheborrowingofinformationbetweencloselyrelatedindividuals.Thesparse
andnon-sparseK-BLUPmodelsperformedsimilarlywithupto28%ofgaininpredictionaccuracy
comparedwithG-BLUPbasedonadditiverelationships.Insomecases,thesparseK-BLUP
outperformedthenon-sparseK-BLUP.
7
CHAPTER2
REGULARIZEDSELECTIONINDICESFORBREEDINGVALUEPREDICTION
USINGHYPER-SPECTRALIMAGEDATA
[Materialpublishedin:Lopez-Cruz
etal.
2020.
Scienti˝cReports
10(8195)]
MarcoLopez-Cruz
1
,EricOlson
1
,GabrielRovere
2
;
3
;
4
,JoseCrossa
6
,SusanneDreisigacker
6
,
SuchismitaMondal
6
,RaviSingh
6
,andGustavodelosCampos
3
;
4
;
5
;

1
DepartmentofPlant,SoilandMicrobialSciences,MichiganStateUniversity,USA
2
DepartmentofAnimalScience,MichiganStateUniversity,USA
3
DepartmentofEpidemiologyandBiostatistics,MichiganStateUniversity,USA
4
InstituteforQuantitativeHealthScienceandEngineering,MichiganStateUniversity,USA
5
DepartmentofStatisticsandProbability,MichiganStateUniversity,USA
6
InternationalMaizeandWheatImprovementCenter(CIMMYT),Mexico

CorrespondenceandrequestshouldbeaddressedtoG.D.L.C.(e-mail:gustavoc@msu.edu).
8
2.1Abstract
High-throughputphenotyping(HTP)technologiescanproducedataonthousandsofphenotypes
perunitbeingmonitored.Thesedatacanbeusedtobreedforeconomicallyandenvironmentally
relevanttraits(e.g.,droughttolerance);however,incorporatinghigh-dimensionalphenotypesin
geneticanalysesandinbreedingschemesposesimportantstatisticalandcomputationalchallenges.
Toaddressthisproblem,wedevelopedregularizedselectionindices;themethodologyintegrates
techniquescommonlyusedinhigh-dimensionalphenotypicregressions(includingpenalization
andrank-reductionapproaches)intotheselectionindex(SI)framework.Usingextensivedatafrom
CIMMYT's(InternationalMaizeandWheatImprovementCenter)wheatbreedingprogramwe
showthatregularizedSIsderivedfromhyper-spectraldatao˙erconsistentlyhigheraccuracyfor
grainyieldthanthoseachievedbystandardSIs,andbyvegetationindicescommonlyusedtopredict
agronomictraits.RegularizedSIso˙erane˙ectiveapproachtoleverageHTPdatathatisroutinely
generatedinagriculture;themethodologycanalsobeusedtoconductgeneticstudiesusinghigh-
dimensionalphenotypesthatareoftencollectedinhumansandmodelorganismsincludingbody
imagesandwhole-genomegeneexpressionpro˝les.
2.2Introduction
High-throughputphenotyping(HTP)technologieshavebeenadoptedatafastpaceinagri-
culture;applicationsrangefromtheuseofHTPinhighlycontrolledenvironments(e.g.,growth
chambers(Nageletal.,2012))toextensiveHTPusingsensingdevicesmountedonaerial(e.g.,
hyper-spectralcamerasmountedonaerialvehicles(Araus&Cairns,2014))andterrestrialequip-
mentsuchastractorsandcombineharvesters(Montesetal.,2006).Modernagriculturalproduction
systemsuseHTPdatatooptimizemanagementpractices(Whiteetal.,2012),forecastagricultural
outputs(Ferrioetal.,2005)andtoassessthequality(e.g.,proteincontent)ofagriculturalcom-
modities(Spielbaueretal.,2009).HTPdatacanalsobeavaluableinputforbreedingprograms.
Forinstance,extensiveHTPmayenableanexpansionofgenetictestingthatcanleadtohigher
9
intensityofselectionandfastergeneticprogress.Moreover,HTPdatamayo˙eropportunitiesto
improvetraitssuchasdroughttolerancethatareotherwisedi˚culttomeasureandbreedfor.
Sensorscangeneratedataonhundredsorthousandsofphenotypesperunitbeingmonitored.
Forexample,hyper-spectralcamerascangeneratere˛ectanceofelectromagneticpowerathundreds
ofwavelengthsinthevisibleandinfraredspectrum.Thesemeasurementscanbeconsideredas
indicatorphenotypesthatcanbeusedtopredictothertraits.Anextensivebodyofresearchdeals
withtheuseHTPdatatopredictphenotypessuchasgrainyield(Ferrioetal.,2005;Garriga
etal.,2017;Weberetal.,2012;Aguateetal.,2017),drymatter(Montesetal.,2006),oiland
proteincontent(Garnsworthyetal.,2000;Oblathetal.,2016).However,therehasbeenmuchless
researchonhowtointegrateHTPdataingeneticstudiesandinbreedingschemes.Ingenetics,the
problemofpredictingthegeneticmeritofatargettraitgivenasetofcorrelatedphenotypeswas
˝rstaddressedbySmith(1936)andHazel(1943)whointroducedtheconceptofselectionindex
(SI)inplantandanimalbreeding,respectively.
ASIseekstoimproveatargettrait
y
i
(e.g.,grainyield)usinginformationfromanothersetof
measuredtraits(e.g.,hyper-spectralimagedata).AlinearSIisaweightedsumofthemeasured
phenotypeswithweightsderivedtomaximizethecorrelationbetweenthegeneticmeritforthe
selectiontargetandtheSI.Thus,theSImethodologyo˙ersanaturalframeworkforintegrating
HTPdataintobreedingdecisions.However,whenthemeasuredphenotypeishigh-dimensional,
thenaïveapplicationoftheSIcanleadtoover˝ttingandsub-optimalaccuracyofindirectselection.
Toaddressthisproblem,wedevelopedregularizedselectionindices(includingpenalizedand
reduced-rankmethods)thataretailoredtoachieveaccuratepredictionofgeneticvaluesusinghigh-
dimensionalphenotypes.TheproposedmethodologyintegratesintotheSIframeworkmethods
oftenusedtopreventover˝ttinginhigh-dimensionalphenotypicregressions(Hastieetal.,2009).
Usingextensivemulti-environmentcropimagingdatafromCIMMYT'swheatbreedingprogram
weshowthatregularizedSIso˙erimprovedaccuracyofindirectselectioninbothoptimaland
stressenvironments.
10
2.3Results
Aselectionindexisalinearcombinationof
p
measuredphenotypes,
x
i
=
¹
x
i
1
;:::;
x
ip
º
0
,ofthe
form
I
i
=
x
0
i

,where

=
¹

1
:::;
p
º
0
isavectorofregressioncoe˚cientswhoseentriesde˝nethe
weightsofeachofthemeasuredphenotypesintheSI.InastandardSItheweightsarederivedby
minimizingtheexpectedsquareddeviationbetweenthegeneticmeritfortheselectiongoal(
g
y
i
,
e.g.,thegeneticmeritforgrainyieldofthe
i
th
genotype)andtheindex,thatis:
^

=
argmin

1
2
E

g
y
i

x
0
i


2
(2.1)
Thesolutiontothisoptimizationproblemis(see
Methods
section):
^

=
P

1
x
G
x
;
y
;
(2.2)
where
G
x
;
y
=
E
¹
x
i
º
isavectorcontainingthegeneticcovariancesbetweentheselectionobjective
(
y
i
)andeachofthemeasuredtraits(
x
i
),and
P
x
isthe(population)phenotypicvariance-covariance
matrixofthemeasuredphenotypes,thatis,
P
x
=
E
¹
x
i
x
0
i
º
.Thus,astandardSItakestheform
I
i
=
x
0
i
P

1
x
G
x
;
y
.ThetheoryunderlyingthederivationofSIsandresponsetoindirectselectionis
wellestablished(Bulmer,1985;Falconer&Mackay,1996).
TheSIisbyconstructionthebestlinearpredictor(BLP)ofthegeneticmeritfortheselection
goal;thispropertyholdswhen
G
x
;
y
and
P
x
areknown.However,whenthenumberofmeasured
phenotypesislarge,errorsintheestimationof
P
x
and
G
x
;
y
mayleadtoover˝ttingandsub-optimal
accuracyofindirectselection.
2.3.1Regularizedselectionindices
Reduced-rank(e.g.,principalcomponentsmethods)andpenalizedregression(Hastieetal.,2009)
aretwoapproachescommonlyusedtoconfrontover˝ttinginhigh-dimensionalregressionproblems.
Thesemethodologiesweredevelopedforregressionproblemsinvolvinganobservablephenotype
(
y
i
).IntheSI,theresponse(
g
y
i
)isunobservable;however,thesameprinciplesthatareappliedin
phenotypicreduced-rankandpenalizedregressionscanbeintegratedintotheSIframework.
11
2.3.1.1Reduced-rankselectionindices
Inprincipalcomponents(PC)regression,theresponseisregressedonareducednumber(
q
<
p
)of
PCsextractedfromasetofpredictors(
x
i
);thesameconceptcanbeusedtoderiveareduced-rank
SI.Forinstance,onecanextractareducednumberofPCsfromacropimageandtheresulting
PCscanbeusedas'measuredtraits'inEquation(2.1).ThesolutionofEquation(2.1)willrender
estimatesoftheregressioncoe˚cientsforthePCs,whichcanbetransformedbacktocoe˚cients
applicabletothemeasuredtraits(see
Methods
).Thus,areduced-rankSI(referredtoasPC-SI)can
bederivedfollowingthesesteps:(
i
)extract,usingthesingularvaluedecomposition,
q
PCsfrom
thematrixcontainingthemeasuredphenotypes,(
ii
)estimatethegeneticcovariancesbetweenthe
˝rst
q
PCsandtheselectionobjective,(
iii
)usetheseestimated(co)variancestoderivecoe˚cients
associatedwiththetop
q
PCs;˝nally,(
i
v
)transformthesecoe˚cientsintocoe˚cientsforthe
measuredphenotypes.Thisprocesscanbedoneusing
q
=
1
;
2
;:::;
p
PCs(
q
=
p
rendersthe
standardSI).Forthesequenceofestimates(
^

¹
1
º
;
^

¹
2
º
;:::;
^

¹
p
º
),onecanevaluatetheaccuracyof
indirectselectionoftheresultingSIandanoptimalrankforthePC-SIcanbechosentomaximize
theaccuracyofindirectselection.
2.3.1.2Penalizedselectionindices
Inapenalizedregression,regularizationisachievedbyincludingintheobjectivefunctionapenalty
onmodelcomplexity.InthecontextofaSI,wehave
^

=
argmin


1
2
E

g
y
i

x
0
i


2
+

J
¹

º

;
(2.3)
where

isapenaltyparameter(

=
0
yieldsthecoe˚cientsforthestandardSI)and
J
¹

º
isapenalty
function.CommonlyusedpenaltiesincludetheL2(
jj

jj
2
2
=
Í
p
j
=
1

2
j
)andL1(
jj

jj
1
=
Í
p
j
=
1
j

j
j
)
norms(Fu,1998),oraweightedsumofthetwo(Zou&Hastie,2005).Using
J
¹

º
=
1
š
2
Í
p
j
=
1

2
j
inEquation(2.3)rendersaRidge-regression-typePSI(RR-PSI,see
Methods
):
^

L
2
=
¹
P
x
+

I
º

1
G
x
;
y
;
12
where
I
isa
p

p
identitymatrix.TheRR-PSI(referredtoastheL2-PSI)yieldsshrunkenestimates
oftheregressioncoe˚cients.
Inmanyapplications,variableselection(i.e.,aSIthatisafunctionofasubsetofthemeasured
phenotypes)maybedesirable.ThispropertycanbeobtainedusingpenaltiesinvolvingtheL1-
norm,eitheralone,
J
¹

º
=
Í
p
j
=
1
j

j
j
(LASSO(Tibshirani,1996)),orincombinationwiththe
L2-norm,
J
¹

º
=
1
š
2
¹
1


º
Í
p
j
=
1

2
j
+

Í
p
j
=
1
j

j
j
(elastic-net(Zou&Hastie,2005)).Unlikethe
L2-PSI,theLASSOandelastic-netSIs(hereinafterreferredtoasL1-PSIandEN-PSI,respectively)
donothaveaclosed-formsolution(Hastieetal.,2009).However,solutionsforPSIsinvolvingan
L1-penaltycanbeobtainedusingiterativeproceduressuchasthecoordinatedescent(Friedman
etal.,2007)andtheleastangleregression(Efronetal.,2004)(LARS)algorithms(see
Methods
).
AswiththePC-SI,anoptimalPSIcanbeobtainedbychoosingthevaluesoftheregularizing
parameters(
;
)thatmaximizetheaccuracyofindirectselection.
Figure2.1:Predictionofthegeneticmeritforgrainyieldusinghyper-spectralcropimagedata.(A)
Dataconsistsofhyper-spectralre˛ectancedata(
x
i
)andphenotypicmeasurementsofthetargettrait
(
y
i
,e.g.,grainyield).(B)Asubsetofthedata(thetrainingset)isusedtoderivethecoe˚cientsof
aselectionindex.(C)Thesecoe˚cientsarethenappliedtoimagedataofindividualsinthetesting
settoderivetheindex(
I
i
)foreachindividual.Thepredictiveabilityoftheindexisassessedby
calculatingtheaccuracyofindirectselection(
Acc
¹Iº
)inthetestingset.
13
2.3.2Accuracyofindirectselection
Indirectselectionaccuracyisde˝nedasthecorrelationbetweentheindexusedtorankgenotypes
andthegeneticmeritoftheselectionobjective,thatis,
Acc
¹Iº
=
cor
¹I
i
;
g
y
i
º
.Thisparameteris
equaltotheproductofthesquarerootoftheheritabilityoftheSI(
h
I
)timesthegeneticcorrelation
betweentheSIandtheselectiontarget,
cor
¹
g
I
i
;
g
y
i
º
(Falconer&Mackay,1996).Toavoid
estimationbias
Acc
¹Iº
mustbeestimatedusingdatathatwasnotusedtoderivethecoe˚cientsof
theindex(Figure2.1);therefore,intheapplicationpresentedbelowwe:(
i
)partitionedthedatainto
trainingandtestingsets,(
ii
)derivedthecoe˚cientsoftheSIinthetrainingset,(
iii
)appliedthese
coe˚cientstoimagedataofthetestingset(
I
i
=
x
0
i

),and(
i
v
)estimated
h
I
,
cor
¹
g
I
i
;
g
y
i
º
,and
Acc
¹Iº
inthetestingset.Furthermore,wequanti˝edthee˚ciencyofindirectselectionrelativeto
massphenotypicselection(RE)using
RE
=
h
I
h
y
cor
¹
g
I
i
;
g
y
i
º
(Falconer&Mackay,1996).
2.3.3Regularizedselectionindicesforwheatgrainyieldusinghyper-spectralimagedata
Weappliedthemethodologydescribedintheprevioussectiontodata(
n
=
3
;
276
)fromthe
CIMMYT'sGlobalWheatProgramconsistingofgrainyield(tonha

1
)andhyper-spectralimage
data.ThedatawerecollectedatCIMMYT'sexperimentalstationinCiudadObregon,Sonora,
Mexico(27

20'N,109

54'W,38masl)from39yieldtrialsinwhichatotalof1,092genotypes
weretested.RainfallinObregonisverylimited;therefore,fourdi˙erentenvironmentswere
generatedrepresentingacombinationofplantingmethods(
Flat
or
Bed
),controlledirrigation
(minimal,2or5irrigations),andplantingdates(optimumorearly-heat).Asexpected,average
yielddecreasedasdroughtstressintensityincreased(seeTable2.1andSupplementaryFigureA.1
forboxplotsofyieldbyenvironment).
Imagedatawascollectedusinganinfraredandanhyper-spectralcameraandconsistedofre-
˛ectanceofelectromagneticpowerat250wavebandswithinthevisibleandnear-infraredspectrums
(392-850nm).Separateimageswerecollectedat9time-pointscoveringvegetative(VEG),grain
˝lling(GF),andmaturity(MAT)stagesofthecrop(seeSupplementaryFigureA.2).Grainyield
14
andimagedatawerepre-adjustedusingmixed-e˙ectsmodelthataccountedforgenotype,trial,
replicate,andsub-block(see
Methods
section).
Table2.1:Averagegrainyieldandheritabilitybyenvironmentalcondition.
PlantingconditionsNumberof
Abbreviation
Average(SD)
Heritability(SD)
DateSystemirrigationsYield
Optimum
FlatMinimalFlat-Drought2.06(0.58)0.83(0.016)
Bed
2Bed-2IR3.67(0.43)0.66(0.032)
5Bed-5IR6.11(0.61)0.43(0.025)
Early5Bed-EHeat6.43(0.73)0.61(0.018)
SD:standarddeviation.
2.3.4Regularizationimprovestheheritabilityandtheaccuracyoftheindex
Toassessthee˙ectofregularizationontheaccuracyofindirectselectionwe˝ttedanL1-PSIovera
gridofvaluesoftheregularizationparameter(

¹
1
º
>
¹
2
º
>:::>
0
inEquation(2.3),using

=
0
rendersastandardSI).Foreachofthesolutions(
^

¹

¹
1
º
º
;
^

¹

¹
2
º
º
;:::
)weestimatedtheheritability
oftheresultingindexandthegeneticcorrelationbetweentheindexandtheselectiontarget,and
fromthoseestimateswederivedtheaccuracyofindirectselection.Thesameapproachwasused
toevaluatetheaccuracyofindirectselectionofPC-SIswithavaryingnumber(
1
;
2
;:::
)ofPCs.
We˝rst˝ttedPSIsandPC-SIsusingdatafromasingletime-point;theresultsfromthelatest
time-point(correspondingtoMATorlateGFstagesdependingontheenvironment)arepresented
inFigure2.2(seeSupplementaryFiguresA.3toA.5forothertime-points).Theheritabilityofthe
L1-PSI(Figure2.2A)decreasedasmorebandsbecameactiveintheindex.Likewise,theheritability
ofPC-SI(Figure2.2B)decreasedwiththenumberofPCsused.However,thegeneticcorrelation
increasedaseithermorebandsbecomeactiveintheL1-PSIormorePCswereusedinthePC-SI.
Consequently,themaximumaccuracyofindirectselectionwasachievedwithaSIofintermediate
complexity(withanywherebetween20and60ofthe250bandsbeingactiveintheL1-PSI,and
between20-60PCsinthePC-SI).Resultsforothertime-pointsandenvironments(Supplementary
FiguresA.3toA.5)exhibitedsimilarpatternswithsomedi˙erencesbetweenenvironments.The
15
accuracyofindirectselectionoftheoptimalL1-PSIwasalwaysclosetothatoftheoptimalPC-SI
andthatoftheoptimalL2-PSI(SupplementaryTableA.1).Importantly,inallcasestheaccuracyof
indirectselectionoftheoptimalregularizedSIswasconsiderablyhigherthanthatofthestandard
SI,whichistheonecorrespondingto250activebandsor250PCs(i.e.,theright-mostresultsin
theplotsinFigure2.2).
Figure2.2:AccuracyofindirectselectionofregularizedSIsanditscomponents.Squareroot
heritability(green),geneticcorrelation(orange),andaccuracyofindirectselection(purple,all
averagedover100training-testingpartitions),versusthenumberofpredictorsusedtobuildthe
index:(A)numberofactivebandsinthecaseoftheL1-PSI,or(B)numberofPCsinthePC-SI.
Eachpanelrepresentsoneenvironment(latesttime-point).
Figure2.3displaystheaccuracyofindirectselectionacrosstime-pointsfortheoptimal(i.e.,
theonewiththehighestaccuracyofindirectselection)L1-PSIandPC-SI.Forcomparisonwe
alsodisplayintheplottheaccuracyofindirectselectionachievedbyastandardSI(ingreen).
Theestimated95%con˝denceintervalsoftheaccuracyoftheregularizedSIs(eitherPC-SIor
L1-PSI)areallabove(anddonotoverlap)withthecon˝denceintervalsfortheaccuracyofthe
standardSI,exceptforasingletime-point(57DASinenvironment
Bed-2IR
).ResultsfromTukey's
HonestSigni˝canceDi˙erencecon˝rmedthattheaccuracyoftheregularizedSIsisstatistically
di˙erent(higher)thanthestandardSIata5%ofsigni˝cance(SupplementaryTableA.1)forall
16
butone(57DASinenvironment
Bed-2IR
)time-point/environment.Regularizationincreasedthe
selectionaccuracyacrosstime-pointsandenvironments.RegularizedSIs(eitherPC-SIorL1-PSI)
hadanaccuracyofindirectselectionthatwasinaverage10-40%higherthantheaccuracyachieved
byastandardSI.Thesegainsinaccuracywerestrongerintheoptimalenvironment(
Bed-5IR
withamedianof36%)andsmallerinthestressedenvironments(
Flat-Drought
and
Bed-EHeat
withamedianof16%).Interestingly,therewerenosizabledi˙erencesbetweentheaccuracyof
indirectselectionachievedwiththeoptimalL1-PSIandthatoftheoptimalPC-SI.Comparedwitha
standardSI,regularizedSIshadhigherheritability(SupplementaryFigureA.6);thiswasachieved
withoutcompromisingthegeneticcorrelation(SupplementaryFigureA.7),thusleadingtoahigher
accuracyofindirectselectionachievedbyeitherpenalizationorrank-reductionstrategies.
Figure2.3:Accuracyofindirectselectionachievedbyastandard(SI)andbyregularized(PC-SI
andL1-PSI)selectionindices.Thelinesprovidetheaverageaccuracyover100training-testing
partitions.Verticallinesrepresenta95%con˝denceintervalfortheaverage.Thehorizontalaxis
givesthetime-pointatwhichimageswerecollectedandareexpressedinbothdaysaftersowing
(DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity).
17
2.3.5Usingdatafrommultipletime-pointsfurtherimprovesselectionaccuracy
Theresultspresentedabovewerebasedondatafromasingletime-point.Wealsogenerated
selectionindicesusingdatafrommultipletime-points(inthiscase,
x
i
wasavectorcontaining
2,250traits,correspondingto250wavebandsmeasuredateachof9time-points).Integratingdata
frommultipletime-pointsfurtherincreasedtheaccuracyofL1-PSIbyamarginthatrangedfrom1
to8pointsonthecorrelationscale(Table2.2).Thegainsinselectionaccuracyobtainedusingdata
frommultipletime-pointsweremoreevidentinenvironmentswithloweraccuracy;similarresults
wereobtainedforthePC-SIandL2-PSI(SupplementaryTableA.1).
Table2.2:Accuracyandrelativee˚ciencyofindirectselectionofanL1-penalizedSIusingdata
fromoneandninetime-points.
Accuracy(SD)RelativeE˚ciency(SD)
EnvironmentBestsingleNinetime-pointsBestsingleNinetime-points
time-point*combinedtime-point*combined
Flat-Drought0.69(0.05)0.70(0.05)0.74(0.05)0.75(0.05)
Bed-2IR0.46(0.04)0.54(0.03)0.57(0.05)0.67(0.04)
Bed-5IR0.47(0.06)0.55(0.05)0.72(0.08)0.83(0.08)
Bed-EHeat0.68(0.04)0.71(0.04)0.88(0.05)0.91(0.04)
Valuesarepresentedasanaverageacross100training-testingpartitions.SD:standarddeviation.*Foreach
environmentweincludethetime-pointthatgavethehighestaccuracyofselection(seeFigure2.3forother
time-points).
2.3.6L1-penalizationleadstosparseselectionindices
Figure2.4showsaheatmapforthesolutionsoftheoptimalL1-PSIthatintegrateddatafrom
the9time-points.Eachpanelrepresentsanenvironment,horizontalbandsrepresenttime-points.
Withineachtime-pointwavebandsnotenteringinthesolutionareingreyandnon-zerocoe˚cients
arerepresentedinayellow-redscale(redindicateslargeabsolute-valuecoe˚cients).Thewell-
irrigatedenvironments(
Bed-5IR
and
Bed-EHeat
)hadconsiderablysparserindiceswithonlya
reducednumberofwavebandsinthesolutions;theseweremostlylocatedintheviolet,blueand
redregionsofthespectrum.Instressedenvironments(
Flat-Drought
and
Bed-2IR
)therewerealso
18
afewwavebandsinthegreenandinfraredregionsthatwereactive.Inalltheindices,therewere
wavebandsfromseveraltime-pointsthatwereactiveintheoptimalsolution,suggestingthatdata
frombothearlyandlatephenologicalstagesareinformativeaboutthegeneticmeritforgrainyield.
Figure2.4:Heatmapofregressioncoe˚cientsforL1-penalizedselectionindices.Separateindices
werederivedforeachenvironmentusingmultitime-pointdata.DAS=daysaftersowing,VEG,GF,
MATrepresentvegetative,grain-˝llingandmaturitystages,respectively.Thebottomcolor-bar
showsthelightcolorassociatedwitheachwavebandinthevisiblespectrum(

750nm);blackwas
usedtorepresentthenear-infraredspectrum(wavelength>750nm).
2.3.7Comparisonwithphenotypicprediction
WecomparedtheaccuracyofindirectselectionofthePSIandPC-SIwithvegetationindices
andpenalizedphenotypicprediction.Vegetationindicesareoftenusedtopredictyield(Tattaris
etal.,2016),biomass,andchlorophyllcontent(Babaretal.,2006;Haboudaneetal.,2002).We
consideredtwovegetationindices:theRedandGreenNormalizedDi˙erenceVegetationIndices
19
(RNDVI(Tucker,1979)andGNDVI(Gitelsonetal.,1996)respectively).Foreachoftheseindices
weestimatedthegeneticcorrelationwithgrainyield,aswellastheirheritabilityandaccuracy
ofindirectselection(SupplementaryTableA.1).Overall,theaccuracyofindirectselectionof
theGNDVIandRNDVIwaslowerthantheoneachievedwithaPSI(theaveragedi˙erencein
accuracybetweenRNDVIandtheL1-PSIvariedbyenvironmentfrom0.02to0.14pointsin
correlation,SupplementaryTableA.1,infavoroftheL1-PSI).TheheritabilityoftheGNDVI
andRNDVIwassimilarandsuperiorinsomecasestothatoftheL1-PSI(SupplementaryFigure
A.6);however,thegeneticcorrelationbetweenthevegetationindicesandgrainyieldwas(inmost
time-pointsandenvironments)lowerthanthegeneticcorrelationbetweentheL1-PSIandgrain
yield(SupplementaryFigureA.7).Thus,themaindriverofthedi˙erenceinaccuracybetweenthe
L1-PSIandthevegetationindiceswasthedi˙erenceingeneticcorrelation.
Wealso˝ttedL1-penalizedphenotypicprediction(L1-Phen)andcomparedtheaccuracyof
indirectselectionofthesephenotypicpredictionmethodswiththatofpenalizedSIs.Overall,the
L1-PhenachievedanaccuracyofindirectselectionveryclosetothatoftheL1-PSI(Supplementary
TableA.1);however,inafewenvironmentsatsometime-points,theL1-PSIachievedahigher
accuracyofindirectselectionthanthephenotypicprediction.
2.4Discussion
High-throughputphenotypinghasbeenextensivelyadoptedinagriculturalresearchandcom-
mercialproduction.ExtractinginterpretableinformationfromHTPdataposesimportantstatistical
challenges.Theclearmajorityofresearchinthisareahasfocusedoncalibratingequationsto
predictphenotypes(e.g.,totalbiomass,grainyield)usingHTPdataasinputs.Thisapproachis
well-suitedforphenotypicprediction;however,thesameapproachcanbesub-optimalforselection
becausethebestpredictorofaphenotypeisnotalwaysthebestpredictorofthegeneticmeritof
thesametrait.
Thebest(linear)phenotypicpredictoristhesumofthebestlinearpredictorofthegeneticmerit
(
g
y
)plusthebestlinearpredictoroftheenvironmentalterm(
"
y
),thatis,
E
¹
y
j
x
º
=
E
¹
g
y
j
x
º
+
E
¹
"
y
j
x
º
.
20
The˝rstterm,
E
¹
g
y
j
x
º
,istheSIanditis,byconstruction,maximallycorrelatedwiththegenetic
merit.Thesecondterm,
E
¹
"
y
j
x
º
,isrelevantforphenotypicpredictionbutrepresentsnoisewhen
theproblemisthatofselectingthebestgenotypes.
Selectionindicesexploitgeneticcovariances,whilephenotypicpredictionreliesonphenotypic
covariancesbetweentheselectiontargetandthemeasuredphenotype(e.g.,cropimaging).Thus,the
twomethodsyielddi˙erentresultswheneverthepatternsofphenotypiccorrelationsaresu˚ciently
di˙erentfromthepatternsofgeneticcorrelations.Inourdataset,environmentalconditions
werehighlycontrolled,withrelativelylowun-controlledwithin-trialvariabilityinenvironmental
conditions.Consequently,thepatternsofphenotypicandgeneticcorrelationswereverysimilar
(seeSupplementaryFigureA.8).Thiswastrueformanytime-pointsandenvironmentsbutnotin
others(e.g.,80,85and93DASin
Flat-Drought
,and90and98DASin
Bed-2IR
);itwasexactlyin
thosetime-pointsandenvironmentsthattheL1-PSIachievedhigheraccuracyofindirectselection
thantheL1-Phenmethod(SupplementaryTableA.1).
AstandardSI(Equation(2.1))is,byconstruction,maximallycorrelatedwiththegenetic
meritoftheselectionobjective.Thisoptimalitypropertyholdswhenthegeneticandphenotypic
(co)variancematricesthatareneededtoderivethecoe˚cientsoftheSI(seeEquation(2.2))are
knownwithouterror.However,whenthemeasuredphenotypeishigh-dimensional,estimation
errorsinthephenotypic(co)variancematrix(
P
x
),aswellasinthegeneticcovariances(
G
x
;
y
),can
makethestandardSIsub-optimal.Ourempiricalresultscon˝rmthis:standardSIsover-˝ttedthe
data;thisleadstoaSIwithlowheritabilityandlowaccuracyofindirectselection.
Topreventover˝tting,weconsideredintegratingideascommonlyusedinhigh-dimensional
regressionintotheSImethodology.Ourempiricalresultsshowthatregularizationconsistently
improvestheaccuracyofindirectselectionrelativetostandardSIs.Weveri˝edthisforvarious
environmentalconditionsandforcropimagingdatacollectedat9di˙erenttime-points.The
optimalPSIandtheoptimalPC-SIachievedalmostthesameaccuracyofindirectselectionforall
theenvironmentsandtime-points,suggestingthateithertypeofregularizationcanbee˙ective.
Reduced-rankselectionindicesareappealingbecauseafterdimensionreductiontheproblem
21
ofderivingaSIistrivialandcanbedealtwithmethodscommonlyusedtoderivestandardSIs.
Moreover,afterHTPhasbeenreducedtoafewderived-traits(saythetop10PCs),thesetraits
canbeintegratedintogeneticevaluations(eitherpedigree-based(Henderson&Quaas,1976)or
genomic-enabled(Meuwissenetal.,2001))usingstandardmulti-traitmodels.
Principalcomponents-basedmethodshavebeenconsideredbeforeintheanalysisofFourier-
transformedinfrared(FTIR)spectraderivedfrommilksamples.Forinstance,Soyeurtetal.
(2010)usedareducednumberofFTIR-derivedPCstoestimatevariancecomponentsforselection
objectives(e.g.,fatorproteincontentinmilk).Buildinguponthisidea,Dagnachewetal.(2013)
suggestedpredictingthegeneticmeritformilkfattyacidsusingFTIR-derivedPCsas'traits'ina
geneticevaluation.However,whenmappingfromgeneticpredictionsofPC-lodgingsontogenetic
predictionsfortheselectionobjectivetheauthorsusedcoe˚cientsderivedfromaphenotypic
(partialleastsquares)regression.Thisdoesnotguaranteethattheresultingindexismaximally
correlatedwiththegeneticmeritoftheselectiontarget.ThepenalizedandPC-SIpresentedinthis
studyaddressthatproblembyusingcoe˚cientsthatarederivedusinggenetic(andnotphenotypic)
covariances.
AdisadvantageofthePC-SIisthatthemethodologydoesnotnaturallyprovidevariable
selection,afeaturethatmaybedesirablewhenthemeasuredphenotypeishigh-dimensional.
Penalizedselectionindicescanperformvariableselectionbasedongeneticcovariances.While
thederivationofaPSIisabitmorechallengingthanthatofthePC-SI,thecomputationalburden
involvedinthederivationofaPSIisnotextremelyhigh.
2.4.1IntegrationofPSIandPC-SIintogeneticevaluations
TheSIsconsideredherepredictgeneticmeritforaselectiontargetfromasetoftraitsmeasured
onanindividual(
I
i
=
x
0
i

);suchindicesexploitborrowingofinformationbetweentraitswithin
anindividual.Borrowingofinformationbetweenindividualsincreasesselectionaccuracy;we
envisiontwowaysinwhichregularizedSIscanbeintegratedintopedigreeorgenomic-based
geneticevaluations.
22
Onepossibilityistouseatwo-stepsapproachwhereasinthe˝rststepaPSIoraPC-SIis
usedtopredictthegeneticmeritusingwithin-individualinformation.Thisstepcanbeconsidered
asataskwherepatternsattributabletogeneticcovariancesareextractedandthoseattributable
toenvironmentalcovariancesaresmoothed-out.Then,inasecondstep,theresultingindex-data
fI
1
;:::;
I
n
g
couldbeusedasatraitinageneticevaluation.
OurstudyshowsthattheuseofaregularizedSIleadstoaderived-phenotypethathashigher
geneticaccuracythanstandardSIs,andthatofbestphenotypicprediction.Inprinciple,usinga
moreaccuratephenotypeshouldleadtoahigheraccuracyofthepredictedbreedingvaluesinthe
secondstep.However,furtherstudiesareneededtodeterminewhetherthegainsinaccuracyat
theleveloftheHTP-derivedphenotypewillfullytranslateintoahigheraccuracyofthepredicted
breedingvaluesinatwo-stepsprocedure.
Aone-stepapproachisconceptuallyfeasibleandstatisticallymoree˚cientasito˙ersthe
possibilityofconsideringcorrelationsbetweentraits,relationshipsbetweengenotypes,andthe
e˙ectsofnon-geneticfactorsjointly;however,theimplementationoftheone-stepapproachusing
high-dimensionalphenotypescanbecomputationallychallenging.Toimplementaone-stepap-
proach,theoptimizationproblemofEquation(2.3)canbemodi˝edbyreplacing
x
i
,thevectorwith
themeasuredphenotypesonthe
i
th
individual,withavector
x
=
¹
x
0
1
;
x
0
2
;:::;
x
0
n
º
0
thatcontainsall
theavailableHTPdata(measuredonallnindividuals);afterexpandingthesquarederrorlossand
takingexpectationsweget
^

i
=
argmin

i

1
2
E

g
2
y
i


0
i
G
g
x
+
1
2

0
i
P
x

i
+

J
¹

i
º

where
G
g
x
isa
pn

1
vectorofgeneticcovariancesincludingbetween-traits-within-individual
(co)variancesandbetween-subjectscovariances.Instandardgeneticmodels,
G
g
x
takesaKronecker
form
G
g
x
=
A
i

G
x
;
y
,where
A
i
aregenetic(eitherDNA-orpedigree-derived)relationshipsbetween
thecandidateforselectionandeachoftheindividualsinthetrainingset,and
G
x
;
y
is,asbefore,a
vectorofgeneticcovariancesbetweentheselectionobjectiveandthemeasuredtraits(
x
).Likewise,
P
x
isa
pn

pn
phenotypic(co)variancematrix.Estimating
P
x
wouldrequireestimatingallthe
geneticandenvironmentalcovariancesamongthemeasuredtraits.
23
2.4.2Impactoftheuseofhigh-throughputphenotypesinbreedingprograms
Accordingtobreeders'equation(Falconer&Mackay,1996;Lush,1937),therateofgeneticgain
fromselectionisdirectlyproportionaltoselectionaccuracyandselectionintensity.Thus,relativeto
theuseofstandardSIs,theuseofregularizedSIsisexpectedtoincreaseselectiongainsbyafactor
equaltothegainsobservedinaccuracy,thatisbetween10%and40%.Relativetomassphenotypic
selection,thePSIshade˚ciencies,RE,rangingfrom60%to90%;therefore,relativetodirect
phenotypicselection,selectiondecisionsbasedonPSIderivedfromimagesareexpectedtoyield
lowergeneticgainsthantheonesthatcouldbeachievedviadirectmassselection.However,the
useofHTPtechnologies(e.g.,cropmonitoringusinghyper-spectralcamerasmountedindrones)
mayenabletheexpansionofthenumberofgenotypestested/measuredaswellasthenumberof
locationswherethosegenotypesaretested.Thiscouldleadtoanincreaseinselectionintensity
whichwillinturnincreaseselectiongains.Forinstance,iftheuseofHTPenablesdoublingthe
numberofgenotypestested,theincreaseinselectiongainsthatcouldbeachievedwithHTPmay
rangefrom20%(inthecasewherethePSIhasREof60%)to80%(forthetraits/environments
withREof90%).
Thediscussionintheprecedingparagraphisentirelybasedonbreeders'equation,whichdoes
notcontemplatethelong-termimpactsofselectioningeneticdiversity.Amoreaccurateand
moreintensiveselectionmaya˙ectdiversityandlong-termresponsetoselection.Toaddressthis
problem,attentiontodiversitywillbeneededwithregularizedSIsaswithanyotherselection
criteria.
2.4.3Regularizedselectionindicescanalsobeavaluabletoolingeneticresearch
High-dimensionalphenotypesarealsobecomingincreasinglyavailableingeneticstudiesinvolving
humansubjectsandmodelorganisms.Performinggeneticstudies(e.g.,genome-wideassociation
analyses)onhigh-dimensionalphenotypesischallengingandtheburdenofmultipletestingacross
hundredsorthousandsofphenotypes(e.g.,RNA-abundanceacrossthousandsofgenes)may
criticallycompromisepower.ThePSIandPC-SIpresentedinthisstudycouldbeusedtoextract
24
geneticpatternsfromhigh-dimensionalphenotypedatasuchasbrainimagingorwhole-genome
geneexpressionpro˝lesandthesepatternscanthenbeusedastraitsingeneticstudies.
Conclusion
.Weproposedtwonovelmethodsforpredictingthegeneticmeritforselectionob-
jectivesfromhigh-dimensionalphenotypes.Thesephenotypesarebecomingincreasinglyavailable
astheadoptionofHTPincropandanimalproductionincreases.Theproposedmethodsintegrate
regularizationprocedurescommonlyusedinhigh-dimensionalregressionsintotheSImethodology.
Regularizationpreventsover˝ttingandincreasestheaccuracyoftheindex.Themethodsproposed
herecanbeusedtoextractgeneticpatternsfromalmostanykindofhigh-dimensionalphenotype,
includingnotonlyHTPdataemerginginagriculturebutalsohigh-dimensionalphenotypesthat
emergeingeneticstudiesinvolvinghumansubjectsandmodelorganisms.
2.5Methods
2.5.1Standardselectionindex
TheweightsonaSIarederivedasthesolutiontotheoptimizationproblemofEquation(2.1):
^

=
argmin

1
2
E

g
y
i

x
0
i


2
Theright-handsidecanbeexpressedas
E
¹
g
y
i

x
0
i

º
2
=
E
¹
g
2
y
i
º
2
E
¹
g
y
i
x
i
º
0

+

0
E
¹
x
i
x
0
i
º

.
The˝rstterm,
E
¹
g
2
y
i
º
,doesnotinvolve

;therefore,itcanbedroppedfromtheobjectivefunction.
Furthermore,if
x
i
hasnullmean,andassumingthattheenvironmentale˙ectson
x
i
areorthogonal
to
g
y
i
,then
E
¹
g
y
i
x
i
º
=
G
x
;
y
isavectorcontainingthegeneticcovariancesbetweentheselection
targetandeachofthemeasuredphenotypes.Likewise,
E
¹
x
i
x
0
i
º
=
P
x
isthephenotypic(co)variance
matrixof
x
i
.Therefore,theprobleminEquation(2.1)canbewrittenas
^

=
argmin


G
0
x
;
y

+
1
2

0
P
x


:
Di˙erentiatingtheright-handsidewithrespecttovector

andsettingthederivativesequalto
zeroleadstothe˝rstorderconditions:
P
x
^

=
G
x
;
y
;therefore,
^

=
P

1
x
G
x
;
y
:
25
2.5.2Reduced-rankselectionindex
Recallthatthesingularvaluedecompositionofareal-valuedmatrix,
X
=
»
x
1
;
x
2
;:::;
x
n
¼
0
(indi-
vidualsinrows,phenotypesincolumns)takestheform
X
=
UDV
0
,where
U
=
»
u
1
;:::;
u
p
¼
isthe
matrixcontainingtheleft-singularvectorsthatspantherow-spaceof
X
,
V
=
»
v
1
;:::;
v
p
¼
isthe
matrixwiththeright-singularvectors,and
D
=
dia
g
¹
d
1
;:::;
d
p
º
isadiagonalmatrixwithpositive
orzeroelements.ThePCs
W
=
XV
=
UD
arelinearcombinationsofthemeasuredphenotypes.A
reduced-rankregressionusesthe˝rst
q
PCs(
~
W
=
»
w
1
;:::;
w
q
¼
,
q

p
)as"measuredphenotypes"
intheSI:
^

¹
q
º
=
argmin

1
2
E

g
y
i

~
w
0
i

¹
q
º

2
;
where
~
w
i
isavectorcontainingthescoresforthe
i
th
observationonthe˝rst
q
PCs.Thesolutionto
theoptimizationproblemtakestheform
^

¹
q
º
=
P

1
~
w
G
~
w
;
y
,where
P
~
w
isthephenotypic(co)variance
matrixofthe˝rst
q
PCsand
G
~
w
;
y
isavectorcontainingthegeneticcovariancesbetweeneachof
thetop
q
PCsandtheselectionobjective.Sincetheleft-singularvectorsareorthonormal(i.e.,
u
0
j
u
j
=
1
and
u
0
j
u
k
=
0
,for
j
,
k
),then
W
0
W
=
D
2
=
dia
g
¹
d
2
1
;:::;
d
2
p
º
.Hence,amethod-of-
momentsestimateofthephenotypic(co)variancematrixof
~
W
containsonlythe˝rst
q
elements
~
D
2
=
dia
g
¹
d
2
1
;:::;
d
2
q
º
;thisis
^
P
~
w
=
1
n

1
~
D
2
Using
^
P
~
w
makesthecoe˚cientsofthePCstobeproportionaltothegeneticcovariancebetween
eachofthePCsandtheselectionobjective:
^

¹
q
º
=
¹
n

1
º¹
~
D
2
º

1
G
~
w
;
y
.Thissolutioncanbemapped
tocoe˚cientsforthemeasuredtraitsusing
^

¹
q
º
=
¹
n

1
º
~
V
¹
~
D
2
º

1
G
~
w
;
y
,where
~
V
isthematrix
containingonlythe˝rst
q
right-singularvectors.
2.5.3Penalizedselectionindices
TheobjectivefunctionofthepenalizedSIisgivenbyEquation(2.3).HereweconsideredPSIs
usingeitherL1orL2-normsoracombinationofthetwo.
26
L2-PSI
.UsinganL2-normaspenalty,
J
¹

º
=
1
š
2
Í
p
j
=
1

2
j
=
1
š
2

0

,inEquation(2.3)leads
tothefollowingoptimizationproblem:
^

L
2
=
argmin


1
2
E

g
y
i

x
0
i


2
+

1
2

0


Therefore:
^

L
2
=
argmin


G
0
x
;
y

+
1
2

0
P
x

+

1
2

0


Thesecondandthirdright-handsidetermscanbecombinedtoobtain:
^

L
2
=
argmin


G
0
x
;
y

+
1
2

0
¹
P
x
+

I
º


;
where
I
isa
p

p
identitymatrix.Di˙erentiatingwithrespectto

andsettingthederivativesequal
tozero,weobtainthe˝rst-orderconditions:
¹
P
x
+

I
º
^

L
2
=
G
x
;
y
;therefore:
^

L
2
=
¹
P
x
+

I
º

1
G
x
;
y
EN-PSI
.Thecoe˚cientsfortheelastic-netfamilyareobtainedbyconsideringanobjective
functionasinEquation(2.3),with
J
¹

º
=
1
š
2
¹
1


º
Í
p
j
=
1

2
j
+

Í
p
j
=
1
j

j
j
;therefore,
^

EN
=
argmin

2
6
6
6
6
4

G
0
x
;
y

+
1
2

0
P
x

+

1
2

0

+

1
2
¹
1


º
p
Õ
j
=
1

2
j
+

p
Õ
j
=
1
j

j
j
3
7
7
7
7
5
:
TheL1-PSIandL2-PSIareparticularcasescorrespondingto

=
1
and

=
0
,respectively.
When

=
0
thesolutionhasaclosed-form(seeL2-PSIabove).If
>
0
,noclosed-formsolution
exists;however,asolutioncanbeobtainedusingthesameiterativealgorithmsthatareusedtosolve
elastic-netregressions(e.g.,LARSandcoordinatedescent(Hastieetal.,2009)).Thesealgorithms
canbeimplementedeitherby"partialresiduals"orusing"covarianceupdates"(Friedmanetal.,
2010).Inourcase,theobjectivefunctionisentirelybasedon(co)varianceterms.Theobjects
P
x
and
G
x
;
y
enterintheobjectivefunctionofthePSIinthesamewaythat
X
0
X
and
X
0
y
enterina
standardelastic-netregression.Therefore,toobtainsolutions,weimplementedthestandardLARS
algorithm(e.g.,Hastieetal.,2009)entirelybasedoncovarianceupdates.
27
2.5.4Data
Thedatasetconsistsof
1
;
092
inbredwheatlinesgroupedinto39trialsandgrownduringthe2013-
2014seasonattheNormanBorlaugexperimentalresearchstationinCiudadObregon,Sonora,
Mexico.Eachtrialconsistedof28breedinglinesthatwerearrangedinanalpha-latticedesign
withthreereplicatesandsixsub-blocks.Thetrialsweregrowninfourdi˙erentenvironments:
Flat-Drought
(sowingin˛atwithirrigationof180mmthroughdripsystem),
Bed-2IR
(sowingin
bedwith2irrigationsapproximately250mm),
Bed-EHeat
(bedsowing30daysbeforeoptimal
plantingdatewith5normalirrigationsapproximately500mm),and
Bed-5IR
(bedsowingwith5
normalirrigations).In2013,allthetrialswereplantedbymid-November(optimalplantingdate),
onthe
21
st
(
Bed-2IR
and
Bed-5IR
)andonthe
26
th
for
Flat-Drought
.Trialsfor
Bed-EHeat
were
plantedonOctober
30
th
.Grainyield(tonha

1
,totalplotyieldaftermaturity)wasrecorded.
Re˛ectancedatawerecollectedfromthe˝eldsusingbothinfrared(A600seriesInfrared
camera,FLIR,Wilsonville,OR)andhyper-spectral(A-series,Micro-Hyperspec,VNIRHeadwall
Photonics,Fitchburg,MA)camerasmountedonaPiperPA-16Clipperaircrafton9di˙erentdates
(time-points)betweenJanuary
10
th
andMarch
27
th
,2014.Duringeach˛ight,datafrom
p
=
250
wavebandsrangingfrom392to850nmwerecollectedforeachpixelinthepictures.UsingArcMap
software(ESRI,CA),theaveragere˛ectanceofallthepixelswithineachgeo-referencedtrialplot
wascalculatedandreportedasasingledata-pointforeachgenotypeforeachband.Daystoheading
wererecordedasthenumberofdaysfromthedateofsowing/˝rstirrigationuntil50%ofspike
emergenceineachplot.Headingofabout50-80%ofthetotalnumberofplotswasusedascriterion
todistinguishbetweenvegetative(VEG)andgrain˝lling(GF)stages.Thecropwasconsideredto
beatmaturity(MAT)stagewhentheaverageRNDVIdecreasedto
˘
0
:
4
.
2.5.5Phenotypepre-processing
Withineachenvironment,grainyieldphenotypicrecordswerepre-adjustedby˝ttingthefollowing
mixedmodel,
y
jklm
=

+
g
j
+
t
k
+
r
l
¹
k
º
+
b
m
¹
kl
º
+
"
jklm
28
where
y
jklm
isthegrainyieldphenotypevalueforthe
j
th
genotype,
k
th
trial,
l
th
replicate(within
trial),
m
th
sub-block(withintrialandreplicate),

istheoverallmeanand
g
j
,
t
k
,
r
l
¹
k
º
,and
b
m
¹
kl
º
arethegenotype,trial,replicate,andsub-blocke˙ects,respectively(allassumedtoberandom)
and
"
jklm
isanerrorterm.Randome˙ectswereassumedtobeindependentlyandidentically
distributed(
iid
)normalwithnullmeanande˙ect-speci˝cvariances.Likewise,theerrorterms
wereassumedtobeiidwithnullmeanandcommonerrorvariance.
Grainyielddatawerepre-adjustedbysubtractingfromthephenotypicrecord(
y
jklm
)themean
(
^

)plusBLUPsoftrial,replicate,andsub-blocke˙ects;thisis
y

jklm
=
y
jklm

^


^
t
k

^
r
l
¹
k
º

^
b
m
¹
kl
º
=
^
g
j
+
^
"
jklm
(2.4)
Re˛ectancedatawaspre-adjustedby˝ttingtheabovemodel,usingre˛ectanceatindividual
bandsasphenotypeexpandedwiththeinclusionofatime-pointe˙ect.Separatemodelswere˝tted
toeachofthewavebands.Aswithgrainyield,re˛ectancedatawerepre-adjustedbysubtracting
fromthemeasuredre˛ectancetheestimatedmeanandpredictedtime-point,trial,replicate,and
sub-blocke˙ects.
Forqualitycontrol,pre-adjustedgrainyieldandre˛ectancephenotypeswereremovedforthose
grainyieldscoreslyingbeyond3timestheinter-quantileregionfromthe0.25and0.75quantiles.
Afterpre-adjusting,allphenotypeswerestandardized(tohaveunitvariance);foreaseofexposition,
hereinafterwerefertotheadjusted-scaledphenotypes(includinggrainyieldandtheimagedata)
simplyasphenotypes.
2.5.6Heritabilityestimation
Afterpre-adjustingstandardization,weanalyzedphenotypesusingamixedmodeloftheform
y
ij
=
g
j
+
"
ij
(2.5)
where
y
ij
isthephenotypeforthe
i
th
observation(
i
hereisasingleindexforindices
k
,
l
,and
m
inEquation(2.4))ofthe
j
th
genotype;thegeneticvaluesare
g
j
iid
˘
N
¹
0
;˙
2
g
y
º
,where
˙
2
g
y
isthe
29
geneticvariance;andtheenvironmentaltermsare
"
ij
iid
˘
N
¹
0
;˙
2
"
y
º
.Plot-basisheritabilitywas
calculatedfromvariancecomponentsestimatesusing
h
2
y
=
˙
2
g
y
˙
2
g
y
+
˙
2
"
y
:
2.5.7Training-testingpartitions
Thedatasetcontainsinformationfrom39trialswith84observationseach.Toassesstheaccuracyof
indirectselection,werandomlyassignedcompletetrialstotestingsets.Thetrainingsetcomprised
allthedatafromthetrialsnotassignedtothetestingset.Thisapproachguaranteesthatnodata
fromasingletrialispresentinbothtrainingandtestingsets.Thisapproachaimsatrepresenting
asituationwhereonehascalibratedthecoe˚cientsoftheindexusinghistoricaltrialsandapply
thesecoe˚cientstoimagedataoffuturetrials.Asimilarvalidationschemehasbeenused(using
herd-year-seasongroupsinsteadoftrials)invalidationproblemsinpreviousstudiesinvolvingmilk
spectradata(Ferraginaetal.,2015).Ineachtraining-testingpartition,outofthe29trialsavailable,
26trials(
n
trn
ˇ
2
;
184
observations)wererandomlyassignedtothetrainingset,andthedata
fromtheremaining13trials(
n
tst
ˇ
1
;
092
)wasusedfortestingset.Theregressioncoe˚cients
oftheindices(the

'sforthestandardSI,PSI,andPC-SI)werecalculatedusinggrainyieldand
re˛ectancedataofthetrainingset.Estimatesofthecoe˚cientsandre˛ectancedatawerethen
usedtocalculatetheSI,
I
ij
=
x
0
ij
^

,foreachobservation
i
inthetestingset(
i
=
1
;:::;
n
tst
).The
heritabilityoftheindexandthegeneticcorrelationbetweentheindexandtheselectiongoalwere
estimatedinthetestingset.
Thetraining-testingproceduredescribedabovewasrepeated100timesbyrandomlyassigning
trialstotrainingandtestingsets.Fromtheseanalyses,wereportedthemeanofheritability,genetic
correlation,andselectionaccuracy;andtheirstandarddeviationacrosstraining-testingpartitions.
30
2.5.8Estimationofphenotypicandgeneticparameters
Thepopulationphenotypic(co)variancematrix
P
x
wasestimatedwithinthetrainingsetusingthe
unbiasedsample(co)variancematrixgivenby
^
P
x
=
1
n

1
Í
n
trn
i
=
1
¹
x
i

x
º¹
x
i

x
º
0
,where
x
isthe
vectorcontainingthesamplemeanofeachwaveband.Sincere˛ectancedataarecenteredand
standardized,thisreducesto
^
P
x
=
1
n

1
X
0
X
,where
X
=
»
x
1
;
x
2
;:::;
x
n
¼
0
isthematrixcontainingall
measuredtraitsinthetrainingset.
Thegeneticcovariance(
G
x
j
;
y
)betweengrainyieldandthe
j
th
measuredtrait(
j
=
1
;:::;
p
)was
estimatedusingasequenceofunivariategeneticmodelsasinEquation(2.5).We˝ttedthatmodel
withgrainyieldphenotypesasresponse,thenforeachofthere˛ectancebandsandthenforthesum
ofgrainyieldandeachofthebands.Thegeneticcovariancesbetweenthebandsandgrainyield
werethenestimatedusing
^
G
y
;
x
j
=
1
2

^
˙
2
g
y
+
x
j

^
˙
2
g
y

^
˙
2
g
x
j

where
^
˙
2
g
y
,
^
˙
2
g
x
j
and
^
˙
2
g
y
+
x
j
aretheestimatedgeneticvariancesforgrainyield,the
j
th
band,and
thesumofgrainyieldandthe
j
th
band,respectively.
2.5.9Estimationoftheaccuracyofindirectselection
Toassesstheaccuracyofindirectselectionweappliedtheregressioncoe˚cientsderivedinthe
trainingsettoimagedatafromthetestingsettoderive
I
ij
=
x
0
ij
^

.Then,usingamixedmodel
analysislikethatdescribedintheprevioussectionweestimatedtheheritabilityoftheSI(
h
2
I
),
theheritabilityofgrainyield(
h
2
y
),andthegeneticcorrelationbetweentheSIandgrainyield
(
cor
¹
g
I
i
;
g
y
i
º
).Fromtheseestimates,wederivedtheaccuracyofindirectselection,
Acc
¹Iº
=
h
I
cor
¹
g
I
i
;
g
y
i
º
,andtherelativee˚ciency,
RE
=
h
I
h
y
cor
¹
g
I
i
;
g
y
i
º
.
2.5.10Software
AlltheaforementionedanalyseswereimplementedintheRsoftwareenvironment(RCoreTeam,
2019),version3.5.1.Linearmixedmodelswereimplementedusingthe"lmer"functionfromthe
31
LME4(Batesetal.,2015)R-package.ThesoftwarethatimplementstheLARSandcoordinate
descentalgorithmsareavailablethroughtheSFSIR-package(https://github.com/MarcooLopez/
SFSI).
2.5.11Materialsanddataavailability
ThedatausedinthisstudyarepubliclyavailablebyCIMMYT(https://www.cimmyt.org/)who
ownsallrightsinthedata.ThedataisalsoincludedintheSFSIR-package.TheR-scriptsneeded
toperformtheanalysespresentedinthisstudycanbefoundinthedocumentationoftheSFSI
R-package.
2.6Acknowledgments
WeacknowledgeCIMMYT'sGlobalWheatProgramthatprovidedbothexperimental˝eld
andHTPdatausedinthiswork.M.L.C.wassupportedbytheMonsanto'sBeachell-Borlaug
InternationalScholarshipProgram(MBBISP).
32
CHAPTER3
OPTIMALBREEDINGVALUEPREDICTIONUSINGASPARSEAMILINDEX
[Materialsubmittedto:
Genetics
(conditionalaccepted,revisionpending)]
MarcoLopez-Cruz
1
andGustavodelosCampos
2
;
3
;
4
;

1
DepartmentofPlant,SoilandMicrobialSciences,MichiganStateUniversity,USA
2
DepartmentofEpidemiologyandBiostatistics,MichiganStateUniversity,USA
3
InstituteforQuantitativeHealthScienceandEngineering,MichiganStateUniversity,USA
4
DepartmentofStatisticsandProbability,MichiganStateUniversity,USA

CorrespondenceandrequestshouldbeaddressedtoG.D.L.C.(e-mail:gustavoc@msu.edu).
33
3.1Abstract
GenomicpredictionusesDNAsequencesandphenotypestopredictgeneticvalues.Inho-
mogeneouspopulations,theoryindicatesthattheaccuracyofgenomicpredictionincreaseswith
samplesize.However,di˙erencesinallelefrequenciesandinlinkagedisequilibriumpatterns
canleadtoheterogeneityinSNPe˙ects.Inthiscontext,calibratinggenomicpredictionsusinga
large,potentiallyheterogeneous,trainingdatamaynotleadtooptimalpredictionaccuracy.Some
studiestriedtoaddressthissamplesize/homogeneitytrade-o˙bydesigningalgorithmstoidentify
anoptimaltrainingset;however,thisapproachassumesthatasingletrainingdatasetisoptimum
forallindividualsinthepredictionset.Here,weproposeanapproachthatidenti˝es,foreach
individualinthepredictionset,asubsetfromthetrainingdata(i.e.,asetofsupportpoints)from
whichpredictionsarederived.Themethodologythatwepropose(whichwelabelSparseSelection
Index,SSI)integratestraditionalSelectionIndexmethodologywithsparsity-inducingtechniques
commonlyusedinhigh-dimensionalregressionsettings.Thesparsityoftheresultingindexiscon-
trolledbyaregularizationparameter(

);theG-BLUP(thepredictionmethodmostcommonlyused
inplantandanimalbreeding)appearsasaspecialcasewhichhappenwhen

=
0
.Inthisstudy,we
presentthemethodologyanddemonstrate,usingtwowheatdatasets(averylargemulti-generation
breedingpanelandasmaller,highly-structured,dataset)withphenotypescollectedintendi˙erent
environments,thattheSSIcanachievesigni˝cant(anywherebetween5-10%)gainsinprediction
accuracyrelativetotheG-BLUP.
3.2Introduction
Selectiondecisionsinplantandanimalbreedingrelyonthepredictedgeneticmeritofselection
candidates.Earlypredictionmethodswerebasedeitheronphenotypesmeasuredinthesame
selectioncandidatesoronprogenytesting(e.g.,Lush,1935).Thesemethodswerelaterextended
intoselectionindices(Hazel,1943;Smith,1936)thatcanuseinformationfromvarioussources
ofcorrelateddata,includingsecondarytraitsmeasuredonthesameindividual,measurementsof
34
thesamephenotypecollectedfromrelatives,andcombinationsthereof(Lush,1948).Henderson
(1950)furtherextendedthemethodologybydevelopingmixed-modelsthatcaninclude˝xedand
randome˙ects.
TheBestLinearUnbiasedPredictor(BLUP)predictsbreedingvaluesbyborrowing(i.e.,aver-
aging)informationfrommultiplesourcesofcorrelateddata.Pedigreesoftentracebackalimited
numberofgenerations(e.g.,5);therefore,inmostanimalandplantbreedingdatasets,pedigree
relationshipsoftende˝neandborrowingofinformationspanwithinthescopeofeach
family.However,thisisnotthecaseingenomic-BLUP(G-BLUP;VanRaden,2008)because
genomicrelationshipsarenotsparseaspedigree-derivedrelationships.
Inthelasttwodecades,genomicprediction(i.e.,genomicselection,GS;Meuwissenetal.,
2001)hasbecomethemethodofchoiceforbreedingvalueprediction.GSmodelspredictbreeding
valuesusinggenome-widemarkersandrelyinthemulti-locuslinkagedisequilibrium(LD)between
SNPsandquantitativetraitloci(QTL).However,itisalsowell-establishedthatfamilyrelationships
andpopulationstructurecontributetotheaccuracyofgenomicprediction(Habieretal.,2007).Ina
Genomicrelationshipmatrix(VanRaden,2007)allindividualsarerelatedtosomeextent;therefore,
everytrainingdatapointcontributestothepredictionofeachindividualinthetestingset.
Genomicpredictionmodelswereoriginallydevelopedwithreferencetoahomogeneouspopu-
lationinwhichmarkere˙ectsareassumedtobethesameacrosssubgroupsofthedata.However,
severalfactors,includingimperfectLDbetweenmarkersandQTLandnon-additivee˙ectscou-
pledwithpopulationstructureandadmixturecanmakemarkere˙ectsvaryacrosssubgroupsin
thesample(delosCamposetal.,2015;Pritchard&Donnelly,2001).Allthesefactorscanmake
thegenomicrelationshipsderivedfrommarkersinaccuratepredictorsofthegenomicrelationships
realizedatcausalloci(e.g.,delosCamposetal.,2013b).Therefore,theaccuracyofG-BLUPmay
besub-optimalwhenthetrainingdataconsistsofheterogeneousgroups(e.g.,multiplefamiliesor
multiplestrainsorbreeds)orevenmulti-generationdatainwhichLDpatternsmayvaryacross
distantgenerations.
SeveralauthorshaverecognizedtheneedtomodelheterogeneousSNP-e˙ectsinthecontextof
35
multi-breed(e.g.,Hayesetal.,2009)andstructured(e.g.,delosCamposetal.,2015)data.Mostof
theexistingmethodsmodelgroup-speci˝ce˙ectsusingeithermultivariateGaussianmodels(e.g.,
Olsonetal.,2012;Schulz-Streecketal.,2012)orinteractionmodels(e.g.,delosCamposetal.,
2015;Isidroetal.,2015;Veturietal.,2019).However,theseapproachescanbedi˚culttousein
thepresenceofcrypticgenetic-heterogeneitypatternswherenocleargroupscanbediscerned.
Anotherlineofresearchseekstoidentifyanoptimaltrainingsetforagivenpredictionset.
Theseoptimaltrainingsetsoftenconsistofindividualsthatarecloselyrelatedtotheindividualsin
thepredictionset,i.e.,thecandidatesofselection(Rincentetal.,2012;Akdemiretal.,2015;Isidro
etal.,2015;Pszczola&Calus,2016;Akdemir&Isidro-Sanchez,2019).However,thesemethods
assumethatasingletrainingsetisoptimalforalltheindividualsinthepredictionsetwhichisnot
necessarilythecase.Therefore,inthisstudy,wefocusondevelopingagenomicpredictionmethod
thatwillidentify,foreachindividualinapredictionsetanoptimaltrainingset(i.e.,asetofsupport
points).Ourapproachachievesthisgoalbyintegratingsparsity(byaddinganL1-penalty)intoa
selectionindex(SI)problem,werefertothemethodasasparseselectionindex(SSI).
3.3MaterialsandMethods
Astandardselectionindex(
I
i
)predictsthebreedingvalueofanindividual(
u
i
)usingalinear
combinationofthetrainingphenotypes(
y
=
¹
y
1
;:::;
y
n
º
0
):
I
i
=

0
i
y
=
Í
n
j
=
1

ij
y
j
.Here,pheno-
typesareassumedtobecenteredandcorrectedbynon-genetice˙ects(e.g.,experimentandblock
e˙ects),and

i
=
¹

i
1
;:::;
in
º
0
isavectorofweightswhichareobtainedasthesolutiontothe
followingoptimizationproblem:
^

i
=
argmin

i
1
2
E

u
i


0
i
y

2
Theright-handsideoftheaboveproblemexpandsto
E
¹
u
2
i
º
+

0
i
E
¹
yy
0
º

i

2
E
¹
y

u
i
º
0

i
.
Assumingthatgenetic(
u
i
)andnon-genetice˙ects(
"
i
)areindependent,eachwithmeanzeroand
(co)variancematricesvar
¹
u
º
=
˙
2
u
G
andvar
¹
"
º
=
˙
2
"
I
,wehavethat
E
¹
u
i


0
i
y
º
2
=
˙
2
u
+

0
i
P

i

2
˙
2
u
G
0
i

i
,where
˙
2
u
isageneticvarianceparameter,
P
=
˙
2
u
G
+
˙
2
"
I
isthephenotypic(co)variance
36
matrixofthetrainingphenotypes,and
G
i
isavectorcontainingthegeneticrelationshipsbetween
the
i
th
subjectofthepredictionsetandeachofthesubjectsinthetrainingdata.Since
˙
2
u
doesnot
dependon

i
,theaforementionedoptimizationproblemcanbereducedto
^

i
=
argmin

i

1
2

0
i
¹
G
+

0
I
º

i

G
0
i


(3.1)
where

0
=
˙
2
"
˙
2
u
=
1

h
2
h
2
istheratiooftheerrortothegeneticvariance,whichcanbeexpressedin
termsoftheheritability,
h
2
.Thesolutiontotheaboveproblemcanbeshowntobe
^

i
=
¹
G
+

0
I
º

1
G
i
(3.2)
Thevector
^

i
canbeshowntobethe
i
th
rowoftheHatmatrixoftheBLUPsofthegenetic
valuesoftheindividualsinthepredictionset(seeAppendixB.1foraproof),thus,dependingon
whether
G
isapedigree-orgenomic-derivedrelationshipmatrix,thestandardSIisequivalenttoa
pedigree-(Henderson,1963)orgenomic-BLUP,respectively.
When
G
isapedigree-basedrelationshipmatrixtheo˙-diagonalentriescorrespondingtopairs
ofsubjectsnotconnectedthroughthepedigreeareequaltozero.Inthatcase,someoftheentries
of
^

i
canalsobeequaltozerowhichimpliesthatthecorrespondingpredictedbreedingvalue
(
^
I
i
=
^

0
i
y
)drawsinformationfromasubsetofthetrainingdata.However,when
G
isagenomic
relationshiptypicallynoneoftheo˙-diagonalsareequaltozero;therefore,noneoftheentriesof
^

i
willbeexactlyequaltozero.Thisimpliesthatalltheobservationsinthetrainingsetcontribute
tosomeextenttopredictthebreedingvaluesofalltheindividualsinthepredictionset.
3.3.1SparseSelectionIndexMethodology
Asnotedearlier,thereareseveralreasons(e.g.,imperfectLD,e˙ectheterogeneity)whyborrowing
ofinformationbetweendistantlyrelatedindividualsmayhaveadetrimentale˙ectonpredictionac-
curacy.Therefore,toachievesparsity(andpossiblydi˙erentialshrinkageonthe
^

i
)weconsidered
addinganL1-penaltytotheobjectivefunctioninEquation(3.1);therefore,
~

i
=
argmin

i
2
6
6
6
6
4
1
2

0
i
¹
G
+

0
I
º

i

G
0
i

+

n
Õ
j
=
1
j

ij
j
3
7
7
7
7
5
(3.3)
37
Theaboveoptimizationproblemdoesnothaveaclosed-formsolution;however,solutionscan
beobtainedusingaCoordinateDescentalgorithmverysimilartotheoneusedtosolveLASSO
problems(seeLopez-Cruzetal.,2020).Theregularizationparameter

controlshowsparse
~

i
willbe;thisparameterisalsoexpectedtoa˙ecttheaccuracyoftheindex.Therefore,anoptimal
valueof

canbefoundbymaximizingtheaccuracyoftheresultingindex.
3.3.2Data
Weusedtwowheatbreedingdatasetstoevaluateandtocomparethepredictionperformanceof
standardandsparseselectionindices.The˝rstdataset(Wheat-large)isamulti-generationwheat
breedingdatasetofaverylargesamplesize(
n
˘
29
;
000
).Thesecondoneis(Wheat-599)isa
small,structureddata(seeSupplementaryFigureB.1).
TheWheat-largedatasetisfromCIMMYT'sGlobalWheatProgramanditincludesphenotype
datafrom
58
;
798
wheatlinesthatwereevaluatedduring˝veyears(2009-2013)attheCIMMYT's
experimentalstationinCiudadObregon,Mexico.Lineswereevaluatedundersixenvironmental
conditions(
B2I
,
B5I
,
MEL
,
LHT
,
DRB
,
EHT
)representingacombinationofplantingsystem(bed
vs˛at,thelaterreferredtoasmelgas),numberofirrigations(2,5irrigationsordripirrigation),
andsowingdate(optimum,lateorearlyplanting).Eachyear,grainyieldtrialswereestablishedin
an

-latticedesignwiththreereplicatesintoincompleteblocks.Moisture-standardizedgrainyield
(tonha

1
)wasmeasuredateachplot.Weusedmixed-e˙ectsmodelswitha(`˝xed')interceptand
therandome˙ectsofthetrial,block(withintrial)andreplicate(withintrial)toderiveleast-square
meansbylineandenvironmentalcondition.Separatemixedmodelswere˝ttedtodatafromeach
ofthesimulatedenvironments.Theaveragegrainyieldinthisdatasetvariedfrom2.72to7.12ton
ha

1
(seeSupplementaryFigureB.2Bforboxplotsofgrainyield)andtheheritabilityofsingle-
plotrecordsvariedbetween0.23and0.57(seeSupplementaryTableB.1forasummaryofthe
data).Onlyasubsetof
29
;
484
genotypeswasgenotypedusingaGBS(Genotyping-by-sequencing)
technologythatyielded
42
;
706
SNPs.WeremovedSNPswithmorethan70%ofmissingvalues
andthosewithminorallelefrequencylowerthan5%.Afterapplyingthese˝lters,weretained
38
9,045SNPs.ThemissingvaluesateachSNPwereimputedasthemeanoftheobservedSNPdata
acrossgenotypes.ThedatasethasbeenpreviouslydescribedandanalyzedbyPérez-Rodríguez
etal.(2017).
TheWheat-599datasetisalsofromCIMMYT'sGlobalWheatProgramanditiscomprised
ofgrainyieldandgenotypedatafor599historicalinbredlinesderivedalong25years.Lineswere
evaluatedintheEliteSpringWheatYieldTrials(ESWYT)thatweregroupedintofourdi˙erent
mega-environments(
Env1
,...,
Env4
).Theavailablephenotypicvaluesareleast-squaremeans
fromtworeplicates.Theaveragegrainyieldinthisdatasetrangedfrom3.23to5.14tonha

1
(seeSupplementaryFigureB.2Aforboxplotsofgrainyielddata)withheritabilityestimatesfor
theleast-squaremeansrangingbetween0.43to0.50(seeSupplementaryTableB.2).Eachofthe
lineswasgenotypedfor
1
;
279
diversityarraytechnology(DArT)markers.Thedatasetisavailable
withtheBGLRR-package(Perez&delosCampos,2014)andhasbeendescribedandanalyzedby
previousauthors(e.g.,delosCamposetal.,2009b;Crossaetal.,2010).
3.3.3Analyses
Foreachdataset,wecomputedagenomicrelationshipmatrix
G
using(centeredandstandardized)
markerinformation,
X
=
f
x
im
g
,as
G
=
ZZ
0
š
p
,where
p
isthenumberofmarkersand
Z
=
f¹
x
im

x
m
ºš
sd
x
m
ºg
isthematrixofcenteredandstandardizedmarkersobtainedbysubtracting
fromeachmarkerentrythemeanofeachcolumn(
x
m
)followedbyscalingbythestandarddeviation
ofthecolumn(
sd
x
m
).Theresultingmatrixhasanaverageofthediagonalelementsequalto1.
Toquantifythepredictionaccuracyofeachoftheindices,wedividedeachdatasetintotraining
(trn)andtesting(tst)setsbyrandomlyassigning30%(70%)ofthedatapointstotesting(training).
Predictionswerederivedby˝rstusingEquation(3.2):
^

i
=
¹
G
+

0
I
º

1
G
i
(forthestandardSI)andEquation(3.3):
~

¹

º
i
=
argmin

i
2
6
6
6
6
4
1
2

0
i
¹
G
+

0
I
º

i

G
0
i

+

n
Õ
j
=
1
j

ij
j
3
7
7
7
7
5
39
(fortheSSI),with
G
=
G
trn
representingthegenomicmatrixofthetrainingdatapoints(i.e.,with
dimensions
n
trn

n
trn
,where
n
trn
=
0
:
7
n
),and
G
i
=
G
trn
;
tst
¹
i
º
beingthevectorcontainingthe
genomicrelationshipsbetweenthe
i
th
data-pointofthetestingset,witheachoftheindividuals
assignedtothetrainingset(i.e.,thedimensionsof
G
i
are
n
trn

1
).Thiswasrepeatedforeach
individualinthetestingset(
i
=
1
;:::;
n
tst
,where
n
tst
=
0
:
3
n
).Subsequently,predictionsforeach
individualwereobtainedusing
^
I
i
=
^

0
i
y
trn
(forthestandardSI)and
^
I
i
=
~

¹

º
0
i
y
trn
(fortheSSI)
where
y
trn
isa
n
trn

1
vectorwiththeadjusted-centeredphenotypesofthetrainingset.
TheimplementationoftheSIrequiresheritabilityestimates.Wederivedthoseby˝ttinga
G-BLUPmodeloftheform
y
i
=

+
u
i
+
"
i
with
"
i
iid
˘
N
¹
0
;˙
2
"
º
and
u
˘
N
¹
0
;˙
2
u
G
º
.The
modelwas˝ttedusingtherrBLUPR-package,separatemodelswere˝ttedtograinyieldwithin
eachenvironmentineachdatasetwithinthetrainingset.Wethenusedthevarianceparameters
estimatestoderive
h
2
=
˙
2
u
š¹
˙
2
u
+
˙
2
"
º
forgrainyield.
Predictionaccuracy(
ˆ
)wasmeasuredwiththecorrelationbetweenthephenotypeandthe
index,dividedbythesquare-rootofthetraitheritabilityofthetrait,
ˆ
=
Acc
¹
^
Iº
=
cor
¹
^
I
i
;
y
i
ºš
h
(Dekkers,2007).
FortheSSI,weestimatedaccuracyoveragridofvaluesoftheregularizationparameter
(

=
0
<
¹
1
º
<
¹
2
º
<

<
max
)where

max
=
max
i
f
j
G
i
j
dia
g
¹
G
º
+

0
g
.Here

max
istheminimum
valueof

thatyieldsanSSIwithnoactivepredictors,and

=
0
givestheweightsofthestandard
SI.FollowingFriedmanetal.(2010)weusedagridofvaluesevenlyspacedinthelogarithmscale
withatotalof100values.Thus,foreachvalueof

inthegrid,wehadanestimateoftheresulting
accuracyoftheSSI.Thiswasusedtopro˝leaccuracyasafunctionoftheregularizationparameter
andalsotochooseanoptimalvalueof

.
Todetermineanoptimalvalueof

weimplementedacalibrationanalysisusingdatafromthe
trainingdataonly.Speci˝cally,foreachtrainingset,weconductedaninternalcross-validation
(CV)asfollows:(
i
)Thetrainingdatasetwaspartitionedinto
k
subsets.(
ii
)SSIswerederived
overagridofvaluesof

usingdatafrom
k

1
foldsfortrainingandthedatainthe
k
th
foldas
testing(i.e.,fortheestimationofaccuracy,seethepreviousparagraph).(
iii
)Theresultingcurves
40
pro˝lingaccuracy(
ˆ
)byvaluesof

wereusedtoidentifythevalueof

(
^

c
v
)thatmaximized
accuracy.(
i
v
)Finally,weusedallthedatafromthetrainingsettoderive
I
i
¹
^

c
v
º
andevaluatedthe
accuracy(
ˆ
)oftheresultingindexintheleft-outdatafromthetestingset.
3.3.4Software
AlltheanalyseswereperformedintheRenvironment-language(RCoreTeam,2019)version
3.5.TheheritabilityofeachofthetraitswasestimatedusingtherrBLUPR-package(Endelman,
2011).SparseSIswerederivedusingtheSSIfunctionfromtheSFSIR-packagethatimplements
theCoordinateDescentalgorithmdescribedinLopez-Cruzetal.(2020).Thepackageisaided
byggplot2(Hadley,2016)andparallel(RCoreTeam,2019)packagestovisualizeresultsandto
speedcomputation.ThispackageisavailablethroughtheGitHubrepositoryathttps://github.com/
MarcooLopez/SFSI.ScriptsillustratingtheuseofthispackageusingtheWheat-599datasetare
presentedintheAppendixB.2.
3.3.5Dataavailability
BothphenotypicandmarkerdatafortheWheat-largedatasetcanbedownloadedfromCIM-
MYT'srepositoryathttp://genomics.cimmyt.org/wheat_50k/PG/(accessedSep14th,2020).The
Wheat-599datasetcanbedownloadedfromtheBGLRR-packagebycallingAll
supplemental˝guresandtablesarecontainedintheAppendixB.Allsupplemental˝lesareavailable
atFigshareathttps://˝gshare.com/s/ce2258d168b16a86454d.
3.4Results
3.4.1Sparsityimprovespredictionaccuracy
Weassessedthee˙ectofsparsityontheaccuracy,by˝ttingtheSSIfor100valuesof

(
0
<
¹
1
º
<

¹
2
º
<

<
max
;thevalue

=
0
producesthecoe˚cientsofthestandardSIorG-BLUP).The
results(averagedover100trn-tstpartitions)areshowninFigure3.1.Thenumberofsupportpoints
41
(i.e.,thenumberoftrainingdatapointscontributingtotheprediction)was,asexpected,inversely
proportionalto

;therefore,tofacilitateinterpretation,thex-axisofFigure3.1isdisplayedasthe
averagenumberofsupportpoints,whichismoremeaningfulthanthe

values.Theaccuracyof
theG-BLUPisalsoshownattherightmostsideoftheplotwhosenumberofsupportpointsisequal
tothesizeofthetrainingdataset.Intermediatevaluesof

ledtosparseindicesthat,inmostcases,
achievedhigherpredictionaccuracythanthatoftheG-BLUP(shadedareainFigure3.1).
Figure3.1:Predictionaccuracy(averageacross100trn-tstpartitions)oftheSSIversusthe(average)
numberofpredictorsintrainingsetsupportingtheSSIofeachindividualintestingset(x-axis).
Genomic-BLUP(bluerightmostpoint)appearsasaspecialcaseofanSSI.Eachpanelrepresents
oneenvironmentwithindataset.(A)Wheat-largedataset.(B)Wheat-599dataset.Verticalbars
representa95%con˝denceintervalfortheaverage.
42
Themaximumaccuracyintheenvironment
EHT
(seeFigure3.1A)wasobtainedwitha
penalizationthatleadstoasparseindexwithanaverageof120supportpoints(
n
sup
).This
predictivesetofindividualsrepresentsaround8%ofthetotaltrainingset(
n
trn
=
1
;
428
)available
forprediction.
Forthesmalldataset(Figure3.1B),thesamepatterncanbeobservedinallenvironments,
exceptforenvironment2.ThiscaseshowsthattheSSIdoesnotalwaysoutperformtheG-BLUP;
however,theSSIachievesthepredictionaccuracyoftheG-BLUPwithasmallersupportset
(
n
sup
ˇ
151
outof419).
Figure3.2:Predictionaccuracyoftheoptimalsparseselectionindex(SSI)versusthatofthe
G-BLUP.Eachpointrepresentsatrn-tstpartition(atotalof100partitionswereimplemented),the
pointshapeandcolorrepresentenvironments.(A)Wheat-largedataset.(B)Wheat-599dataset.
Thevalueof

intheSSIwasestimatedusing105-foldcross-validationsconductedwithinthe
trainingdata.Inparenthesis,bythelegend,isthep-valueforthetwo-sidedSign(binomial)testfor
within-environmentdi˙erencesinaccuracybetweentheSSIandtheG-BLUP.
3.4.2Usinganinternalcross-validationtoachieveoptimalsparsity
TheresultsinFigure3.1suggestthatonecan˝ndavalueof

thatleadstoanindexwitha
predictiveperformanceasleastashigh(andinmostcaseshigher)astheG-BLUP.However,to
obtainanunbiasedestimateofthemaximumaccuracythatonecouldachievewithanSSI,one
43
shouldnotusedatafromthetestingsettoselecttheoptimalvalueof

.Therefore,werepeated
theanalysesdescribedabove,thistimeperformingthegridsearchforanoptimalvalueof

by
implementing105-foldCVswithineachtrainingdataset.ThisCVwasusedtochooseanoptimal
valueof

(
^

c
v
).Then,wesolvedtheSSIusing
^

c
v
andallthetraininggenotypes,andevaluated
theaccuracyof
I
i
¹
^

c
v
º
inatestingsetthatwasnotusedtochoose
^

c
v
.Thiswasrepeatedfor100
trn-tstpartitions.Figure3.2showstheaccuracyof
I
i
¹
^

c
v
º
versusthatoftheG-BLUP,eachpoint
intheplotrepresentsatrn-tstpartition.
Table3.1:Predictionaccuracy(averageacross100partitions)achievedbysparseselectionindices
(SSIs)andbytheG-BLUP(standardSI),bydatasetandenvironmentalcondition.
Environment
n
tst
n
trn
Method

c
v
a
n
sup
b
Accuracy(SD)
c
Counts
d
Wheat-large
B2I1,1202,612
G-BLUP0.00002,6120.617(0.031)b
97
SSI0.01354340.648(0.031)a
B5I8,84220,631
G-BLUP0.000020,6310.555(0.010)b
100
SSI0.01071,4700.609(0.009)a
MEL1,3213,082
G-BLUP0.00003,0820.600(0.045)b
99
SSI0.01315240.661(0.046)a
LHT1,3223,082
G-BLUP0.00003,0820.669(0.024)b
99
SSI0.01683800.709(0.025)a
DRB1,1292,634
G-BLUP0.00002,6340.629(0.035)b
98
SSI0.03221360.675(0.037)a
EHT6121,428
G-BLUP0.00001,4280.614(0.049)b
94
SSI0.03011780.649(0.047)a
Wheat-599
Env1180419
G-BLUP0.00004190.721(0.070)b
87
SSI0.0413780.760(0.067)a
Env2180419
G-BLUP0.00004190.702(0.087)a
41
SSI0.01232540.692(0.085)a
Env3180419
G-BLUP0.00004190.585(0.101)a
53
SSI0.0613840.586(0.093)a
Env4180419
G-BLUP0.00004190.663(0.082)b
87
SSI0.0617540.714(0.075)a
SD:Standarddeviationacrossthe100trn-tstpartitions.G-BLUPmodelcorrespondstoanSSIwhere

=
0
.
a
Averagevalueof

estimatedbycross-validatingthetrainingset.
b
Averagenumberofindividuals
intrainingsetsupportingthepredictionofindividualsfromtestingset.
c
Modelswiththesameletterarenot
signi˝cantlydi˙erentfromothers(ANOVAfollowedbyTukey'sHSDtest,5%signi˝cancelevel).
d
Number
oftimes(outofthe100partitions)thattheSSIoutperformedtheG-BLUPinpredictionaccuracy.
44
IntheWheat-largedataset,theoptimalSSIoutperformedtheG-BLUPin94%ofthecases
(Table3.1).Forthisdataset,theSSIo˙eredsigni˝cant(accordingtoTukey'sHonestSigni˝cance
Di˙erencetest,HSD)gainsinaccuraciesacrossenvironments.Thesegainsrangefrom5%(inthe
environmentB2I)to10%(intheenvironment
MELGAS
)inthecorrelationmetric.
SimilarpatternswereobservedwiththeWheat-599dataset.InEnvironments1and4the
SSIoutperformedtheG-BLUPinmorethan80%ofthetrn-tstpartitions(Table3.1),withgains
inaccuracyrangingfrom5-7%.However,inEnvironments2and3,therewerenostatistically
signi˝cantgainsinaccuracy(seeTable3.1).
Figure3.3:Distributionofthenumberoftrainingsupportpoints(
n
sup
)inoptimalsparseselec-
tionindices(resultsobtainedover100trn-tstpartitions;
n
trn
=sizeofthetrainingdataset),by
environmentalcondition,Wheat-largedataset.
3.4.3SparseSelectionIndicesbuildsubject-speci˝ctrainingsets
Foreachindividualinthepredictionset,anSSIyieldsasetofsupportpointsinthetrainingset
consistingoftheindexofallthenon-zeroentriesof
~

¹

º
i
.Figure3.3showsthedistribution(across
100trn-tstpartitions)ofthenumberofsupportpoints(
n
sup
)for
^

c
v
foreachoftheenvironments
45
oftheWheat-largedataset.At
^

c
v
,
n
sup
rangesfrom30to
ˇ
5
;
000
.In3oftheenvironments(
B2l
,
MELGAS
,and
LHT
)theaveragenumberofsupportpointswas
n
sup
ˇ
450
,thatis
˘
15

20
%of
thesizeofthetrainingset.Inenvironment
B5l
,theproportionofactivetrainingsupportpointswas
˘
5

10
%.Ontheotherhand,inenvironment
EHT
predictionsreliedonanaverageof
n
sup
ˇ
178
(of
1
;
428
)individualsfromtraining(Figure3.3).SimilarpatternswerealsoobservedintheWheat-
599data(SupplementaryFigureB.3);forinstance,testingphenotypesfromenvironment1were
optimallypredictedinaveragewith
n
sup
ˇ
78
(of419);however,therelativesparsity(
n
sup
š
n
trn
)
wassmallerintheWheat-largedataset(5-17%)thanintheWheat-599dataset(12-60%).
Figure3.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)andthe
correspondingsupportpoints(green).Greypointsrepresentgenotypesthatdidnotcontributeto
thepredictionofthegeneticvalueofthegenotypeinyellow.Allpanelsrepresentsolutionsforthe
environment
EHT
,Wheat-largedataset.
46
Figure3.4shows(forselectedtestinggenotypes)thecoordinatesonthe
1
st
and
2
nd
PCof
boththepredictionpoint(yellowcircle)andthetraininggenotypes.Activetraininggenotypes
arerepresentedinagreencircle,andthosenon-active(i.e.,withzeroweightintheindex)are
representedingrey.Insomecases,thesupportsetincludestraininggenotypesthatarenearby
(accordingtothecoordinatesonthe
1
st
2PCs)thepredictionpoint.However,inothercases,
thesupportsetspannedoutsidetheofwherethepredictionpointresides.This
suggeststhattheSSIdoesnotnecessarilychoosetrainingpointswithintheclusters.Asimilarplot
fortheWheat-599datasetispresentedinSupplementaryFigureB.4.
3.4.4Genomicrelationshipsandweightsinstandardandsparseselectionindices
Figure3.5Ashowsthecoe˚cientsoftheG-BLUPandoftheSSI(i.e.,the

ij
'sderivedfrom
Equations(3.2)and(3.3),respectively)versusthegenomicrelationship(
g
ij
,the
ij
entryof
G
).
InFigure3.5A,the

ij
'swerederivedforatraining-testingpartitionwith˝xedheritabilityand

chosenbyCVconductedwithinthetrainingset,forenvironmentEHTfromtheWheat-largedata
set.TheweightsusedbytheG-BLUPare,asexpected,alldi˙erentfromzeroandarepositively
associatedwiththegenomicrelationships(i.e.,onaverage,traininggenotypescloselyrelatedto
genotypesinthepredictionsetreceivehigherweightontheindex).However,thepointsdonotfall
overaperfectlinebecausetheweightgiventoeachofthetrainingpointsdependsnotonlyonthe
relationshipbetweenthetrainingpointandthepredictionpointbutalsoontherelationshipsamong
trainingpoints.Ontheotherhand,asexpected,theSSIzero-outsmostoftheweights.Interestingly,
theSSIseemstozero-outmostoftheweightsthatareinthetopleftandlower-rightquadrants(i.e.,
pointsthathadanegative(positive)relationshipandintheG-BLUPgotpositive(negative)weight,
comparebothplotsinFigure3.5A).PanelBinFigure3.5showstheproportionofcoe˚cients
thatarezeroed-outbylevelofgenomicrelationship.Mostofthecoe˚cientscorrespondingto
traininggenotypeswithrelationshipswithpredictionpointsbetween-0.1and0.1arezeroed-out;
theproportionofcoe˚cientsthatarezeroeddecreasesrapidlyas
g
ij
increases;however,the
decreaseseemstobefasterfortheWheat-largedatasetthanfortheWheat-599(Supplementary
47
FigureB.6).Interestingly,theproportionofcoe˚cientszeroed-outalsodecreasesfor`negative'
genomicrelationships,suggestingthattheSSIdoesnotusea`local'supportset;instead,theSSI
seemstouseinformativesupportpoints.Atleastinthecontextofa`linear'kernelastheoneused
here,anegativepriorcorrelation(i.e.,
g
ij
<
0
)canbeinformative.Thepatternsobservedinother
environmentsoftheWheat-largedatasetandinthefourenvironmentsoftheWheat-599dataset
wereconceptuallysimilartotheonespresentedinFigure3.5(seeSupplementaryFiguresB.5and
B.6).
Figure3.5:(A)Weights(

ij
)ofastandardSI(G-BLUP)andoftheoptimalsparseselectionindex
(SSI)versusthegenomicrelationship(
g
ij
).(B)ProportionofweightsintheSSIthatwerezero
(non-active)andnon-zero(supportpoints);Wheat-largedataset,environment
EHT
.
3.5Discussion
Samplesizehasbeenrecognizedasoneofthemainfactorslimitingpredictionaccuracy
ingenomicprediction(Lorenzana&Bernardo,2009;delosCamposetal.,2013a;Habieretal.,
2013).Inun-structuredpopulations,SNPe˙ectscanbeassumedtobehomogeneousand,therefore,
genomicpredictionaccuracyincreaseswithsamplesize(e.g.,Daetwyleretal.,2008;delosCampos
etal.,2013a).However,thisisnotnecessarilythecaseinstructuredandadmixedpopulations,in
multi-familydata(e.g.,datafrombi-parentalfamilies),orinmulti-generationdata.Inthosecases,
di˙erencesinallelefrequenciesandinLD-patternsacrossmaymakeSNPe˙ectsheterogenous
acrosssubgroupsinthesample.Inthatcontext,alargertrainingdatasetmaynottranslateinto
48
higherpredictionaccuracy.Thisphenomenonhasbeenrecognizedinbothplantandanimal
breeding,aswellasincomplextraitpredictioninhumans.
Forexample,usingdatafromabroilerbreedingpopulation,Wolcetal.(2016)showedthatusing
trainingsetsthatincludeddatafrommanygenerationsledtoslightlylowerpredictionaccuracythan
theoneachievedwhenmodelsweretrainedwithdatafromjustthelast3generations.Likewise,
Hayesetal.(2009)showedthatthepredictionaccuracyforHolsteincattlewasnotimprovedby
addingtothetrainingsetdatafromJerseycattle.Inplantbreeding,usingdatafrombi-parental
families,Jacobsonetal.(2014)reportedthatwithinfamilypredictionaccuracycouldbeincreased
bytrainingmodelsusingonlydatafromfamiliesthatshareatleastoneoftheparents.Finally,in
thecontextofhumandata,delosCamposetal.(2013b)notedthattheaccuracyofSNP-derived
genomicrelationshipscouldbeverylowfordistantlyrelatedindividuals.Thus,combiningfamily
datawithlargevolumesofdatafromdistantlyrelatedindividualsmaynotimprove(ormayeven
reduce)predictionaccuracyrelativetomodelstrainedwithfamilydataonly.
Thus,whendataoriginatesfromheterogeneoussourcestheremaybetrade-o˙sbetweensample
sizeandthepossibilityofhavingahomogenousdatasetinwhichSNPcanbeconceivedas
homogenouswithinthetrainingdataandbetweentrainingandtestingsets.Therecognitionthat
ingenomicprediction'biggerisnotalwaysbetter'ledtothedevelopmentofseveralmodelsand
model-trainingstrategiesaimingtoimprovepredictionaccuracy.Onelineofresearchattemptsto
modele˙ectheterogeneityusinggroup-speci˝ce˙ects(e.g.,Veturietal.,2019;Rioetal.,2020).
However,thisapproachisusefulwhenindividualsclusterinasmallnumber(e.g.,2or3)of
well-de˝nedclusters,andbecomeslessusefulanddi˚culttoapplywhendataischaracterized
byeitheralargenumberofgroups(e.g.,bi-parentalfamilies)orwhengroupsoverlapincryptic
manners(e.g.,admixedpopulationsorpartially-overlapping-multi-generationdata).Anotherline
ofresearchseekstoidentifyantrainingbyeitherselectingdatafromindividuals
thatarecloselyrelatedtothepredictionset(e.g.,Rincentetal.,2012;Jacobsonetal.,2014;Wolc
etal.,2016)orbyusingmoresophisticatedoptimizationalgorithms(e.g.,Akdemiretal.,2015).
However,theseapproachesassumethatitispossibletobuildatrainingsetthatisoptimalforall
49
thegenotypesinthepredictionset.Ourapproachdi˙ersfromtheseonesinthatwedevelopeda
methodologythatbuildssubject-speci˝ctrainingsets.Indeed,theSSIselects,foreachindividual
inthepredictionset,acustomtrainingset(orsupportpoints)fromwhichpredictionsarederived.
OurapproachbuildsonselectionindexmethodologybyaddinganL1-(sparsity-inducing-)penalty
intotheoptimizationproblem.Theresultisanindexinwhichonlyasubsetofthetrainingpoints
contributestopredictionaccuracy.
Whenthetrainingdataconsistsofdisconnectedfamilies,pedigreeBLUPequationscanalso
besparse.However,thisisnotthecaseoftheG-BLUPbecausegenomicrelationshipmatrices
aredense.TheSSIbringsbacksparsityintogenomicprediction.Thelevelofsparsityislargely
controlledbythepenalizationparameter(

).Thisparametercanbetunedusingcross-validation
withinthetrainingdata.Aswithanyotherparameter,thevalueof

thatmaximizesaccuracy
maychangeslightlybetweentrn-tstpartitions;however,inourexperience,usingafew(e.g.,10)
training-testingpartitionsareenoughtoobtainanaccurateestimateofthevalueoftheregularization
parameterthatmaximizesaccuracy.
Asnoted,anSSIidenti˝es,foreachindividualinthepredictionset,anetworkofgenotypes
inthetrainingdataset(seeFigure3.4andSupplementaryFigureB.4)thatcontributetothe
prediction.At˝rstglance,thisappearssimilartotheapproachusedinak-nearestneighbor(KNN)
regression(Cover&Hart,1967).InKNN,the
k
geneticallyclosestindividuals(neighbors)predict
eachselectioncandidate,andpredictionsarederivedusinganaverageofthephenotypesinthe
neighborhood.Thereareimportantdi˙erencesbetweentheKNNandtheSSI.First,theKNNuses
onlymarginalsimilarities/distancesbetweenapredictionpointandthepointsinthetrainingdata
tobuilda'neighborhood';theSSI,however,alsoconsidersthecorrelations(i.e.,redundancies)
betweenpointswithinthetrainingdata.Asaconsequence,theoptimalsupportsetoftheSSImay
includesomedistantlyrelatedindividuals(seeFigure3.4andSupplementaryFigureB.4).Second,
whileinthestandardKNNpredictionsaresimplythearithmeticmeanofthephenotypesinthe
neighborhood,intheSSIeachtrainingpointcontributesdi˙erentlywithweights(the

ij
's)that
re˛ectboththecorrelationofthetrainingpointwiththepredictionpointaswellascorrelations
50
amongpointsinthetrainingset.
BestLinearUnbiasedPrediction(BLUP)methodsareequivalenttoL2-penalizedregressions.
InBLUP,shrinkageiscontrolledbythenoiseandsignalvariances(

0
=
˙
2
"
š
˙
2
u
,seeEquation
(3.2).WeaddedtotheoptimizationproblemanL1-penality;thus,theSSIusesbothL1andL2
(whichisintrinsicallybuiltintheSI)penalties.Therefore,theSSIcanbeseenasbeingatypeof
Elastic-Net(Zou&Hastie,2005)regression.However,intheSSItheweightontheL2-penalty
isdeterminedbytheratioofvariancecomponents(

0
=
˙
2
"
š
˙
2
u
)whichmayormaynotbean
optimalchoicefromapredictionperspective(particularlyiftheunderlyingassumptionsofthe
BLUPmethod,e.g.,homogeneityofe˙ects,donothold).Therefore,toadd˛exibilitytotheSSI
weconsideredexplicitlyaddingL1-andL2-penalties,andsearchingforanoptimalcombination,
usingcross-validation,oftherelativeweightsofthepenalizationparametersoftheElastic-Net(

and

)optimizationproblem:
~

¹
;
º
i
=
argmin

i
2
6
6
6
6
4
1
2

0
i
¹
G
+

0
I
º

i

G
0
i

+

1
2
¹
1


º
n
Õ
j
=
1

2
ij
+

n
Õ
j
=
1
j

ij
j
3
7
7
7
7
5
Toavoidtoo-muchpenalization,wedecreasedtheweightoftheinitialL2-penaltyto
0
:
5

0
.We
foundthatthispracticecouldincreasepredictionaccuracybyasmallfactor(2-3.5%,seeSupple-
mentaryTableB.3fortheWheat-largedataset)relativetotheoriginalSSImethod(Equation(3.3)).
However,thispracticedidnotprovideanyadvantageovertheoriginalSSIintheWheat-599data
set(seeSupplementaryTableB.4).
Inconclusion,wepresentedanovelpredictionmethodthatcombinesinasingleframework,
selectionindexmethodologywithsparsity-inducingmethods.TheresultingSSIidenti˝esoptimal
trainingsetsforeachpointinthepredictionset.Themethodcanbeusefulformultipleapplications,
includingtheuseingenomicpredictionofdatafromstructuredpopulations,bi-parentalfamilies,
andtheanalysesofverylargemulti-generationdatasets.
51
3.6Acknowledgments
WearegratefultoCIMMYT'sGlobalWheatProgramthatprovidedboththeexperimental˝eld
andmarkerdatausedinthisstudy.M.L.C.wassupportedbytheMonsanto'sBeachell-Borlaug
InternationalScholarshipProgram(MBBISP)andbytheDissertationCompletionFellowship
fundedbytheMichiganStateUniversityGraduateSchool.
52
CHAPTER4
GENOMICPREDICTIONINMULTI-GENERATIONALMAIZEHYBRIDSUSING
SPARSEKERNELMODELS
MarcoLopez-Cruz
1
,YosephBeyene
2
,ManjeGowda
2
,JoseCrossa
3
,
andGustavodelosCampos
4
;
5
;
6
1
DepartmentofPlant,SoilandMicrobialSciences,MichiganStateUniversity,USA
2
GlobalMaizeProgram,InternationalMaizeandWheatImprovementCenter(CIMMYT),Kenya
3
BiometricsandStatisticsUnit,InternationalMaizeandWheatImprovementCenter(CIMMYT),
Mexico
4
DepartmentofEpidemiologyandBiostatistics,MichiganStateUniversity,USA
5
InstituteforQuantitativeHealthScienceandEngineering,MichiganStateUniversity,USA
6
DepartmentofStatisticsandProbability,MichiganStateUniversity,USA
53
4.1Abstract
Therehasbeenmuchinterestintheuseoflargehistoricaldatatocalibratemoreaccurate
predictionmodelstoimprovecurrentbreedingprograms.However,multi-generationaldataoften
comeswithincreasedheterogeneitythatmightincludecomplexadmixturepatterns,inwhich
geneticrelationshipsarereducedasgenerationsadvance.Ithasbeenrecognizedthatdi˙erences
inheterogeneitypatternsbetweenthetrainingandthepredictionset,orincludinginthetraining
setindividualsthataredistantlyrelatedtothepredictionsetcanreducethepredictionaccuracy.
Mostoftheresearchinthissensefocusesondesigningoptimaltrainingsetsthatincludeonlyafew
previousgenerationsoragroupofgenotypesthataremorecloselyrelatedtothecurrentprediction
set.However,sometrainingindividualscanbeoptimalforsome,butnotallindividualsinthe
predictionset.Thesparseselectionindex(SSI)determines,foreachindividualintheprediction
set,acustomizedoptimaltrainingset.Usingadditivegenomicrelationships,theSSIcanprovide
anincreasedaccuracyrelativetothestandardG-BLUP.ModelswithGaussiankernels(K-BLUP)
havebeenshowntoyieldahigheraccuracybymaximizingthecovariancebetweencloselyrelated
genotypes.WestudiedwhethertheSSIusingGaussiankernelscanprovideincreasedaccuracies.
Usingathree-generationdoubledhaploidmaizedatasetfromtheInternationalMaizeandWheat
ImprovementCenter(CIMMYT),weshowthatthestandardK-BLUPoutperformedtheG-BLUP.
Also,wefoundthatusinganSSIwithadditivegenomicrelationships(sparseG-BLUP)leadsto
gainsinaccuracybetween5%-20%,relativetothestandardG-BLUP.FortheK-BLUP,thegains
obtainedbyaddingsparsityweresmallerandnotalwayssigni˝cant.
4.2Introduction
AlmosttwodecadeshavepassedsinceGenomicSelection(GS)was˝rstproposedby(Meuwis-
senetal.,2001).Thisgroundbreakingideawasquicklyadoptedforbreedingdairycattle(Hayes
etal.,2009),beefcattle(Garrick,2011),broilers(Wolcetal.,2016),maize(Bernardo&Yu,2007),
wheat(Polandetal.,2012),andmanyotheranimalspeciesandcrops(Xuetal.,2020).Overtime,
54
investmentsbypublicandprivateorganizationsledtothedevelopmentoflargegenomicdatasets
comprisingDNA-sequencesandphenotypes.Thelargesamplesizeofmoderngenomicdatasets
hasincreasedourabilitytotrainhigh-dimensionalgenomicpredictionequationsaccurately.
However,thelargersamplesizeoftencomeswithanincreasedgeneticheterogeneity,including
manygenerationsofdataandoftencomplexadmixturepatterns.Moreover,therehavebeensome
signsthatingenomicprediction,`biggermaynotalwaysbebetter'.Forexample,Wolcetal.(2016)
reportedthattheaccuracyofgenomicpredictioninabroilerbreedingprogramwashigherwhen
usingdatafromthelastthreegenerationsrelativetopredictionequationstrainedusingdatafromthe
last˝vegenerations.Likewise,Riedelsheimeretal.(2013)andJacobsonetal.(2014)reportedthat
thepredictionaccuracywashigherwhenmodelsweretrainedusingdatafrombiparentalfamilies
thatsharedatleastoneparentrelativetotrainingusingdatafromalltheavailablebiparental
families.
EarlyworkbyHabieretal.(2010)showedthatfamilyrelationshipshaveanimportantimpact
onpredictionaccuracy,andmanystudieshavedemonstratedthatdistantlyrelatedindividuals
makeasmall(sometimesnegligible)contributiontothepredictionaccuracy.However,asnoted
above,someevidencesuggeststhatusingtrainingsetsformedbyindividualsdistantlyrelatedtothe
genotypesofthepredictionsetmayactuallyhaveanegativeimpactonthepredictionaccuracy(e.g.,
Lorenz&Smith,2015).Thismayhappenif,forexample,heterogeneityinallelefrequencyand
inlinkagedisequilibrium(LD)patternsbetweenthetrainingandpredictionsetleadtoSNP-e˙ect
heterogeneity.
Issuesrelatedtodata-ande˙ect-heterogeneityhavespawnedmultipleresearche˙orts.One
lineofresearchmodelse˙ect-heterogeneityexplicitlyusingSNP-by-groupinteractionmodels
ormultivariatemodels(`multi-breedgenomicprediction')inwhiche˙ectsareassumedtobe
correlatedamonggroups(e.g.,Olsonetal.,2012;Lehermeieretal.,2015;Rioetal.,2020).This
approachhasshownpromise,butitisonlyadequatewhengenotypescanbeclusteredintoclearly
disjointgroups.
Anotherlineofresearchseekstoincreasepredictionaccuracybyoptimaldesignoftraining
55
datasets.Themethodsproposedandusedtoidentifyanoptimaltrainingsetspanfromsimple
threshold-basedmethods(e.g.,Clarketal.,2012;Lorenz&Smith,2015)tomoresophisticated
algorithmsthatseektominimizepredictionerrorvarianceandfunctionsthereof(Rincentetal.,
2012;Akdemir&Isidro-Sanchez,2019;Rothetal.,2020).Amainassumptionofthesetrainingset
optimizationmethodsisthatasingletrainingsetisoptimalforallindividualsinthepredictionset.
Butthismaynotbethecaseifsomegenotypesinthetrainingsetcanimprovepredictionaccuracy
forsomeofthecandidatesofselectionandreduceitforothers.
Toaddressthelimitationsofexistingmethods,inChapter3ofthisdissertation,wedevelopeda
predictionmethod(sparseselectionindex,SSI)thatidenti˝es,foreachindividualintheprediction
set,acustomizedtrainingset.Ourapproachintegratesintotheselectionindexmethodology(Smith,
1936;Hazel,1943),asparsity-inducingpenaltythatleadstosparseselectionindices.
InChapter3,weusedtheSSImethodologytopredictgrainyieldintwowheatdatasets.
Theapplicationpresentedinthatchapterusedadditivegenomicrelationships,andtheresults
showedthattheSSIoutperformedthegenomic-BLUP(G-BLUP;VanRaden,2008)by5-10%in
thecorrelationscale.
ReproducingKernelHilbertSpaces(RKHS)regressionhasshowngoodpredictiveperformance
inmanygenomicapplications(e.g.,delosCamposetal.,2010;González-Camachoetal.,2012).
TheG-BLUPisaspecialcaseofRKHSregressioninwhichalinearkernel(additivegenomic
relationships)isusedtodescribethegeneticsimilaritybetweengenotypes.However,several
studies(e.g.,Crossaetal.,2010;Morota&Gianola,2014)havesuggestedthatusingnon-linear
kernels(e.g.,Gaussiankernels)mayleadtoahighergenomicpredictionaccuracy.InaGaussian
kernel,thecovariancebetweengeneticvaluesishigherforcloselyrelatedindividualsanddrops
astwogenotypesbecomeincreasinglydistant.Therateatwhichthepriorcovariancebetween
geneticvaluesdropsiscontrolledbyabandwidthparameter.Largebandwidthparametervalues
(thatleadtohighlylocalcovariances)canbeusedtoderivepredictionswhicharelargelydependent
oncloselyrelatedindividuals.Thus,thereisaclearlinkbetweenRKHSwithGaussiankernelsand
theSSImethodologypresentedinChapter3.However,theGaussiankerneldoesnotyieldstrictly
56
sparsepredictionequations.
Therefore,tofurtheradvanceourresearchintheuseofSSIs,inthischapter,westudywhether
theSSIcanalsoimprovethepredictionperformanceofRKHSregressionswithnon-linearkernels.
InthischapterweevaluatetheperformanceoftheSSIusingadditive(sparseG-BLUP)andnon-
additive(sparseK-BLUP)kernelsusingathree-generationDH(doubledhaploid)maizedataset
fromtheInternationalMaizeandWheatImprovementCenter(CIMMYT).Forseveralscenarios
oftrainingsetcomposition,weshowthatthestandardRKHSregressionwithaGaussiankernel
outperformedtheadditiveG-BLUP.InagreementwithwhatwereportinChapter3,wefoundthat
theuseofanSSIwithadditivegenomicrelationships(sparseG-BLUP)leadstogainsinprediction
accuracybetween5%-20%,relativetothestandard(non-sparse)G-BLUP.FortheK-BLUP,the
gainsobtainedbyaddingsparsityweresmallerandnotalwayssigni˝cant.
4.3MaterialsandMethods
4.3.1Genotypesandphenotypicdata
Thegenotypesusedinthestudyconsistof3068DHlinesderivedfrom54biparentalfamilies.
TheDHlinesweredevelopedin2017,2018,and2019atCIMMYT'sMaizeDHfacilityatthe
Agricultural&LivestockResearchOrganization(KALRO)inKiboko,Kenya.Thebiparental
familieswereobtainedbycrossingeliteinbredlineswithdrought-tolerantlines.Seedsfromeach
ofthefamilieswerecollectedandsubmittedforDHinduction.The3068DHlineswereselected
fromalargerpopulationforstageImulti-locationyieldtrialsin2017-2019,basedontheresults
ofevaluatinggermination,goodstandestablishment,planttype,lowearplacement,andwell-˝lled
ears.
Eachyear,theselectedDHlineswerecrossedwithasingle-crosstesterfromthecomplementary
heteroticgroupandevaluatedunderwell-watered(denotedas
OPT
)anddrought(denotedas
DRT
)
environmentalconditions.Thenumberofhybrids(trials)plantedin2017,2018,and2019were
923(14),1423(34),and722(17),respectively;trialswereconnectedbyacommoncheckandthree
57
tosixcommercialchecks,plantedinanalpha-latticedesignwithtworeplications,andevaluatedin
twowell-wateredlocationsandonemanageddroughtstresslocationsduringthe2017,2018,and
2019growingseasons.TheOPTexperimentswereconductedduringtherainyseason,applying
supplementalirrigationasneeded.TheDRTexperimentswereconductedduringthedry(rain-
free)seasonandirrigationwassuspended2weeksbefore˛oweringanduntilharvest.Entrieswere
plantedintwo-rowplots,5mlong,with0.75mspacingbetweenrowsand0.25mbetweenhills.
Twoseedsperhillwereinitiallyplantedandthen,threeweeksafteremergence,oneplantperhill
wasmaintainedtoobtaina˝nalplantdensityof53333plants/ha.Fertilizerswereappliedatthe
rateof60kgNand60kgP
2
O
5
/ha,asrecommendedforthearea.Nitrogenwasappliedtwice:at
plantingand6weeksafteremergence.Fieldswerekeptfreeofweedsbyhandweeding.Grain
yield(GY,tonsha

1
),anthesisdate(AD,days),plantheight(PH,cm)traitswererecorded.Plots
weremanuallyharvestedandGYwascorrectedtoa12.5%moisture.ADwasmeasuredfrom
plantingtowhen50%oftheplantsshedpollen,andPHwasmeasuredfromthesoilsurfacetothe
˛agleafcollaron˝verepresentativeplantswithineachplot.
Leafsamplesweretakenfromeachofthe3068DHlinesandsenttoIntertek,Sweden,forDNA
extraction.TheDNAsampleplateswereforwardedtotheInstituteforGenomicDiversity,Cornell
University,Ithaca,NY,USA,forgenotypingwithrepetitivesequences(rAmpSeq)asdescribedby
Buckleretal.(2016).Atotalof5465markerscodedas0(absence)and1(presence)were˝ltered
byminorallelefrequency(MAF<0.05)fromwhichonly5173werekeptforanalyses.
Furtherinformationaboutthe2017and2018datacanbefoundinBeyeneetal.(2019)and
Atandaetal.(2020),whohavepreviouslydescribedandanalyzedthesedatasets.
4.3.2Phenotypespre-processing
AdjustedmeansofGY,PH,ADwereobtainedusingmixede˙ectsmodel˝ttedseparatelyforeach
trait-year-environmental-conditioncombination.TheBestLinearUnbiasedEstimates(BLUE)of
genotypesacrosslocationsfortheOPTexperimentswereestimatedusingtheMETA-Rsoftware
58
(Alvaradoetal.,2020)followingthelinearmixed:
Y
ijkl
=

+
g
i
+
L
j
+
R
k
¹
j
º
+
B
l
¹
kj
º
+
¹
g

L
º
ij
+
e
ijkl
where
Y
ijkl
isthephenotypicrecordofgenotype
i
atlocation
j
inreplicate
k
withintheblock
l
,

istheoverallmean,
L
j
isthe˝xede˙ectofthelocation
j
,
R
k
¹
j
º
isthe˝xede˙ectofthereplicate
k
withinlocation
j
,
B
l
¹
kj
º
istherandome˙ectoftheincompleteblock
l
withinreplicate
k
and
location
j
assumedtobeindependentlyandidenticallydistributed(
iid
)normalwithmeanzeroand
variance
˙
2
b
,
g
i
isthe˝xede˙ectofgenotype
i
,
¹
g

L
º
ij
isthe˝xede˙ectofthegenotype

location
interaction,and
e
ijkl
istherandomerrorassumedtobe
iid
normalwithmeanzeroandvariance
˙
2
e
.After˝ttingthemodeljustdescribed,adjustedphenotypes(
y
i
)wereobtainedbysubtracting
fromphenotypicrecords(GY,PH,andAD),theestimatede˙ectsoflocation,replicate,incomplete
block,genotype

locationinteraction,anderror.Likewise,withineachyear,theBLUEforeach
traitforthesingle-locationDRTexperimentwasobtainedthroughthelinearmodel
Y
ikl
=

+
g
i
+
R
k
+
B
l
¹
k
º
+
e
ikl
where
R
k
isthe˝xede˙ectofthereplicate
k
,
B
l
¹
k
º
istherandome˙ectoftheincompleteblock
l
withinreplicate
k
assumedtobe
iid
normalwithmeanzeroandvariance
˙
2
b
,andtheremaining
factorsareasbefore.Theadjustedphenotypeswereobtainedbysubtractingfromthephenotypic
records,theestimatede˙ectsofreplicate,incompleteblock,anderror.
Afterphenotypespre-processing,atotalof
n
=
3039
linescontainingmarkerinformationand
thatwereobservedinallenvironmentsforalltraitswerekeptforGSmodels.The˝nalnumberof
linesineachyearisasfollows:
n
1
=
901
linesin2017,
n
2
=
1417
in2018,and
n
3
=
721
in2019.
4.3.3Genomicselectionmodels
Weconsideredfourdi˙erentmodels:genomic-BLUP(G-BLUP)usingadditivegenomicrelation-
ships(VanRaden,2008),ReproducingKernelHilbertSpaces(RKHS)regression(Gianolaetal.,
2006)whichisequivalenttoaG-BLUPwithanon-linearkernel,andsparseselectionindices(SSI)
59
obtainedbyimposinganL1-penaltyontheG-BLUP(sparseG-BLUP)andontheRKHS(sparse
K-BLUP).Inwhatfollowswedescribeeachofthesemodels;forsimplicity,sinceallphenotypes
werecentered,wepresentmodelswithoutinterceptnor˝xede˙ects.
G-BLUP:
Inthismodel,thedata-equationtakestheform
y
=
u
+
"
(4.1)
where
y
=
¹
y
1
;:::;
y
n
º
0
,
u
=
¹
u
1
;:::;
u
n
º
0
,and
"
=
¹
"
1
;:::;"
n
º
0
arethevectorsofadjustedphenotypes,
breedingvalues(BV),andenvironmentalerrorterms,respectively.Breedingvaluesanderrorsare
assumedtobenormallydistributed
u
˘
N
¹
0
;˙
2
u
G
º
and
"
˘
N
¹
0
;˙
2
"
I
º
,where
˙
2
u
and
˙
2
"
arethe
geneticanderrorvariances,
G
istheadditivegeneticrelationshipmatrix(GRM),and
I
isanidentity
matrix.
Thegenomicrelationshipmatrix
G
=
f
G
ij
g
wasderivedfrommarkers,
X
=
f
x
im
g
,using
G
=
ZZ
0
š
p
,where
p
=
5173
isthenumberofmarkersand
Z
=
f¹
x
im

x
m
ºš
sd
x
m
g
isthematrix
ofcenteredandscaledmarkersobtainedbysubtractingfromeachmarkerentrythemeanofthe
correspondingcolumnfollowedbyscalingbythestandarddeviationofthecolumn.Theresulting
matrixhasanaverageofthediagonalelementsequaltoone.
ThepredictedBVs(
^
u
PS
)fortheindividualsinthepredictionset(PS)arethenlinearcombi-
nationsofthephenotypes(
y
TS
)ofthesubjectsinthetrainingset(TS),thisis(e.g.,Searleetal.,
1992)
^
u
PS
=
B
G
y
TS
(4.2)
where
B
G
=
G
PS
;
TS
¹
G
TS
+

0
I
º

1
isaHatmatrix(i.e.,coe˚cientsofregressionofBVson
phenotypes),

0
=
˙
2
"
š
˙
2
u
istheratiobetweenresidualandgeneticvariances,
G
PS
;
TS
isamatrix
containingtheadditivegeneticrelationshipsbetweenthedatapointsinthepredictionsetwiththose
inthetrainingset,and
G
TS
representstheadditiveGRMofthetrainingdatapoints.
RKHSregression:
InaRKHSregression,thevectorofgenomicpredictions(
u
)inEquation
(4.1)areobtainedaslinearcombinationsonkernelevaluationsas
u
=
K

,where
K
isan
n

n
matrixofkernelevaluationsonmarkersgenotypes,and

isavectorofregressioncoe˚cients.As
60
shownindelosCamposetal.(2009a),ifthevector

isassumedtobedistributed

˘
N
¹
0
;˙
2
a
K

1
º
,
itfollowsthat
u
˘
N
¹
0
;˙
2
a
K
º
:
(4.3)
Therefore,thevectorofgenomicpredictionsfortheindividualsinthepredictionsetare
^
u
PS
=
B
K
y
TS
(4.4)
wheretheHatmatrixisnow
B
K
=
K
PS
;
TS
¹
K
TS
+

0
I
º

1
,

0
=
˙
2
"
š
˙
2
a
,
K
PS
;
TS
isthematrix
containingtheevaluationoftheGaussiankernelforpairsoftraining-predictiongenotypes,and
K
TS
isthekernelGRMofthetrainingdata.
FindingthesolutionofanRKHSmodelisthereforeequivalentto˝ndingthesolutionforthe
G-BLUPmodelbutusing
K
insteadoftheadditivematrix
G
.Inthissense,werefertotheRKHS
regressionastoK-BLUPmodel.
WeconsideredK-BLUPmodelsusingGaussiankernelmatrices
K
=
f
K
ij
g
,
i
;
j
=
1
;:::;
n
,given
by
K
ij
¹

º
=
exp


~
d
2
ij
º
,where

isabandwidthparameterand
~
d
2
ij
isthescaledsquaredEuclidean
distancebetweenindividuals
i
and
j
givenbytheirmarkersgenotypes,obtainedbydividing
thedistance
d
2
ij
=
Í
p
m
=
1
¹
x
im

x
jm
º
2
bytheaveragedistance
d
=
1
n
2
Í
i
Í
j
d
2
ij
.Threeextreme
Gaussiankernelswereusedasde˝nedbyGonzález-Camachoetal.(2012),namely
K
1
=
f
K
ij
¹

1
ºg
,
K
2
=
f
K
ij
¹

2
ºg
,and
K
3
=
f
K
ij
¹

3
ºg
,where

1
=
0
:
2
,

2
=
1
,and

3
=
5
.SeeSupplementary
FigureC.1forpairwisecomparisonsofkernel(
K
ij
)versusadditive(
G
ij
)genomicrelationships.
Inaddition,kernelaveraging(KA)wasalsoimplementedasdescribedindelosCamposetal.
(2010)usingthethreekernels
K
k
,
k
=
1
;
2
;
3
.Brie˛y,thethreekernelsareconsideredtojointly
contributetothepredictionas
u
=
Í
3
k
=
1
K
k

k
,whereeachsummandhasitsowndistribution(as
inEquation(4.3))as
K
k

k
=
u
k
˘
N
¹
0
;˙
2
a
k
K
k
º
.ThisKA-BLUPmodelis˝ttedinaBayesian
fashion;however,itcanberewrittenasasinglerandome˙ect(asinEquation(4.1))bymaking
u
˘
N
¹
0
;˙
2
a
K
A
º
,where
˙
2
a
=
˙
2
a
1
+
˙
2
a
2
+
˙
2
a
3
isthetotalkernelvarianceand
K
A
isanaverage
kernelGRMgivenby
K
A
=
˙
2
a
1
˙
2
a
K
1
+
˙
2
a
2
˙
2
a
K
2
+
˙
2
a
3
˙
2
a
K
3
:
(4.5)
61
SparseBLUPmodels:
SparsitywasincorporatedintotheG-BLUPandK-BLUPmodelsas
describedinChapter3.Brie˛y,inaSSI,theweightsoftheselectionindexforthe
i
th
individualin
thepredictionsetwasobtainedfromthefollowingL1-penalizedoptimizationproblem
~
b
¹

º
i
=
argmin
b
i
2
6
6
6
6
4
1
2
b
0
i
¹
G
TS
+

0
I
º
b
i

G
0
i
b
i
+

n
Õ
j
=
1
j
b
ij
j
3
7
7
7
7
5
(4.6)
where
G
0
i
=
G
PS
¹
i
º
;
TS
isthevectorcontainingtheadditiverelationshipsbetweenthe
i
th
subject
inthepredictionsetandeachofthesubjectsinthetrainingset,

isaparametercontrollingthe
degreeofsparsityof
~
b
¹

º
i
,and
Í
n
j
=
1
j
b
ij
j
isapenaltyontheL1-normof
b
i
.A(sparse)Hatmatrix
fortheSSI,
~
B
¹

º
G
,containsineachrowthesolutionstoEquation(4.6),obtainedforeachtesting
genotype.Avalueof

=
0
yieldsthesame(non-sparse)HatmatrixofthestandardG-BLUPin
Equation(4.2).ForthesparseK-BLUPmodelsweusedEquation(4.6)withtheGaussiankernel
(either
K
1
,
K
2
,
K
3
,or
K
A
)insteadofadditiverelationshipmatrices(
G
).Althoughoptimization
probleminEquation(4.6)doesnothaveaclosed-form,solutionsforitcanbederivedusinga
coordinatedescentalgorithm(seeChapter3forfurtherdetails).Finally,anoptimalvalueof

can
beobtainedusingcross-validationwithinthetrainingset(moredetailsinChapter3).
4.3.4Variancecomponents
TheimplementationofG-BLUP,K-BLUP,andthecorrespondingsparseversionsofthesemodels
requireestimatesofvariancecomponents.Weobtainedtheseestimatesby˝ttingBayesiangenomic
modelswithineachtrait-environmentcombination.Theseanalyseswereperformedusingthe
BGLRR-package,withthedefaultsettingforhyper-parameters(Perez&delosCampos,2014).
After˝ttingthemodels,posteriormeansofthevariancecomponentswereobtained.Forthe
standardKA-BLUP,themodelwas˝ttedwiththethreekernelstogethertoestimatethekernel-
speci˝cvariancesandthenusedtoderive
˙
2
a
andthekernel
K
A
(Equation(4.5)).Aheritability
estimatefortheG-BLUPmodelwasderivedas
h
2
=
˙
2
u
š¹
˙
2
u
+
˙
2
"
º
.FortheK-BLUPmodels,the
heritabilitywasobtainedbyreplacingtheestimateof
˙
2
u
bythekernelgeneticvarianceestimate
˙
2
a
.Allthesemodelswere˝ttedusingdatafromthetrainingsetonly.
62
4.3.5Assessmentofpredictionaccuracy
VariancecomponentsestimatesandthecorrespondingGRM(
G
,
K
1
,
K
2
,
K
3
,or
K
A
)wereused
toderivethenon-sparse(
B
G
or
B
K
forthestandardBLUP)andthesparse(
~
B
¹

º
G
=
f
~
b
¹

º
0
i
G
g
or
~
B
¹

º
K
=
f
~
b
¹

º
0
i
K
g
)Hatmatrices.(NotethatintheSSI,therowsofthesparseHatmatrix
aresimplythesolutionstoEquation(4.6),obtainedforeachtestinggenotype.)Thepredictions
(
^
u
PS
)werederived(asinEquation(4.2)and(4.4))astheproductofthe(non-sparseorsparse)Hat
matrixtimesthevectorofphenotypesinthetrainingset.Predictionaccuracywasmeasuredasthe
correlationbetweenobservedandpredictedvaluesinthepredictionset,i.e.,
ˆ
=
cor
¹
y
PS
;
^
u
PS
º
.
Predictionaccuracywasassessedfordi˙erentpredictionscenariosusingcycle2019asthe
predictionsetwithdi˙erenttrainingsetcompositions,asfollows:(
i
)thedatafromthe2019
cyclewasrandomlypartitionedinto85%-15%(i.e.,612and109individuals),(
ii
)the85%-
set(
n
PS
=
612
)fromtheyear2019waspredictedusingdataoftheearliergenerations2017
(
n
TS
=
901
),2018(
n
TS
=
1417
),and2017+2018combined(
n
TS
=
2318
)astrainingset,(
iii
)the
predictionofthe612individualswasalsoperformedusingthesametrainingsetsbutaugmented
byprogressivelyincludingtheremaining15%-setfrom2019,˝rst37(5%),then73(10%),and
lastly109(15%)individuals.SeeTable4.1forasummaryofallthetrainingsetcompositions.All
predictionswereperformed100timesusingdi˙erentrandompartitionsofthe2019data.
4.3.6Software
AlltheaforementionedanalyseswereperformedintheRenvironment-language(RCoreTeam,
2019).AllstandardBayesianG-BLUPandK-BLUPmodelswere˝ttedusingtheBGLRR-package
(Perez&delosCampos,2014)toestimatevariancecomponents.ThesparseHatmatrices(
~
B
¹

º
G
or
~
B
¹

º
K
)wereobtainedwiththeSFSIR-package(Lopez-Cruzetal.,2020).Foreachtrait-
environment-partition,anoptimalvalueof

wasobtainedusing10-foldcross-validationwithin
thetrainingset.
63
Table4.1:Trainingset(TS)compositionusedineachpredictionscenario.(Thepredictionsetwas
thesameforalltrainingscenariosandconsistedof612randomlychosenindividualsfrom2019).
Datafrom%of2019datausedTotaltraining
previousyears(
n
)fortraining(
n
)size(
n
TS
)
0(0)901
20175(37)938
(901)10(73)978
15(109)1010
0(0)1417
20185(37)1454
(1417)10(73)1490
15(109)1526
0(0)2318
2017+20185(37)2355
(2318)10(73)2391
15(109)2427
Figure4.1:(A)First3principalcomponentsoftheadditivegenomicrelationshipsmatrix,
G
.
Pointsrepresentindividualsthatarecolorseparatedaccordingtocycle(2017,2018,or2019).(B)
Heatmapofthegenomicrelationshipsmatrix.
64
4.4Results
Thegermplasmusedinthisstudyisderivedfromdi˙erentbiparentalfamiliesacrossyears.
Thisrichnessofthedataisre˛ectedinahighpopulationheterogeneityinwhichindividuals
clusterintogroupswithinandacrossgenerations(Figure4.1).However,thecrossesperformed
preventedtheformationofaclearstructure(e.g.,2clusters);insteadthepopulationshowsamore
crypticsubstructurewithvaryingdegreesofadmixturebetweenfamilies.Theintermixingbetween
generationsthatisobservedinFigure4.1canbeattributedtoallelesharingasonlyallelesfromthe
selectedeliteparentsinonegenerationarepassedtothenextgeneration.
4.4.1PredictionaccuracycomparisonofG-BLUPandK-BLUPmodels
Figure4.2showstheaccuracyofprediction(averagedacrossall100partitions)forGY-OPTusing
allstandardBLUPmodelsforalldi˙erenttrainingsetcompositionsrepresentingacombinationof
previouscycles(2017,2018,or2017+2018)plustheinclusionof0,5,10,and15%(i.e.,0,37,73,
and109subjects)ofthetotalindividualsfromthesame2019cycle(seeTable4.1).Asexpected,
theinclusionofindividualsfromthesamecycleincreasesthepredictionaccuracyacrossallmodels
andtrainingsetcomposition.Forinstance,usingthe2017cycleastrainingset,theaccuracyofthe
G-BLUPwasincreasedby100%whenadding73individuals;however,using2018astrainingset
showeda60%increase,andwhenusing2017+2018,againof27%wasobserved.
Likewise,predictionaccuracywashigherwhencombiningdatafrom2017+2018astraining
setcomparedwithmodelsusingdatafrom2018or2017alonefortraining(seetop-leftpanelin
Figure4.2).However,theaccuracywasnotalwaysincreasedwhenthetrainingsetisaugmented
toinclude2017+2018together(seebottompanelsandtop-rightpanelinFigure4.2),and,insome
cases,wasevenlowered(seeSupplementaryFigureC.3fortraitPH).Contrastingly,theprediction
usingindividualsfrom2017+2018wassometimesequallyperformed(seethebottom-leftpanel
inFigure4.2)andinsomecasesoutperformed(seeSupplementaryFigureC.2forGY-DRTand
SupplementaryFigureC.3forPH)whenusingonlydatafromthe2017cycle.
65
Ingeneral,kernel-basedmodelsachievedhigherpredictionaccuracythanthestandardG-BLUP.
Althoughthekernels
K
1
,
K
2
,and
K
3
arerankeddi˙erentlyacrosstrainingsetcompositions,models
with
K
A
seemstobemorestableacrossallscenariosperformingsimilartothebestofthethree
kernels
K
1
,
K
2
,or
K
3
.Thisresultisinagreementwiththe˝ndingsindelosCamposetal.(2010).
Figure4.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthedatafrom
the2017,2018,or2017+2018cyclesalone(top-leftpanel),orincombinationwithaproportion
(5%,10%,15%)ofthedatafromthe2019cycle.Thepredictionsetconsistedof612genotypes
fromthe2019cyclethatwerenotusedformodeltraining.Modelswiththesameletterwithin
panelindicatenosigni˝cantdi˙erencefromeachother(

=
0
:
05
,ANOVAfollowedbyTukey
test).GY-OPTtrait-environmentcombination.
4.4.2E˙ectofsparsityonpredictionaccuracy
Thesamepartitionsoftraining-predictionsetsusedtoobtaintheresultsforthestandardmodels
wereusedtoevaluatethepredictionaccuracyofSSIs(sparsemodels).Across-validatedvalue

CV
wasfoundwithinthetrainingsettocalculateanoptimalsparseBLUPmodel.Table4.2contains
theresultsofthepredictionsoftheGY-OPTtrait-environmentcombinationforthescenariowhere
15%ofdatafrom2019isincludedinthetrainingset(2017,2018,or2017+2018).Resultsfor
66
thecaseswhenadding0%,5%,and10%ofthe2019dataarepresentedinSupplementaryTable
C.1.Withthistrainingsetcomposition,theaccuracyofthestandardG-BLUPmodelswasbetween
0.46-0.48.K-BLUP(standardorsparse)andsparseG-BLUPmodelsachievedhigherprediction
accuracythanthestandardG-BLUP,withgainsinaccuracy(relativetoG-BLUP)rangingfrom
minimal(1%)tosubstantial(12%).
Table4.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition(including
15%ofsubjectsfromthe2019cycle),GY-OPTtrait-environmentcombination.
Accuracy(SD)%Gain
TS(
n
TS
)GRM

CV
a
n
sup
(RS)
b
h
2
StandardSparseI
c
II
d
G0.0181171(17)0.520.47(0.031)0.51(0.030)-9
K10.0037188(19)0.870.48(0.030)0.50(0.034)34
2017K20.0060219(22)0.750.51(0.028)0.52(0.027)92
(1010)K30.00001007(100)0.920.50(0.029)0.50(0.029)70
KA0.0041240(24)0.840.51(0.028)0.52(0.029)92
G0.0112337(22)0.610.46(0.026)0.48(0.028)-5
K10.0016480(31)0.910.48(0.026)0.49(0.027)33
2018K20.0023684(45)0.800.50(0.026)0.50(0.027)80
(1526)K30.00001526(100)0.900.46(0.029)0.46(0.029)00
KA0.0015683(45)0.870.49(0.027)0.49(0.028)70
G0.0188268(11)0.530.48(0.025)0.51(0.027)-7
K10.0033350(14)0.880.50(0.025)0.51(0.026)42
2017+18K20.0031750(31)0.770.52(0.025)0.52(0.026)90
(2427)K30.00002427(100)0.870.50(0.027)0.50(0.027)40
KA0.0020922(38)0.830.52(0.026)0.52(0.027)80
GRM:Geneticrelationshipmatrix.SD:standarddeviation.
a
PenalizationparameterinEquation(4.6)found
bycross-validatingtheTS.
b
n
sup
=averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient
inthesparseHatmatrix(supportset).RS:relativesparsity(
100
n
TS
š
n
sup
).Inthestandardmodels

CV
is
equaltozeroand
n
sup
isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy
ofthe
c
standardK-BLUPrelativetothestandardG-BLUP,and
d
sparse*-BLUPrelativetothestandard
*-BLUP(*=G-orK-).
Thegainsinpredictionaccuracyaremoreevidentwhentheaccuracyofthenon-sparseG-BLUP
modelswaslower(i.e.,whenfewerindividualsfromthe2019cycleareincludedinthetraining
set).Forinstance,whentheaccuracyoftheG-BLUPisaslowas0.2(thecasewherenodata
from2019isincludedinthetrainingset),SSIsyieldedgainsinaccuracy(relativetothestandard
G-BLUP)ofupto28%(SupplementaryTableC.1).Itwasonlyintheselow-accuracysituations
67
that,insomecases,theuseofastandardK-BLUPmodelwith
K
3
resultedin
˘
6
%lossofaccuracy
(relativetothestandardG-BLUP)andthatusingsparsemodels,theaccuracylostwas3-10%(see
SupplementaryTableC.1).Theadvantageinaccuracyofasparsemodeloveritsstandardversion
wasmoremarkedfortheG-BLUP(i.e.,whenusingadditiverelationshipsmatrices,23%);however,
thesamegainsweresmallerfortheRKHSregressionsusingGaussiankernels.(
˘
2

8
%gain
inaccuracy,SupplementaryTableC.1).Nosigni˝cantdi˙erencebetweensparseandnon-sparse
modelwasobservedwhenusingthelarge-bandwidthkernel
K
3
.Similarresultscanbefoundfor
traitGY-DRT(seeSupplementaryTableC.2).
Figure4.3:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(horizontalaxis)
versusthepredictionaccuracyofallothermodels(verticalaxisofeachpanel).(B)Prediction
accuracyofthestandard*-BLUPmodel(horizontalaxis)versusthepredictionaccuracyofits
sparseversion(verticalaxis),bytypeofkernelusedinpanels.Eachpointrepresentatraining-
testingpartitionwithineachtrainingsetcomposition.Coloredpointsabove(below)the45degree
linerepresentcasesforwhichonemodeloutperformedtheothermodel.P:p-valueforthetest
(fromANOVA)fordi˙erencesinaccuracybetweenthetwomodels.TraitGY,environmentOPT.
ForthePH-OPTtrait-environmentcombination,whenmodelsweretrainedusing15%ofthe
2019data,thegainsinaccuracyobtainedwiththesparseG-BLUP,relativetothenon-sparse
G-BLUP,wereashighas11%,andupto18%withasparseK-BLUPmodel(Table4.3).These
gainsinaccuracywereverynotable(>100%)whenaddingtothetrainingset10%orfewerof
68
theindividualsfromthe2019cycle(seeSupplementaryTableC.3).ResultsforthePH-DRT
trait-environmentarereportedinSupplementaryTableC.4wheresimilarpatternsareobserved.
Table4.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition(including
15%ofsubjectsfromthe2019cycle),PH-OPTtrait-environmentcombination.
Accuracy(SD)%Gain
TS(
n
TS
)GRM

CV
a
n
sup
(RS)
b
h
2
StandardSparseI
c
II
d
G0.0112506(50)0.560.49(0.037)0.51(0.042)0.03.4
K10.0008726(72)0.890.51(0.036)0.51(0.039)3.11.2
2017K20.0007773(77)0.770.54(0.034)0.54(0.034)9.10.1
(1010)K30.00001010(100)0.920.51(0.036)0.50(0.036)2.4-0.1
KA0.0003862(85)0.870.53(0.035)0.53(0.035)7.3-0.7
G0.0161208(14)0.710.50(0.038)0.56(0.033)0.010.9
K10.0020350(23)0.950.53(0.036)0.56(0.035)5.64.6
2018K20.0021627(41)0.880.56(0.033)0.57(0.031)11.31.4
(1526)K30.00001526(100)0.920.53(0.032)0.53(0.032)5.40.0
KA0.0015647(42)0.920.56(0.033)0.56(0.033)10.21.4
G0.0132356(15)0.640.47(0.038)0.52(0.039)0.010.8
K10.0015696(29)0.930.52(0.036)0.54(0.037)9.34.2
2017+18K20.00071728(71)0.850.56(0.034)0.56(0.034)17.7-0.2
(2427)K30.00002427(100)0.920.51(0.036)0.51(0.036)8.30.1
KA0.00041896(78)0.890.55(0.034)0.55(0.035)16.4-0.8
GRM:Geneticrelationshipmatrix.SD:standarddeviation.
a
PenalizationparameterinEquation(4.6)found
bycross-validatingtheTS.
b
n
sup
:averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient
inthesparseHatmatrix(supportset).RS:relativesparsity(
100
n
TS
š
n
sup
).Inthestandardmodels

CV
is
equaltozeroand
n
sup
isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy
ofthe
c
standardK-BLUPrelativetothestandardG-BLUP,and
d
sparse*-BLUPrelativetothestandard
*-BLUP(*=G-orK-).
Acrossallscenarios,thestandardG-BLUPshowedthelowestaccuracyamongallmodels(SSI
andstandardK-BLUP,seeFigure4.3AforGY-OPT).ThisinferiorityofthestandardG-BLUPwas
alsoobservedforGY-DRTandPH(seeSupplementaryFigureC.4AandSupplementaryFigure
C.5A).TheadditionofsparsitytotheK-BLUPmodelsresultedsometimesinanextraadvantage
inaccuracywhenusingakernelwithasmallbandwidth(
K
1
with

=
0
:
2
,and
K
2
with

=
1
)or
averagedacrossextremekernels(
K
A
)forGY(Figure4.3BandSupplementaryFigureC.4B)and
PH(SupplementaryFigureC.5B).
69
4.4.3Automatictraining-sampleselection
Tables4.2and4.3(andSupplementaryTablesC.1-C.4)showtheoptimalvalueofthepenalization
parameter

andthedegreeofsparsityoftheresultingindex,measuredbytheaveragenumberof
subjectsfromthetrainingsetinthesupportset(
n
sup
,subjectswithanon-zerocoe˚cientinthe
estimatedHatmatrix)ofeachpredictedgenotype.Thedegreeofsparsityvariedacrossmodels.
FortheGY-OPTtrait-environmentcombination,acrossalltrainingsetcompositions,thestrongest
sparsitywasachievedusingthegenomicmatrixGwitharelativesparsity(
n
sup
š
n
TS
)of11-33%
(Table4.2andSupplementaryTableC.1)whiletherelativesparsitywithkernelsincreasesasthe
bandwidthparameter

increases(relativesparsityof14-47%for
K
1
and22-59%for
K
2
).The
relativesparsityachievedwhenusingthe
K
A
kernel(25-62%)wassimilartothatofthe
K
2
kernel
(seeTable4.2andSupplementaryTableC.1).Thefactthatnodi˙erenceinaccuracywasobserved
betweenstandardandsparse
K
3
-BLUPmodelisduethattheoptimal

CV
waszero;thus,thesparse
modelwasequivalenttothestandardmodel.
Figure4.4displaysaheatmapofthesparseHatmatrix(
~
B
¹

º
G
)ofthesparseG-BLUPmodel.
Individualsinthetrainingset(2017+2018plus15%from2019)appearincolumnsandthosein
thepredictionsetareshowninrows.Individualsfromthetrainingsetthatdidnotcontributeto
eachindex(i.e.,thosewithzeroweightintheindex)aredisplayedingrey.Thosewithanon-zero
coe˚cient(supportset)areshowninayellow-blue(logarithm)scale.Theheatmapmakesevident
howSSIsselectcustomtrainingsetsforeachgenotypeinthepredictionset.Individualgenotypes
inthetrainingsetsupportsthepredictionofsomebutnotallthegenotypesinthepredictionset.The
solutionfortheHatmatrixinFigure4.4isverysparse,withvaryingnumberofsupportpointsby
testinggenotype.Predictionsofeachofthe612testinggenotypeswasperformedusingphenotypes
from,onaverage,268(outof2427traininggenotypes,i.e.,11%,seeTable4.2)traininggenotypes.
Forthesamepredictionscenario,aheatmapforthesparse
K
A
-BLUPmodel(showinga38%of
sparsity)isprovidedinSupplementaryFigureC.6.
Figure4.5shows,foreachofthesparsemodels,theproportionofthetrainingindividualsfrom
eachcycle(2017,2018,or2019)thatcontributedtotheprediction(withincyclesupportset)ofthe
70
testingindividuals.Eachpanelrepresentsthedi˙erenttrainingsetscomposedof2017+2018data
plustheadditionofeither0%,5%,10%,or15%ofthe2019data.
Figure4.4:Heatmapofthecoe˚cientsintheHatmatrix(
~
B
¹

º
G
)ofthesparseG-BLUPmodel
foronetraining-prediction(TS-PS)partitioninthepredictionof
n
PS
=
612
individualsfrom2019
using
n
TS
=
2427
individuals(2017+2018plus15%ofthe2019set).Predictedindividualsare
presentedincolumnsandtrainingindividualsarepresentedinrowsseparatedbycycleandnumber
ofindividualsinparentheses.Thevalueof

wasobtainedbycross-validation.Eachcolumn
representsvaluesofthevector
~
b
¹

º
i
G
=
f
~
b
ij
g
,
j
=
1
;:::;
2427
(Equation(4.6)).Individualsno
contributingtothepredictionhaveacoe˚cient
~
b
ij
=
0
representedingreycolor.Individualswith
anon-zerocoe˚cientareshowninayellow-bluelogarithmscale(intheoriginalscale,yellow
indicateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environmentcombination.
71
Asexpected,trainingindividualsthatbelongtothesamegroupasthetestingindividualsare
morelikelytobeincludedinthesupportset.Forexample,usingasparseG-BLUPmodeltrained
with2017+2018plus5%fromthe2019data(seethetop-rightpanelinFigure4.5),onaverage,
41%ofthe2019genotypesofthe37includedinthetrainingsetcontributedtothepredictionof
thetestingindividuals.Althoughmoreabundant,asmallerportionofthetotalindividualsfrom
previouscycles(19%ofthe901subjectsfrom2017and18%ofthe1417from2018)arealso
contributingtotheprediction.Withasmallerdegreeofsparsity,similarpatternswerealsoobserved
forthesparseK-BLUPmodels(seeFigure4.5)exceptwith
K
3
thatdidnotrendersparsityatall
(notshowninthe˝gure).PlotsshowingthewithincyclesparsitypatternsforGY-DRTandPH
(OPTandDRT)areshowninSupplementaryFiguresC.7andC.8.
Figure4.5:Proportionofthetrainingindividualsfromeachcyclethatcontributedtotheprediction
ofthe612testinggenotypesfrom2019,usingsparsemodelswithdi˙erentrelationshipmatrices
(horizontalaxis):
G
,
K
1
,
K
2
,or
K
A
.Thetrainingsetwascomposedbyindividualsfrom2017
(
n
=
901
)and2018(
n
=
1417
)alone(top-leftpanel)orincombinationwithaproportion(5%,
10%,15%)ofthedatafromthe2019cycle.GY-OPTtrait-environmentcombination.
72
Asmoreindividualsfromthesamecycleareaddedtothetrainingset,fewerindividuals
frompreviousgenerationsbecomelessfrequentinthesupportset.Forinstance,performingthe
predictionwithasparseG-BLUPusing2017+2018dataincluding15%ofthe2019data(bottom-
rightpanelinFigure4.5),yieldedasmallersupportsetwithonly10%(90of901)ofthe2017data
and10%(142of1417)ofthedatafrom2018.
4.5Discussion
Multiplefactorsa˙ectthepredictiveperformanceofGSmodels,includingsamplesize,trait
heritability,theextentoflinkagedisequilibrium(LD)betweenmarkersandquantitativetraitloci
(QTL),andtherelationshipsbetweentrainingandtestinggenotypes(Daetwyleretal.,2008;He˙ner
etal.,2009;Lorenzana&Bernardo,2009;Combs&Bernardo,2013).
Generalguidelinessuggestthatpredictionaccuracyismaximizedwhenthetrainingsetincludes
asu˚cientlargenumberofindividualswhicharedistantlyrelatedtoeachother(Rincentetal.,
2012)andarecloselyrelatedtothesubjectsinthepredictionset(Habieretal.,2010;Clarketal.,
2012).Ontheotherhand,thereisevidencesuggestingthatincreasingthetrainingsetsizeby
includingindividualsthataregeneticallydistanttothoseinthepredictionsetdoesnotnecessarily
increaseandmightevendecreasethepredictionaccuracy(e.g.,Lorenz&Smith,2015).
Eachcycleofabreedingprogramproducesanewbatchofgenomicandphenotypicdata;
therefore,aftermanyyearsofadoptingGS,thedataavailableformodeltrainingistypically
multi-generationalandmayoftenincludecomplexpatternsofpedigreerelationshipswithinand
betweengenerations.ThereisclearevidencethataGSmodelneedstobere-trainedeverycycle
(Wolcetal.,2011;Pszczola&Calus,2016;Wientjesetal.,2013).Whenre-trainingmodels,
breedingorganizationsfacemanychallenges.Shouldalltheavailabledatabeusedformodel
training?Shouldinsteadonerestrictthetrainingdatatoonlyincludegenotype/phenotypesfrom
recentgenerations?Shouldoneexcludedatafromgenotypesdistantlyrelatedtothecurrentsetof
candidatesofselection?
Someevidencesuggeststhatingenomicprediction,`biggerisnotnecessarilybetter'.For
73
instance,usinghistoricwheatdatageneratedover17years,Dawsonetal.(2013)observedthat
theaccuracyofyear-to-yearpredictionsusingtrainingsetscomposedofallpreviousyearswas
approximatelythesameaswhenconsideringonlythreeyearsback.Likewise,inabroilerbreeding
population,Wolcetal.(2016)foundthatthemaximumaccuracywasaccomplishedwhenthe
trainingsetwascomposedofthethreemostrecentgenerations.
TheSSI(sparseG-BLUP)methodologypresentedinChapter3o˙ersaframeworktoidentify,
foreachindividualinthepredictionset,acustomizedtrainingset(orsupportpoints)fromwhichthe
predictionsarederived.Thismethodologyconsidersboththerelationshipsbetweenthecandidate
ofselectionandeachtraininggenotypeaswellasrelationshipsamongtraininggenotypes(Equation
(4.6)).Therefore,inthischapter,weproposethatthesparseselectionindicespresentedinChapter
3canbeusedtoaddresstheproblemoftrainingsetoptimizationwithmulti-generationdata.We
usedamulti-generationdataoriginatedfrommorethan50biparentalfamiliestomeasuretheimpact
ofsparsityusingSSIsformedusingadditiveandnon-additivekernels.
Ourresultscon˝rmedthatanSSIbasedonadditiverelationshipsyieldsahigherprediction
accuracythanthestandardadditiveG-BLUP.Whenanon-additivekernelwasused,wefoundthat
sparsityimprovedpredictionaccuracyprovidedthatthekernelusedwasnotalreadya`local'kernel,
thatis,akernelinwhichgeneticcovariancesarepositiveonlyforcloselyrelatedindividuals.For
localkernels(
K
3
),addingsparsitydidnotimprovepredictionaccuracyinaclearandsystematic
way.Therefore,ourresultscon˝rmthebene˝tof`localpredictions',whichcanbeobtainedeither
byusinganRKHSwithlocalkernelsorwithanSSIappliedtoadditivegenomicrelationships.
BoththeSSIandtheKernelsregressionrequireoptimizingaparameterthatcontrolshowlocal
predictionsare.TheSSIrequiresoptimizingthepenalizationparameter(

),whichcanbedone
bycross-validationwithinthetrainingdataset.Ontheotherhand,theRKHSregressionrequires
tuningthebandwidthparameterwhichcontrolshowfastcovariancesdropwithgeneticdistance.
Thiscanbedoneeitherbycomparingmultiplekernelsusingcross-validationorbyusingmultiple
kernelswith'kernelaveraging'asdiscussedindelosCamposetal.(2010).
Standardtraining-optimizationmethodsassumethatasingletrainingsetisoptimalforall
74
candidatesofselection.TheSSIdoesnotmakethisassumption.Ourresultsshowclearlythat
eachSSIpicksaparticularsetofsupportpointsandthattheoptimaltrainingsetvariesfrom
genotypetogenotype.TheinspectionoftheHatmatrixoftheSSImakesitclearthatinprediction,
one-size-does-not-˝t-all
candidatesofselection.Likewise,theinspectionoftheHatmatrixshows
thatoptimizingtrainingsetsbyrestrictingthetrainingdatatorecentgenerationsmayalsonot
beoptimal.Indeed,mostoftheSSIspickedinformationfromallthegenerationsavailable,with
varyinglevelsofsparsity.
In
conclusion
:SSIscanbeusedtooptimizepredictionaccuracywhenthetrainingdataexhibit
complexrelationshippatterns.Inthiscontext,di˙erencesinallelefrequenciesandinLD-patters
maymakeSNPe˙ectheterogenousacrossfamiliesandsub-families,thusmakingthestandard
G-BLUPsub-optimal.BothlocalkernelsandSSIscanbeusedtooptimizepredictionaccuracyin
suchdatasets.
75
CHAPTER5
CONCLUDINGREMARKSANDFUTUREDIRECTIONS
Thecoreofthisdissertationhasamethodologypointofview,presentinganinnovativeprocedure
thatcombinestwowell-establishedapproaches:selectionindexandpenalizedregression.Thefor-
merwasdevelopedforbreedingvaluepredictionusingdi˙erentsourcesofcorrelatedinformation.
Thelatteriscommonlyusedinstatisticsandmachinelearningforvariableselectiontopreventover-
˝ttinginsituationswheretherearemorevariablesthanobservations.Thisnovelapproach,which
wenamedsparseselectionindex(SSI),o˙ersopportunitiessuchasintegratinghigh-throughput
phenotypesingeneticevaluationsandsolutionsfortrainingsetoptimizationingenomicselection
withhighlyheterogeneousdata.
Wemadeane˙orttopresentourSSImethodologyindeepdetail,developsoftwareforits
implementation,andempiricallyvalidateit(withcross-validation)withrealdatausingseveraldata
setswithgenomicandphenotypicinformation.Asabrand-newmethod,noexhaustiveevaluation
oftheSSIwaspossibletobepresentedinthisdissertationatthisstage;however,theresultsobtained
withthesedatasetsareverypromising,performingatleastasgoodasthestandardmethods.
TheSSIapplicationspresentedinthisdissertationareofthetypesingle-traitmodel;they
mightbefeasiblyextendedtomulti-traitmodelsallowingtheborrowingofinformationwithinand
betweenindividualsatthesametime.Multi-traitmodelshavethepotentialtoincreaseprediction
accuracy;therefore,furtherresearchisrequiredtoinvestigatewhethertheuseofamulti-traitSSI
canalsobeadvantageous.
ThescopeoftheSSIcangobeyondthebreedingvaluespredictiononly.Thismethodleaves
thedooropentoothergeneticresearchareas(e.g.,genome-wideassociationanalysisonhigh-
dimensionalphenotypesandnetworkanalysisfromgeneexpression).Therefore,moreresearchis
neededtoexplorethefullpotentialoftheSSIprocedure.
76
APPENDICES
77
APPENDIXA
SUPPLEMENTARYFIGURESANDTABLESFROMCHAPTER2
78
FigureA.1:Box-plotofgrainyieldphenotypicrecordsbyenvironmentalcondition.
n
ˇ
3200
observationswithinenvironment.SD:standarddeviation.
79
FigureA.2:Lightre˛ectancepatternsasfunctionofthewavelength.Eachlinerepresentsthe
mean(across
n
ˇ
3200
observations)re˛ectanceforeachwaveband,withintime-point(˛ight
date).Withineachenvironment,meanswerescaledtoliewithin0and1bydividingthembythe
maximumaverage.
80
FigureA.3:AccuracyofindirectselectionofL1-PSIanditscomponents.Squarerootheritability,
geneticcorrelationandaccuracyofindirectselection,allaveragedover100training-testingpar-
titionsversusthenumberofbandsenteringintheindex;bytime-point(DAS=daysaftersowing,
Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)withinenvironment.
81
FigureA.4:AccuracyofindirectselectionofL2-PSIanditscomponents.Squarerootheritability,
geneticcorrelationandaccuracyofindirectselection,allaveragedover100training-testingparti-
tionsversusthepenalizationparameter(

,logarithmscale)usedtobuildtheindex;bytime-point
(DAS=daysaftersowing,Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)within
environment.
82
FigureA.5:AccuracyofindirectselectionofPC-SIanditscomponents.Squarerootheritability,
geneticcorrelationandaccuracyofindirectselection,allaveragedover100training-testingparti-
tionsversusthenumberofprincipalcomponentsusedtobuildtheindex;bytime-point(DAS=days
aftersowing,Stage:VEG=vegetative,GF=grain˝lling,orMAT=maturity)withinenvironment.
83
FigureA.6:Squarerootofheritabilityofthestandard(SI),oftheregularized(PC-SIandL1-PSI)
selectionindices,andoftheRNDVI.Thelinesprovidetheaveragesquarerootheritabilityover
100training-testingpartitions.Verticallinesrepresenta95%CIfortheaverage.Thehorizontal
axisgivethetime-pointatwhichimageswerecollectedandareexpressedinbothdaysaftersowing
(DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity).
84
FigureA.7:Geneticcorrelationbetweengrainyieldandall:thestandard(SI),theregularized
(PC-SIandL1-PSI)selectionindices,andtheRNDVI.Thelinesprovidetheaveragegenetic
correlationover100training-testingpartitions.Verticallinesrepresenta95%CIfortheaverage.
Thehorizontalaxisgivethetime-pointatwhichimageswerecollectedandareexpressedinboth
daysaftersowing(DAS)andstages(VEG=vegetative,GF=grain˝lling,MAT=maturity).
85
FigureA.8:Phenotypic,genetic,andenvironmentalcovariances(absolutevalue)betweenwave-
bandsandgrainyield.'D':discrepancybetweenphenotypicandgeneticcovariancesasmeasuredby
thesumoftheabsolutedi˙erences;bytime-point(DAS:daysaftersowing,Stage:VEG=vegetative,
GF=grain˝lling,MAT=maturity)withinenvironment.
86
TableA.1:Accuracyofindirectselection(averageover100training-testingpartitions)forbest
phenotypicprediction(principalcomponents(PCR),L1-penalizedprediction(L1-Phen),RNDVI,
andGNDVI)andforbestgenotypicprediction(standardSI,optimalPC-SI,L1-PSI,andL2-PSI).
PhenotypicpredictionGenotypicprediction
Env/TP

PCRL1-PhenRNDVIGNDVI
SIPC-SIL1-PSIL2-PSI
Flat-Drought
45
0.24a0.23a0.23a0.21b
0.18c0.24a0.23a0.24a
52
0.27ab0.27ab0.27ab0.25b
0.20c0.27a0.27a0.27ab
65
0.42a0.42a0.35b0.35b
0.35b0.43a0.43a0.42a
73
0.45ab0.45ab0.41cd0.43bc
0.39d0.46a0.46a0.46a
80
0.44bc0.43c0.35e0.39d
0.40d0.46a0.45ab0.46a
85
0.41abc0.40cd0.32f0.39d
0.35e0.43a0.43ab0.43a
93
0.46bc0.47abc0.36e0.45cd
0.44d0.48ab0.49a0.49a
105
0.67a0.67a0.62b0.64b
0.63b0.68a0.67a0.68a
111
0.68ab0.68ab0.67bc0.64d
0.65cd0.69ab0.69ab0.69a
Multi
0.68cd0.68bcd0.68d0.65e
0.00f0.70ab0.70abc0.70a
Bed-2IR
50
0.18a0.14cd0.00f0.12d
0.09e0.18a0.15bc0.16ab
57
0.19a0.19a0.00c0.03b
0.19a0.20a0.20a0.20a
70
0.37a0.36a0.20c0.21c
0.31b0.37a0.36a0.38a
78
0.35a0.35a0.25c0.30b
0.28b0.36a0.36a0.36a
85
0.37a0.36a0.22c0.30b
0.29b0.38a0.37a0.38a
90
0.30abcd0.29cd0.21e0.28d
0.20e0.32a0.31abc0.32ab
98
0.45a0.46a0.18d0.35c
0.38b0.47a0.46a0.46a
110
0.40abc0.39bc0.34d0.39c
0.35d0.42a0.41ab0.42a
116
0.44a0.44a0.44a0.39b
0.38b0.45a0.44a0.45a
Multi
0.53cd0.53d0.46e0.40f
0.01g0.55ab0.54bc0.56a
Bed-5IR
50
0.18a0.17ab0.16ab0.15b
0.08c0.17ab0.16ab0.16ab
57
0.25a0.25a0.21c0.21bc
0.14d0.25a0.24a0.24ab
70
0.27a0.26a0.21b0.19b
0.20b0.27a0.27a0.26a
78
0.26a0.24a0.19b0.19b
0.18b0.26a0.24a0.26a
85
0.32a0.32a0.26b0.25b
0.24b0.32a0.32a0.33a
90
0.31a0.31a0.25c0.28b
0.22d0.32a0.32a0.32a
98
0.30a0.29a0.26b0.25b
0.16c0.29a0.28a0.28a
110
0.46a0.45a0.10d0.22c
0.34b0.45a0.45a0.45a
116
0.47a0.47a0.20d0.34c
0.38b0.47a0.47a0.47a
Multi
0.54a0.54a0.32c0.37b
0.00d0.54a0.55a0.55a
Bed-Heat
72
0.57a0.57a0.54b0.53b
0.50c0.57a0.57a0.57a
79
0.61a0.61a0.60a0.58b
0.51c0.61a0.61a0.61a
92
0.64a0.64a0.65a0.63a
0.55b0.64a0.64a0.64a
100
0.66a0.66a0.67a0.65a
0.57b0.66a0.66a0.66a
107
0.66a0.66a0.67a0.66a
0.56b0.66a0.67a0.66a
112
0.68ab0.68ab0.69a0.66b
0.62c0.68a0.68a0.69a
120
0.69a0.68a0.69a0.66b
0.59c0.69a0.68a0.68a
132
0.62a0.61a0.55b0.54b
0.54b0.62a0.61a0.61a
138
0.54a0.53a0.47b0.46b
0.46b0.54a0.53a0.54a
87
TableA.1(cont'd)
Multi
0.71a0.70a0.70a0.67b
0.00c0.71a0.71a0.72a
*Eachrowcontainsresultsforeachenvironmentandtime-point(DAS:daysaftersowing).Modelswiththe
sameletter(withineachrow)arenotsigni˝cantlydi˙erentfromeachother(

=
0
:
05
,ANOVAfollowedby
Tukeytest).
88
APPENDIXB
SUPPLEMENTARYMATERIALFROMCHAPTER3
89
FigureB.1:Toptwoprincipalcomponentsofthegenomicrelationshipmatrix,
G
,foreachdataset.
Eachpointrepresentindividuals.(A)Wheat-599dataset.(B)Wheat-largedataset.Individuals
arecolor-groupedbythecycle(sowing-harvestyear).
90
FigureB.2:Boxplotofgrainyieldphenotypicrecords(intonha

1
)byenvironmentalcondition
forbothWheat-599andWheat-largedatasets.SDstandarddeviation.
91
FigureB.3:Distributionofthenumberoftrainingsupportpoints(
n
sup
)inoptimalsparseselec-
tionindices(resultsobtainedover100trn-tstpartitions;
n
trn
=sizeofthetrainingdataset),by
environmentalcondition,Wheat-599dataset.
92
FigureB.4:Firsttwoprincipalcomponentscoordinatesforpredictionpoints(yellow)andthe
correspondingsupportpoints(green).Greypointsrepresentgenotypesthatdidnotcontributeto
thepredictionofthegeneticvalueofthegenotypeinyellow.Allpanelsrepresentsolutionsforthe
environment1,Wheat-599dataset.
93
FigureB.5:(leftandcenter)Weights(

ij
)ofastandardSI(G-BLUP)andoftheoptimalsparse
selectionindex(SSI)versusthegenomicrelationship(
g
ij
),and(right)proportionofweightsin
theSSIthatbelongedtoeitherthesupportingornon-activesets,bygenomic-relationship;by
environment,Wheat-largedataset.
94
FigureB.6:(leftandcenter)Weights(

ij
)ofastandardSI(G-BLUP)andoftheoptimalsparse
selectionindex(SSI)versusthegenomicrelationship(
g
ij
),and(right)proportionofweightsin
theSSIthatbelongedtoeitherthesupportingornon-activesets,bygenomic-relationship;by
environment,Wheat-599dataset.
95
TableB.1:Numberofavailableobservations,averagegrainyield,andheritabilitybyenvironmental
conditionfortheWheat-largedataset.
PlantingconditionsNumberof
Namen
AverageHeritability
DateSystemirrigations(SD)Yield(SD)
a
OptimumBed2B2I3,7324.53(0.261)0.41(0.029)
OptimumBed5B5I29,4737.12(0.372)0.57(0.025)
OptimumFlat5MEL4,4035.76(0.305)0.23(0.025)
LateBed5LHT4,4043.83(0.375)0.51(0.025)
OptimumBedMinimalDRB3,7632.74(0.275)0.38(0.029)
EarlyBed5EHT2,0406.16(0.525)0.41(0.038)
SD.Standarddeviation.
a
PosteriormeanandSDobtainedacross10,000MonteCarloreplicatesusing
Gibbssampling.
TableB.2:Numberofavailableobservations,averagegrainyield,andheritabilitybyenvironmental
conditionfortheWheat-599dataset.
MoistureregimeTemperatureNamen
AverageHeritability
(SD)Yield(SD)
a
Optimalirrigation,lowrainfallOptimalEnv15995.14(0.614)0.50(0.054)
HighrainfallOptimalEnv25994.51(0.790)0.46(0.056)
LowrainfallHighdroughtEnv35993.86(0.592)0.43(0.062)
IrrigationorrainfallHot,lowhumidityEnv45993.23(0.636)0.44(0.061)
SD.Standarddeviation.
a
PosteriormeanandSDobtainedacross10,000MonteCarloreplicatesusing
Gibbssampling.
96
TableB.3:Maximumpredictionaccuracy(averageacross100partitions)achievedbytheSSIfor
di˙erentvaluesoftheparameter

ofanElastic-Net-typeSSI,byenvironmentalconditionforthe
Wheat-largedataset.
Environment

0
a

opt
b
n
sup
c
Accuracy(SD)
d
B2I
1.53201.000.01413950.649(0.032)b
0.7660
0.250.06673200.663(0.032)a
0.500.03203300.664(0.032)a
0.750.02332900.664(0.032)a
1.000.01553380.664(0.032)a
B5I
1.84121.000.01191,2260.610(0.009)b
0.9215
0.250.04601,1870.630(0.009)a
0.500.02231,2030.631(0.009)a
0.750.01641,0440.631(0.009)a
1.000.01329430.631(0.009)a
MEL
3.79341.000.01165610.665(0.046)b
1.8967
0.250.07053380.685(0.045)a
0.500.04062700.686(0.045)a
0.750.02942360.687(0.045)a
1.000.01952820.687(0.045)a
LHT
0.98411.000.02182370.712(0.026)b
0.4921
0.250.08542480.727(0.025)a
0.500.04911940.729(0.025)a
0.750.02952230.729(0.025)a
1.000.02352020.730(0.025)a
DRB
1.75551.000.03441030.679(0.038)b
0.8778
0.250.14611020.694(0.039)a
0.500.0823790.696(0.039)a
0.750.0588690.697(0.040)a
1.000.0386850.697(0.040)a
EHT
1.55141.000.02841200.657(0.046)a
0.7757
0.250.09701590.670(0.048)a
0.500.05541260.672(0.048)a
0.750.03991100.672(0.048)a
1.000.02641330.673(0.048)a
SD:Standarddeviationacrossthe100trn-tstpartitions.
a
ShrinkagefactorinvolvedinthestandardSI
(Equation(3.2)).Withinenvironment,inthetoprowavalueof

0
wasusedasintheG-BLUPandin
rowsbelow,

0
wasreducedtohalf.
b
Optimalvalueof

thatyieldedanSSIwiththemaximumaccuracy
amongallindicesobtainedforagridof100valuesof

.
c
Averagenumberofindividualsinsupportingthe
predictionofindividualsfromtestingset.
d
Modelswiththesameletterarenotsigni˝cantlydi˙erentfrom
others(ANOVAfollowedbyTukey'sHSDtest,5%signi˝cancelevel.
97
TableB.4:Maximumpredictionaccuracy(averageacross100partitions)achievedbytheSSIfor
di˙erentvaluesoftheparameter

ofanElastic-Net-typeSSI,byenvironmentalconditionforthe
Wheat-599dataset.
Environment

0
a

opt
b
n
sup
c
Accuracy(SD)
d
Env1
1.21011.000.0314840.769(0.062)a
0.5061
0.250.1042990.772(0.063)a
0.500.04921010.773(0.063)a
0.750.02961100.773(0.063)a
1.000.02361030.773(0.063)a
Env2
1.30341.000.01751510.708(0.085)a
0.6517
0.250.06861470.711(0.086)a
0.500.03971260.710(0.086)a
0.750.02401360.710(0.086)a
1.000.01921290.710(0.086)a
Env3
1.40841.000.0514500.609(0.090)a
0.7042
0.250.2213480.611(0.089)a
0.500.1017480.610(0.090)a
0.750.0601540.609(0.091)a
1.000.0474500.609(0.091)a
Env4
1.43801.000.0615400.722(0.073)a
0.7190
0.250.2689390.727(0.074)a
0.500.1483310.727(0.074)a
0.750.0870350.728(0.075)a
1.000.0681320.728(0.075)a
SD:Standarddeviationacrossthe100trn-tstpartitions.
a
ShrinkagefactorinvolvedinthestandardSI
(Equation(3.2)).Withinenvironment,inthetoprowavalueof

0
wasusedasintheG-BLUPandin
rowsbelow,

0
wasreducedtohalf.
b
Optimalvalueof

thatyieldedanSSIwiththemaximumaccuracy
amongallindicesobtainedforagridof100valuesof

.
c
Averagenumberofindividualsinsupportingthe
predictionofindividualsfromtestingset.
d
Modelswiththesameletterarenotsigni˝cantlydi˙erentfrom
others(ANOVAfollowedbyTukey'sHSDtest,5%signi˝cancelevel.
98
B.1EquivalencebetweenStandardSelectionIndexandBLUP
Considerastandardsingle-traitmodeloftheform
y
=
u
+
"
where
y
=
¹
y
1
;:::;
y
n
º
0
,
u
=
¹
u
1
;:::;
u
n
º
0
,and
"
=
¹
"
1
;:::;"
n
º
0
arevectorsofphenotypes,genetic,
andenvironmentale˙ects,respectively.Here,forsimplicityweassumethatallthesevectorshave
zero-mean.
InastandardG-BLUPmodel,
u
and
"
areassumedtobeindependent(i.e.,
co
v
¹
u
;
"
0
º
=
0
),
bothhavenullmeans(i.e.,
E
¹
u
º
=
E
¹
"
º
=
0
),and(co)variancematrices
v
ar
¹
u
º
=
˙
2
u
G
and
v
ar
¹
"
º
=
˙
2
"
I
,respectively;here
G
isarelationshipmatrixthatcouldbederivedfromapedigree
orfromDNAsequences.
Considernowapartitionofeachofthedatainintoatraining(trn)andatesting(tst)set.
Theobjectiveistopredictthegeneticvaluesoftheindividualsinthetestingset(
u
tst
)usingthe
phenotypedataavailablefromthetrainingset(
y
trn
).The(co)variancematrixofthevectorof
breedingvaluescanbepartitionedasfollows
v
ar
©


«
2
6
6
6
6
6
4
u
trn
u
tst
3
7
7
7
7
7
5
ª
®
®
¬
=
˙
2
u
2
6
6
6
6
6
4
G
trn
G
trn
;
tst
G
0
trn
;
tst
G
tst
3
7
7
7
7
7
5
where
G
trn
and
G
tst
arethegeneticrelationshipsubmatricesforthetrainingandtestingdatapoints,
respectively,and
G
trn
;
tst
isthegeneticrelationshipsubmatrixbetweentrainingandtestingsubjects.
TheBestLinearPredictor(BLP)of
u
tst
(
^
u
tst
)takestheform(e.g.,Searleetal.,1992):
E
¹
u
tst
j
y
trn
º
=
E
¹
u
tst
º
+
co
v
¹
u
tst
;
y
0
trn
º

v
ar
¹
y
trn
º


1

y
trn

E
¹
y
trn
º

=
G
0
trn
;
tst
¹
G
trn
+

0
I
º

1
y
trn
:
Alternatively,onecanwrite
^
u
tst
=
H

y
trn
,where
H
=
G
0
trn
;
tst
¹
G
trn
+

0
I
º

1
isa
matrix.Thus,theBLUPofthegeneticvalueofthe
i
th
testingindividualis
^
u
tst
¹
i
º
=
H
0
i
y
trn
where
H
0
i
isthe
i
th
rowof
H
,thatis
H
0
i
=
G
0
i
¹
G
trn
+

0
I
º

1
whichisequaltotheweightsofthestandard
selectionindex,
^

0
i
=
G
0
i
¹
G
trn
+

0
I
º

1
(seeEquation(3.2)inthemanuscript).
99
B.2SparseSelectionIndices(SSI)usingtheSFSIR-package
Inthissection,weusedatafromtheWheat-599datasetusedinthemanuscripttoillustrate
howto˝tSparseSelectionIndicesusingtheSFSIR-package(Lopez-Cruzetal.,2020).
B.2.1InstallingthepackagefromGitHub
ThefollowingsnippetshowshowtoinstallthepackagefromGitHub.
rm(list=ls())
#Installdevtoolspackagefirst
install.packages('devtools',repos='https://cran.r-project.org/')
#InstallSFSIpackagefromGitHub
devtools::install_git('https://github.com/MarcooLopez/SFSI')
library(SFSI)#Loadthepackage
#InstallBGLRpackage(neededtodownloadthedata)
install.packages('BGLR',repos='https://cran.r-project.org/')
B.2.2Datapreparation
ToillustratetheuseofthesoftwarewewillusedatafromtheWheat-599datasetwhichisavailable
withtheBGLRR-package(Perez&delosCampos,2014).Thefollowingcodeshowshowto
preparedataforenvironment1;alltheanalyseshereinafterarebasedonthisdata.
data(wheat,package="BGLR")#LoaddatafromtheBGLRpackage
#Selecttheenvironment1toworkwith
y<-as.vector(scale(wheat.Y[,1]))
#CalculateGmatrix
G<-tcrossprod(scale(wheat.X))/ncol(wheat.X)
#Createadirectoryandsavedata
dir.create("data",recursive=TRUE)
save(y,G,file="data/geno_pheno.RData")
100
B.2.3Heritabilityandvariancecomponents
ImplementingtheSSIrequiresanestimateoftheheritability.WeobtainthisusingaG-BLUP
model
y
i
=

+
u
i
+
"
i
with
"
i
iid
˘
N
¹
0
;˙
2
"
º
and
u
˘
N
¹
0
;˙
2
u
G
º
.Thismodelcanbe˝ttedwiththe
'˝tBLUP'functionincludedintheSFSIR-package.TheBGLRR-packagecanbealsousedto˝t
aBayesianversionofthemodel.Thecodebelowillustrateshowtoestimateheritabilityusingthe
'˝tBLUP'function.
load("data/geno_pheno.RData")#Loaddata
#Fitmodel
fm0<-fitBLUP(y,K=G)
fm0$h2<-fm0$varU/(fm0$varU+fm0$varE)#Estimateheritability
c(fm0$varU,fm0$varE,fm0$h2)#Variancecomponents(varU,varE,h2)
#Createadirectoryandsavedata
dir.create("output",recursive=TRUE)
save(fm0,file="output/varComps.RData")
B.2.4Training-testingpartitions
Thecodebelowproducestraining(trn,
70%
)andtesting(tst,
30%
)partitions.Theparameter
'nPart'de˝nesthenumberofpartitions.Theoutputisamatrixwith'nPart'columnsand180rows
containingindicesrepresentingtheobservationsthatareassignedtothetestingsets.Theobjectis
savedinthe˝le'partitions.RData'andwillbeusedinlateranalyses.
nPart<-5#Numberofpartitions
load("data/geno_pheno.RData")#Loaddata
nTST<-ceiling(0.3*length(y))#NumberofelementsinTSTset
partitions<-matrix(NA,nrow=nTST,ncol=nPart)#Matrixtostorepartitions
seeds<-round(seq(1E3,.Machine$integer.max,length=nPart))
for(kin1:nPart)
{set.seed(seeds[k])
partitions[,k]<-sample(1:length(y),nTST,replace=FALSE)
}
save(partitions,file="output/partitions.RData")#Savepartitions
101
B.2.5AccuracyoftheG-BLUPandoftheSparseSI
ThefollowingscriptshowshowtoderiveSSIsusingthepartitionsabovecreated.Theweightsofthe
SSIarecomputedusingthe'SSI'functionfor'nLambda=100'valuesof

.TheG-BLUPmodelis
˝ttedforcomparisonusingthe'˝tBLUP'function.Estimatesof

and
h
2
arecomputedinternally
inthe'SSI'functionwhenthesearenotprovided.TheseestimatesobtainedfromtheG-BLUP
modelwillbepassedtothe'SSI'functiontosavetime.Indicesdenotingtrainingandtestingsets
arepassedthroughthe'trn'and'tst'parameters,respectively.TheaccuracyoftheG-BLUPand
SSImodelsarestoredintheobject'accSSI',andsavedinthe˝le'results_accuracy.RData'.
#Loaddata
load("data/geno_pheno.RData")
load("output/varComps.RData")
load("output/partitions.RData")
accSSI<-mu<-h2<-c()#Objectstostoreresults
for(kin1:ncol(partitions))
{cat("partition=",k,"\n")
tst<-partitions[,k]
trn<-(1:length(y))[-tst]
yNA<-y;yNA[tst]<-NA
#G-BLUPmodel
fm1<-fitBLUP(yNA,K=G)
mu[k]<-fm1$b#Retrievemuestimate
h2[k]<-fm1$h2#Retrieveh2estimate
#SparseSI
fm2<-SSI(y,K=G,b=mu[k],h2=h2[k],trn=trn,tst=tst,mc.cores=1,nLambda=100)
fm3<-summary(fm2)#Usefulfunctiontogetresults
accuracy<-c(GBLUP=cor(fm1$u[tst],y[tst]),fm3$accuracy)/sqrt(fm0$h2)
lambda<-c(min(fm3$lambda),fm3$lambda)
df<-c(max(fm3$df),fm3$df)
accSSI<-rbind(accSSI,data.frame(rep=k,SSI=names(accuracy),accuracy,lambda,df))
}
save(mu,h2,accSSI,file="output/results_accuracy.RData")
102
B.2.6DisplayingResults
Thefollowingcodecreatesaplot(asinFigure3.1inthemanuscript)showingtheestimatedgenetic
predictionaccuracybyvaluesofthepenaltyparameter(inlogarithmicscale).Therightmostpoint
intheplotcorrespondstotheG-BLUPmodel(obtainedwhen

=
0
).Thepointatthepeakdenotes
themaximumaccuracythatwasobtainedbytheSSI.
load("output/results_accuracy.RData")
dat<-data.frame(do.call(rbind,lapply(split(accSSI,accSSI$SSI),
function(x)apply(x[,-c(1:2)],2,mean))))
dat$Model<-unlist(lapply(strsplit(rownames(dat),"\\."),function(x)x[1]))
dat2<-do.call(rbind,lapply(split(dat,dat$Mod),function(x)x[which.max(x$acc),]))
ggplot(dat[dat$df>1,],aes(-log(lambda),accuracy))+
geom_hline(yintercept=dat["GBLUP",]$accuracy,linetype="dashed")+
geom_line(aes(color=Model),size=1.1)+theme_bw()+
geom_point(data=dat2,aes(color=Model),size=2.5)
B.2.7Cross-validatingtoobtainanoptimalpenalization
Thesnippetbelowcanbeusedtoperform,withineachtrn-tstpartition,
k
-foldsCVtogetan
'optimal'valueof

withinthetrainingdata,andthenusedto˝tanSSIforthetestingset.TheCV
isimplementedusingthe'SSI_CV'functionfromtheSFSIR-packageforone5-foldsCV,thiscan
besetbychangingthe'nCV'and'nFolds'parameters.
103
load("data/geno_pheno.RData");load("output/varComps.RData")
load("output/partitions.RData");load("output/results_accuracy.RData")
lambdaCV<-accSSI_CV<-dfCV<-c()#Objectstostoreresults
for(kin1:ncol(partitions))
{cat("partition=",k,"\n")
tst<-partitions[,k]
trn<-(1:length(y))[-tst]
#Cross-validatingthetrainingset
fm1<-SSI_CV(y,K=G,trn.CV=trn,nLambda=100,mc.cores=1,nFolds=5,nCV=1)
lambdaCV[k]<-summary(fm1)$optCOR["mean","lambda"]
#FitaSSIwiththeestimatedlambda
fm2<-SSI(y,K=G,b=mu[k],h2=h2[k],trn=trn,tst=tst,lambda=lambdaCV[k])
accSSI_CV[k]<-summary(fm2)$accuracy/sqrt(fm0$h2)
dfCV<-cbind(dfCV,fm2$df)
}
save(accSSI_CV,lambdaCV,dfCV,file="output/results_accuracyCV.RData")
Afterrunningtheaboveanalysis,thefollowingsnippetcanberuntocreateaplot(asinFigure
3.2inthemanuscript)comparingpartition-wisetheaccuracyoftheoptimalSSIwiththatofthe
G-BLUP.Theaverageaccuraciesarealsoshownintheplot.
load("output/results_accuracy.RData")
load("output/results_accuracyCV.RData")
dat<-data.frame(GBLUP=accSSI[accSSI$SSI=="GBLUP",]$acc,SSI=accSSI_CV)
rg<-range(dat)
tmp<-c(mean(rg),diff(rg)*0.4)
ggplot(dat,aes(GBLUP,SSI))+geom_abline(slope=1,linetype="dotted")+
geom_point(shape=21,color="orange")+xlim(rg)+ylim(rg)+
annotate("text",tmp[1],tmp[1]-tmp[2],label=round(mean(dat$GBLUP),3))+
annotate("text",tmp[1]-tmp[2],tmp[1],label=round(mean(dat$SSI),3))
Thecodebelowcreatesaplot(asinFigure3.3inthemanuscript)showingthedistributionof
thenumberofpointsinthesupportsetfortheSSI,acrossallpartitions.
104
load("output/results_accuracyCV.RData")
dat<-data.frame(df=as.vector(dfCV))
bw<-round(diff(range(dat$df))/40)
ggplot(data=dat,aes(df,stat(count)/length(dfCV)))+theme_bw()+
geom_histogram(color="gray20",fill="lightblue",binwidth=bw)+
labs(x=bquote("Supportsetsize("*n[sup]*")"),y="Frequency")
B.2.8Subject-speci˝ctrainingsets
Thenextscriptcanbeusedtocreateaplot(asinFigure3.4inthemanuscript)showing(fora
singletrn-tstpartition)thesubsetofpointsinthesupportset,foreachindividualbeingpredicted.
Thisplotcanbemadethroughthe'plotNet'functionfromtheSFSIpackage.
#Loaddata
load("data/geno_pheno.RData")
load("output/partitions.RData")
load("output/results_accuracyCV.RData")
part<-1#Chooseanypartitionfrom1,...,nPart
tst<-partitions[,part]
trn<-(1:length(y))[-tst]
#FitSSIwithlambdapreviouslyestimatedusingCV
fm<-SSI(y,K=G,trn=trn,tst=tst,lambda=lambdaCV[part])
plotNet(fm,K=G,tst=fm$tst[1:25],single=FALSE,title=NULL,bg.col="white")
105
APPENDIXC
SUPPLEMENTARYFIGURESANDTABLESFROMCHAPTER4
106
FigureC.1:Genomicrelationships(
G
ij
)versuskernelrelationships(
K
ij
)ofindividualsincycle
2019withthoseincycles2017(left)and2018(right).
G
i
j
and
K
i
j
arethe
ij
th
elementof
G
and
K
¹

º
,respectively.(A)
K
1
=
K
¹
0
:
2
º
.(B)
K
2
=
K
¹
1
º
.(C)
K
3
=
K
¹
5
º
.
107
FigureC.2:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthedatafrom
the2017,2018,or2017+2018cyclesalone(top-leftpanel),orincombinationwithaproportion
(5%,10%,15%)ofthedatafromthe2019cycle.Thepredictionsetconsistedof612genotypes
fromthe2019cyclethatwerenotusedformodeltraining.Modelswiththesameletterwithin
panelindicatenosigni˝cantdi˙erencefromeachother(

=
0
:
05
,ANOVAfollowedbyTukey
test).GY-DRTtrait-environmentcombination.
108
FigureC.3:Predictionaccuracybymodelandtrainingset(TS).TSsconsistedonallthedatafrom
the2017,2018,or2017+2018cyclesalone(top-leftpanel),orincombinationwithaproportion
(5%,10%,15%)ofthedatafromthe2019cycle.Thepredictionsetconsistedof612genotypes
fromthe2019cyclethatwerenotusedformodeltraining.Modelswiththesameletterwithin
panelindicatenosigni˝cantdi˙erencefromeachother(

=
0
:
05
,ANOVAfollowedbyTukey
test).TraitPH(OPTandDRTenvironments).
109
FigureC.4:(A)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(horizontalaxis)
versusthepredictionaccuracyofallothermodels(verticalaxisofeachpanel).(B)Prediction
accuracyofthestandard*-BLUPmodel(horizontalaxis)versusthepredictionaccuracyofits
sparseversion(verticalaxis),bytypeofkernelusedinpanels.Eachpointrepresentatraining-
testingpartitionwithineachtrainingsetcomposition.Coloredpointsabove(below)the45degree
linerepresentcasesforwhichonemodeloutperformedtheothermodel.P:p-valueforthetest
(fromANOVA)fordi˙erencesinaccuracybetweenthetwomodels.GY-DRTtrait-environment
combination.
110
FigureC.5:(left)Predictionaccuracyofthestandard(non-sparse)G-BLUPmodel(horizontalaxis)
versusthepredictionaccuracyofallothermodels(verticalaxisofeachpanel).(right)Prediction
accuracyofthestandard*-BLUPmodel(horizontalaxis)versusthepredictionaccuracyofits
sparseversion(verticalaxis),bytypeofkernelusedinpanels.Eachpointrepresentatraining-
testingpartitionwithineachtrainingsetcomposition.Coloredpointsabove(below)the45degree
linerepresentcasesforwhichonemodeloutperformedtheothermodel.P:p-valueforthetest
(fromANOVA)fordi˙erencesinaccuracybetweenthetwomodels.TraitPH(OPTandDRT
environments).
111
FigureC.6:Heatmapofthecoe˚cientsintheHatmatrix(
~
B
¹

º
K
)ofthesparseKA-BLUPmodel
foronetraining-prediction(TS-PS)partitioninthepredictionof
n
PS
=
612
individualsfrom2019
using
n
TS
=
2427
individuals(2017+2018plus15%ofthe2019set).Predictedindividualsare
presentedincolumnsandtrainingindividualsarepresentedinrowsseparatedbycycleandnumber
ofindividualsinparentheses.Thevalueof

wasobtainedbycross-validation.Eachcolumn
representsvaluesofthevector
~
b
¹

º
i
K
=
f
~
b
ij
g
,
j
=
1
;:::;
2427
(Equation(4.6)).Individualsno
contributingtothepredictionhaveacoe˚cient
~
b
ij
=
0
representedingreycolor.Individualswith
anon-zerocoe˚cientareshowninayellow-bluelogarithmscale(intheoriginalscale,yellow
indicateslargevaluesandblueindicatessmallvalue).GY-OPTtrait-environmentcombination.
112
FigureC.7:Proportionofthetrainingindividualsfromeachcyclethatcontributedtotheprediction
ofthe612testinggenotypesfrom2019,usingsparsemodelswithdi˙erentrelationshipmatrices
(horizontalaxis):
G
,
K
1
,
K
2
,or
K
A
.Thetrainingsetwascomposedbyindividualsfrom2017
(
n
=
901
)and2018(
n
=
1417
)alone(top-leftpanel)orincombinationwithaproportion(5%,
10%,15%)ofthedatafromthe2019cycle.GY-DRTtrait-environmentcombination.
113
FigureC.8:Proportionofthetrainingindividualsfromeachcyclethatcontributedtotheprediction
ofthe612testinggenotypesfrom2019,usingsparsemodelswithdi˙erentrelationshipmatrices
(horizontalaxis):
G
,
K
1
,
K
2
,or
K
A
.Thetrainingsetwascomposedbyindividualsfrom2017
(
n
=
901
)and2018(
n
=
1417
)alone(top-leftpanels)orincombinationwithaproportion(5%,
10%,15%)ofthedatafromthe2019cycle.TraitPH(OPTandDRTenvironments).
114
TableC.1:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,and10%ofsubjectsfromthe2019cycle),traitGY,environmentOPT.
Accuracy(SD)%Gain
TS(
n
TS
)GRM

CV
a
n
sup
(RS)
b
h
2
StandardSparseI
c
II
d
TS+0%(2019)
G0.0139271(30)0.510.20(0.017)0.25(0.018)024
K10.0028307(34)0.860.21(0.017)0.20(0.017)5-9
2017K20.0037388(43)0.730.24(0.017)0.26(0.017)177
(901)K30.0000901(100)0.960.26(0.016)0.26(0.016)280
KA0.0023435(48)0.860.24(0.019)0.26(0.029)196
G0.0086469(33)0.610.25(0.015)0.22(0.015)0-11
K10.0011664(47)0.910.25(0.015)0.22(0.015)2-13
2018K20.0019743(52)0.790.26(0.015)0.24(0.016)5-8
(1417)K30.00001417(100)0.920.23(0.015)0.23(0.015)-70
KA0.0015680(48)0.880.26(0.016)0.24(0.017)4-8
G0.0120569(25)0.520.33(0.015)0.33(0.015)02
K10.0028484(21)0.880.33(0.015)0.31(0.015)2-6
2017+18K20.00131374(59)0.760.34(0.015)0.33(0.015)4-2
(2318)K30.00002316(100)0.880.31(0.015)0.31(0.015)-40
KA0.00101446(62)0.830.33(0.016)0.33(0.016)3-3
TS+5%(2019)
G0.0149241(26)0.500.32(0.036)0.36(0.035)012
K10.0027292(31)0.860.34(0.036)0.36(0.036)58
2017K20.0028469(50)0.730.37(0.037)0.38(0.038)162
(938)K30.0000938(100)0.930.34(0.038)0.34(0.038)70
KA0.0015550(59)0.840.36(0.038)0.37(0.038)141
G0.0097412(28)0.610.26(0.021)0.24(0.020)0-7
K10.0014563(39)0.910.27(0.021)0.26(0.024)2-3
2018K20.0015835(57)0.790.28(0.021)0.27(0.021)7-3
(1454)K30.00001454(100)0.920.28(0.021)0.28(0.021)60
KA0.0012749(52)0.890.28(0.021)0.26(0.024)6-6
G0.0142441(19)0.530.34(0.019)0.36(0.019)04
K10.0025534(23)0.880.35(0.019)0.34(0.024)3-4
2017+18K20.00181117(47)0.760.37(0.021)0.37(0.021)9-1
(2355)K30.00002354(100)0.880.36(0.022)0.36(0.022)40
KA0.00131235(52)0.830.37(0.021)0.37(0.020)8-2
TS+10%(2019)
G0.0186183(19)0.520.42(0.029)0.45(0.030)07
K10.0041180(18)0.870.44(0.029)0.46(0.030)44
2017K20.0067218(22)0.750.47(0.030)0.47(0.030)121
(974)K30.0000974(100)0.920.48(0.032)0.48(0.032)140
KA0.0045244(25)0.850.46(0.030)0.47(0.032)101
G0.0111349(23)0.610.40(0.029)0.41(0.030)02
K10.0017480(32)0.910.41(0.029)0.42(0.030)32
115
TableC.1(cont'd)
2018K20.0017781(52)0.790.44(0.029)0.43(0.029)10-1
(1490)K30.00001490(100)0.900.42(0.028)0.42(0.028)50
KA0.0012781(52)0.880.43(0.029)0.42(0.029)8-1
G0.0126455(19)0.530.42(0.021)0.44(0.021)04
K10.0027459(19)0.880.44(0.021)0.44(0.026)41
2017+18K20.00211032(43)0.760.46(0.023)0.46(0.023)10-1
(2391)K30.00002390(100)0.870.46(0.025)0.46(0.025)90
KA0.00151153(48)0.830.46(0.023)0.46(0.023)9-1
GRM:Geneticrelationshipmatrix.SD:standarddeviation.
a
PenalizationparameterinEquation(4.6)found
bycross-validatingtheTS.
b
n
sup
=averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient
inthesparseHatmatrix(supportset).RS:relativesparsity(
100
n
TS
š
n
sup
).Inthestandardmodels

CV
is
equaltozeroand
n
sup
isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy
ofthe
c
standardK-BLUPrelativetothestandardG-BLUP,and
d
sparse*-BLUPrelativetothestandard
*-BLUP(*=G-orK-).
116
TableC.2:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitGY,environmentDRT.
Accuracy(SD)%Gain
TS(
n
TS
)GRM

CV
a
n
sup
(RS)
b
h
2
StandardSparseI
c
II
d
TS+0%(2019)
G0.0071418(46)0.570.27(0.014)0.29(0.014)06
K10.0012522(58)0.890.28(0.014)0.20(0.020)4-30
2017K20.0015597(66)0.770.31(0.014)0.33(0.014)154
(901)K30.0000901(100)0.940.34(0.013)0.34(0.013)230
KA0.0008680(75)0.880.31(0.018)0.30(0.033)13-2
G0.0075561(40)0.410.20(0.018)0.23(0.018)017
K10.0011768(54)0.810.20(0.018)0.19(0.017)3-8
2018K20.00011387(98)0.600.23(0.018)0.24(0.018)182
(1417)K30.00001417(100)0.660.26(0.013)0.26(0.014)34-2
KA0.00001417(100)0.690.23(0.020)0.23(0.021)19-3
G0.0066954(41)0.440.23(0.018)0.27(0.018)014
K10.00131013(44)0.840.24(0.018)0.25(0.018)25
2017+18K20.00032049(88)0.650.28(0.017)0.29(0.017)211
(2318)K30.00002318(100)0.750.38(0.013)0.38(0.013)630
KA0.00002318(100)0.730.29(0.019)0.29(0.029)222
TS+5%(2019)
G0.0151243(26)0.570.41(0.028)0.46(0.027)011
K10.0031331(35)0.890.43(0.028)0.44(0.053)42
2017K20.0035484(52)0.770.46(0.026)0.46(0.026)12-1
(938)K30.0000938(100)0.920.45(0.021)0.45(0.021)90
KA0.0010659(70)0.870.46(0.027)0.45(0.033)10-1
G0.0126373(26)0.410.39(0.034)0.43(0.031)010
K10.0013678(47)0.820.41(0.033)0.42(0.035)44
2018K20.00001453(100)0.610.45(0.030)0.45(0.032)150
(1454)K30.00001454(100)0.670.48(0.026)0.48(0.026)22-1
KA0.00001454(100)0.700.46(0.029)0.45(0.031)160
G0.0163387(16)0.440.40(0.035)0.46(0.024)016
K10.0020728(31)0.840.41(0.034)0.46(0.030)311
2017+18K20.00081696(72)0.650.46(0.031)0.45(0.034)15-1
(2355)K30.00002355(100)0.760.49(0.020)0.49(0.020)240
KA0.00061773(75)0.730.46(0.031)0.46(0.038)150
TS+10%(2019)
G0.0174190(19)0.580.54(0.033)0.58(0.030)07
K10.0038184(19)0.890.55(0.032)0.58(0.028)24
2017K20.0064229(23)0.780.57(0.029)0.59(0.029)62
(974)K30.0000974(100)0.920.54(0.031)0.54(0.031)00
KA0.0040284(29)0.880.57(0.029)0.58(0.031)52
G0.0124354(24)0.410.49(0.026)0.53(0.025)09
K10.0020488(33)0.820.50(0.026)0.53(0.027)36
117
TableC.2(cont'd)
2018K20.00021416(95)0.610.54(0.024)0.54(0.025)110
(1490)K30.00001490(100)0.680.56(0.021)0.56(0.022)150
KA0.00001482(99)0.700.54(0.024)0.54(0.026)110
G0.0152389(16)0.450.51(0.030)0.56(0.028)010
K10.0022622(26)0.840.52(0.029)0.55(0.028)36
2017+18K20.00051981(83)0.660.56(0.026)0.56(0.027)100
(2391)K30.00002391(100)0.760.58(0.021)0.58(0.021)140
KA0.00012292(96)0.730.56(0.026)0.56(0.027)100
TS+15%(2019)
G0.0236126(12)0.580.64(0.029)0.68(0.025)07
K10.0058125(12)0.890.65(0.028)0.68(0.025)24
2017K20.0097125(12)0.790.67(0.026)0.68(0.023)52
(1010)K30.00001010(100)0.920.64(0.025)0.64(0.025)10
KA0.0061145(14)0.870.66(0.026)0.68(0.024)43
G0.0126343(22)0.420.60(0.026)0.65(0.030)08
K10.0013705(46)0.830.62(0.026)0.64(0.028)33
2018K20.00011491(98)0.620.65(0.025)0.65(0.025)80
(1526)K30.00001526(100)0.690.65(0.024)0.65(0.024)80
KA0.00001518(99)0.710.65(0.026)0.65(0.026)80
G0.0175273(11)0.450.60(0.030)0.67(0.026)011
K10.0028458(19)0.840.62(0.029)0.66(0.026)37
2017+18K20.00111488(61)0.660.65(0.027)0.66(0.026)91
(2464)K30.00002427(100)0.760.65(0.024)0.65(0.024)80
KA0.00071682(69)0.740.65(0.027)0.66(0.026)90
GRM:Geneticrelationshipmatrix.SD:standarddeviation.
a
PenalizationparameterinEquation(4.6)found
bycross-validatingtheTS.
b
n
sup
=averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient
inthesparseHatmatrix(supportset).RS:relativesparsity(
100
n
TS
š
n
sup
).Inthestandardmodels

CV
is
equaltozeroand
n
sup
isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy
ofthe
c
standardK-BLUPrelativetothestandardG-BLUP,and
d
sparse*-BLUPrelativetothestandard
*-BLUP(*=G-orK-).
118
TableC.3:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,and10%ofsubjectsfromthe2019cycle),traitPH,environmentOPT.
Accuracy(SD)%Gain
TS(
n
TS
)GRM

CV
a
n
sup
(RS)
b
h
2
StandardSparseI
c
II
d
TS+0%(2019)
G0.0024670(74)0.550.14(0.014)0.12(0.014)0-9
K10.0005735(82)0.880.14(0.014)0.13(0.015)1-8
2017K20.0006763(85)0.760.13(0.014)0.14(0.014)-23
(901)K30.0000901(100)0.960.04(0.012)0.04(0.012)-73-3
KA0.0002857(95)0.870.12(0.019)0.11(0.028)-14-3
G0.0131317(22)0.710.02(0.015)0.08(0.015)0430
K10.0020387(27)0.950.06(0.015)0.08(0.015)26543
2018K20.0024615(43)0.870.10(0.015)0.12(0.014)56422
(1417)K30.00001417(100)0.930.06(0.013)0.06(0.013)322-1
KA0.0019545(38)0.930.09(0.021)0.13(0.017)51733
G0.0129499(22)0.640.11(0.013)0.07(0.014)0-38
K10.0014835(36)0.930.15(0.013)0.07(0.013)39-51
2017+18K20.00081660(72)0.840.16(0.013)0.13(0.014)49-19
(2318)K30.00002318(100)0.940.06(0.013)0.06(0.013)-46-1
KA0.00051786(77)0.890.15(0.018)0.14(0.024)40-8
TS+5%(2019)
G0.0010819(87)0.550.23(0.031)0.22(0.031)0-1
K10.0002859(92)0.880.23(0.031)0.23(0.032)3-3
2017K20.0003860(92)0.750.25(0.031)0.24(0.031)9-2
(938)K30.0000938(100)0.920.20(0.032)0.20(0.032)-100
KA0.0002882(94)0.870.24(0.033)0.22(0.035)5-5
G0.0103373(26)0.700.10(0.037)0.18(0.036)076
K10.0016458(31)0.940.16(0.036)0.19(0.036)5917
2018K20.0018714(49)0.870.24(0.032)0.25(0.032)1365
(1454)K30.00001454(100)0.920.24(0.027)0.24(0.027)1380
KA0.0013682(47)0.920.23(0.034)0.24(0.033)1215
G0.0121500(21)0.630.16(0.024)0.14(0.028)0-11
K10.00061388(59)0.920.22(0.025)0.17(0.027)34-19
2017+18K20.00061815(77)0.840.26(0.026)0.25(0.026)61-4
(2355)K30.00002355(100)0.920.21(0.027)0.21(0.027)280
KA0.00041847(78)0.890.25(0.028)0.24(0.026)53-5
TS+10%(2019)
G0.0026706(72)0.550.28(0.043)0.28(0.044)0-1
K10.0004817(84)0.880.29(0.042)0.28(0.044)3-2
2017K20.0004857(88)0.760.30(0.043)0.30(0.043)8-1
(974)K30.0000974(100)0.920.28(0.046)0.28(0.046)-10
KA0.0002907(93)0.860.29(0.042)0.28(0.043)4-2
G0.0128291(20)0.700.17(0.072)0.25(0.065)048
K10.0020363(24)0.940.23(0.066)0.25(0.066)3511
119
TableC.3(cont'd)
2018K20.0022624(42)0.870.30(0.054)0.31(0.055)792
(1490)K30.00001490(100)0.920.32(0.043)0.32(0.043)920
KA0.0019540(36)0.920.29(0.057)0.30(0.055)704
G0.0117468(20)0.630.22(0.047)0.22(0.055)00
K10.0013798(33)0.920.28(0.044)0.25(0.053)26-8
2017+18K20.00051846(77)0.840.32(0.042)0.31(0.042)47-3
(2391)K30.00002391(100)0.920.29(0.042)0.29(0.042)320
KA0.00031982(83)0.890.31(0.043)0.30(0.044)43-3
GRM:Geneticrelationshipmatrix.SD:standarddeviation.
a
PenalizationparameterinEquation(4.6)found
bycross-validatingtheTS.
b
n
sup
=averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient
inthesparseHatmatrix(supportset).RS:relativesparsity(
100
n
TS
š
n
sup
).Inthestandardmodels

CV
is
equaltozeroand
n
sup
isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy
ofthe
c
standardK-BLUPrelativetothestandardG-BLUP,and
d
sparse*-BLUPrelativetothestandard
*-BLUP(*=G-orK-).
120
TableC.4:Heritabilityandaccuracyofpredictionforeachtrainingset(TS)composition
(including0%,5%,10%,and15%ofsubjectsfromthe2019cycle),traitPH,environmentDRT.
Accuracy(SD)%Gain
TS(
n
TS
)GRM

CV
a
n
sup
(RS)
b
h
2
StandardSparseI
c
II
d
TS+0%(2019)
G0.0025658(73)0.640.18(0.013)0.17(0.013)0-4
K10.0005718(80)0.920.19(0.013)0.34(0.012)978
2017K20.0003842(93)0.840.24(0.013)0.25(0.013)344
(901)K30.0000901(100)1.000.33(0.012)0.33(0.012)841
KA0.0000901(100)0.920.22(0.024)0.11(0.063)26-51
G0.0175263(19)0.59-0.15(0.017)-0.11(0.017)0-29
K10.0028319(23)0.91-0.15(0.017)-0.13(0.017)-3-13
2018K20.0027589(42)0.81-0.13(0.018)-0.12(0.018)-14-9
(1417)K30.00001417(100)0.96-0.06(0.019)-0.06(0.019)-630
KA0.0017675(48)0.87-0.14(0.021)-0.13(0.018)-11-8
G0.0178363(16)0.58-0.05(0.017)-0.04(0.017)0-18
K10.0028447(19)0.91-0.05(0.017)-0.11(0.018)-3115
2017+18K20.0033741(32)0.82-0.04(0.017)-0.05(0.017)-2836
(2318)K30.00002316(100)0.970.08(0.019)0.08(0.019)-2430
KA0.0022853(37)0.87-0.04(0.020)-0.05(0.026)-3039
TS+5%(2019)
G0.0025664(71)0.660.55(0.024)0.56(0.025)02
K10.0003781(83)0.920.57(0.023)0.60(0.021)35
2017K20.0001907(97)0.860.61(0.020)0.60(0.021)90
(938)K30.0000938(100)0.980.58(0.030)0.58(0.029)50
KA0.0000919(98)0.930.60(0.022)0.57(0.039)7-5
G0.0157271(19)0.590.17(0.040)0.26(0.044)060
K10.0030286(20)0.910.22(0.043)0.29(0.044)3234
2018K20.0024625(43)0.820.31(0.045)0.33(0.045)877
(1454)K30.00001454(100)0.950.34(0.037)0.34(0.037)1040
KA0.0018658(45)0.870.30(0.046)0.32(0.047)835
G0.0214261(11)0.580.21(0.030)0.34(0.033)059
K10.0034344(15)0.910.27(0.032)0.32(0.080)2420
2017+18K20.0037636(27)0.820.36(0.034)0.36(0.035)681
(2355)K30.00002355(100)0.960.43(0.027)0.43(0.027)1020
KA0.0028699(30)0.870.35(0.037)0.36(0.038)652
TS+10%(2019)
G0.0050625(64)0.670.63(0.025)0.64(0.028)01
K10.0006769(79)0.930.65(0.024)0.65(0.027)20
2017K20.0000970(99)0.860.68(0.023)0.67(0.025)60
(974)K30.0000974(100)0.960.67(0.029)0.67(0.029)50
KA0.0001947(97)0.930.67(0.024)0.66(0.028)5-1
G0.0184211(14)0.590.51(0.037)0.62(0.037)023
K10.0030260(17)0.910.58(0.033)0.64(0.035)1310
121
TableC.4(cont'd)
2018K20.0023628(42)0.830.65(0.028)0.66(0.028)282
(1490)K30.00001490(100)0.940.60(0.031)0.60(0.031)170
KA0.0018653(44)0.870.64(0.029)0.66(0.030)272
G0.0212231(10)0.580.49(0.030)0.63(0.027)028
K10.0034309(13)0.910.57(0.028)0.65(0.027)1513
2017+18K20.0040559(23)0.820.65(0.023)0.67(0.023)332
(2391)K30.00002391(100)0.950.64(0.022)0.64(0.022)310
KA0.0033556(23)0.870.65(0.024)0.67(0.025)323
TS+15%(2019)
G0.0055487(48)0.660.67(0.023)0.69(0.025)03
K10.0039613(61)0.930.69(0.022)0.70(0.020)22
2017K20.0006804(80)0.860.71(0.021)0.72(0.021)60
(1010)K30.00001010(100)0.970.70(0.027)0.69(0.027)30
KA0.0003857(85)0.920.71(0.022)0.70(0.025)5-1
G0.0198173(11)0.590.54(0.032)0.65(0.031)020
K10.0034204(13)0.910.60(0.029)0.66(0.027)1110
2018K20.0025567(37)0.830.67(0.025)0.68(0.026)232
(1526)K30.00001526(100)0.940.67(0.027)0.67(0.027)220
KA0.0018626(41)0.870.67(0.025)0.68(0.026)231
G0.0213193(8)0.580.51(0.028)0.65(0.025)027
K10.0038253(10)0.910.59(0.026)0.66(0.025)1413
2017+18K20.0034600(25)0.820.67(0.023)0.68(0.023)302
(2464)K30.00002427(100)0.950.67(0.022)0.67(0.022)320
KA0.0028618(25)0.870.66(0.023)0.67(0.025)292
GRM:Geneticrelationshipmatrix.SD:standarddeviation.
a
PenalizationparameterinEquation(4.6)found
bycross-validatingtheTS.
b
n
sup
=averagenumberofindividualsfromtheTSwithanon-zerocoe˚cient
inthesparseHatmatrix(supportset).RS:relativesparsity(
100
n
TS
š
n
sup
).Inthestandardmodels

CV
is
equaltozeroand
n
sup
isequaltothetotalTSsize.WithineachTScycle,percentageofgaininaccuracy
ofthe
c
standardK-BLUPrelativetothestandardG-BLUP,and
d
sparse*-BLUPrelativetothestandard
*-BLUP(*=G-orK-).
122
BIBLIOGRAPHY
123
BIBLIOGRAPHY
Aguate,FernandoM.,SamuelTrachsel,LorenaGonzález-Pérez,JuanBurgueño,JoséCrossa,
MónicaBalzarini,DavidGouache,MatthieuBogard&GustavodelosCampos.2017.Use
ofhyperspectralimagedataoutperformsvegetationindicesinpredictionofmaizeyield.
Crop
Science
57.
Akdemir,Deniz&JulioIsidro-Sanchez.2019.Designoftrainingpopulationsforselectivepheno-
typingingenomicprediction.
Scienti˝cReports
9.
Akdemir,Deniz,JulioISanchez&Jean-LucJannink.2015.Optimizationofgenomicselection
trainingpopulationswithageneticalgorithm.
GeneticsSelectionEvolution
47(38).
Alvarado,Gregorio,FranciscoM.Rodríguez,AngelaPacheco,JuanBurgueño,JoséCrossa,Mateo
Vargas,PaulinoPérez-Rodríguez&MarcoA.Lopez-Cruz.2020.META-R:Asoftwareto
analyzedatafrommulti-environmentplantbreedingtrials.
TheCropJournal
8(5).
Araus,Luis&JillECairns.2014.Fieldhigh-throughputphenotyping:thenewcropbreeding
frontier.
TrendsinPlantScience
19(1).
Atanda,SikiruAdeniyi,MichaelOlsen,JuanBurgueño,JoseCrossa,DanielDzidzienyo,Yoseph
Beyene,ManjeGowda,KateDreher,XuecaiZhang,BoddupalliM.Prasanna,PangirayiTon-
goona,EricYirenkyiDanquah,GbadeboOlaoye&KellyR.Robbins.2020.Maximizing
e˚ciencyofgenomicselectioninCIMMYT'stropicalmaizebreedingprogram.
Theoretical
andAppliedGenetics
.
Babar,MA,MPReynolds,MVanGinkel,ARKlatt,WRRaun&MLStone.2006.Spectral
re˛ectancetoestimategeneticvariationforin-seasonbiomass,leafchlorophyll,andcanopy
temperatureinwheat.
CropScience
46.
Bates,Douglas,MartinMächler,BenjaminMBolker&StevenCWalker.2015.Fittinglinear
mixed-e˙ectsmodelsusinglme4.
JournalofStatisticalSoftware
67(1).
Bernardo,Rex&JianmingYu.2007.ProspectsforGenomewideSelectionforQuantitativeTraits
inMaize.
CropScience
47.
Beyene,Yoseph,ManjeGowda,MichaelOlsen,KellyR.Robbins,PaulinoPérez-Rodríguez,
GregorioAlvarado,KateDreher,StarYanxinGao,StephenMugo,BoddupalliM.Prasanna&
JoseCrossa.2019.EmpiricalComparisonofTropicalMaizeHybridsSelectedThroughGenomic
andPhenotypicSelections.
FrontiersinPlantScience
10(1502).
Bulmer,MG.1985.
Themathematicaltheoryofquantitativegenetics
.NewYork,USA:Oxford
UniversityPress.
delosCampos,G.,D.Gianola&G.J.Rosa.2009a.ReproducingkernelHilbertspacesregression:
ageneralframeworkforgeneticevaluation.
Journalofanimalscience
87.
124
delosCampos,Gustavo,DanielGianola,GuilhermeJ.M.Rosa,KentA.Weigel&JoseCrossa.
2010.Semi-parametricgenomic-enabledpredictionofgeneticvaluesusingreproducingkernel
Hilbertspacesmethods.
GeneticsResearch
92(4).
delosCampos,Gustavo,JohnM.Hickey,RicardoPong-Wong,HansD.Daetwyler&MarioP.L.
Calus.2013a.Whole-genomeregressionandpredictionmethodsappliedtoplantandanimal
breeding.
Genetics
193(2).
delosCampos,Gustavo,HugoNaya,DanielGianola,JoséCrossa,AndrésLegarra,Eduardo
Manfredi,KentWeigel&JoséMiguelCotes.2009b.Predictingquantitativetraitswithregression
modelsfordensemolecularmarkersandpedigree.
Genetics
182.
delosCampos,Gustavo,AnaIVazquez,RohanFernando,YannCKlimentidis&DanielSorensen.
2013b.PredictionofcomplexhumantraitsusingtheGenomicBestLinearUnbiasedPredictor.
PLoSGenetics
9(7).
delosCampos,Gustavo,YogasudhaVeturi,AnaI.Vazquez,ChristinaLehermeier&Paulino
Pérez-Rodríguez.2015.Incorporatinggeneticheterogeneityinwhole-genomeregressionsusing
interactions.
JournalofAgricultural,Biological,andEnvironmentalStatistics
20(4).
Clark,SamuelA.,JohnM.Hickey,HansD.Daetwyler&JuliusH.J.vanderWerf.2012.The
importanceofinformationonrelativesforthepredictionofgenomicbreedingvaluesandthe
implicationsforthemakeupofreferencedatasetsinlivestockbreedingschemes.
Genetics
SelectionEvolution
44(4).
Combs,Emily&RexBernardo.2013.Accuracyofgenomewideselectionfordi˙erenttraitswith
constantpopulationsize,heritability,andnumberofmarkers.
ThePlantGenome
6(1).
Cover,T&PHart.1967.Nearestneighborpatternclassi˝cation.
IEEETransactionsonInformation
Theory
13(1).
Crossa,José,GustavodelosCampos,PaulinoPérez,DanielGianola,JuanBurgueño,JoséLuis
Araus,DanMakumbi,RaviP.Singh,SusanneDreisigacker,JianbingYan,ViviArief,Marianne
Banziger&HansJoachimBraun.2010.Predictionofgeneticvaluesofquantitativetraitsinplant
breedingusingpedigreeandmolecularmarkers.
Genetics
186(2).
Daetwyler,HansD.,BeatrizVillanueva&JohnA.Woolliams.2008.Accuracyofpredictingthe
geneticriskofdiseaseusingagenome-wideapproach.
PLoSONE
3(10).
Dagnachew,B.S.,T.H.E.Meuwissen&T.Ådnøy.2013.GeneticcomponentsofmilkFourier-
transforminfraredspectrausedtopredictbreedingvaluesformilkcompositionandqualitytraits
indairygoats.
JournalofDairyScience
96(9).
Dawson,JulieC.,Je˙reyB.Endelman,NicolasHeslot,JoseCrossa,JessePoland,SusanneDreisi-
gacker,YannManès,MarkE.Sorrells&JeanLucJannink.2013.Theuseofunbalanced
historicaldataforgenomicselectioninaninternationalwheatbreedingprogram.
FieldCrops
Research
154.
125
Dekkers,JCM.2007.Predictionofresponsetomarker-assistedandgenomicselectionusing
selectionindextheory.
JournalofAnimalBreedingandGenetics
124.
Efron,B,THastie,IJohnstone&RTibshirani.2004.Leastangleregression.
TheAnnalsof
Statistics
32(2).
Endelman,Je˙reyB.2011.RidgeRegressionandOtherKernelsforGenomicSelectionwithR
PackagerrBLUP.
ThePlantGenomeJournal
4(3).
Falconer,DouglasS&TrudyFCMackay.1996.
Introductiontoquantitativegenetics
.Essex,UK:
PrenticeHall4thedn.
Ferragina,A,GdelosCampos,AIVazquez,ACecchinato&GBittante.2015.Bayesian
regressionmodelsoutperformpartialleastsquaresmethodsforpredictingmilkcomponents
andtechnologicalpropertiesusinginfraredspectraldata.
JournalofDairyScience
98(11).

Ferrio,JP,DVillegas,JZarco,NAparicio,JLAraus&CRoyo.2005.Assessmentofdurum
wheatyieldusingvisibleandnear-infraredre˛ectancespectraofcanopies.
FieldCropsResearch
94.
Friedman,Jerome,TrevorHastie,HolgerHö˛ing&RobertTibshirani.2007.Pathwisecoordinate
optimization.
TheAnnalsofAppliedStatistics
1(2).
Friedman,Jerome,TrevorHastie&RobTibshirani.2010.Regularizationpathsforgeneralized
linearmodelsviacoordinatedescent.
JournalofStatisticalSoftware
33(1).
Fu,WenjiangJ.1998.Penalizedregressions:theBridgeversustheLASSO.
JournalofComputa-
tionalandGraphicalStatistics
7(3).
Garnsworthy,PC,JWiseman&KFegeros.2000.Predictionofchemical,nutritiveandagronomic
characteristicsofwheatbynearinfraredspectroscopy.
JournalofAgriculturalScience
135.

Garrick,DorianJ.2011.Thenature,scopeandimpactofgenomicpredictioninbeefcattleinthe
UnitedStates.
GeneticsSelectionEvolution
43(1).
Garriga,Miguel,SebastiánRomero-Bravo,FélixEstrada,AlejandroEscobar,IvanAMatus,Ale-
jandrodelPozo,CesarAAstudillo&GustavoALobos.2017.Assessingwheattraitsbyspectral
re˛ectance:dowereallyneedtofocusonpredictedtrait-valuesordirectlyidentifytheelite
genotypesgroup?
FrontiersinPlantScience
8(280).
Gianola,Daniel,RohanL.Fernando&AlessandraStella.2006.Genomic-assistedpredictionof
geneticvaluewithsemiparametricprocedures.
Genetics
173.
Gitelson,AnatolyA,YoramJKaufman&MarkNMerzlyak.1996.Useofagreenchannelin
remotesensingofglobalvegetationfromEOS-MODIS.
RemoteSensingofEnvironment
58(3).

126
Goddard,Mike.2009.Genomicselection:Predictionofaccuracyandmaximisationoflongterm
response.
Genetica
136.
González-Camacho,J.M.,G.delosCampos,P.Pérez,D.Gianola,J.E.Cairns,G.Mahuku,
R.Babu&J.Crossa.2012.Genome-enabledpredictionofgeneticvaluesusingradialbasis
functionneuralnetworks.
TheoreticalandAppliedGenetics
125(4).
Habier,D,RLFernando&JCMDekkers.2007.Theimpactofgeneticrelationshipinformation
ongenome-assistedbreedingvalues.
Genetics
177.
Habier,D,RohanLFernando&DorianJGarrick.2013.GenomicBLUPDecoded:ALookinto
theBlackBoxofGenomicPrediction.
Genetics
194.
Habier,David,JensTetens,FranzReinholdSeefried,PeterLichtner&GeorgThaller.2010.The
impactofgeneticrelationshipinformationongenomicbreedingvaluesinGermanHolsteincattle.
GeneticsSelectionEvolution
42(5).
Haboudane,Driss,JohnRMiller,NicolasTremblay,PabloJZarco-Tejada&LouiseDextraze.
2002.Integratednarrow-bandvegetationindicesforpredictionofcropchlorophyllcontentfor
applicationtoprecisionagriculture.
RemoteSensingofEnvironment
81.
Hadley,Wickman.2016.
ggplot2:Elegantgraphicsfordataanalysis
.Springer-VerlagNewYork.
Hansen,PM&JKSchjoerring.2003.Re˛ectancemeasurementofcanopybiomassandnitrogen
statusinwheatcropsusingnormalizeddi˙erencevegetationindicesandpartialleastsquares
regression.
RemoteSensingofEnvironment
86.
Hastie,Trevor,RobertTibshirani&J.H.Friedman.2009.
Theelementsofstatisticallearning:
datamining,inference,andprediction
.NewYork,USA:Springer2ndedn.
Hayes,BenJ,PhillipJBowman,AmandaCChamberlain,KlaraVerbyla&MikeEGoddard.2009.
Accuracyofgenomicbreedingvaluesinmulti-breeddairycattlepopulations.
GeneticsSelection
Evolution
41.51.
Hazel,LN.1943.Thegeneticbasisforconstructingselectionindexes.
Genetics
28(6).
He˙ner,ElliotL,MarkESorrells&Jean-LucJannink.2009.GenomicSelectionforCrop
Improvement.
CropScience
49.
Henderson,CR.1950.Estimationofgeneticparameters.
TheAnnalsofMathematicalStatistics
21.
Henderson,CR.1963.Selectionindexandexpectedgeneticadvance.In
Statisticalgeneticsand
plantbreeding:Asymposiumandworkshop
,Washington,D.C.:NationalAcademyof
Sciences-NationalResearchCouncil.
Henderson,CR&RLQuaas.1976.Multipletraitevaluationusingrelatives'records.
Journalof
AnimalScience
43(6).
127
Hernandez,Javier,GustavoALobos,IvánMatus,AlejandrodelPozo,PaolaSilva&Mauricio
Galleguillos.2015.Usingridgeregressionmodelstoestimategrainyieldfrom˝eldspectral
datainbreadwheat(TriticumaestivumL.)grownunderthreewaterregimes.
RemoteSensing
7.

Isidro,Julio,JanninkJean-Luc,DenizAkdemir,JessePoland,NicolasHeslot&MarkESorrells.
2015.Trainingsetoptimizationunderpopulationstructureingenomicselection.
TheorAppl
Genet
128.
Jacobson,Amy,LianLian,ShengqiangZhong&RexBernardo.2014.Generalcombiningability
modelforgenomewideselectioninabiparentalcross.
CropScience
54(3).
Jiang,Guo-Liang.2013.MolecularMarkersandMarker-AssistedBreedinginPlants.In
Plant
breedingfromlaboratoriesto˝elds
,InTech.
Lehermeier,Christina,NicoleKrämer,EvaBauer,CyrilBauland,ChristianCamisan,LauraCampo,
PascalFlament,AlbrechtEMelchinger,MonicaMenz,NinaMeyer,LaurenceMoreau,Jesús
Moreno-González,MilenaOuzunova,HubertPausch,NicolasRanc,WolfgangSchipprack,Man-
fredSchönleben,HildrunWalter,AlainCharcosset&Chris-CarolinSchön.2014.Usefulnessof
MultiparentalPopulationsofMaize.
Genetics
198.
Lehermeier,Christina,ChrisCarolinSchön&GustavodelosCampos.2015.Assessmentof
geneticheterogeneityinstructuredplantpopulationsusingmultivariatewhole-genomeregression
models.
Genetics
201(1).
Lopez-Cruz,Marco,EricOlson,GabrielRovere,JoseCrossa,SusanneDreisigacker,Mondal
Suchismita,RaviSingh&GustavodelosCampos.2020.Regularizedselectionindicesfor
breedingvaluepredictionusinghyper-spectralimagedata.
Scienti˝cReports
10(8195).
Lorenz,Aaron&KevinPSmith.2015.AddingGeneticallyDistantIndividualstoTraining
PopulationsReducesGenomicPredictionAccuracyinBarley.
CropScience
55.
Lorenzana,RobenzonE.&RexBernardo.2009.Accuracyofgenotypicvaluepredictionsfor
marker-basedselectioninbiparentalplantpopulations.
TheoreticalandAppliedGenetics
120(1).

Lush,JayL.1935.Progenytestandindividualperformanceasindicatorsofananimal'sbreeding
value.
JournalofDairyScience
18(1).
Lush,JayL.1937.
Animalbreedingplans
.Ames,Iowa:IowaStateCollege,Ames.
Lush,JayL.1948.
Thegeneticsofpopulations
.Iowa:IowaStateCollege,Amesspecialreedn.
Mahlein,AK,ECOerke,USteiner&HWilhelmDehne.2012.Recentadvancesinsensingplant
diseasesforprecisioncropprotection.
EuropeanJournalofPlantPathology
133.
Melchinger,AlbrechtE.,H.FriedrichUtz&ChrisC.Schör.1998.Quantitativetraitlocus(QTL)
mappingusingdi˙erenttestersandindependentpopulationsamplesinmaizerevealslowpower
ofQTLdetectionandlargebiasinestimatesofQTLe˙ects.
Genetics
149(1).
128
Meuwissen,THE,BJHayes&MEGoddard.2001.Predictionoftotalgeneticvalueusing
genome-widedensemarkermaps.
Genetics
157(4).
Miah,G.,M.Y.Ra˝i,M.R.Ismail,A.B.Puteh,H.A.Rahim,R.Asfaliza&M.A.Latif.2013.
Blastresistanceinrice:areviewofconventionalbreedingtomolecularapproaches.
Molecular
BiologyReports
40(3).
Montes,JM,HFUtz,WSchipprack,BKusterer,JMuminovic,CPaul&AEMelchinger.2006.
Near-infraredspectroscopyoncombineharvesterstomeasuremaizegraindrymattercontent
andqualityparameters.
PlantBreeding
125.
Montes,JuanM,AlbrechtEMelchinger&JochenCReif.2007.Novelthroughputphenotyping
platformsinplantgeneticstudies.
TrendsinPlantScience
12(10).
Montesinos-López,Osval,AbelardoMontesinos-López,JoseCrossa,GustavodelosCampos,
GregorioAlvarado,MondalSuchismita,JessicaRutkoski,LorenaGonzález-Pérez&JuanBur-
gueño.2017.Predictinggrainyieldusingcanopyhyperspectralre˛ectanceinwheatbreeding
data.
PlantMethods
13(4).
Morota,Gota&DanielGianola.2014.Kernel-basedwhole-genomepredictionofcomplextraits:
Areview.
FrontiersinGenetics
5.
Nagel,KerstinA.,AlexanderPutz,FrankGilmer,KathrinHeinz,AndreasFischbach,Johannes
Pfeifer,MarcFaget,StephanBlossfeld,MichaelaErnst,ChryssaDimaki,BerndKastenholz,
AnnKatrinKleinert,AnnaGalinski,HannoScharr,FabioFiorani&UlrichSchurr.2012.
GROWSCREEN-Rhizoisanovelphenotypingrobotenablingsimultaneousmeasurementsof
rootandshootgrowthforplantsgrowninsoil-˝lledrhizotrons.
FunctionalPlantBiology
39(11).

Oblath,EmilyA,TerryAIsbell,MarkABerhow,BrettAllen,DavidArcher,JackBrown,RussellW
Gesch,JerryLHat˝eld,JalalDJabro,JamesRKiniry&DanielSLong.2016.Developmentof
near-infraredspectroscopycalibrationstomeasurequalitycharacteristicsinintactBrassicaceae
germplasm.
IndustrialCrops&Products
89.
Olson,K.M.,P.M.VanRaden&M.E.Tooker.2012.Multibreedgenomicevaluationsusing
purebredHolsteins,Jerseys,andBrownSwiss.
JournalofDairyScience
95(9).
Perez,Paulino&GustavodelosCampos.2014.Genome-wideregressionandpredictionwiththe
BGLRstatisticalpackage.
Genetics
198(2).
Pérez-Rodríguez,Paulino,JoséCrossa,JessicaRutkoski,JessePoland,RaviSingh,AndrésLegarra,
EnriqueAutrique,GustavoDeLosCampos,JuanBurgueño&SusanneDreisigacker.2017.
Single-stepgenomicandpedigreegenotype
Ö
environmentinteractionmodelsforpredicting
wheatlinesininternationalenvironments.
ThePlantGenomeJournal
10(2).
Poland,Jesse,Je˙reyEndelman,JulieDawson,JessicaRutkoski,ShuangyeWu,YannManes,
SusanneDreisigacker,JoséCrossa,HéctorSánchez-villeda,MarkSorrells&Jean-LucJannink.
2012.GenomicSelectioninWheatBreedingusingGenotyping-by-Sequencing.
ThePlant
GenomeJournal
5(3).
129
Pritchard,JonathanK&PeterDonnelly.2001.StudiesofAssociationinStructured
orAdmixedPopulations.
TheoreticalPopulationBiology
60.
Pritchard,JonathanK.,MatthewStephens&PeterDonnelly.2000.Inferenceofpopulationstructure
usingmultilocusgenotypedata.
Genetics
155.
Pszczola,M&MPLCalus.2016.Updatingthereferencepopulationtoachieveconstantgenomic
predictionreliabilityacrossgenerations.
Animal
10(6).
RCoreTeam.2019.
R:Alanguageandenvironmentforstatisticalcomputing
.RFoundationfor
StatisticalComputingVienna,Austria.https://www.R-project.org/.
Riedelsheimer,Christian,Je˙reyB.Endelman,MichaelStange,MarkE.Sorrells,JeanLucJannink
&AlbrechtE.Melchinger.2013.Genomicpredictabilityofinterconnectedbiparentalmaize
populations.
Genetics
194.
Rincent,R,SNicolas,TAltmann,DBrunel,PRevilla,AMelchinger,EBauer,NMeyer,
CGiau˙ret,CBauland,PJamin,JLaborde,HMonod,PFlament,ACharcosset&LMoreau.
2012.Maximizingthereliabilityofgenomicselectionbyoptimizingthecalibrationsetof
referenceindividuals:comparisonofmethodsintwodiversegroupsofmaizeinbreds(Zeamays
L.).
Genetics
192.
Rio,Simon,LaurenceMoreau,AlainCharcosset&TristanMary-Huard.2020.Accountingfor
group-speci˝callelee˙ectsandadmixtureingenomicpredictions:Theoryandexperimental
evaluationinmaize.
Genetics
216(1).
Roemer,C,MWahabzada,ABallvora,FPinto,MRossini,CPanigada,JBehmann,JLeon,CThu-
rau,CBauckhage,KKersting,URascher&LPluemer.2012.EarlyDroughtStressDetection
inCereals:SimplexVolumeMaximizationforHyperspectralImageAnalysis.
FunctionalPlant
Biology
39.
Roth,Morgane,HélèneMuranty,MarioDiGuardo,WalterGuerra,AndreaPatocchi&Fabrizio
Costa.2020.Genomicpredictionoffruittextureandtrainingpopulationoptimizationtowards
theapplicationofgenomicselectioninapple.
HorticultureResearch
7(1).
Rutkoski,Jessica,JessePoland,SuchismitaMondal,EnriqueAutrique,LorenaPérez-González,
JoséCrossa,MatthewReynolds&RaviSingh.2016.Canopytemperatureandvegetationindices
fromhigh-throughputphenotypingimproveaccuracyofpedigreeandgenomicselectionforgrain
yieldinheat.
G3:Genes,Genomes,Genetics
6.
Schulz-Streeck,T,JOOgutu,ZKaraman,CKnaak&HPPiepho.2012.GenomicSelectionusing
MultiplePopulations.
CropScience
52.
Searle,SR,GCasella&CEMcCulloch.1992.
Variancecomponents
.JohnWiley&Sons,Inc.
Smith,HF.1936.Adiscrimantfunctionforplantselection.
AnnalsofEugenics
7.
Soyeurt,H,IMisztal&NGengler.2010.Geneticvariabilityofmilkcomponentsbasedon
mid-infraredspectraldata.
JournalofDairyScience
93(4).
130
Spielbauer,Gertraud,PaulArmstrong,JohnWBaier,WilliamB.Allen,KatinaRichardson,
BoShen&AMarkSettles.2009.High-throughputnear-infraredre˛ectancespectroscopy
forpredictingquantitativeandqualitativecompositionphenotypesofindividualmaizekernels.
CerealChemistry
86(5).
Tang,Haibao,UzaySezen&AndrewH.Paterson.2010.Domesticationandplantgenomes.
CurrentOpinioninPlantBiology
13.
Tattaris,Maria,MatthewP.Reynolds&ScottC.Chapman.2016.Adirectcomparisonofremote
sensingapproachesforhigh-throughputphenotypinginplantbreeding.
FrontiersinPlantScience
7.
Tibshirani,Robert.1996.RegressionshrinkageandselectionviatheLASSO.
JournaloftheRoyal
StatisticalSocietyB
58(1).
Tucker,ComptonJ.1979.Redandphotographicinfraredlinearcombinationsformonitoring
vegetation.
RemoteSensingofEnvironment
8(2).
VanRaden,PM.2007.Genomicmeasuresofrelationshipandinbreeding.
InterbullBulletin
37.

VanRaden,PM.2008.E˚cientMethodstoComputeGenomicPredictions.
JournalofDairy
Science
91(11).
Veturi,Yogasudha,GustavodelosCampos,NengjunYi,WenHuang,AnaIVazquez&Brigitte
Kühnel.2019.Modelingheterogeneityinthegeneticarchitectureofethnicallydiversegroups
usingrandome˙ectinteractionmodels.
Genetics
211(April).
Weber,VS,JLAraus,JECairns,CSanchez,AEMelchinger&EOrsini.2012.Predictionof
grainyieldusingre˛ectancespectraofcanopyandleavesinmaizeplantsgrownunderdi˙erent
waterregimes.
FieldCropsResearch
128.
White,Je˙reyW,PedroAndrade-Sanchez,MichaelAGore,KevinFBronson,TerryACo˙elt,
MatthewMConley,KennethAFeldmann,AndrewNFrench,JohnTHeun,DouglasJHunsaker,
MatthewAJenks,BruceAKimball,RobertLRoth,RobertJStrand,KellyRThorp,GerardW
Wall&GuangyaoWang.2012.FieldCropsResearchField-basedphenomicsforplantgenetics
research.
FieldCropsResearch
133.
Wientjes,YvonneCJ,RoelFVeerkamp&MarioPLCalus.2013.TheE˙ectofLinkage
DisequilibriumandFamilyRelationshipsontheReliabilityofGenomicPrediction.
Genetics
193.
William,H.M.,R.Trethowan&E.M.Crosby-Galvan.2007.Wheatbreedingassistedbymarkers:
CIMMYT'sexperience.
Euphytica
157(3).
Wolc,Anna,JesusArango,PetekSettar,JanetE.Fulton,NeilP.O'Sullivan,RudolfPreisinger,
DavidHabier,RohanFernando,DorianJ.Garrick&JackC.M.Dekkers.2011.Persistenceof
accuracyofgenomicestimatedbreedingvaluesovergenerationsinlayerchickens.
Genetics
SelectionEvolution
43(23).
131
Wolc,Anna,A.Kranis,J.Arango,P.Settar,J.E.Fulton,N.P.O'Sullivan,A.Avendano,K.A.
Watson,J.M.Hickey,G.delosCampos,R.L.Fernando,D.J.Garrick&J.C.M.Dekkers.2016.
Implementationofgenomicselectioninthepoultryindustry.
AnimalFrontiers
6(1).
Xu,Yunbi&JonathanH.Crouch.2008.Marker-assistedselectioninplantbreeding:From
publicationstopractice.
CropScience
48(2).
Xu,Yunbi,XiaogangLiu,JunjieFu,HongwuWang,JiankangWang,ChanglingHuang,Boddu-
palliM.Prasanna,MichaelS.Olsen,GuoyingWang&AiminZhang.2020.EnhancingGenetic
GainthroughGenomicSelection:FromLivestocktoPlants.
PlantCommunications
1(1).
100005.
Xu,Yunbi,DebraJ.Skinner,HuixiaWu,NataliaPalacios-Rojas,JoseLuisAraus,JianbingYan,
ShibinGao,MarilynL.Warburton&JonathanH.Crouch.2009.Advancesinmaizegenomics
andtheirvalueforenhancinggeneticgainsfrombreeding.
InternationalJournalofPlant
Genomics
1(957602).
Yu,Jianming,JamesBHolland,MichaelDMcmullen&EdwardSBuckler.2008.GeneticDesign
andStatisticalPowerofNestedAssociationMappinginMaize.
GeneticsSelectionEvolution
178.
Zhu,Chengsong,MichaelGore,EdwardSBuckler&JianmingYu.2008.Statusandprospectsof
associationmappinginplants.
ThePlantGenomeJournal
1(1).
Zou,Hui&TrevorHastie.2005.Regularizationandvariableselectionviatheelasticnet.
Journal
oftheRoyalStatisticalSocietyB
67(2).
132