GRAPHESTIMATIONANDNETWORKCONSTRAINEDREGULARIZATIONW
ITH
APPLICATIONSINGENETICALGENOMICSANALYSIS
By
BinGao
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialfullmentoftherequirements
forthedegreeof
Statistics|DoctorofPhilosophy
2015
ABSTRACT
GRAPHESTIMATIONANDNETWORKCONSTRAINED
REGULARIZATIONWITHAPPLICATIONSINGENETICALGENOMICS
ANALYSIS
By
BinGao
Estimationandapplicationofgraphicalstructureareimportantto
picsinmodernstatis-
tics.Graphicalstructureisanidealtooltodescribegeneregulat
orynetworkwhichisimpor-
tantingeneticalgenomicsstudies.Inthisdissertation,agraphes
timationmethodandtwo
networkconstrainedregularizationmethodsareproposed.Both
theirtheoreticalproperties
andapplicationsingeneticalgenomicsarestudied.
Largeamountofresearcheortshavebeenfocusedonestimatin
ggenenetworksbased
ongeneexpressiondatainordertounderstandthefunctionalba
sisofalivingorganism.By
treatinggeneexpressionsasquantitativetraitswhileconsideringg
eneticmarkers,genetical
genomicsanalysishasshownitspowerinenhancingtheunderstandin
gofgeneregulation.
Knowingthatgeneexpressionsareoftenduetodirectedregulatio
ns,weintroduceacovariate-
adjustedGaussiangraphicalmodeltoestimatetheequivalencecla
ssofthedirectedacyclic
graphs(DAGs)inageneticalgenomicsanalysisframeworkinthesec
ondchapter.Anestima-
tionprocedureisintroducedandtheestimationconsistencyforsp
arseDAGsisestablished.
WeapplythemethodtoahumanAlzheimer'sdiseasedatasettoshowt
heutilityofthe
method.
Givenageneregulationnetwork,learnedeitherthroughbiologicale
xperimentsorvia
statisticalmethods,incorporatingsuchstructureinaregressio
nmodelwouldincreasethe
phenotypepredictionaccuracy.Inchapter3,weusegeneexpre
ssionstopredictphenotypic
responseswhileconsideringthegraphicalstructuresongenenet
worksinaparametricre-
gressionmodel.Giventhatgeneexpressionsareintermediatephen
otypesbetweenatrait
andgenes,wefollowtheinstrumentalvariableregressionframewo
rkproposedbyLinetal.
(2015)andtreatgeneticvariantsasinstrumentalvariablestode
alwiththeendogeneityissue.
Weproposeatwo-stepestimationprocedure.Intherststep,w
eapplylassotoestimate
theeectsofgeneticvariants.Inthesecondstep,weusethepr
edictedexpressionsobtained
fromtherststepaspredictorswhileadoptinganetworkconstra
inedregularizationmethod
toobtainbettervariableselectionandestimation.Weestablishthes
electionconsistency
andrelated
L
1
bound.
Itiscommonlyrecognizedthatgeneticeectsoncomplextraitsare
largelymodulated
byenvironmentalinuences.However,themodulationmechanismis
barelyunderstood.
Basedontheworkinchapter3,weadoptaninstrumentalvariabler
egressionapproach
whileincorporatingavaryingcoecientmodeltocapturepotential
nonlinearenvironmental
modulationeectsinchapter4.B-splinesareusedtoapproximatet
hecoecientfunctions.
Grouplassoandagraphconstrainedpenaltiesareappliedtoachieve
bettervariableselec-
tionandestimationwhennetworkstructuresareconsideredforg
eneexpressions.Selection
consistencyisestablishedundersomeassumptions.Weapplyourme
thodtoahumanliver
cohortdatasettodemonstrateitsutility.
Idedicatethisdissertationtomyparents,LianrongSongandQings
hengGao.
iv
ACKNOWLEDGMENTS
IwouldliketogratefullyandsincerelythankmyadvisorDr.YuehuaCu
iforhisguidance,
understanding,patienceduringmygraduatestudyandresearch
.Dr.Cuiguidedmeinto
theinterdisciplinaryeldofstatisticalgenetics.Hehelpedmeselect
importantpapersto
readandtaughtmethebasicconceptsandtechniquesinstatistica
lgeneticsatthebeginning
ofmyresearch.IrememberthosediscussionsinhisoceinwhichIlea
rnedbasictoolsand
techniqueswhichwouldbeveryimportantinmyresearch.Whenever
Ineedhelp,Dr.Cuiis
alwaysthere.EverytimeIknockedthedoorofhisoce,healwayss
penttimelisteningto
myquestionspatientlyandgavehelpfulcomments.IntheJournal
Clubwhichishostedby
Dr.Cui,Ilearnedalotfromothers'presentation.Ialsodeveloped
mypresentationskills.
Iappreciateothermembersinmycommittee.Dr.YiminXiaowasmyinitia
ladvisor.
Hegavemehelpfuladvisesontheselectionofderentcoursesand
guidanceonmylifewhen
IjustarrivedatUS.Dr.Ping-ShouZhonghelpedinmanyaspectsth
rougheverydiscussion
wehad.Hisinsightfulopinionsmademeviewquestionsderently.Dr.
QingLu'scomments
andquestionsintheJournalClubweregermaneandthoughtful.All
themembersinthe
committeehelpmeduringmygraduatestudy.Ireallyappreciatethe
irguidanceandhelp.
IappreciatealltheprofessorswhotaughtmeandhelpedmeintheD
epartmentofStatis-
ticsandProbability.Iobtainedknowledgefromthosewellprepared
coursesandenjoyedthe
processoflearningandtheinteractionwiththeinstructors.
MythanksgotostudentsinDr.Cui'sgroup,includingHonglangWang,
TaoHe,Shunjie
Guan,JingyiZhang,postdoctoralresearcherXuLiu,andforma
lstudentDr.CenWu.I
alsowanttothankotherstudentsinthedepartment,includingYuz
henZhou,XiaoqingZhu,
LiqianCai,SiddharthaNandy,JiwoongKimandmanyothers.Itismyluc
kthatsomany
v
nicepeopleandfriendsarearoundme.Theyhelpedmeinbothresear
chanddailylife.
Lastbutdenitelynotleast,Ithankmyparentsfortheirsupport
.Theyalwaysmakeall
possibleeortstohelpmewheneverIneed.Theyowncreditsinvery
achievementofmine.
vi
TABLEOFCONTENTS
LISTOFTABLES
....................................
ix
LISTOFFIGURES
...................................
x
Chapter1Introduction
...............................
1
1.1Overview......................................1

1.2Integrativegeneticalgenomicsanalysis..................
...2
1.3Estimatioinandapplicationofgraphicalstructures..........
....4
1.4Varyingcoecientmodelcapturinggene-environmentinteract
ion......6
1.5Objectivesandorganization...........................
7
Chapter2Learningdirectedacyclicgraphicalstructuresw
ithgenetical
genomicsdata
..............................
9
2.1Introduction....................................9

2.2ModelandEstimation..............................12
2.2.1Background................................12

2.2.2Covariate-adjustedgaussiangraphicalmodel...........
...13
2.2.3Atwo-stageestimationprocedure....................1
4
2.3Theoreticalresult.................................1
5
2.4Simulation.....................................17
2.4.1Simulationdesign.............................17

2.4.2Performanceinthelowdimensionalcase................19

2.4.3Performanceinthehighdimensionalcase................1
9
2.4.4Comparinggraphswithorwithoutcovariates.............2
2
2.4.5Comparingdirectedandundirectedgraphs...............
23
2.5ApplicationtoRealData.............................24

2.6Discussion.....................................28
Chapter3IntegrativeAnalysisofGeneticalGenomicsDataI
ncorporat-
ingNetworkStructure
.........................
30
3.1Introduction....................................30

3.2StatisticalMethodsandEstimation......................
.34
3.2.1TheModel.................................34

3.2.2Estimationofgeneticeects.......................35

3.2.3Networkconstrainedregularization...................3
6
3.3TheoreticalProperties..............................
37
3.4Simulation.....................................39

3.5ApplicationtoRealData.............................42

3.6Discussion.....................................46
vii
Chapter4GraphConstrainedRegularizationforNonparamet
ricInstru-
mentalVariablesRegressioninGeneticalGenomicAnalysis
.51
4.1Introduction....................................51

4.2NonparametricModelsandGraphConstrainedRegularization...
.....55
4.2.1TheModelSetup.............................55

4.2.2FirstStepEstimation...........................55

4.2.3SecondStepEstimation.........................56
4.3TheoreticalResults................................6
0
4.4Simulation.....................................62

4.5ApplicationtoRealData.............................67

4.6Discussion.....................................74
Chapter5Conclusionandfuturework
......................
75
APPENDICES
......................................
77
AppendixAProofofTheorem1............................
78
AppendixBProofofTheorem2............................
84
AppendixCProofofTheorem3............................
89
BIBLIOGRAPHY
....................................
98
viii
LISTOFTABLES
Table2.1Simulationsetupinthehighdimensionalcasewithxed

......21
Table2.2Simulationsetupinthehighdimensionalcasewithvarying
p
,q
and

.22
Table2.3Estimationcomparisonwithandwithoutadjustingforthema
rker
eectsunderderentscenarios.....................24
Table2.4EstimationcomparisonbetweenPCandcaPCwhen
q>p
......24
Table2.5Estimationcomparisonbetweendirectedandundirectedgr
aphs...25
Table3.1Simulationresultsforp=100,q=100,n=600............
.41
Table3.2Simulationresultsfor
p
=100,
q
=100,
n
=200

1400andnumSNP=444
Table3.3Simulationresultsfor
p
=600,
q
=600and
n
=300.........45
Table3.4Listoftop10selectedgenesforeachoftheIVGCandIVme
thod.46
Table3.5thegenesinpathwayofMetabolismofxenobioticsbycytoch
rome
P450-Homosapiens..........................49
Table4.1Simulationresultsfor
p
=100,
q
=100and

=0...........64
Table4.2Selectionratefor
p
=100,
q
=100,
n
=800and

=0........65
Table4.3Simulationresultsfor
p
=100,
q
=100and

=0
:25.........65
Table4.4Selectionratefor
p
=100,
q
=100,
n
=800and

=0
:25......68
Table4.5Listoftop10genesbasedonthestabilityselectionrate...
....72
ix
LISTOFFIGURES
Figure2.1Performanceofthetwo-stageestimationalgorithmunde
rderent
parameterandsamplesizesettings.Thecirclesrepresentthespa
rse
graphestimationresultswith
E
[N
]=2andthetrianglesrepresent
thedensegraphestimationresultswith
E
[N
]=5.Theupperand
lowerboundsofthe95%condenceintervalsarealsoshownwhich

aredculttoseparateunderlarge
n
..................20
Figure2.2BoxplotofTPRandFPRunderderentsettingswhen
q
=100...21
Figure2.3BoxplotofTPRandFPRunderderentsettingswhen
q
=500...22
Figure2.4BoxplotofTPRandFPRunderderentsimulationsettingsw
ith
varying
q
and

..............................23
Figure2.5TheestimatedCPDAGinthecase(left)andcontrol(right
)group
basedonthetop100genes.Thoseoverlappednodesindicatedual

regulationdirections...........................26
Figure2.6TheestimatedCPDAGinthecase(left)andcontrol(right
)group
basedongenesmappedtotheAlzheimer'sdiseasepathwayfromthe

KEGGdatabase.Thoseoverlappednodesindicatedualregulation

directions.................................27
Figure3.1Illustrationoftheinstrumentalvariablesmodelingenetic
algenomics
analysis..................................34
Figure3.2Simulationresultsforxed
p
=100and
q
=100butvarying
n
(200
to1400).................................43
Figure4.1Simulationresultsfor
p
=100,
q
=100and

=0...........66
Figure4.2Simulationresultsfor
p
=100,
q
=100and

=0onestimation
performance...............................67
Figure4.3Simulationresultsfor
p
=100,
q
=100and

=0onestimation
performance...............................68
Figure4.4Simulationresultsfor
p
=100,
q
=100and

=0
:25.........69
Figure4.5Simulationresultsfor
p
=100,
q
=100and

=0
:25onestimation.70
x
Figure4.6Simulationresultsfor
p
=100,
q
=100and

=0
:25onestimation.71
Figure4.7Graphicalstructureofthetop10genesselectedbyIVG
C.......73
Figure4.8TheestimatedcoecientfunctionsforgeneCYP2E1andU
GT2A2.73
xi
Chapter1

Introduction

1.1Overview

Inthisdissertation,westudytheestimationandapplicationofgrap
hicalstructuresinge-
neticalgenomicsanalysis.Generegulatorynetworkscontainabun
dantinformationabout
thefunctionsofgeneexpressionsandthemechanismsofgenereg
ulations.Inhumanbody,
hundredsofthousandsgenesareexpressedandinteractwithea
chothertoachieveallnec-
essaryfunctionsincells.Theyareexpressedatpropertimeandins
ucientamountto
determinewhenandhowproteinsaresynthesized.Iftheenvironm
entalfactorsarechanged,
likeatoxicsubstanceappears,dysfunctionsofgeneexpressionm
ayoccur.Theexpressions
ofasetofgenesmaybesuppressedorenlarged.Asaresult,theg
eneswhoseexpression
areaectedbythatsetofgenesmaybealsodysfunctional.Finally,
adiseasemayappears
andhencethehealthofhumanbodyisthreatened.So,itisalwaysan
interesttolearnthe
truearchitectureofgeneregulatorynetworksandtoapplythek
nowledgetofacilitateother
researchesandimprovepublichealth.Inthischapter,werstpro
videsomebackgroundon
geneticalgenomicsanalysisandthetheoryofgraphanditsvarious
applicationsin1.2and
1.3.Tosolvetheproblembroughtbygene-environmentinteraction
,varyingcoecientmodel
anditsadvantageandexibilityareintroducedin1.4.Theobjectives
andorganizationof
thedissertationarein1.5.
1
1.2Integrativegeneticalgenomicsanalysis

DNAmicroarraytechnologyisusedtomeasuretherelativeabundan
ceoflargeamountof
genessimultaneously.Inthepast,scientistscouldonlyanalyzeafe
wgenesatoncebecauseof
thelimitationofmeasurementtechnology.DNAmicroarrayoerspe
oplegreatopportunities
tounderstandthefunctionsofgeneexpressions.Recentdevelo
pmentsinnextgeneration
sequencingoerfaster,moreaccurate,andcheapertechnolog
ytomeasuregeneexpression
andgeneticvariantswithdigitalresolution.Alloftheprogressofm
easurementtechnology
providespromisingopportunityintheeldofgeneticalgenomicsana
lysis.
Regressionmodelsarecommonstatisticalmodelsusedtoexploreth
erelationshipofgene
expressionandaphenotypictrait.Aphenotypictraitcanbeadisea
seorabodyfeature.
However,thedimensionalityofgeneexpressionfeatures(
ˇ
40
K
)istypicallylarge.Classical
statisticalmethodsdonotworkinthissituation.Sincethepioneerin
gworkofLASSOby
Tibshirani(1996),variableselectioninpenalizedregressionshasbe
encommonlyadopted
tosolvethelarge
p
small
n
problem.Manyregularizationmethodshavebeendeveloped
sincethen,includingSCAD(FanandLi,2001),adaptiveLASSO(Zoua
ndHastie,2005),
groupLASSO(YuanandLin,2006),andMCP(Zhang,2010).Withth
evariableselection
technique,potentialcandidategenescanbescreenedforfurth
erbiologicalvalidation.
Whileusinggeneexpressiontopredictthephenotypeofinterest,w
ecanalsostudy
therelationbetweengenotypeandgeneexpression.Ithasbeena
longhistorythatpeople
trytondrelationshipbetweenquantitativetraitsandgeneticvar
iants.Quantitativetrait
loci(QTL)mappinghasbeenusedtondgeneticvariantswhichgove
rnthephenotypic
variationofacomplextrait,suchasbloodpressureorgrainyield.In
QTLmappingstudies,
geneticmarkerssuchassinglenucleotidepolymorphisms(SNPs)can
beusedtopinpointthe
2
locationofgeneticvariantsthataectthetraitofinterest.Since
thehugesuccessofQTL
mapping,expressionquantitativetraitloci(eQTL)mappinghasbee
nextensivelystudied.
IneQTLmapping,expressiontraitsaretreatedasquantitativetr
aits.Thus,onecanstudy
thegeneticarchitectureofgeneexpressionandfurtherdissect
thegeneregulationmechanism
suchascis-andtrans-regulation.IftheeQTLisintheregionoforc
losetothegenewhose
expressionistheinterestedquantitativetrait,sucheQTLiscalledc
is-eQTL.IftheeQTL
isfarfromthegenewhoseexpressionistheinterest,itiscalledtran
s-eQTL.Studieshave
shownthatcis-regulationisusuallystrongerthantrans-regulatio
nandmuchworkhasbeen
donetostudycis-andtrans-eQTL(e.g.,Giladetal.,2008).
eQTLmappingisalsocoinedasgeneticalgenomicsanalysissinceitcomb
inesgenetic
variantswithgeneexpressiondatatogether.Withthedevelopmen
tofhumanhapmapand
the1000genomeprojects,millionsofSNPvariantshavebeenident
ed.Thusithasbe-
comefeasibletopinpointspeccvariantslocatedinageneregionorin
ter-genicregionthat
regulategeneexpressions.Geneexpressiondatahavebeenbroa
dlyappliedtostudygenenet-
works.Forexample,basedonthecorrelationofgeneexpressions
,onecaninferthenetwork
structureusingtechniquessuchasBayesiannetwork(Friedmane
tal.,2000)orGaussian
graphicalmodel(Dobraetal.,2004).However,withoutadditional
biologicalinformation,
suchestimatednetworksareundirected.Giventhatgeneregulat
ionsareoftendirected,a
directednetworkinferenceshouldmakemorebiologicalsenseandb
emoreinformative.In
addition,twoindependentgenescouldshowfalsecorrelationifthey
shareacommongenetic
regulator.Thecorrelationshouldbegoneifthecommongeneticee
ctisadjusted(Caietal.
2013).Thismotivatesustodevelopadirectedgraphestimationpro
cedurewhileadjusting
forthegeneticeects.
Afterformulatingmodelstodescribetherelationbetweenphenoty
peandgeneexpres-
3
sionandtherelationbetweengeneexpressionandgenotype,wear
einterestedinhowto
incorporategeneregulatorynetworksintotheanalysis.Forthisp
urpose,weintroducesome
backgroundknowledgeontheestimationandapplicationofgraphica
lstructuresinthenext
section.

1.3Estimatioinandapplicationofgraphicalstructures

Generegulatorynetworkisagraphicalstructurewhichcanbelear
nedeitherbyexperimental
methodsorbystatisticalmethods.Itisusedtodescriberegulato
ryrelationsbetweengenes.
Ingeneral,agraphoranetworkconsistsofnodesandedges.Ifw
euse
G
todenoteagraph,
V
todenotethesetofthenodes,and
E
todenotethesetoftheedges,wecanwriteagraph
as
G
=(
V;E
).Here,anodeisageneexpressionandanedgeisacorrelationbetw
eentwo
genes.Towkindsofgraphsarestudiedorappliedinthisdissertation
,directedgraphand
undirectedgraph.
Derentkindsofgraphicalstructuresaredenedintheliteratur
e.Let
X
=(
X
1
;:::;X
p
)
followamultivariatenormaldistribution
N
(
;
)
;=(
˙
ij
)
;=

1
=(
!
ij
).Atleast
threegraphicalstructurescanbedened.
1.Structuredcovariancematrix.Forexample,itisassumedclosed
ataintimeare
stronglycorrelatedinlongitudinalananlysis.
2.Gaussiangraphicalmodel.
X
i
??
X
j
jf
X
t
g
t
6=
i;j
()
!
i;j
=0.
3.Causalrelation(directedgraph)canbedenedifotherassump
tionsareassumed.
Therstoneisusedtoinferthecorrelationrelationsbetweengene
s.Thesecondisused
toinfertheconditionalcorrelationrelationsbetweengene.Thelas
toneisusedtodocausal
inference.alotofliteraturescanbefoundaboutthesetopics(Bic
kelandLevina,2008a,
4
2008b;Rothman,LevinaandZhu,2009;LamandFan,2009;Meinsh
arsenandBuhlmann,
2006;Friedman,HastieandTibshirani,2007;Cai,Li,LiuandXie,2013
).
Aftergeneregulatorynetworksarelearned,thenextchallengeis
howtoapplythemas
priorknowledgetohelpfurtheranalysis,forinstance,studyingth
erelationsbetweengene
expressionsandtraitsofinterest.Thefollowinglistsfourregulariz
ationmethodswhichtake
advantageofthegenegraphicalstructuresinvariableselectiona
ndestimation:
1.FusedLASSO(Tibshirani,etal.,2005.):
p
(

1
;
2
)=

1
k

k
1
+

2
P
j
j

j


j

1
j
2.GRACE(LiandLi,2008):
p
(

1
;
2
)=

1
k

k
1
+

2
P
i
˘
j
(

i
p
d
i


j
q
d
j
)
2
3.GFlasso(KimandXing,2009):
p
(

1
;
2
)=

1
k

k
1
+

2
P
i
˘
j
j

i

sign(^
r
ij
)

j
j
4.GOSCAR(Yangetal.,2012):
p
(

1
;
2
)=

1
k

k
1
+

2
P
i
˘
j
max(

i
;
j
)
where
p
(

1
;
2
)isapenaltytermand

1
;
2
aretuningparameters.Intheseworks,derent
penaltiesareimposedtodealwithderentgraphicalstructures.
ThefusedLASSOrequiressmoothnessontheeectsofadjacent
variables.Thiscanbe
achievedbyorderingthevariablesproperly.GRACEusesthespec
cgraphicalstructure
andencouragegreatereectsof\hub"geneswhicharethosewit
hmanyconnectionsto
othergenes.GFlassoissimilartotheideaoffusedLASSOexceptitus
esnetworkstructures
ratherthanadjacentsmoothness.GOSCARisusedtoencourage
theeectsoftwoconnected
genestobeexactlythesame.Thederencebetweenabsolutepen
altyandsquaredpenalty
isabsolutepenaltyencouragestwocoecientstobeexactlythesa
mewhilesquaredpenalty
doesnot.Thesesuccessfulapplicationsofgraphsinspireustoap
plynetworkinformationin
ourwork.InChapter3andChapter4,aparametricgraphicalstr
uctureandanonparametric
graphicalstructureareusedforinstrumentalvariablesregres
sionmodel.
5
1.4Varyingcoecientmodelcapturinggene-environment
interaction
Geneexpressionsarethedirectproductsofgenefunctions.How
ever,mostgenefunctions
areinuencedbyenvironmentalfactors,resultingintheso-called
gene-environment(G

)
interactionswhicharebroadlystudiedinliterature.Modelingtheee
ctofanenvironmental
factorasafunctionbringsexibilitytocaptureeitherlinearornon-
linearmodulationeect.
AtypicalG

Einteractionmodelassumesalinearinteractionstructurebetwee
nagenetic
variable(
G
)andenvironmentalvariable(
X
),i.e.,
Y
=

0
+

1
X
+

1
G
+

2
XG
+

=

0
+

1
X
+(

1
+

2
X
)
G
+
:
AsdemonstratedinMaetal.(2011),suchasimpleregressionmodel
couldeasilymissany
interactionsignalsthatdonotshowalinearstructure.Theauthor
sproposedthefollowing
varying-coecient(VC)modelwhichcancapturethenonlinearmod
ulationeect,
Y
=

(
X
)+

(
X
)
G
+
:
where

(
X
)and

(
X
)representtheinterceptandinteractionfunctionwhichareassu
med
tobesmooth.NonparametricestimationtechniquessuchasB-splin
ecanbeappliedto
estimatethefunctions.Suchanonparametrictreatmenthasmuc
hexibilitytocapturethe
realbiologicalfunctionofenvironmentalforces.AsfortheVCmo
delling,vastliterature
canbefoundonestimationorvariableselection(e.g.,HastieandTibsh
irani,1993;Wang
6
andXia,2009;Fengetal.,2011;Tangetal.,2012).Littleworkhasbe
endoneongenetical
genomicsanalysiswhileincorporatingtheeectofenvironmentalfa
ctors.
1.5Objectivesandorganization

Inthisdissertation,weproposeagraphestimationmethodtolearn
directedgraphicalstruc-
turesandtwographapplicationmethodsusingnetworkstructure
saspriorknowledgeto
improvetheperformanceofvariableselection.Bothparametrican
dnonparametricmodels
andmethodsareconsidered.
PCalgorithmhasbeenoneofthemostpopularmethodsinestimatingd
irectedgraphs.
However,covariateeectsarenotconsidered.Intheestimation
ofundirectedgraphs,two
connectedgenesmaysharecommonregulators(e.g.,SNPs).Thet
wogenesmaybeinde-
pendentiftheeectofcommonregulator(s)isadjusted.Thesam
ephenomenonappliesto
theestimationofdirectedgraphs.MotivatedbytheworkofCaiet
al.(2013),weproposea
covariate-adjustedPCalgorithminchapter2whichisusedtolearnt
heregulatorynetworks
ofgeneswhileadjustingfortheeectsofgeneticvariants.Weest
ablishtheestimationand
selectionconsistencyoftheproposedtwo-stepestimationalgorit
hm.
Inageneticalgenomicsanalysistoevaluatethecorrelationbetwee
nphenotypictraitsof
interestandgeneexpressions,oneusuallyassumesthattheerro
rtermisindependentofthe
predictors.However,thisisnotalwaystrue.Endogeneityinwhichs
omepredictorscorrelate
withtheerrormayarise.Instrumentalvariables(IV)regression
isacommonwaytosolvethis
problem.Linetal.(2015)proposedanIVregressionframeworkinw
hichgeneticvariants
aretreatedastheinstrumentalvariablestoeasethecorrelation
problem.Theauthors
showedthesuperiorperformanceofthevariableselectionmethod
viaincorporatingtheIV
7
regression.Asweillustratedpreviously,incorporatinggenenetwo
rksinvariableselection
couldpotentiallyimprovetheperformanceofgeneselection.Thus,
weproposeanetwork
constrainedregularizationmethodinaIVregressionframeworkan
dapplythemethodto
anintegrativeanalysisofgeneticalgenomicsdatainchapter3.Wee
stablishtheestimation
andselectionconsistencyoftheproposedmethod.
Asweillustratedpreviously,environmentalfactorscouldpotentia
llymodulategenetic
eectsongeneexpressionandfurtheraectthephenotypicvar
iation.Forexample,the
amountofexposuretoheavymetalcouldmodifygeneticeectong
eneexpressionand
furtherleadtoincreasedriskofneurovegetativediseases.Vary
ing-coecientmodelisa
naturalchoicetodescribeanypotentialnonlinearmodulationeec
t.Builtuponthework
inchapter3,weextendtheIVregressionmodeltoincorporateth
enonlinearmoderation
eectofanenvironmentalfactorinchapter4.Thevaryingcoec
ientfunctionsaremodelled
throughnonparametrictechniquessuchasB-splineandnetworkc
onstrainedregularization
isalsoconsideredtoimprovetheperformanceofgeneselection.We
alsoestablishthe
estimationandselectionconsistencyoftheproposedmethod.
Weconductextensivesimulationstudiestoevaluatetheperforman
ceofthethreemethods
andcomparewithexistingones.Theutilityofthemethodsisdemonst
ratedthroughreal
dataanalysis.Conclusionandfutureworkarefollowedinchapter5.
Technicalproofsare
giveninAppendices.Thedissertationcontributestothedevelopme
ntofgraphestimation
andvariableselectioninstatisticaltheoryandmethod,andtothea
nalysisofgenetical
genomicsdatainapplication.
8
Chapter2

Learningdirectedacyclicgraphical

structureswithgeneticalgenomics

data

2.1Introduction

Thepastdecadeshavewitnessedenormousmethodologydevelopm
entsingenenetworkin-
ferenceusinggeneexpressiondata(e.g.,CheungandSpielman,200
2;Schadtetal.,2003).
Suchnetworksweredevelopedmostlybyassessingthepairwisecor
relationofgeneexpres-
sions.Fromastatisticalpointofview,undertheassumptionthatt
hejointdistribution
ofthegeneexpressionsofinterestisamultivariatenormaldistribu
tion,suchnetworkscan
beconstructedbyassessingthenonzeroelementsoftheinverse
covariancematrix,theso
calledprecisionmatrixorconcentrationmatrix.Assume
X
=(
X
1
;:::;X
p
)
˘
N
(0
;)and
=

1
.X
i
??
X
j
jf
X
t
g
t
6=
i;j
()
!
i;j
=0,where
!
i;j
'sarethe(
i;j
)thentriesin.
Thatis,
X
i
and
X
j
areconditionallyindependentifandonlyif
!
i;j
=0.Thusonecan
inferthepositionsofzeroentriesintoinferthegraphicalstruct
ure,i.e.,theconditional
independenciesbetweencomponentsin
X
.However,whenthedatadimensionislarger
thanthesamplesize,theinverseofthesamplecovariancematrices
arenotavailable.When
9
consideringsuchnetworksasundirectedgraphs,therehasbeen
hugestatisticalinterestsin
developingGaussiangraphicalmodelsassumingsparsityoftheprec
isionmatrix,tonamea
few,Friedmanetal.(2008),Caietal.(2013),Meinshausenetal.(
2006),andYuanand
Lin(2007).Undercertainassumptions,thepositionsofzeroentr
iesintheprecisionmatrix
indicateconditionalindependencebetweenvariablesofstudy.
Wheninferringnetworkonlybasedongeneexpressions,twogenes
couldshowcorre-
lationinexpressionsimplybecausetheyshareacommonregulator,w
hiletheyshouldbe
independentconditioningonthecommonregulator.Motivatedbyth
is,afewmethodology
developmentshavebeenfocusedoncovariates-adjustedgraph
icalmodelsbyusinggenetic
markersascovariatestocorrectbothfalsepositivesandfalsene
gatives(e.g.,LeeandLiu,
2012;YinandLi,2011,2013;Caietal.,2013).Intheirestimationpr
ocedures,theeectof
geneticvariantsisestimatedintherststep.Then,thegraphical
structureisestimatedin
thesecondstepwhileadjustingforthegeneticeects.Theirsimula
tionresultsandrealdata
analysisshowedthatgraphestimatesweresubstantiallyimprovedw
hentakingthegenetic
eectsintoaccount.Thesemodelsareallfocusedontheestimatio
nofundirectedgraphs
withoutconsideringdirectioninformation.Thedirectedgraphs,ho
wever,aremoreattrac-
tiveinthesensethatoftenpeopleareinterestedinthecausalrela
tionships,thatis,interests
arenotonlyfocusedonwhethertwovariablesareconditionallyindep
endentbutalsoifone
causestheotherone.Inreallife,geneexpressionsareoftenth
eresultsofdirectedregula-
tions.Thus,theestimationofdirectedgraphsshouldprovidemore
biologicalinformation
forfurtherfunctionalinvestigation.Geneticalgenomicsdatapr
ovidesoptimalresourcesto
learnsuchdirectedgraphicalnetworks.
Intypicaldirectedacyclicgraph(DAG)inference,oneprobabilityd
istributionhasa
setofcorrespondingDAGswhichareMarkovequivalentundercert
ainassumptions.The
10
directionsoftheedgesinthegraphsindicatecausalrelations.Stu
diesondirectedgraphs
havebeenourishedinliteraturebothintheoryandinapplication(e.g
.,Andersonetal.,
1997;RichardsonandSpirtes,2002;Alietal.,2009).Themostcom
moncaseistheGaussian
case.Let
Y
=(
Y
1
;:::;Y
p
)bea
p
-dimensionalGaussianrandomvector,andeachcoordinate
representsageneexpression.Toinfertheregulatoryrelationsh
ipofthese
p
geneexpressions,
thePC-algorithmdevelopedbySpirtesetal.(2000)canbeappliedto
constructacompleted
partiallyDAG(CPDAG).TheCPDAGcanuniquelyrepresentallDAGsco
rrespondingto
thecommondistributionof
Y
,andcanberepresentedastheregulatorynetworksofthe
geneexpressions.KalischandBuhlmann(2007)laterprovedthat
thegraphobtainedfrom
thePC-algorithmisconsistentundersomeconditions,eveninhighdim
ensionalcases.
DirectedgraphsconstructedwiththePC-algorithmaremorebiolog
icallyrelevantcom-
paredtoundirectedgraphs.However,asmentionedbefore,iftw
ogenesshareacommon
regulator,theirexpressionrelationshipcouldbecomplicatedoreve
ndistortedwithoutcon-
sideringtheunderlyinggeneticstructure.Whenadjustingforthe
genotype(i.e.,thecommon
regulator)eects,thetworelatedgenescouldbeindependentor
viceversa.Thismotivates
ustodevelopacovariate-adjusteddirectedgraphicalmodelbye
xtendingtheCPDAGes-
timationproceduretoaregressionframework.Wetreatgeneticm
arkersascovariatesand
developatwo-stageestimationmethodwhichcombinesthepenalized
estimationandPC-
algorithmtoestimatethemarker-adjustedCPDAGstructure.
Inthefollowingofthepaper,werstprovidesomebackgroundkno
wledgeaboutgraphical
modelsandtheformulationofourmodelandthetwo-stageestimat
ionprocedureinSection
2.2.TheoreticalresultsaregiveninSection2.3.Simulationsandreal
dataanalysisaregiven
inSection2.4and2.5tojustifyourmethodfollowedbydiscussionsinSe
ction2.6.Allthe
proofsofthetheoreticalresultsarerenderedintheAppendixA.
11
2.2ModelandEstimation

2.2.1Background

Let
G
=(
V;E
)beagraphwhere
V
representsthesetofverticesand
E
representstheset
ofedges.Adirectedgraphisagraphinwhicheachedgehasadirectio
nrepresentedby
a\
!
"sign.Adirectedacyclicgraph(DAG)isadirectedgraphwithnodirec
tedcycles.
Acriteriacalled
d
-separationisusedtoreadconditionalindependencerelationships
froma
DAGstructure(Pearl,2000).Othercriteriacanbeusedtogener
ateconditionalindepen-
dencerelationships.Thereasonwhy
d
-separationisparticularlyusefulisthatitisrelatedto
causalinference(Pearl,2000;Spirtesetal.,2000)andforthepu
rposetoconstructdirected
graphs.
Givenarandomvector,itscoordinatescanbeviewedasthevertice
sofagraph
G
=
(
V;E
).AprobabilitydistributioniscalledfaithfultoaDAGifandonlyiftheco
nditional
independencerelationshipsindicatedbythedistributionareasthes
ameasthoseobtained
via
d
-separation.MultipleDAGsmayrepresentthesamesetofcondition
alindependence
relationships.TheDAGstowhichaprobabilitydistributionisfaithfula
reMarkovequivalent
andcomposeaMarkovequivalenceclass.TwoDAGsareMarkovequiv
alentifandonlyif
theyhavethesameskeletonandthesamev-structures(Fryden
berg,1990;VermaandPearl,
1990,1992).TheskeletonofaDAGisthegraphwherealldirectede
dgesarereplacedwith
undirectededges.Av-structureisatriple(v
1
,v
2
,v
3
)havingthestructurev
1
!
v
2
 
v
3
.AMarkovequivalenceclassofDAGscanbedescribedbyauniqueCPDA
G(Chickering,
2002).ACPDAGistheunionoftheDAGsinaMarkovequivalenceclassin
thesensethata
directededgeexistsifthatedgeexistsineachDAGandanundirecte
dedgeexistsifthatedge
hasderentdirectionsintwoDAGs.PC-algorithm(Spirtesetal.,200
0)isaneectiveand
12
ecientalgorithmtoestimateCPDAGs.Thereasonwhyweestimatet
heCPDAGisthat
multipleDAGscorrespondtoaprobabilitydistributionwhileonlyoneCPD
AGcorresponds
toaprobabilitydistribution.AfteraCPDAGisobtained,wecanexten
dittotheDAGs
byaddingdirectionsontheedgessuchthatnodirectedcyclesareg
enerated.Ithasbeen
proventhatPC-algorithmcangenerateaconsistentestimateund
ercertainassumptions,
eveninhighdimensionalcases(KalischandBuhlmann,2007).

2.2.2Covariate-adjustedgaussiangraphicalmodel

Consideramulti-responselinearregressionmodelwith
p
responsesand
q
covariates.Suppose
wehave
n
independentobservations(
y
i
;x
i
),
i
=1
;:::;n
,where
y
i
=(
y
i
1
;:::;y
ip
)
T
and
x
i
=(
x
i
1
;:::;x
iq
)
T
.Weassumethatthedataarecenteredandstandardized.Dene
Y
=
(
y
1
;:::;y
n
)
T
and
X
=(
x
1
;:::;x
n
)
T
.Therelationshipbetweenthemultivariateresponse
Y
andthepredictors
X
canbedescribedbythefollowingregressionmodel,
Y
=
XB
+
E
where
B
=(
B
1
;:::;B
p
)isa
q

p
coecientmatrixand
E
=(

1
;:::;
n
)
T
istheerrorterm.We
assumethattheerrors

i
=(

i
1
;:::;
ip
)
T
,i
=1
;:::;n
,are
i.i.d.
randomvariablesfollowing
amultivariatenormaldistribution
N
(0
;).Let
y
j
=(
y
1
j
;
;y
nj
)
T
bethe
j
-thresponse
vector(
j
=1
;::;p
)and
x
l
=(
x
1
l
;
;x
nl
)
T
bethe
l
-thpredictorvector(
l
=1
;::;q
).In
ageneticalgenomicsframework,
X
representtheSNPmarkersand
Y
representthegene
expressions.OurinterestistoestimatetheCPDAGstructureof
Y
,whileadjustingfor
theeectof
X
.Conditionalon
X
,weexpectthatfalsepositivesand/orfalsenegativesof
connectionscouldbereduced.
13
2.2.3Atwo-stageestimationprocedure

Without
X
,thePC-algorithmcanbeappliedtoinfertheCPDAGstructures.In
themodel
describedabove,theconditionalmeanof
Y
changesas
X
varies.ThusthePC-algorithm
cannotbedirectlyappliedsinceitisonlyforamodelwithaconstantme
an,i.e.,
X
=
(
X
1
;:::;X
p
)followsamultivariatenormaldistributionwithmean

andcovariancematrix
.Toovercomethedculty,weproposeatwo-stageestimationpr
oceduretoestimatethe
meanfunctionandtheCPDAGstructure.Intherststage,wees
timatethecoecient
matrix
B
.Weimplementapenalizedestimationprocedurewithrespecttoeach
responsein
Y
viasolvingthefollowing
p
optimizationproblems,
^
B
j
=argmin
B
j

1
2
n
tr
f
(
y
j

XB
j
)
T
(
y
j

XB
j
)
g
+

j;n
k
B
j
k
1

;where

j;n
;j
=1
;:::;p;
aretuningparameters,and
kk
1
referstothe
L
1
norm.Theestimate
of
B
isdenotedas
^
B
=(
^
B
1
;:::;
^
B
p
).Let
^
E
=
Y

X
^
B
.Inthesecondstage,thePC-algorithm
isappliedto
^
E
toestimatethedirectedgraphicalstructurecorrespondingtot
heprobability
model
N
(0
;).Thetwo-stageestimationalgorithmistermedasthecovariate-
adjusted
PC-algorithm(caPC).
Ingeneral,thecaPCalgorithmcontainstwosteps.Therststepis
toestimatetheskele-
tonandthesecondstepistoextendtheskeletontotheCPDAG.In
formationobtainedfrom
therststepisusedtodeterminethedirectionsoftheedgesinthe
secondstep.Ifwehave
theperfectknowledgeabouttheconditionalindependencerelatio
nships,thePC-algorithm
canreturnuswiththecorrectskeletonintherststep(Spirtese
tal.,2000).However,we
donothaveperfectknowledgeinpractice.Therefore,weusesam
plestoestimatetheskele-
tonbyimplementingatestingprocedure.Fortestingtheconditiona
lindependence,Kalisch
14
andBuhlmann(2007)appliedestimatedpartialcorrelationsandFis
hertransformationto
constructatestingstatistic.TheyprovedthatthePC-algorithm
couldgenerateconsistent
estimatesundercertainassumptionseveninthehighdimensionalsit
uations.Tomakethe
paperself-contained,wealsoincludedthePCalgorithminthesupplem
entalle.
2.3Theoreticalresult

Inthisworkweallowthenumberofresponsestoincreasealongwitht
hesamplesize,i.e.,
p
isafunctionof
n
.Thenumberofmarkercovariates
q
isxed.Thisisavalidtreatment
aswecandoaneQTLmappingstudytorstndtheeQTLsfortheex
pressionresponses,
thenxtheminthemodel.Belowweshowthatthetwo-stageestimat
ionmethodgenerates
consistentestimateofaCPDAG.Toshowtheestimationconsistenc
y,weadoptthesetupof
KnightandFu(2000)andKalischandBuhlmann(2007).Thedimensio
nofamultivariate
responsevariableisdenotedas
p
(
n
)andaDAGisdenotedas
G
=
G
n
.Conditionsand
remarkscanbefoundintheAppendixA.
Let
ˆ
n
;i;j
bethecorrelationbetween

i
and

j
.Let
ˆ
n
;i;j
jk
bethepartialcorrelation
between

i
and

j
given
f

r
;r
2
k
g
,where
k
isasubsetof
f
1
;:::;p
(
n
)
gnf
i;j
g
.^
ˆ
n
;i;j
and
^
ˆ
n
;i;j
jk
arethecorrespondingestimatedones.
Toestablishtheestimationconsistency,weneedthefollowinglemmas
.Lemma1.
Assumethatthedistribution
P
n
isnormallydistributed,wehave
^
ˆ
n
;i;j
(
^
B
)

^
ˆ
n
;i;j
(
B
)=
o
p
(1)
;8
i;j
=1
;:::;p:
15
Lemma2.
Assumethatthedistribution
P
n
isnormallydistributed,wehave
^
ˆ
n
;i;j
jk
(
^
B
)

^
ˆ
n
;i;j
jk
(
B
)=
o
p
(1)
;8
i;j
=1
;:::;p:
Lemma3.
Assumethatthedistribution
P
n
isnormallydistributedand
sup
n;i;j;
k
j
ˆ
i;j
jk
j
M<
1
.Then,forany
>
0
,wehave
sup
i;j;
k
2
K
m
n
i;j
P
(
j
^
ˆ
n
;i;j
jk
(
^
B
)

ˆ
n
;i;j
jk
j
>
)

C
1
(
n

2

m
n
)

exp((
n

4

m
n
)log(
4


2
4+

2
))
forsomepositiveconstant
C
1
.
Lemma1showsthatthederencebetweentheestimatedcorrelat
ioncoecient^
ˆ
n
;i;j
giventheestimatedcoecientmatrix
^
B
andtheestimatedcorrelationcoecientgiventhe
truecoecientmatrix
B
is
o
p
(1).Lemma2statesthatthederencebetweentheestimated
partialcorrelationcoecient^
ˆ
n
;i;j
jk
giventheestimatedcoecientmatrix
^
B
andtheesti-
matedpartialcorrelationcoecientgiventhetruecoecientmatr
ix
B
is
o
p
(1).Lemma3
showsthatthederencebetweentheestimatedpartialcorrelat
ioncoecient^
ˆ
n
;i;j
jk
given
theestimatedcoecientmatrix
^
B
andthetruepartialcorrelationcoecient
ˆ
n
;i;j
jk
is
bounded.TheproofsofthethreelemmasarerelegatedintheSupp
lementalle.
Dene

n
asthesigncancelevelfortestingthesigncanceofthepartialc
orrelation
afterFisher'stransformation,i.e.,fortesting
H
0
:ˆ
i;j
jk
=0.Withtheaboveassumptions,
wehavethefollowingtheorem.

Theorem1.
Dene
^
G
(

n
)
astheestimatedCPDAGusingthetwo-stageestimationproce
-
16
dureand
G
isthetrueCPDAG.If

j;n
=
o
(
n
)
;j
=1
;:::;p
,andConditions(C1)-(C6)are
satised,thenthereexists

n
suchthat
P
(
^
G
(

n
)=
G
)=1

O
(exp(

cn
1

2
d
))
!
1
;as
n
!1
where
c
isapositiveconstantand
d>
0
isasinCondition(C6).
TheconvergentrateoftheCPDAGestimationisthesameasthatinT
heorem2in
KalischandBuhlmann(2007).Thismeanstheerrorsoccurredinth
erststagedonothave
agreateectonthesecondstageoftheestimationofgraphicals
tructures.Theproofofthe
TheoremcanbefoundintheAppendixA.

2.4Simulation

2.4.1Simulationdesign

Wedidextensivesimulationstoassesstheestimationperformance.
Wefollowedthesimu-
lationsetupinYinandLi(2013)andKalischandBuhlmann(2007).To
generatetheerror
term,webeganwithanadjacencymatrix
A
p

p
lledwithzeros.Then,everyentryinthe
lowertrianglewasreplacedwithindependentrealizationsofaBernou
llirandomvariable
withsuccessprobability
p
A
(0
<p
A
<
1),where
p
A
iscalledthesparsenessofthemodel.
Entrieswith1werethenreplacedbyrealizationsofaUniform(0,1)ra
ndomvariable.The
correspondingDAGwasconstructedbydrawingadirectededgefo
rmnode
s
tonode
t
if
A
st
6=0and
s<t
.ThecreatedDAGhasthefollowingproperty:
E
(
N
j
)=
p
A
(
p

1),
where
N
j
isthenumberofneighborsforagivennode
j
.Thesizeof
E
(
N
j
)determinesthe
sparsenessofthegraphwithsmallvaluescorrespondingtoaspar
segraphandlargevalues
17
correspondingtoadensegraph.
Thematrix
A
willbeusedtogeneratethedataasfollows.

j
˘
N
(0
;1)
;j
=1
;
;p;

1
=

1
;
j
=
j

1
X
k
=1
A
jk

k
+

j
;where

0
j
s
areindependent.WeappliedtheR-package
pcalg
togenerateDAGsaccordingto
theabovedescription.Suchprocesseswererepeatedtogenera
tetheerrormatrix
E
n

p
.Togeneratethe
q

p
coecientmatrix
B
,wegenerateda
q

p
sparseindicatormatrix
whoseentryis1withaprobabilityproportionalto
=q
,where

iscalledthesparseness
ofthecoecientmatrix.Ifanentryis1,itisreplacedbyarealization
fromUnif([-1,-
u
][
[u
,1]),where
u
isarealizationfromUnif([0
:1
;0
:9]).Nextwegenerated
X
whoseentries
wererealizationsofBernoulli(1,0.5).Finally,wegenerated
Y
by
Y
=
XB
+
E
.Theperformancewasevaluatedbasedontruepositiverate(TPR)
,falsepositiverate
(FPR)andthestructuralHammingdistance(SHD)followingtheeva
luationcriteriapro-
posedbyKalischandBuhlmann(2007).TPRandFPRwereusedfort
heevaluationofthe
estimateoftheskeleton.TPRisdenedastheratioofthenumbero
fcorrectlyestimated
edgestothenumberoftotaledges.FPRisdenedastheratiooft
henumberofincorrectly
estimatededgestothenumberoftotalnon-edges(thenumbero
flocationswherethereare
noedges).SHDwasusedfortheevaluationoftheestimatedCPDAG
,andisdenedasthe
numberofedgeswherethetrueCPDAGandtheestimatedCPDAGar
ederent.
18
2.4.2Performanceinthelowdimensionalcase

Wechose

=0
:01inthelowdimensionalsituationwhere
p
wasxedand
n
varied.Wecalled
itlowdimensionalcaseinwhichthesamplesize
n
islargerthan
p
and
q
.Forcomparison
purpose,wealsoincludedthe
log
(
n
)=4case.Threeparametersettingsandsevensample
sizeswerechosen,eachwith40replications.Figure2.1depictsthes
imulationresults.Inall
thecases,TPRincreasesandSHDdecreasesasthesamplesizeincr
eases(denotedby
log
(
n
)
intheplot).ThetendencyofFPRisnotclear.However,theoverall
scaleofFPRissmallin
allcases,implyingreasonablecontrolofFPR.Asimilarpatternwasa
lsoobservedinKalisch
andBuhlmann(2007).Theauthorsexplainedthatthisisbecausea
constant

wasused
underallsamplesizes.Inaddition,theperformanceofasparsegr
aphestimation(denotedas
circleswith
E
[N
]=2)appearstobebetterthanthatofadensegraphestimation(
denoted
astriangleswith
E
[N
]=5).The95%condenceintervalbasedonthe40replicatesfore
ach
caseisalsoshown.Itisclearthatthecondenceintervalbecomes
narrowerasthesample
sizeincreasesinallthecases.Thisisinconsistentwithourlargesamp
letheoreticalresult.
2.4.3Performanceinthehighdimensionalcase

Inthissection,wereportedtheperformanceoftheestimationinh
ighdimensionalsituations
where
p
variedas
n
increased,tochecktheconsistencyresultsasstatedinTheorem
1.Inall
thecases,
p
wasassumedtobelargerthan
n
.WecloselyfollowedthesetupinKalischand
Buhlmann(2007)andchose

=0
:05inthehighdimensionalcase.Speccally,datawere
simulatedfollowingTable2.1,andthenumberofcovariateswasassum
edtobe
q
=100and
500.Theaveragenumberofneighborswascalculatedas
E
[N
]=
n
0
:5
=
6.Thesparsenessof
thecoecientmatrixwascontrolledby

.Weran30simulationsineachsetting.Figures
19
45678910
0.00.40.8
log(n)TPR45678910
0.000.020.04
log(n)FPR45678910
0204060
log(n)SHDScenario1:
p
=25,
q
=10,
E
[N
]=2or5,

=3
45678910
0.00.40.8
log(n)TPR45678910
0.0000.0100.020
log(n)FPR45678910
050100150
log(n)SHDScenario2:
p
=50,
q
=40,
E
[N
]=2or5,

=4
45678910
0.00.40.8
log(n)TPR45678910
0.0000.0020.004
log(n)FPR45678910
050150250
log(n)SHDScenario3:
p
=100,
q
=100,
E
[N
]=2or5,

=3.5
Figure2.1:Performanceofthetwo-stageestimationalgorithmund
erderentparameter
andsamplesizesettings.Thecirclesrepresentthesparsegraphe
stimationresultswith
E
[N
]=2andthetrianglesrepresentthedensegraphestimationresult
swith
E
[N
]=5.The
upperandlowerboundsofthe95%condenceintervalsarealsosho
wnwhicharedcult
toseparateunderlarge
n
.2.2and2.3showtheresultsfor
q
=100and
q
=500,respectively.Asshowninbothgures,
itisclearthatTPRisincreasingandFPRisdecreasingalongwiththeincr
easingofthe
samplesize.Theseobservationsareconsistentwithourtheoretic
alresults.Inaddition,the
sizeof
q
alsohasaneectontheestimationperformanceinwhichlarger
q
(=500)gives
20
Table2.1:Simulationsetupinthehighdimensionalcasewithxed

.set1set2set3set4set5set6
p
16239867498013101661
n
50100150200250300
E
[N
]1.181.672.042.362.642.89

555555
smallerTPRandlargerFPRcomparedtotheresultsundersmaller
q
(=100).
set1set2set3set4set5set6
0.40.50.60.70.8
TPRset1set2set3set4set5set6
0.0020.0040.0060.008
FPRFigure2.2:BoxplotofTPRandFPRunderderentsettingswhen
q
=100.
Wedidanothersimulationinwhichboth
p
and
q
increasewith
n
.Thesamplesize
n
was
assumedtobe50,100,150,200,250,and300.Thedimensionof
Y
and
X
wasassumed
tobe
p
=
n
1
:5
=
6
:3and
q
=
p=
2,respectively.Theaveragenumberofneighborsisgivenas
E
[N
]=
n
0
:5
=
6.Thesparsenessofthecoecientmatrixwascontrolledby

=log
n
.The
detailedsetupcanbefoundinTable2.2.Thirtysimulationswererunine
achsetting.The
resultsofthesimulationareshowninFigure2.4.Weobservedasimilarp
atternasshownin
Figures2.2and2.3,thatis,TPRisincreasingandFPRisdecreasingas
n
increases.This
resultprovidessupportiveevidencethatourmethodalsoworksint
hesituationwhereboth
p
and
q
increaseswith
n
.21
set1set2set3set4set5set6
0.20.30.40.50.60.70.8
TPRset1set2set3set4set5set6
0.0020.0040.0060.0080.010
FPRFigure2.3:BoxplotofTPRandFPRunderderentsettingswhen
q
=500.
Table2.2:Simulationsetupinthehighdimensionalcasewithvarying
p
,q
and

.set1set2set3set4set5set6
p
712003675667911039
q
35100184283395520
n
50100150200250300
E
[N
]1.181.672.042.362.642.89

3.94.65.05.35.55.7
2.4.4Comparinggraphswithorwithoutcovariates

TheproposedmethodimprovestheDAGestimationbyadjustingfor
potentialmarkeref-
fects.Toevaluatethegainbyadjustingforthemarkereects,w
ecomparedtheestimation
performanceofbeforeandafteradjustingforthemarkereec
tsusingthePC-algorithm
(denotedasPCinthetable)andcovariate-adjustedPC-algorithm
(denotedascaPCinthe
table)basedonsimulateddata.Weconsideredsixscenariosasshow
ninTable2.3.we
observedconsistentlylargerTPR,smallerFPRandSHDbyusingthec
aPCestimation
methodcomparedtotheresultsusingthePC-algorithminallthesimu
lationscenarios.This
simulationfurtherjustesthebenetofadjustingforthecovar
iates'eectwhenestimating
theDAGstructure.Wealsosimulatethecasewhere
q>p
.SimilarpatternthatcaPC
22
set1set2set3set4set5set6
0.50.60.70.8
TPRset1set2set3set4set5set6
0.0040.0080.012
FPRFigure2.4:BoxplotofTPRandFPRunderderentsimulationsettings
withvarying
q
and

.outperformedPCwasobserved.

2.4.5Comparingdirectedandundirectedgraphs

Wedidsomesimulationsfollowingtocomparetheperformanceofthed
irectedandundirected
graphestimation.Herewechoose
p
=50
;100
;200,
q
=2
p
and
n
=250.Asbefore,wechose
E
[N
]=2,

=3.5.TheresultsaresummarizedinTable5.WeappliedgraphicalLASS
Oto
estimatetheundirectedgraphicalstructure.Wecanseethatun
directedmethodachievesa
higherTPR,butthedirectedmethodachievesalowerFPRandsmaller
SHD.HereSHD
isusedtomeasuretheskeletonofthegraphsinceundirectedgrap
hsdonothavedirection
information.ThereasonwhyundirectedmethodisbetteronTPRisb
ecause,intheory,
directedgraphissparsethanundirectedgraph.Sooverall,thedir
ectedgraphperforms
betterthantheundirectedmethodonbothFPRandSHD,butsue
rsalittlebitonTPR.
23
Table2.3:Estimationcomparisonwithandwithoutadjustingforthem
arkereectsunder
derentscenarios.
MethodTPRFPRSHD
p
=25,
q
=10,
n
=250,
E
[N
]=2,

=3.5caPC0.76810.001111.5
PC0.66980.038025.9
p
=50,
q
=50,
n
=250,
E
[N
]=2,

=4caPC0.74650.002126.7
PC0.56800.019755.3
p
=100,
q
=100,
n
=250,
E
[N
]=2,

=3caPC0.74030.001752.4
PC0.67970.005981.7
p
=400,
q
=200,
n
=250,
E
[N
]=2.5,

=20caPC0.76780.0054648.6
PC0.23460.01381504.9
p
=800,
q
=200,
n
=250,
E
[N
]=1.5,

=25caPC0.78190.00521948.1
PC0.13780.00823175.9
p
=1000,
q
=200,
n
=250,
E
[N
]=1.5,

=20caPC0.79020.00432543.4
PC0.27370.00653874.4
Table2.4:EstimationcomparisonbetweenPCandcaPCwhen
q>p
MethodTPRFPRSHD
p
=100,
q
=200,
n
=200,
E
[N
]=2,

=3caPC0.81760.008684.0
PC0.72640.0161129.8
p
=200,
q
=400,
n
=200,
E
[N
]=2,

=20caPC0.66120.0107324.9
PC0.26450.0177516.4
p
=400,
q
=800,
n
=200,
E
[N
]=2,

=20caPC0.57040.0076866.7
PC0.29840.01001138.3
2.5ApplicationtoRealData

Weappliedourproposedtwo-stageestimationmethodtoanAlzheime
rdiseasedataset
(Websteretal.2009).Followingthepaper,176Alzheimercasesand
187controlswere
includedinouranalysis.FortheSNPs,thosewithlowgenotypingcallr
ate(lessthan90%),
lowminorallelefrequency(lessthan5%)andthosefailedtheHardy-W
einbergequilibrium
testincontrolgroup(p-value
<
0.001)wereremoved.Aftertheseoperations,around332,000
SNPmarkerswereleftforfurtheranalysis.Weusedtheresiduale
xpressiondataafter
adjustingfortheeectsofseveralcovariatessuchasgender,
APOEstatusandageatdeath
24
Table2.5:Estimationcomparisonbetweendirectedandundirectedg
raphs
MethodTPRFPRSHD
p
=50,
q
=100,
n
=250,
E
[N
]=2,

=3.5directed0.79040.001312.5
undirected0.87360.036949.9
p
=100,
q
=200,
n
=250,
E
[N
]=2,

=3.5directed0.76400.001431.9
undirected0.83290.014990.2
p
=200,
q
=400,
n
=250,
E
[N
]=2.5,

=3.5directed0.78160.001475.1
undirected0.81180.0049137.5
(datacanbefoundathttp://labs.med.miami.edu/myers/LFuN/data
ajhg.html).Thereare
8,560geneexpressions.Weimplementedatwosamplet-testforeac
hgeneexpressionand
selectedtop100signcantgeneexpressionstolearntheirDAGstr
ucture.Foreachofthe
100geneexpressions,wettedasimplelinearregressionwitheachS
NPmarkerasthe
predictorandselectedSNPswithp-value
<
0
:001.AmongtheseSNPs,wechosethosethat
showassociationwithtwoormoregeneexpressions(p-value
<
0
:001).Thisendedupwith
3,776SNPsinthecasegroupand3,871SNPsinthecontrolgroupwith
60SNPsincommon.
Thenaldatasetusedfortheanalysiscontained100geneexpres
sions(asthe
Y
variable),
3,776SNPsinthecasegroupand3,871SNPsinthecontrolgroup(as
the
X
variables).We
appliedourestimatingmethodtothetwogroupsseparately.Ourgo
alwastolearntheDAG
structureforthe100geneswhileadjustingfortheeectsofSNP
markersinthecaseand
controlgroupswiththehopetoidentifyderentialDAGstructur
esthatcandistinguishthe
twogroups.Anyderencesmightpotentiallyexplainthediseaseetio
logyofAlzheimer.
WelearnedtheCPDAGsatthe

=0
:001level.Forthecasegroup,theestimated
CPDAGhas101directededges(Figure2.5).Forthecontrolgroup
,theestimatedCPDAG
has88directededges(Figure2.5).Thegraphsweregeneratedwit
htheRpackage
igraph
.Inbothgraphs,geneswererepresentedwithnumbers.Thecorr
espondinggenenamescan
befoundinthesupplementaltable.ThetwoCPDAGsshare18edges
incommon.We
25
noticedthatsomesub-CPDAGstructuresshowninthecasegroup
werenotpresentinthe
controlgroup.Forexample,thegraph70
!
95
 
38wereobservedinthecasegroup,while
thethreegenesareseparatedfromothergroupsinthecontrol
population.Ontheother
hand,someCPDAGsshowninthecontrolgroupwerenotpresentin
thecasegroup.For
example,thesub-graphby17-26-31-32-76isobservedinthecon
trolgroupwhilethegenes
arelessconnectedinthecasegroup.Suchderentialgraphicals
tructuresbetweenthetwo
groupsmightindicateregulationheterogeneitybetweenthetwogr
oups.Furtherbiological
vercationsareneedtovalidatethendings.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100Figure2.5:TheestimatedCPDAGinthecase(left)andcontrol(righ
t)groupbasedonthe
top100genes.Thoseoverlappednodesindicatedualregulationdir
ections.
Asacomparison,weappliedPC-algorithmdirectlytothe100geneswit
houtadjusting
fortheSNPs'eects.Forthecasegroup,theestimatedgraphh
as124edges.Forthecontrol
group,theestimatedgraphhas100edges.Thisisconsistentwitho
urassumptionthatthe
numberofedgesshouldbereducedafteradjustingforthemarke
rs'eect.Thisphenomenon
26
hasbeenexplainedanddemonstratedinotherworksfocusingonun
directedgraphs(e.g.,
YinandLi,2011;Caietal.2013).
ADAM17
APPNAE1
APBB1GAPDHBACE1
BACE2
PSENENAPH1AAPH1BNDUFV1NDUFV2NDUFV3NDUFA1
NDUFA2
NDUFA3
NDUFA4
NDUFA5
NDUFA6
NDUFA7
NDUFA8
NDUFA9
NDUFA10
NDUFAB1
NDUFA12
NDUFA13
NDUFB1NDUFB2NDUFB4NDUFB5NDUFB6NDUFB7NDUFB8NDUFB9NDUFB10NDUFS2NDUFS3NDUFS4NDUFS5NDUFS6NDUFS7NDUFS8NDUFC1NDUFC2SDHASDHBSDHCSDHDUQCRFS1CYC1UQCRC1UQCRC2UQCRHUQCRBUQCRQUQCR10UQCR11COX1
COX2
COX4I1
COX5A
COX5B
COX6A1
COX6B1
COX6C
COX7A1
COX7A2
COX7A2L
COX7B
COX7C
COX8A
ATP5A1
ATP5B
ATP5C1
ATP5D
ATP5E
ATP6
ATP5F1
ATP5G1
ATP5G3
ATP5H
ATP5O
ATP5J
HSD17B10LPLAPOELRP1TNFRSF1ABIDCALM2CALM3CALM1PPP3CAPPP3CBPPP3CCPPP3R1BAD
CYCSAPAF1
CASP9CASP3PLCB1PLCB4GRIN1GRIN2CCACNA1C
MAPK1MAPK3RYR3
ITPR1ITPR3ATP2A2
ATF6
CAPN1CAPN2CDK5R1CDK5MAPTGSK3BSNCAADAM17
APPNAE1
APBB1GAPDHBACE1
BACE2
PSENENAPH1AAPH1BNDUFV1NDUFV2NDUFV3NDUFA1
NDUFA2
NDUFA3
NDUFA4
NDUFA5
NDUFA6
NDUFA7
NDUFA8
NDUFA9
NDUFA10
NDUFAB1
NDUFA12
NDUFA13
NDUFB1NDUFB2NDUFB4NDUFB5NDUFB6NDUFB7NDUFB8NDUFB9NDUFB10NDUFS2NDUFS3NDUFS4NDUFS5NDUFS6NDUFS7NDUFS8NDUFC1NDUFC2SDHASDHBSDHCSDHDUQCRFS1CYC1UQCRC1UQCRC2UQCRHUQCRBUQCRQUQCR10UQCR11COX1
COX2
COX4I1
COX5A
COX5B
COX6A1
COX6B1
COX6C
COX7A1
COX7A2
COX7A2L
COX7B
COX7C
COX8A
ATP5A1
ATP5B
ATP5C1
ATP5D
ATP5E
ATP6
ATP5F1
ATP5G1
ATP5G3
ATP5H
ATP5O
ATP5J
HSD17B10LPLAPOELRP1TNFRSF1ABIDCALM2CALM3CALM1PPP3CAPPP3CBPPP3CCPPP3R1BAD
CYCSAPAF1
CASP9CASP3PLCB1PLCB4GRIN1GRIN2CCACNA1C
MAPK1MAPK3RYR3
ITPR1ITPR3ATP2A2
ATF6
CAPN1CAPN2CDK5R1CDK5MAPTGSK3BSNCAFigure2.6:TheestimatedCPDAGinthecase(left)andcontrol(righ
t)groupbasedongenes
mappedtotheAlzheimer'sdiseasepathwayfromtheKEGGdatabase
.Thoseoverlapped
nodesindicatedualregulationdirections.
WedidanothercasestudybyfocusingontheAlzheimer'sdiseasepat
hwayfromthe
KEGGdatabase.Therearetotal168genesinthispathway,buton
ly120geneexpressions
inthisdatasetweremappedtothepathway.Foreachofthe120ge
neexpressions,we
performedasinglemarkerlinearregressionanalysisusingeachofth
eSNPs.Weselected
thosemarkerswithp-value
<
0
:001.Thenthe5000SNPs(with93incommon)withthe
smallestp-valueswereselectedincasegroupandcontrolgroupre
spectively.
Thedataweusedforthenalanalysiscontain120geneexpression
s(the
Y
variables),
5000SNPmarkersinthecasegroupand5000SNPmarkersinthecon
trolgroup(the
X
variables).OurgoalistolearntheDAGstructuresbasedonthe12
0geneexpressionswhile
adjustingfortheeectsof5000SNPmarkersinbothgroups.Wea
ppliedourestimating
27
methodtothecaseandcontrolgroupseparately.Inthecasegr
oup,theestimatedCPDAG
has108directededges(Figure2.6).Inthecontrolgroup,thees
timatedCPDAGhas109
directededges(Figure2.6).Thereare25commonedgesbetweent
hetwographs.We
canseederentgraphicalstructuresbetweenthetwogroups.
Someoftheestimatedsub-
DAGsagreeswiththerealbiologicalnetwork.Forexample,somege
nesinmitochondria
wereidentedasconnectedinthecasegraph(e.g.,COX7A2,NDUFB
2,SDHC;NDUFV3,
COX7B,COX4I1,COX1,NDUFB4,NDUFA10).

2.6Discussion

Genesfunctioninnetworksandgeneexpressionsareoftenresult
edfromgeneregulations.
Hence,learningdirectedregulationscanenhanceourknowledgeab
outgenefunctionanddis-
easeetiology.Inthiswork,weintroducedacovariate-adjustedm
odeltostudythegraphical
structuresofmultiplegeneexpressionswithdirectionregulationinf
ormation.Weproposed
atwo-stageproceduretoobtaintheestimatedCPDAGcorrespon
dingtotheprobabilitydis-
tribution.Themarkeradjustedexpressionnetworkcanreducet
hefalseconnectionsoftwo
genesiftheyshareacommonregulator.Simulateddatawereusedt
otestourmethodand
showedsupportiveresultsbothinlowandhighdimensionalcases.We
providedaconsistency
resultforthecasewhen
p
isincreasingalongwithsamplesize
n
.Weappliedourmethod
toarealcase-controldatasettounderstandthegeneregulat
ionmechanisminAlzheimer
disease.Derentgraphswerelearnedcorrespondingtothecase
andcontrolgroup.For
theAlzheimerpathwayanalysis,wewereabletorecoversomeofthe
structuresforgenes
involvedinMitochondriafunction.
Intherealdataanalysis,wecouldnotrecovertheexactcausalr
elationshipsbetween
28
variablesthatareshownintheKEGGdatabase.Thisisduetothecom
plexityofthe
biologicalstructureandlimiteddatasamples.Moreover,thedatam
ayviolatethenormality
assumption.Otherrobustmethodssuchasthenonparanormalm
odel(Liuetal.2009;Liu
etal.2012)whichimplementanonparametrictransformationofthe
datamaybeapplied
toourframework.Inaddition,weexpectmorereliablestructures
canbeobtainedifwe
incorporatepriorknowninformationintoouranalysisbypre-settin
gsomeedges.Weexpect
moreaccurateandstableestimationcanbeachievedviaincorporat
ingbothstatisticaland
biologicalknowledgeintothemodeldevelopmentprocess.Thesewill
beincorporatedinto
ourfutureinvestigation.
29
Chapter3

IntegrativeAnalysisofGenetical

GenomicsDataIncorporating

NetworkStructure

3.1Introduction

Ingeneticalgenomicsanalysis,achallengingproblemishowtocombine
phenotype,genotype
andgeneexpressiondatatogethertoobtainnewknowledgeabout
thegeneticbasisofgene
expressionsandtofurtherprovidenovelinsightsintogenefunct
ions.Muchworkonmodel
developmentshasbeendonetocombinederentsourcesofdata.
Forexample,QTLmapping
studiescombinegenotypeandphenotypedatatoinferthegenetic
architectureofacomplex
trait.eQTLmappingcombinesgeneexpressionsandgenotypeswhe
regeneexpressionsare
viewedasquantitativetraitstounderstandthegeneticbasisofge
neexpression.Gaussian
graphicalmodels(MeinshausenandBuhlmann,2006;Friedmaneta
l.,2007)havebeen
appliedtogeneexpressiondatatoinfergeneregulatorynetworks
.Recently,covariate-
adjustedGaussiangraphicalmodels(YinandLi,2013;Caietal.,201
3)havebeendeveloped
tocombinegenotypesandgeneexpressionstoimprovegenenetwo
rklearning.Morerecently,
integrativemodelingandtestingmethods(e.g.,Huangetal.,2014;Zh
aoetal.,2014)combine
30
phenotypes,genotypesandgeneexpressionstoimprovethepow
erofstatisticaltesting.As
moreandmoregenomicdataarebecomingavailable,integrativeanaly
sisofmulti-source
genomicdatawouldprovideamorecomprehensivepictureofabiologic
alsystem,hence
oeringpromisingopportunitiesforpersonalizedmedicineandtreat
ment.
Whenusinggeneexpressionstopredictphenotypicresponses,line
arregressionhasbeen
thestandardmeanstomodeltherelationbetweenphenotypesan
dgeneexpressions.Dueto
thehighcostofobtainingexpressiondata,samplesizeistypicallysma
llerthanthenumberof
geneexpressions.Tosolvethisproblem,variousregularizationmet
hodshavebeendeveloped
toachievevariableselectionandestimation.Fromthebiologicalpers
pective,thephenotypic
responseofinterestandgeneexpressionsareofteninuencedb
ycommonexternalforces.
Theseexternalforcesareusuallyunobservableandcannotbemo
delledexplicitlyinthe
model.Hencetheireectsareusuallyputintheerrorterm.Sothee
xogeneitycondition
inwhichthepredictorsandtheerrorareuncorrelatedisviolated,a
ndconsequentlythe
estimatorsobtainedbyordinaryleastsquaresmaynotbeconsiste
nt.Toovercomethe
issuebroughtbyendogeneity,Linetal.(2015)appliedinstrumenta
lvariables(IV)models
andregularizationmethodstohandletheproblemcausedbythecor
relationbetweengene
expressionsandtheerrortermsinahighdimensionalsetupwithbot
hgeneexpressionsand
genotypes.Geneticvariantsareusedasinstrumentalvariabless
incetheyareindependent
ofexternalenvironmentalexposuresandaectthephenotypic
traitofinterestonlythrough
aectinggeneexpressions.Givengeneticvariantsusedasinstrum
entalvariables,theproblem
broughtbythedependencebetweenthegeneexpressionsandth
eerrortermswhichincluding
environmentalexposuresorotherfactorscanbesolvedbysubs
titutingthepredictionsofgene
expressionsintheregressionmodel.Thedcultiesonhighdimensiona
lityofgenotypesand
geneexpressionsarehandledbyregularizationmethodsuchasthe
LASSOtyperegression.
31
Weallknowthatgenesfunctioninnetworkstofulltheirjointtask
.Thusgeneexpres-
sionstendtobecorrelatedwhentheybelongtothesamenetwork(
e.g.,pathway).When
geneexpressionsareconsideredinaregressionmodel,suchcorre
lationinformationshouldbe
consideredwhenselectingimportantgenes.Althoughvastamount
sofstatisticalliterature
havebeenfocusedonthevariableselectionprobleminahighdimension
alsetup,manyof
themfailforcorrelatedvariables.AspointedoutbyZouandHastie
(2005),LASSOtendsto
selectonlyonevariablefromagroupofvariableshighlycorrelatedwit
heachother.Several
methodswereintroducedtosolvetheproblem.ZouandHastie(200
5)proposedelasticnet
toachievegroupingeect,i.e.,\thecoecientsofagroupofhighlyc
orrelatedvariablestend
tobeequal(uptoachangeofsignifnegativelycorrelated)".InLe
mma2andTheorem2in
theirpaper,itsaidthatifastrictlyconvexpenaltyisapplied,thehigh
erthecorrelationof
twovariablesis,thelesstheupperboundofthedistanceoftheest
imatedcoecientsofthe
twovariablesis.Ifthecorrelationisalmostzero,thetwoestimated
coecientsarealmost
thesame(exceptaminussignifnegativelycorrelated).LiandLi(2
008)proposedanetwork
constrainedmethodwhichtakesadvantageofthecorrelationinfo
rmationnamelynetwork
information.TheyintroducedaLaplacianmatrixand
L
2
penaltytocopewiththecorre-
lationproblem.Theyapplied
L
2
normonthepairwisederencesofthecoecientsofthe
correlatedvariablestoachievegroupingeect,andobtainedsimilar
theoreticalresultsasin
ZouandHastie(2005).Theirsimulationresultsshowedthattheirme
thodworksbetterthan
elasticnetincaseswherepriorknowledgeongraphicalstructures
isconsidered.InLiand
Li(2010),theauthorsproposedamodedpenaltyfunctionwhich
takesaccountofthesign
derencestoencouragetheabsolutevaluesofthecoecientsof
theconnectedvariablesto
shrinktowardthesame.Huangetal.(2011)proposedasparseLa
placianshrinkagemethod
andproveditsoracleproperty.
32
MotivatedbyLiandLi(2008),inthischapterweproposeatwo-st
epproceduretoachieve
variableselectionandestimationundertheIVregressionframewor
kincorporatinggene
regulatorynetwork.Throughapplyinggraphicalstructuresasp
riorknowledge,wehopeto
addresstheproblemcausedbyhighcorrelationsbetweengenesina
pathwaytoachievebetter
variableselectionandestimationresults.Werstadoptamultivariat
emulti-responselinear
modeltoprojectgeneexpressionsintothegeneticspace.Under
theassumptionthateach
geneisonlycontrolledbyafewgeneticvariants,weusetheLASSOalg
orithmtoestimatethe
coecientmatrix.Sincegeneticvariablesareindependentoftheer
rorterms,theprojected
expressionvaluesarepotentiallyuncorrelatedwiththeerrorterm
s.Thentheprojected
valuesareusedinthesecondstageestimationwhilegenenetworkst
ructuresareconsidered.
Weproposeagraphconstrainedregularizationmethodtoachievev
ariableselectionand
estimationinthesecondstage.Assumingcertaingraphicalstruct
uresongeneexpressions
whichcanbeobtainedeitherbybiologicalexperimentsorbystatistic
alestimation,two
penalties(LASSOandgraphconstrainedpenalties)areapplied.The
graphconstrained
penaltyisusedtoencourageshrinkingcoecientsofapairofvariab
lesconnectedinthe
networktowardthesametoachievegroupeect.Inthestudyof
thetheoreticalproperties
oftheestimatorobtainedthroughourmethod,weassumethe\gr
aphicalirrepresentable
condition"whichisusedtoguaranteetheselectionconsistency.
Thischapterisorganizedasthefollowing.In3.2,weintroducetheins
trumentalvariables
modelandthetwo-stepestimationmethod.Theoreticalconsiste
ncyresultsaregivenin3.3.
Simulationsandrealdataanalysisaregivenin3.4and3.5,respectively
,followedbydiscussion
in3.6.AllthetechnicalproofsarerenderedintheAppendixB.
33
3.2StatisticalMethodsandEstimation

3.2.1TheModel

Let
Y
=(
Y
1
;:::;Y
n
)
T
,X
=(
X
1
;:::;X
n
)
T
,and
G
=(
G
1
;:::;G
n
)
T
bematricesof
n
inde-
pendentandidenticallydistributedphenotypes,geneexpressions
,andgenotypes,where
Y
i
isascaler,
X
i
=(
X
i
1
;:::;X
ip
)
T
,and
G
i
=(
G
i
1
;:::;G
iq
)
T
.Themodelsweusearethefollowing
Y
=
X
+

;(3.1)
X
=
G

+

;(3.2)
where

=(
1
;:::;

p
)isa
q

p
coecientmatrix,

=(

1
;:::;
p
)
T
isacoecientvector,

=(

1
;:::;
n
)
T
and

=(

1
;:::;
n
)
T
areerrorvectoranderrormatrixsatisfying(

T

i
;
i
)
T
are
i.i.d.
p
+1dimensionalrandomvariablesfollowingamultivariatenormaldistrib
ution
N
(0
;).Wewriteas=
0

B

@

11

12

21
˙
22
1

C

A
,where
11
isthecovariancematrixof

i
,˙
22
is
thevarianceof

i
,and
12
=
T

21
isthecovariancebetween

i
and

i
.Figure3.1:Illustrationoftheinstrumentalvariablesmodelingenet
icalgenomicsanalysis.
Therelationshipbetweengenotype,geneexpressionandphenoty
pecanbemodelledwith
linearregressionmodels.Figure3.1showstherelationshipbetweent
hethreevariablesas
34
wellasthelatentvariable
E
,where
Y
isthephenotypeofinterest,
X
isthegeneexpression,
G
isthegenotype,and
E
representsexternalstimulatingfactorswhichareunobservable
henceareputintotheerrorterm.Weassume
G
aect
Y
onlythrough
X
,and
E
aects
both
X
and
Y
.ToillustratetheideaofIVregressioninourcase,weconsiderasimple
regressionmodel
Y
=
X
+

(
X
)since
E
aectsboth
Y
and
X
.Sotheerrorterm

coulddependson
X
andwehave
dY
dX
=

+
d
(
X
)
dX
.Wecanseethatusingordinaryleastsquaresregressionwill
notgiveconsistentestimatefor

unless
d
(
X
)
dX
=0,i.e.,

(
X
)isindependentof
X
.For
thepurposeofhopingtoobtainconsistentestimatorof

,PhilipG.Wrightintroducedthe
instrumentalvariablesmodelin1928.InFigure3.1,geneticvariant
G
isusedasinstrumental
variableinourgeneticalgenomicsanalysis.Asexplainedbefore,
G
isagoodchoicesinceit
aect
Y
onlythrough
X
andisindependentof
E
.Moredetailedexplanationcanbefound
inLinetal.(2015).

3.2.2Estimationofgeneticeects

Undertheassumptionthatonlyafewgeneticvariantsinuencethe
geneexpressions,the
coecientmatrix

issparse.Weapplyregularizedleastsquaresmethodtoestimateea
ch
geneexpressionseparately.Theestimationproblemcanbeformula
tedasthefollowing
p
optimizationproblems
^

j
=argmin

j

1
2
n
n
X
i
=1
tr
f
(
X
j

G

j
)
T
(
X
j

G

j
)
g
+

j
k

j
k
1

;j
=1
;:::;p:
(3.3)
where

j
;j
=1
;::;p
aretuningparameters,
kk
1
referstothe
L
1
normand
X
j
=(
X
1
j
;:::;X
nj
)
T
;j
=1
;:::;p
.Theestimateof

isdenotedas
^

=(
^

1
;:::;
^

p
).
35
3.2.3Networkconstrainedregularization

Intherststep,weobtainanestimateofthecoecientmatrix

.So,wehavepredictions
ofthecovariatesas
^
X
=
G
^

.Wesubstitute
^
X
for
X
inthesecondstep.
Theoptimizationproblemweareinterestedinis
^

=argmin

n
1
2
n
n
X
i
=1

Y
i

p
X
j
=1
^
X
ij

j

2
+
p

(

)
o
;(3.4)
p

(

)=


p
X

j
=1
j

j
j
+
1


2
X
k
˘
t
j
r
kt
j
(

k

s
kt

t
)
2

;(3.5)
where
r
kt
isaweightofcorrelationstrengthbetweentwovariables,
s
kt
=sign(
ˆ
kt
),
ˆ
kt
is
correlationcoecient,
>
0and

2
[0
;1]aretuningparameters,and
k
˘
t
means
r
kt
6=0.
ThepenaltyfunctioninEquation3.5includestwopenaltieswhichareus
edtoselectvariables
andtohandletheproblemsbroughtbyhighcorrelationsviaencoura
gingcoecientsoftwo
correlatedvariablestoshrinktothesamevaluetoachievegrouping
eect.
Weapplycoordinatedescentalgorithmtosolvetheoptimizationprob
lem.Wecenterthe
responsevariable
Y
andstandardize
^
X
j
,j
=1
;:::;p
.Let
h
2f
1
;:::;p
g
,wehave
1
2
n
n
X
i
=1

Y
i

p
X

j
=1
^
X
ij

j

2
+


p
X
j
=1
j

j
j
+
1


2
X
k
˘
t
j
r
kt
j
(

k

s
kt

t
)
2

=
1
2
n
n
X
i
=1

Y
i

^
X
ih

h

X
j
6=
h
^
X
ij

j

2
+


j

h
j
+

X
j
6=
h
j

j
j
+
1


2
j
r
hj
j
(

u

s
hj

j
)
2
+
1


2
X
k
˘
t
k;t
6=
h
j
r
kt
j
(

k

s
kt

t
)
2

:36
Takingthederivativewithrespectto

h
,wehave
@
@
h
n
1
2
n
n
X
i
=1

Y
i

p
X

j
=1
^
X
ij

j

2
+


p
X
j
=1
j

j
j
+
1


2
X
k
˘
t
j
r
kt
j
(

k

s
kt

t
)
2

o
=

1
n
n
X
i
=1

Y
i

X
j
6=
h
^
X
ij

j


(1


)
X
h
˘
j
r
hj

j
+

1+

(1


)
X
h
˘
j
j
r
hj
j


h
+


h
j

h
j
:Let
@@
h
n
1
2
n
P
n

i
=1

Y
i

P
p

j
=1
^
X
ij

j

2
+
p

(

)
i
=0.Wesolvetheequationandobtain
thesolutionas
^

h
=
S
(
1
n
P
n

i
=1

Y
i

P
j
6=
h
^
X
ij

j

+

(1


)
P
h
˘
j
r
hj

j
;
)
1+

(1


)
P
h
˘
j
j
r
hj
j
:SimilartoFriedmanetal.(2007)andFriedmanetal.(2010),weset

max
=
max
j
jh
^
X
j
;Y
ij
n
and

min
=

max
,where

=0
:001,andconstructasequenceof
K
valuesof

decreasing
from

max
to

min
onthelogscale.Weusecrossvalidationtochoosethetuningparame
ters

and

.3.3TheoreticalProperties

Herewegivethetheoreticalresultsonthevariableselectionandes
timation.Weadopt
notationsusedinLinetal.(2015).Foramatrix
A
,k
A
k
1
=max
j
P
i
j
a
ij
j
and
k
A
k
1
=
max
i
P
j
j
a
ij
j
.Foravector

,amatrix
A
,andindexsets
I
and
J
,
I
and
A
IJ
denotethe
subvectorandthesubmatrix.
J
c
denotesthecomplementof
J
and
j
J
j
denotesthenumber
oftheelementsin
J
.Denetherestrictedeigenvalueforamatrix
A
n

m
and1

s

m
by

(
A;s
)=min
jJ

s
min

6=0
k

J
c
k
1

3
k

J
k
1
k
A
k
2
p
n
k

J
k
2
.Let
r
=max
1

j

p
j
supp(
j
)
j
,s
=
j
supp(

)
j
,and
˙
max
=max
1

j

p
˙
j
,wheresupp(

)isthesupportiveset.Weassume
k

k
1

M
1
37
and
k

k
1

M
2
forsomepositiveconstants
M
1
and
M
2
.Wewrite(3.5)as
p

(

)=


P
p

j
=1
j

j
j
+
1


2

T
L


=

1
P
p

j
=1
j

j
j
+

2

T
L

,where

1
=

and

2
=

1


2
.L
isanon-negativedenitematrixequippedwiththegraphicalinforma
tion.Weassume
k
L

k
1
<C
L
forsomeconstant
C
L
.Inordertocontroltheestimationerrorintherst
step,thefollowingconditionisimposed.
(C1)

(
G
;r
)


forsome
>
0.
Weallow
p
,q
,r
,s
,and

todependonthesamplesize
n
.InLinetal.(2015),they
establishedatheoremontheestimationandpredictionlossof
^

.Webaseouranalysison
theirresults.
Inordertoobtaintheselectionconsistencyandthe
L
1
bound,weneedtoimpose
anassumptionon
G

whichissimilartotheirrepresentableconditioninZhaoandYu
(2006).Dene
C
=
1
n
(
G

)
T
G

,S
=supp(

),and
˚
=
k
(
C
SS
+2

2
L
SS
)

1
k
1
.Let
b
0
=min
j
2
S
j

j
j
.Theassumptionisneededtoobtainselectionconsistency,
(C2)
k
(
C
S
c
S
+2

2
L
S
c
S
)(
C
SS
+2

2
L
SS
)

1
k
1
<
1


,where

isaconstantand
0
<<
1.
SimilarlytotheargumentinZhaoandYu(2006),(C2)requiresthe\r
egressioncoe-
cients"oftheirrelevantcovariatesontherelevantcovariatesar
elessthan1.InLemma4
oftheAppendixB,weprovethatcondition(C2)holdsforthesample
covariancematrix
of
^
X
withasmaller

.Withthese,weestablishthefollowingtheoremabouttheselection
consistencyof
^

.Theorem2.
Iftheregularizationparametersintherststepareselect
edas

j
=
C˙
j

q
(log
p
+log
q
)
n
andsatisfy
16
˚

2
rs
max
(2
M
1
+

max
)


2(4


)
,where
C>
2
p
2
and

max
=
max
1

j

p

j
,theregularizationparametersinthesecondstepareselec
tedas

1
=
C
0

38

q
r
(log
p
+log
q
)
n
and

2
C
L


(16

3

)
4(4


)(8

3

)

1
,where
C
0
=
c
0
L
max(
˙
p
+1
;M
2
˙
max
)
with
c
0
>
0
,
b
0
hasalowerboundthat
b
0
>
2(4


)
8

3

˚
h
8


2(4


)

1
+2

2
C
L
i
,andfurtherassume
that(C1)and(C2)hold,thenwithprobabilityatleast
1

c
1
(
pq
)

c
2
,where
c
1
;c
2
>
0
,
^

obtainedfrom(3.4)satises
sign(
^

)=sign(

)
;k
^

S


S
k
1

16(4


)
˚C
0
(8

3

)
2

r
r
(log
p
+log
q
)
n
:3.4Simulation

HerewecloselyfollowthesimulationsetupproposedinLinetal.(2015)
,butimpose
certaincorrelationstructuresongenestoshowtheimpactofnet
workstructuresonvariable
selectionandestimation.Wesimulatetotal
p
variablesofgeneexpressionandconsiderthree
groupstructureson

with5variablesineachgroup.Withineachgroup,thevariablesare
correlated.Thestrengthofcorrelationiscontrolledbythenumbe
rofeectiveSNPsthey
shareincommon.Twogroupsofvariableshavenonzerocoecients
andthe3rdgrouphas
zerocoecients.Alltherestofthe
p

15variableshavezerocoecientsandhavenograph
structure,i.e.,

=(

1
;
;
5
|
{z
}
5
;
6
;
;
10
|
{z
}
5
;0
;
;0
|
{z
}
5
;0
;
;0
|
{z
}
p

15
)
T
.Thepurposetosimulate
thestructuredgroupwithzerocoecientsisforthecomparisonp
urpose.Twosimulation
scenariosareconsidered.Inscenario1,wesetallthenonzeroco
ecientstobe0.5(i.e.,

k
=0
:5for
k
=1
;
;10),whileinscenario2,

k
˘
U
(0
:5
;1)for
k
=1
;
;10.Again,
thetwosetupsareforthepurposeofcomparison.
39
Forthecovariancematrixcov(

;
)==
0

B

@

11

12

21
˙
22
1

C

A
,the
i
throwand
j
thcolumn
entryin
11
issettobe(0
:2)
ji

j
jfor
i;j
=1
;
;p
,and
˙
22
=1.Tomake

andthe15
X
variableswithagraphstructurecorrelated,wesetthe15entries
inthelastcolumnsof

12
as0.23andtherestas0.Thenwesimulate(

;
)
˘
N
p
+1
(
0
;).
G
isgeneratedby
samplingfromaBernoullidistributionwithsuccessprobability0.5.
Tosimulate
X
,weneedtogeneratethecoecientmatrix

rst.Aswementionedbefore,
thegraphstructurein
X
iscontrolledbythenumberofcommonlyshared
G
variables.The
moretheyshareincommon,thestrongerthecorrelationbetween
them.Weassumethe
numberofcommonSNPsas3,4or5toachievederentstrengthso
fcorrelation.The
correspondingcoecientsin

aresettobeindependentrealizationsfrom
U
(0
:75
;1).Here
eachcolumnof

isacoecientvectorcorrespondingtoanexpressionvariable.For
the
restcolumnsof

correspondingtothe
X
variableswithoutastructure,5nonzeroentriesof
eachcolumnarerandomlyselectedandtheirvaluesareindependent
lyfrom
U
([

1
;
0
:75]
[
[0
:75
;1]).Thenwecangenerate
X
and
Y
by
X
=

G
+

and
Y
=

T
X
+

.Weconsiderbothlowdimensionalandhighdimensionalsituationsinour
simulation.
Inthelowdimensionalcase,weset
p
=100
;q
=100andvary
n
from200to1400when
both
p
and
q
arexed.Inthehighdimensionalcase,weset
p
=600
;q
=600and
n
=
300.WecompareourmethodnamelyIVregressionwithgraphconst
rainedregularization
(denotedasIVGC)withthemethodproposedbyLinetal.(2015)wit
hIVregressiononly
(denotedasIV).Themeasuresweusetoevaluateandcompared
erentmethodsandsetups
arethenumberofcorrectlyestimatednonzerocoecients(True
Positive),thenumberof
estimatednonzerocoecients(ModelSize),
k
^


k
2
(EstimationLoss),
n

1
=
2
k
X
(
^


40

)
k
2
(ModelError),and
MCC
(=
TP

TN

FP

FN
p
(
TP
+
FP
)(
TP
+
FN
)(
TN
+
FP
)(
TN
+
FN
)
)whereTP=true
positive,TN=truenegative,FP=falsepositiveandFN=falsenegativ
e.TNisdenedasthe
numberofcorrectlyestimatedzerocoecients;FPisdenedasth
enumberofincorrectly
estimatednonzerocoecients;andFNisdenedasthenumberofin
correctlyestimated
zerocoecients.ThegreatertheMCCis,thebetteravariablesele
ctionperformanceis.
Thereasonthatwechoosemodelerrorinsteadofpredictionerro
risexplainedinLinetal.
(2015).Weusecrossvalidationtochoosethepenaltytuningparam
eters.Foreachcase,we
run200replicationsandreportthesamplemeansandstandarderr
ors.
Table3.1:Simulationresultsforp=100,q=100,n=600
MethodnumSNPEstimationLossModelErrorTruePositiveModelSiz
eMCC

k
=0
:5
;k
=1
;
;10
30.67(0.28)0.34(0.11)10.00(0.00)14.79(3.29)0.82(0.10)

40.62(0.34)0.35(0.12)10.00(0.00)13.73(4.10)0.86(0.11)

51.33(1.43)0.59(0.46)9.60(1.15)12.98(3.50)0.84(0.13)
IVGC

k
˘
U
(0
:5
;1)
;k
=1
;
;10
31.51(0.41)0.67(0.14)10.00(0.00)15.56(4.15)0.80(0.11)

41.28(0.38)0.59(0.14)10.00(0.00)14.35(3.96)0.83(0.11)

52.13(1.70)0.95(0.59)9.63(1.07)12.97(3.61)0.84(0.11)

k
=0
:5
;k
=1
;
;10
31.62(0.46)0.66(0.14)9.99(0.10)15.77(4.36)0.79(0.11)

41.82(0.49)0.75(0.15)9.96(0.20)14.38(3.87)0.83(0.11)

54.20(0.71)1.52(0.24)7.60(1.06)10.88(3.15)0.71(0.12)
IV

k
˘
U
(0
:5
;1)
;k
=1
;
;10
32.05(0.58)0.89(0.18)10.00(0.00)15.49(4.42)0.80(0.11)

42.30(0.62)0.94(0.19)9.96(0.20)14.64(4.26)0.82(0.11)

55.70(1.22)2.14(0.44)8.01(1.11)11.44(3.32)0.74(0.14)
Table3.1showstheresultsforthelowdimensionalsituationwith
n
=600.Asintroduced
inthesimulationsetup,thegreaterthenumberofsharedSNPs(nu
mSNP),thehigherthe
correlationbetweentheexpressionvariables.WecanseeIVGChas
lessestimationerrorsand
41
modelerrorsinthetwoscenarioswithderent

values.Forthetruepositive,bothIVGC
andIVperformsimilarwhennumSNPis3or4(weakcorrelationcompar
edtonumSNP=5).
WhennumSNPincreasesto5,IVGChasahighertruepositiverateth
anIVdoes.This
iswhatweexpecttosee.IVimplementsaLASSOpenaltywhichonlyran
domlypicks
onevariableamongcorrelatedones.Introducinggraphicalconst
rainedpenaltyachievesthe
selectionconsistencyforgraphicalstructurespresent.Forth
emodelsizeandMCC,IVGC
andIVhavequitesimilarperformance,althoughIVGChasslightlylarg
erMCCthnaIV
does.Ingeneral,IVGCperformsbettercomparingtoIVinbothva
riableselectionand
estimation.
Wealsovariedthesamplesize
n
butxed
p
and
q
.TheresultsaredisplayedinFigure
3.2.Wecanseethatbothmethodsperformbetterasthesamplesiz
eincreases.IVGC
completelydominatesIVintermsofestimationlossandmodelerror.
Inaddition,IVGC
hasslightlysmallermodelsizeandlargerMCCcomparedtoIV.Bothme
thodsperform
quitesimilarintermsoftruepositive,especiallywhen
n
getslarge.Adetailedcomparison
canbefoundinTable3.2alongwiththestandarderrorgiveninthepar
enthesis.
Table3.3showstheresultsinhighdimensionalsituations.Wecansees
imilarpatterns
asweobservedinthelowdimensionalcases.Ingeneral,IVGCperfo
rmsasgoodasorbetter
thanIVdoes.

3.5ApplicationtoRealData

Weapplyourmethodtoahumanlivercohortdatasettoshowtheutilit
yofthemethod.
Thedatacontaingenotypes,geneexpressions,andphenotypes
ofenzymeactivitieswhich
canbedownloadedfromsagebionetworks'synapseplatformusing
SynapseIDsyn4499.For
42
0123456
Estimation LossSample size
Estimation loss200400600800100012001400
IVGC

IV0.00.51.01.52.02.5
Model Error
Sample size
Model error200400600800100012001400
IVGC

IV0246810
True Positive
Sample size
No. of true positives
200400600800100012001400
IVGC

IV1012141618
Model Size
Sample size
Model size
200400600800100012001400
IVGC

IV0.60.70.80.91.0
MCCSample size
MCC200400600800100012001400
IVGC

IVFigure3.2:Simulationresultsforxed
p
=100and
q
=100butvarying
n
(200to1400)
detailsofthedataset,pleaserefertoSchadtetal.(2008)andYa
ngetal.(2010).The
phenotypesareenzymeactivitymeasurementsofCytochromeP4
50.Therearetotal170
individualsmeasuredon18,556genetranscripts,449,699SNPsands
omecovariateslike
43
Table3.2:Simulationresultsfor
p
=100,
q
=100,
n
=200

1400andnumSNP=4
n
MethodEstimationLossModelErrorTruePositiveModelSizeMCC
200IVGC1.29(0.77)0.72(0.28)9.93(0.41)14.31(3.87)0.83(0.11)
IV3.20(0.78)1.32(0.27)9.30(0.76)13.69(3.60)0.79(0.12)
400IVGC0.86(0.43)0.47(0.15)9.98(0.20)14.43(3.53)0.83(0.11)
IV2.34(0.63)0.95(0.20)9.77(0.47)14.74(3.43)0.80(0.11)
600IVGC0.62(0.34)0.35(0.12)10.00(0.00)13.73(4.10)0.86(0.11)
IV1.82(0.49)0.75(0.15)9.96(0.20)14.38(3.87)0.83(0.11)
800IVGC0.55(0.25)0.29(0.09)10.00(0.00)14.36(4.15)0.83(0.11)
IV1.70(0.42)0.68(0.14)9.99(0.10)14.91(4.14)0.82(0.11)
1000IVGC0.49(0.22)0.26(0.09)10.00(0.00)14.52(3.85)0.83(0.11)
IV1.47(0.38)0.60(0.12)10.00(0.00)15.27(3.84)0.80(0.11)
1200IVGC0.45(0.18)0.24(0.08)10.00(0.00)13.91(3.44)0.85(0.11)
IV1.29(0.33)0.53(0.11)10.00(0.00)14.33(3.32)0.83(0.10)
1400IVGC0.43(0.19)0.23(0.07)10.00(0.00)14.50(4.02)0.83(0.11)
IV1.25(0.33)0.50(0.10)10.00(0.00)14.71(4.28)0.83(0.12)
genderandage.Therearetotal9enzymeactivitymeasuresandw
eparticularlyfocuson
CYP2E1sinceitistheonlyonepassedtheShapiro-Wilknormalitytest(
p-value
>
0
:1)after
thelogtransformation.Weregressthelog-transformedCYP2E1
overcovariatesgenderand
age,thenobtainthecovariatesadjustedresponsesforfollowup
analysis.
FortheSNPdata,thosewithgenotypingcallratelessthan90%orm
inorallelefrequency
(MAF)lessthan5%areremovedfromthedataset.Afterthisstep
,therearetotal312,082
SNPsleft.HerewefocusonKEGGpathway\MetabolismofXenobiotic
sbyCytochrome
P450".Therearetotal74genesinthispathwayand72ofthemare
mappedtoourdata
set.Wetalinearregressionmodelforeachgenetranscripttose
lectthetop1,000SNPs
accordingtotheirmarginalp-values.Thenwetamultipleregressio
nmodelwiththe1,000
44
Table3.3:Simulationresultsfor
p
=600,
q
=600and
n
=300
MethodnumSNPEstimationLossModelErrorTruePositiveModelSiz
eMCC

k
=0
:5
;k
=1
;
;10
31.65(1.08)0.73(0.27)10.00(0.00)20.70(11.02)0.74(0.14)

41.68(0.91)0.84(0.25)9.98(0.14)18.17(8.91)0.78(0.15)

52.93(1.76)1.25(0.51)9.13(1.38)16.25(7.36)0.75(0.15)
IVGC

k
˘
U
(0
:5
;1)
;k
=1
;
;10
32.59(0.94)1.22(0.27)9.99(0.10)19.65(7.07)0.74(0.12)

42.62(1.04)1.37(0.36)9.91(0.45)17.54(7.25)0.78(0.13)

54.36(2.64)1.93(0.70)9.15(1.42)16.95(8.67)0.75(0.18)

k
=0
:5
;k
=1
;
;10
32.95(1.21)1.13(0.25)9.84(0.39)22.22(11.68)0.70(0.14)

43.28(1.21)1.31(0.28)9.63(0.56)19.77(10.38)0.73(0.15)

54.86(1.23)1.77(0.35)7.71(1.03)16.70(9.69)0.64(0.15)
IV

k
˘
U
(0
:5
;1)
;k
=1
;
;10
33.86(1.48)1.61(0.35)9.90(0.30)21.84(10.98)0.71(0.13)

44.45(1.10)1.86(0.35)9.66(0.59)19.35(7.50)0.72(0.12)

56.91(1.69)2.64(0.44)7.87(1.08)16.49(9.00)0.65(0.16)
SNPsandestimatetheircoecientsusingtheLASSOalgorithm.Weth
enobtainthetted
valuesforthe72genetranscriptsandapplythegraphconstraine
dIVmethodtoselect
importantones.Wemanuallyobtainthegraphstructureforthe72
genesinthispathway
fromtheKEGGpathwaydatabase.Stabilityselection(Meinshausen
andBuhlmann,2009;
ShahandSamworh,2013)isappliedtoobtainastablevariableselectio
nresult.Eachtime
werandomlyselect80%ofthedatatorunouralgorithmandwerepea
t100times,then
calculatethepercentageofselectionforeachgeneaatheselectio
nrate.
Table3.4showsthetop10selectedgenesusingIVGCandIV.Wecans
eethatalarge
proportionofgenesselectedwiththetwomethodsoverlap.Howev
er,ahigherselectrate
isachievedfortheIVGCmethod.ThisimpliesthattheIVGCmethodism
orestable
whengraphicalinformationistakenintoaccount.Amongthelistedg
enes,CYP2E1is
selected100%usingtheIVGCmethodcomparedtoa99%fortheIVm
ethod.Weexpect
45
Table3.4:Listoftop10selectedgenesforeachoftheIVGCandIVm
ethod
IVGCIV
CYP2E11.000CYP2E10.990
UGT1A50.775UGT1A50.765
SULT2A10.540SULT2A10.420
CYP2B60.465CYP2B60.295
MGST10.445MGST10.240
HSD11B10.435HSD11B10.295
CYP2C90.365CYP2C90.195
UGT1A10.345-
GSTM30.325GSTM30.210
-CYP2F10.165
UGT1A30.295-
-UGT1A10.120
thatthisgeneshouldbeselectedverytimesincetheresponseisthe
activityofthisgene.
ThehighselectionratefortheIVGCmethodindicatetherobustofo
urmethodwhen
networkinformationisincorporated.Inadditiontothisgene,othe
rgenessuchasUGT1A5,
SULT2A1,CYP2B6,MGST1,andHSD11B1alsohaveselectionratesla
rgerthan40%.The
stabilityevaluationindicatestherelativeimportanceofthesegenes
intheactivityofenzyme
CYP2E1.Furtherbiologicalvalidationisneededtodisclosetherealr
elationshipofthese
geneswiththeenzymeactivity.Acompletelistofthegenesinthepat
hwayisgiveninTable
3.5.

3.6Discussion

Linetal.(2015)proposedanIVregressionframeworktodealwith
theendogeneityissuein
geneticalgenomicdataanalysis.Giventhatgenesfunctioninnetwo
rktofulltheirjoint
tasks,inthisworkweproposedagraphconstrainedselectionande
stimationmethodunder
IVregressionframework.OurworkisbuiltupontheworkofLinetal.
(2015)whilein-
46
corporatingthepriorknowledgeaboutthegraphstructureofge
nesbelongingtoanetwork
(e.g.,pathway).Weestablishedtheselectionconsistencyunderth
eproposedtwo-stepesti-
mationprocedureandshowedthe
L
1
boundwhichprovidessometheoreticalinsightsinto
thepropertiesofourmethod.Intensivesimulationswereconduct
edtoevaluatethemodel
performancecomparedwiththeIVregressionmethodbyLinetal.(
2015).Sincetheau-
thorshavedemonstratedtheadvantageoftheIVregressionov
eranaiveregressionwithout
consideringinstrumentalvariables(Linetal.2015),wedidnotinclud
ethecomparisonof
ourmethodwiththenaivemethodinthiswork.Weappliedourmethodt
oahumanliver
cohortdatasettodemonstratetheeectivenessofourmethod
.Thestabilityevaluationre-
sultsindicatetherobustnessofourmethodcomparedtotheIVme
thodwithoutconsidering
therealbiologicalregulationrelationship.
Ingeneral,LASSOdoesnothaveoracleproperty.Itispossibletha
ttheestimation
errorsofthecoecientscannotbereducedtozeroevenunderla
rgesamples.SCADand
MCParenaturalalternativeswhichinfactenjoytheoracleproper
ty.Bothpenaltycanbe
appliedinourproceduretoreplacethelassopenalty.Wewillconsider
thesepenaltiesinour
futurework,althoughsubstantialmodcationmaybeneededfor
theproofoftheselection
consistency.
Whenapplyingthegraphconstrainedpenalizationmethod,thegrap
hicalstructureneeds
tobeknownbeforethepenalizedestimationisapplied.Inourrealda
taanalysis,weusethe
pathwayinformationfromtheKEGGpathwaydatabaseaspriorkno
wledgetoestablishthe
graphstructure.Thisisanaturalwayifpriorknowledgeisavailable.
However,whensuch
knowledgeisnotavailable,somestatisticalmethodscanbeappliedto
estimatethenetwork
structurebeforeapplyingourmethod.Suchmethodsinclude,for
example,thethresholding
methodsbyBickelandLevina(2008a,2008b),Rothman,Levinaan
dZhu(2009),andLam
47
andFan(2009).Inaddition,ourcurrentmodelisdevelopedforco
ntinuousresponse.In
humangenetics,manydiseaseresponsesareshownasbinaryvaria
bles.Wewillgeneralize
ourmethodtoageneralizedlinearmodelframeworkinfutureinvest
igations.
48
Table3.5:thegenesinpathwayofMetabolismofxenobioticsbycytoc
hromeP450-Homo
sapiens
CYP1A1cytochromeP450,family1,subfamilyA,polypeptide1[KO:K07
408]
CYP2C9cytochromeP450,family2,subfamilyC,polypeptide9[KO:K17
719]
CYP3A4cytochromeP450,family3,subfamilyA,polypeptide4[KO:K17
689]
CYP1B1cytochromeP450,family1,subfamilyB,polypeptide1[KO:K07
410]
GSTA5glutathioneS-transferasealpha5[KO:K00799]

GSTA2glutathioneS-transferasealpha2[KO:K00799]

GSTA4glutathioneS-transferasealpha4[KO:K00799]

GSTO2glutathioneS-transferaseomega2[KO:K00799]

GSTM4glutathioneS-transferasemu4[KO:K00799]

GSTT2glutathioneS-transferasetheta2(gene/pseudogene)[K
O:K00799]
GSTT1glutathioneS-transferasetheta1[KO:K00799]

GSTM3glutathioneS-transferasemu3(brain)[KO:K00799]

MGST1microsomalglutathioneS-transferase1[KO:K00799]

MGST3microsomalglutathioneS-transferase3[KO:K00799]

GSTP1glutathioneS-transferasepi1[KO:K00799]

GSTM1glutathioneS-transferasemu1[KO:K00799]

GSTM5glutathioneS-transferasemu5[KO:K00799]

MGST2microsomalglutathioneS-transferase2[KO:K00799]

GSTA1glutathioneS-transferasealpha1[KO:K00799]

GSTM2glutathioneS-transferasemu2(muscle)[KO:K00799]

GSTA3glutathioneS-transferasealpha3[KO:K00799]

GSTO1glutathioneS-transferaseomega1[KO:K00799]

GSTT2BglutathioneS-transferasetheta2B(gene/pseudogene
)[KO:K00799]
GSTK1glutathioneS-transferasekappa1[KO:K13299]

EPHX1epoxidehydrolase1,microsomal(xenobiotic)[KO:K01253]

CYP2B6cytochromeP450,family2,subfamilyB,polypeptide6[KO:K17
709]
SULT2A1sulfotransferasefamily,cytosolic,2A,dehydroepiandr
osterone
(DHEA)-preferring,member1[KO:K11822]
CYP1A2cytochromeP450,family1,subfamilyA,polypeptide2[KO:K07
409]
CYP2A6cytochromeP450,family2,subfamilyA,polypeptide6[KO:K17
683]
CYP2E1cytochromeP450,family2,subfamilyE,polypeptide1[KO:K07
415]
CYP2F1cytochromeP450,family2,subfamilyF,polypeptide1[KO:K07
416]
CYP2S1cytochromeP450,family2,subfamilyS,polypeptide1[KO:K07
420]
AKR1C2aldo-ketoreductasefamily1,memberC2[KO:K00089K00212
]AKR1C4aldo-ketoreductasefamily1,memberC4[KO:K00037K00089
K00092
K00212]
AKR1C1aldo-ketoreductasefamily1,memberC1[KO:K00089K00212
]49
Table3.5(cont'd)
DHDHdihydrodioldehydrogenase(dimeric)[KO:K00078[EC:1.1.1.1791.3
.1.20]
CYP2A13cytochromeP450,family2,subfamilyA,polypeptide13[KO:K
17685]
CYP2D6cytochromeP450,family2,subfamilyD,polypeptide6[KO:K17
712]
HSD11B1hydroxysteroid(11-beta)dehydrogenase1[KO:K15680
]CBR1carbonylreductase1[KO:K00079]

CBR3carbonylreductase3[KO:K00084]

UGT2A1UDPglucuronosyltransferase2family,polypeptideA1,com
plexlocus
[KO:K00699]
UGT2A3UDPglucuronosyltransferase2family,polypeptideA3[KO:K0
0699]
UGT2B17UDPglucuronosyltransferase2family,polypeptideB17[KO
:K00699]
UGT2B11UDPglucuronosyltransferase2family,polypeptideB11[KO
:K00699]
UGT2B28UDPglucuronosyltransferase2family,polypeptideB28[KO
:K00699]
UGT1A6UDPglucuronosyltransferase1family,polypeptideA6[KO:K0
0699]
UGT1A4UDPglucuronosyltransferase1family,polypeptideA4[KO:K0
0699]
UGT1A1UDPglucuronosyltransferase1family,polypeptideA1[KO:K0
0699]
UGT1A3UDPglucuronosyltransferase1family,polypeptideA3[KO:K0
0699]
UGT2B10UDPglucuronosyltransferase2family,polypeptideB10[KO
:K00699]
UGT1A9UDPglucuronosyltransferase1family,polypeptideA9[KO:K0
0699]
UGT2B7UDPglucuronosyltransferase2family,polypeptideB7[KO:K0
0699]
UGT1A10UDPglucuronosyltransferase1family,polypeptideA10[KO
:K00699]
UGT1A8UDPglucuronosyltransferase1family,polypeptideA8[KO:K0
0699]
UGT1A5UDPglucuronosyltransferase1family,polypeptideA5[KO:K0
0699]
UGT2B15UDPglucuronosyltransferase2family,polypeptideB15[KO
:K00699]
UGT1A7UDPglucuronosyltransferase1family,polypeptideA7[KO:K0
0699]
UGT2B4UDPglucuronosyltransferase2family,polypeptideB4[KO:K0
0699]
UGT2A2UDPglucuronosyltransferase2family,polypeptideA2[KO:K0
0699]
CYP3A5cytochromeP450,family3,subfamilyA,polypeptide5[KO:K17
690]
AKR7A2aldo-ketoreductasefamily7,memberA2(aatoxinaldehyd
ereductase)
[KO:K15303]
AKR7A3aldo-ketoreductasefamily7,memberA3(aatoxinaldehyd
ereductase)
[KO:K15303]
ALDH3B1aldehydedehydrogenase3family,memberB1[KO:K00129]

ALDH3B2aldehydedehydrogenase3family,memberB2[KO:K00129]

ALDH1A3aldehydedehydrogenase1family,memberA3[KO:K00129]

ALDH3A1aldehydedehydrogenase3family,memberA1[KO:K00129]

ADH1Aalcoholdehydrogenase1A(classI),alphapolypeptide[KO:K1
3951]
ADH1Balcoholdehydrogenase1B(classI),betapolypeptide[KO:K1
3951]
ADH1Calcoholdehydrogenase1C(classI),gammapolypeptide[KO:K
13951]
ADH7alcoholdehydrogenase7(classIV),muorsigmapolypeptide[K
O:K13951]
ADH4alcoholdehydrogenase4(classII),pipolypeptide[KO:K1398
0]
ADH5alcoholdehydrogenase5(classIII),chipolypeptide[KO:K00
121]
ADH6alcoholdehydrogenase6(classV)[KO:K13952]
50
Chapter4

GraphConstrainedRegularizationfor

NonparametricInstrumental

VariablesRegressioninGenetical

GenomicAnalysis

4.1Introduction

Drivenbytheadvancementofmodernbiotechnology,vastamount
ofgeneticandgenomics
dataareroutinelygeneratedwiththehopetounderstandthemole
cularfunctionofliving
organisms.Anyphenotypictraitsrootingeneticsinwhichgenefunc
tionsareconstantly
modedbyexternalfactorssuchasenvironmentalchanges.As
publichealthisbecoming
anincreasingconcernduetovariousissuessuchasfoodsafetyan
dair/waterpollution,
understandingtheimpactofexternalforcesongenefunctionan
dthedegreeandmechanism
bywhichtheseexternalstimuliactonourgenomeisthusparamoun
tlyimportantindisease
preventionandimprovingpublichealth.Geneticalgenomicdatawhich
includegenetic
variants(e.g.,SNPs),geneexpressionandphenotypictraitscoup
lingwithenvironmental
measurements,providepowerfulresourcestodeciphersuchme
chanisms.
51
ThecentraldogmaingeneticstellsusthatDNAtranscribestoRNAa
ndRNAtranslates
toproteinwhicheventuallyleadingtophenotypicproducts.Withgen
eticalgenomicdata,
therelationshipamongthethreecanbemodelledandtestedthroug
hpowerfulstatistical
methods.Ingeneticassociationstudies,therelationshipbetween
aphenotypicresponse
andgeneticvariantscanbemodeledbyasimplelinearorlogisticregres
sionmodeldepend-
ingonthenatureofthedatadistribution.Let
Y
beaquantitativetraitresponseand
G
=(
G
1
;
;G
q
)
T
denote
q
geneticvariants,thentherelationshipbetween
Y
(assuming
centered)and
G
canbedescribedbythefollowingmodel,
Y
=
G

+
"
,where

represent
thegeneticeects.Let
Z
beanenvironmentalvariable.Tounderstandhow
Z
inuences
G
toaect
Y
,Maetal.(2011)introducedavarying-coecientmodeltostudyn
onlinear
environmentalmodulationeecton
G
withthemodel
Y
=
q
X
k
=1

k
(
Z
)
G
k
+
":
(4.1)
Let
X
=(
X
1
;
;X
p
)
T
denote
p
geneexpressionvaluesinageneticalgenomicstudy.As
theintermediatephenotypebetween
G
and
Y
,onenaturalchoiceistousegeneexpression
valuesasthepredictorstondoutwhichgenesareassociatedwith
aphenotype
Y
.Thiscan
bedonebyttingthemodel
Y
=
X

+

,where

representtheeectsofgeneexpressions.
IfweassumeacausalrelationshipwiththeorderofDNA-RNA-phen
otype,i.e.,DNAaects
phenotypeonlythroughgeneexpression,thenthefollowingtwo-s
tagesequentialmodelscan
bettedtodissecttherelationship,i.e.,
Y
=
X
+

;X
=
G

+

;52
wherewelet
Y
n

1
,X
n

p
,and
G
n

q
bematricesof
n
independentandidenticallydis-
tributedphenotypes,geneexpressions,andgenotypes,respe
ctively;

q

p
and

p

1
are
coecientmatrixandvector,

n

p
and

n

1
areerrormatrixandvector.Asarguedin
chapter3,whenunobservablelatentvariablesaectboth
X
and
Y
,G
canbeconsidered
asinstrumentalvariablesandtheeectsof
X
on
Y
canbeconsistentlyestimated.Linet
al.(2015)proposedaninstrumentalvariables(IV)regressionfr
ameworktoachievevariable
selectionandestimationwhenbothdimensionsof
p
and
q
arelargetosolvetheendogeneity
problem.Intherststage,apenalizedvariableselectionalgorithmis
appliedtoeachgene
expression.Then
X
isreplacedbythettedvalues
^
X
inthesecondstageandthemodelof
interestbecomes
Y
=(
G
^

)

+

:(4.2)
Whentheeectofanenvironmentalvariable
Z
isconsidered,wecanincorporatemodel
(4.1)tounderstandhow
Z
modulatesgeneeecttoaect
Y
.Thusmodel(4.2)canbe
modedtoreectthisby,
Y
=
G
^


(
Z
)+

:where

(

)are
p
-dimensionalfunctionswhichareassumedtobesmooth.Thus,und
erthe
IVregressionframework,itisnaturaltoconsiderthefollowingmod
el,
Y
=
X
(
Z
)+

;X
=
G

+

;(4.3)
Aswearguedinchapter3,genesfunctioninanetworkfashiontofu
lltheirjointtasks.
Astheintermediatephenotype,geneexpressionscouldbehighlyco
rrelatedforgenesbe-
53
longingtothesamenetworkorpathway.Thus,anetworkconstra
inedpenalizationmethod
ismoreappealingtosolvetheissuesbroughtbythegraphicalstruc
tureofgeneexpressions.
Focusingonmodel(4.3),inthischapterweapplyinstrumentalvaria
blestosolvetheproblem
causedbyendogeneityandproposeagraphconstrainedregulariz
ationmethodtosolvethe
problemcausedbyhighcorrelation.Theeectofageneexpression
ismodeledbyavarying
coecientfunctionwhichreectsthenonlinear(ifany)moderation
eectof
G
on
Y
.We
proposeatwo-stepproceduretoachievevariableselectionandes
timation.Intherststep,
undertheassumptionthateachgeneisonlycontrolledbyafewgene
ticvariants,weuse
LASSOtoestimatethecoecientmatrix

.ThenweapplygroupLASSOandgraphcon-
strainedregularizationtoachievevariableselectionandestimationf
orthevaryingcoecient
functions.Afterreplacing
X
withthettedvalues
X
,weuseB-splineexpansionstoapprox-
imatethecoecientfunctions.Finallyweimposetwopenalties,group
LASSOpenaltyand
graphconstrainedpenaltyonB-splinecoecients.Thegraphcons
trainedpenaltyisused
tohandlethecorrelationproblemandencouragethecoecientsof
apairofvariableswhich
areconnectedinthenetworktoshrinktowardthesamevalues.
Thechapterisorganizedasthefollowing.In4.2,weintroducethemo
delandthetwo-
stepestimationmethod.Theoreticalconsistencyresultsaregive
nin4.3.Simulatingdata
andrealdataanalysesareusedtojustifyourmethodin4.4and4.5,
respectively,followed
byabriefdiscussionin4.6.ProofsarerenderedintheAppendixC.
54
4.2NonparametricModelsandGraphConstrainedReg-
ularization
4.2.1TheModelSetup

Let
Y
=(
Y
1
;:::;Y
n
)
T
,X
=(
X
1
;:::;X
n
)
T
,G
=(
G
1
;:::;G
n
)
T
,and
Z
=(
Z
1
;:::;Z
n
)
T
be
matricesof
n
independentandidenticallydistributed(
i.i.d.
)phenotype,geneexpressions,
genotypes,andenvironmentalvariable,where
Y
i
isascaler,
X
i
=(
X
i
1
;:::;X
ip
)
T
,G
i
=
(
G
i
1
;:::;G
iq
)
T
,and
Z
i
isascaler.Hereweproposethefollowingmodel,
Y
i
=

0
(
Z
i
)+
p
X

j
=1

j
(
Z
i
)
X
ij
+

i
;(4.4)
X
i
=

T
G
i
+

i
;(4.5)
where

=(
1
;:::;

p
)isa
q

p
coecientmatrix,

j
(

)
;j
=0
;1
;:::;p;
aresmoothfunctions.

=(

1
;:::;
n
)
T
and

=(

1
;:::;
n
)
T
areerrorvectoranderrormatrixsatisfying(

T

i
;
i
)
T
are
i.i.d.
p
+1dimensionalrandomvariablesfollowingamultivariatenormaldistrib
ution
N
(0
;).Wewriteas=
0

B

@

11

12

21
˙
22
1

C

A
,where
11
isthecovariancematrixof

i
,˙
22
is
thevarianceof

i
,and
12
=
T

21
isthecovariancebetween

i
and

i
.4.2.2FirstStepEstimation

Undertheassumptionthatonlyafewgeneticvariantsinuencethe
geneexpressions,the
coecientmatrix

issparse.Weadoptanaiveestimationmethodandapplyregularized
leastsquarestoeachgeneexpressionseparately.Theestimation
problemcanbeformulated
55
asthefollowing
p
optimizationproblems,i.e.,
^

j
=argmin

j

1
2
n
n
X
i
=1
tr
f
(
X
j

G

j
)
T
(
X
j

G

j
)
g
+

j
k

j
k
1

;j
=1
;:::;p:
(4.6)
where

j
;j
=1
;::;p
aretuningparameters,
kk
1
referstothe
L
1
normand
X
j
=(
X
1
j
;:::;X
nj
)
T
;j
=
1
;:::;p
.Theestimateofisdenotedas
^

=(
^

1
;:::;
^

p
).
4.2.3SecondStepEstimation

Intherststep,weobtainanestimateofthecoecientmatrix

.Thettedvalues
^
X
=
G
^

isthenusedtosubstitute
X
inthesecondstep.
Theoptimizationproblemweareinterestedinis
^

(

)=argmin

(
)
n
1
2
n
n
X
i
=1

Y
i

p
X
j
=0

j
(
Z
i
)
^
X
ij

2
+
p

(

(

))
o
;(4.7)
where
p

(

(

))=


p
X
j
=1
k

j
(

)
k
+
1


2
X
k
˘
t
j
r
kt
jk

k
(

)

s
kt

t
(

)
k
2

;
(

)=(

0
(

)
;
1
(

)
;:::;
p
(

))
T
,r
kt
isaweightofcorrelationstrengthbetweentwovariables;
s
kt
=sign(
ˆ
kt
);
ˆ
kt
iscorrelationcoecient;
>
0and

2
[0
;1]aretuningparameters;
and
k
˘
t
means
r
kt
6=0.Denethenormofafunction
g
(
t
)as
jj
g
(
t
)
jj
=
q
R
g
2
(
t
)d
t
.Similarformsofoptimizationproblemhaveappearedintheliterature.
Huangetal.
(2011)studiedthepropertyofthisprobleminaparametricsituatio
n.Dayeetal.(2012)
proposedasimilarproblemandanalgorithm.However,theydidnotinc
orporatethege-
neticvariantsasinstrumentalvariables.Weapplythenonparamet
ricgraphconstrained
regularizationinanIVregressionframework.Thisbringschallenges
intheproofofselection
56
consistencysince
^
X
arenoti.i.d.samples.InDayeetal.(2012),theauthorsappliedtheir
methodtolongitudinaldatawhichbroughtsomechallengesincomput
ation.Weadoptan
algorithmsimilartowhatWeietal.(2011)proposedandthecomputa
tionproblemiseased.
Weapproximatesmoothfunctions

j
(

)
;j
=0
;1
;
;p
bypolynomialsplines.Without
lossofgenerality,weassume
Z
2
[0
;1].Let
k
j
beapartitionoftheinterval[0,1],with
K
j
interiorknots
k
j
=
f
0=
k
j;
0
<k
j;
1
<:::<k
j;K
j
<k
j;K
j
+1
=1
g
;j
=0
;1
;:::;p:
Let
B
j
beasetoffunctionson[0,1]whichsatisfythefollowing:(a)thefunc
tionsarepolyno-
mialsof
p
degreesorlessonintervals[
k
j;s
;k
j;s
+1
)
;s
=0
;1
;:::;K
j

1,and[
k
j;K
j
;k
j;K
j
+1
];and(b)thefunctionshave
p

1timescontinuousderivatives.
Let
f
B
jl
(

)
g
L
j
l
=1
betheB-splinebasesof
B
j
with
L
j
=
p
+
K
j
+1.Sothefunction

j
(

)
canbeapproximatedby

j
(

)
ˇ
P
L
j
l
=1

jl
B
jl
(

).Dene
B
j
(

)=(
B
j
1
(

)
;
;B
jL
j
(

))
T
,
j
=(

j
1
;
;
jL
j
)
T
,j
=0
;1
;
;p
,and
R
ij
=
R
B
i
(
t
)
B
T
j
(
t
)d
t
.Hereintegrationofa
matrixmeansintegrationofeveryelementofthatmatrix.Bythede
nitionofthenormof
afunction,wehave
k

j
(
t
)
k
=
s
Z

2
j
(
t
)d
t
ˇ
v

u

u

u

t
Z

L
j
X
l
=1

jl
B
jl
(
t
)

2
d
t
=
q

T
j
R
jj

j
;57
k

i
(
t
)


j
(
t
)
k
2
=
Z
(

i
(
t
)

s
ij

j
(
t
))
2
d
t
ˇ
Z

L
i
X
l
=1

il
B
il
(
t
)

s
ij
L
j
X
l
=1

jl
B
jl
(
t
)

2
d
t
=

T
i
R
ii

i

2
s
ij

T
i
R
ij

j
+

T
j
R
jj

j
:Dene
B
(
t
)=
0

B

B

B

B

B

@
B
01
(
t
)
B
02
(
t
)

B
0
L
0
(
t
)

00

0
.
.
..
.
..
.
..
.
..
.
..
.
.00

0

B
p
1
(
t
)
B
p
2
(
t
)

B
pL
p
(
t
)
:1

C

C

C

C

C

A
Andfurtherdene
U
=(
B
T
(
Z
1
)
^
X
1
;
;B
T
(
Z
n
)
^
X
n
)
T
.Here
^
X
i
=(1
;^
X
i
1
;
;^
X
ip
)
;i
=
1
;
;n
.Let
U
j
=(
^
X
1
j
B
j
(
Z
1
)
;
;^
X
nj
B
j
(
Z
n
))
T
;j
=0
;1
;
;p
.Wehave
U
=(
U
0
;
;U
p
).
Dene

=(

T
0
;:::;
T
p
)
T
.So,theoptimizationproblemin(
??
)ischangedto,
^

=argmin

f
1
2
n
(
Y

p
X

j
=0
U
j

j
)
T
(
Y

p
X
j
=0
U
j

j
)+
p

(

)
g
;(4.8)
where
p

(

)=

h

p
X
j
=1
q

T
j
R
jj

j
+
1


2
X
k
˘
t
j
r
kt
j
(

T
k
R
kk

k

2
s
kt

T
k
R
kt

t
+

T
t
R
tt

t
)
i
:(4.9)
Therstpartin(4.4)isagroupLASSOpenalty(YuaandLin,2007)wit
hkernelmatrices
R
i;i
(
i
=1
;
;p
)sinceHuangetal.(2004)provedthat
R
ii
aresymmetricandpositive
denite.Thesecondpartin(4.4)isagraphconstrainedregularizat
ionpenaltywhichis
appliedtotakeadvantageofthegraphinformationongenestohan
dlethehighcorrelation
58
problem.
Wechoosethesamebasicfunctionsforall
p
+1functions

j
(

)
;j
=0
;1
;
;p
tosimplify
ouralgorithm.Hence,wesimplifythenotationsanddene
K
=
K
j
;L
=
L
j
;f
B
l
(

)
g
L

l
=1
=
f
B
jl
(

)
g
L
j
l
=1
;R
=
R
ij
;i;j
=1
;
;p
.Thistreatmentismoreoncomputationalpurposesince
itiscomputationallychallengetoselecttheorderandknotsforallth
e
p
+1functions.
Let
R
=
R
1
=
2
R
1
=
2
since
R
isasymmetricpositivedenitematrix,and
˘
j
=
R
1
=
2

j
and
V
j
=
U
j
R

1
=
2
;j
=0
;
;p
.(
??
)and(4.4)changeto
˘

=argmin
˘
f
1
2
n
(
Y

p
X

j
=0
V
j
˘
j
)
T
(
Y

p
X
j
=0
V
j
˘
j
)+~
p

(
˘
)
g
(4.10)
~
p

(
˘
)=

h

p
X

j
=1
k
˘
j
k
2
+
1


2
X
k
˘
t
j
r
kt
j
(
˘
T
k
˘
k

2
s
kt
˘
T
k
˘
t
+
˘
T
t
˘
t
)
i
(4.11)
Furthermore,weassumeeach
V
j
isorthonormalized,i.e.
1
n
V
T
j
V
j
=
I;j
=0
;1
;
;p
.TheorthonormalizationcanbeachievedusingQRdecomposition.
Weapplycoordinatedescentalgorithmtosolvetheoptimizationprob
lem.If
h
2
f
1
;
;p
g
,wehave
1
2
n
(
Y

p
X

j
=0
V
j
˘
j
)
T
(
Y

p
X

j
=0
V
j
˘
j
)+
p

(
˘
)
=
1
2
n
(
Y

V
h
˘
h

X
j
6=
h
V
j
˘
j
)
T
(
Y

V
h
˘
h

X
j
6=
h
V
j
˘
j
)+

h

k
˘
h
k
2
+

p
X
j
6=
h
k
˘
j
k
2
+
1


2
X
h
˘
j
j
r
hj
j
(
˘
T
h
˘
h

2
s
hj
˘
T
h
˘
j
+
˘
T
j
˘
j
)+
1


2
X
k
˘
t
k;t
6=
h
j
r
kt
j
(
˘
T
k
˘
k

2
s
kt
˘
T
k
˘
t
+
˘
T
t
˘
t
)
i
:59
Takingthederivativewithrespectto
˘
h
,weobtain
@
@˘
h
h
1
2
n
(
Y

p
X
j
=0
V
j
˘
j
)
T
(
Y

p
X
j
=0
V
j
˘
j
)+
p

(
˘
)
i
=

1
n
V
T
h
(
Y

X
j
6=
h
V
j
˘
j
)


(1


)
X
h
˘
j
r
hj
˘
j
+(1+

(1


)
X
h
˘
j
j
r
hj
j
)
˘
h
+

˘
h
k
˘
h
k
2
:Let
@@˘
h
h
1
2
n
(
Y

P
p

j
=0
V
j
˘
j
)
T
(
Y

P
p

j
=0
V
j
˘
j
)+
p

(
˘
)
i
=0.Wesolvetheequation
andobtainthesolutionas
˘
h
=
1
1+

(1


)
P
h
˘
j
j
r
hj
j

1


k
1
n
V
T
h
(
Y

P
j
6=
h
V
j
˘
j
)+

(1


)
P
h
˘
j
r
hj
˘
j
k
2

+

h
1
n
V
T
h
(
Y

X
j
6=
h
V
j
˘
j
)+

(1


)
X
h
˘
j
r
hj
˘
j
i
:If
h
=0,wehave
˘
0
=
1
n
V
T
0
(
Y

P
j
6=0
V
j
˘
j
).SimilartoFriedmanetal.(2007)and
Friedmanetal.(2010),weset

max
=max
j
k
V
T
j
Y
k
2
=
(
n
)and

min
=

max
,where

=0
:001andconstructasequenceof
K
valuesof

decreasingfrom

max
to

min
onthe
logscale.Weusecrossvalidationtochoosethetuningparameters

and

.Afterobtaining
˘

,wehave^

j
=
R

1
=
2
˘

j
;j
=0
;1
;
;p
.4.3TheoreticalResults

LikeDayeetal.(2012),weassumethecoecientfunctionsthemse
lvesaresplinefunc-
tionswithxednumberofknots.Asinthealgorithm,wefurtherass
umeallcoecient
functionshavethesamebasefunctions.Let
U
=(
B
T
(
Z
1
)
^
X
1
;
;B
T
(
Z
n
)
^
X
n
)
T
.Nowthe
60
optimizationproblemchangesto
min

f
1
2
n
k
Y

U
k
2
+

1
p
X
j
=1
q

T
j
R
jj

j
+

2

T
L
R

g
;(4.12)
where

1
=

and

2
=

1


2
.Thegraphicalinformationisincludedinthematrix
L
R
.Herewegivethetheoryonthepropertiesofselectionandestimatio
nofourmethod.
OurproofcloselyfollowsthatinLinetal.(2015)andisgivenintheAppe
ndixC.Werst
introducesomenecessarynotationsandassumptionsthatwillbeu
sed.Thispartoverlaps
withsomeinChapter3.Tomakethischapterself-contained,weintr
oducethemhereagain.
Foramatrix
A
,k
A
k
1
=max
j
P
i
j
a
ij
j
and
k
A
k
1
=max
i
P
j
j
a
ij
j
.Foravector

,amatrix
A
,andindexsets
I
and
J
,
I
and
A
IJ
denotethesubvectorandsubmatrix.
J
c
denotes
thecomplementof
J
and
j
J
j
denotesthenumberoftheelementsin
J
.Denetherestricted
eigenvalueforamatrix
A
n

m
and1

s

m
by

(
A;s
)=min
jJ
j<s
min

6=0
k

J
c
k
1

3
k

J
k
1
k
A
k
2
p
n
k

J
k
2
.Let
r
=max
1

j

p
j
supp(

j
)
j
,s
=
j
supp(

)
j
and
s
isthenumberofsupportivegroups,
and
˙
max
=max
1

j

p
˙
j
,wheresupp(

)isthesupportiveset.Weassume
k

k
1

M
1
and
k

k
2

M
2
forsomeconstants
M
1
and
M
2
,and
k
L

k
2
<C
L
forsomeconstant
C
L
.Inordertocontroltheestimationerrorintherststep,thefo
llowingconditionis
imposed.Let
N
1
=max
fk
R
1
=
2
k
2
;k
R

1
=
2
k
2
g
and
N
2
=max
fk
B

k
2
;k
B

T
k
2
g
,where
B

=
0

B

B

B

B

B

@
B
1
(
t
1
)

B
l
(
t
1
)
.
.
..
.
.B
1
(
t
n
)

B
l
(
t
n
)
1

C

C

C

C

C

A
;t
i
2
[0
;1]
;i
=1
;
;n
.Thefollowingconditionisusedtocontroltheerrorsfromthersts
tep.
(C1)

(
G
;r
)


forsome
>
0.
Weconsiderthelowdimensionalsituationwherethedimensionisxed.
Inordertoobtain
61
theselectionconsistencyandthe
L
1
bound,weneedtoassumeonemoreassumption.Dene
C
=
1
n
U
(
G

)
T
U
(
G

),where
X
isreplacedwith
G

,and
S
=supp(

)whichistheindex
ofsupportivegroups.Let
b
0
=min
j
2
S
k

j
k
2
.Thefollowingconditionwhichisinspirit
similartotheirrepresentableconditioninZhaoandYu(2006),isneed
ed.
(C2)
k
(
C
tS
+2

2
L
R

tS
)(
C
tS
+2

2
L
R

SS
)

1
k
2
<
1


s
2
=
3
N
2
1
,8
t
2
S
C
,where

isaconstant
and0
<<
1.
SimilarlytotheargumentinZhaoandYu(2006),(C2)requiresthe\r
egressioncoe-
cients"oftheirrelevantcovariatesontherelevantcovariatesar
elessthan1.Thefollowing
theoremstatestheselectionconsistencyof
^

.Theorem3.
Iftheregularizationparametersselectedintherststeps
atisfysomeregular-
izationconditions;theregularizationparametersinthes
econdstepareselectedas

1
=
O
(
n

1
=
2
)
and

2
C
L
h
(1

3
4

)
1
p
s
+1
i


(16

3

)
16(4


)

1
;
b
0
hasalowerboundthat
b
0
>
2(4


)
8

3

˚
h
8


2(4


)

1
p
s
+2

2
C
L
i
,andfurtherassume(C1)and(C2)hold,thenwithproba-
bilityapproachingto1,theestimated
^

satises
sign(
^

)=sign(

)
;k
^

S


S
k
1

2(4


)
8

3

˚
h
8


2(4


)

1
p
s
+2

2
C
L
i
s
1
:5
N
1
:4.4Simulation

Todenethecoecientfunctions

(

)=(

0
(

)
;
1
(

)
;
;
p
(

))
T
,weusethefunctionsde-
nedinDayeetal.(2012).Thefollowingfunctionsareusedtoconst
ructthetruecoecient
functions.
62
Polynomial:
f
pn
1
(
z
)=2
z

1
;f
pn
2
(
z
)=(1

2
z
)
3
Low-Frequency:
f
lf
1
(
z
)=sin(2
ˇz
)
;f
lf
2
(
z
)=cos(2
ˇz
)
High-Frequency:
f
hf
1
(
z
)=sin(8
ˇz
)
;f
hf
2
(
z
)=cos(8
ˇz
)
Compound:
f
cp
(
z
)=(1

2
z
)
3

cos(4
ˇz
).
Now,wedenethetruecoecientsasthefollowing.

0
=2
f
cp
;
3
=(1


)
f
lf
2
+
f
lf
1
;
1
=(1


)
f
lf
2
+
f
pn
1
;
4
=(1


)
f
lf
2
+
f
hf
2
;
2
=(1


)
f
lf
2
+
f
hf
1
;
5
=(1


)
f
lf
2
+
f
pn
2
:where

0
istheinterceptfunctionandothersarethose5nonzerofunction
s.

isaconstant.
If

=0,allthevecoecientfunctionsarethesame.If

6=0,thecoecientfunctionsare
notthesamesowecanassesstherobustnessofourmethod.
Wegeneratethegraphicalstructureaswedidinchapter3with3gr
oups.Thedegreeof
correlationwithineachgroupsiscontrolledbythenumberofSNPsth
eyshareincommon
whichissetas4.Forthethreegroupswithagraphicalstructure,
5variablesaresimulatedin
eachgroup.Weassumetwogroupsofvariableshavenonzerocoe
cientfunctionsasspeced
bythevefunctionsgivenabove.Therestvariableshavezerocoe
cients.Wesimulatetwo
cases.Intherstcase,weset

as0.Inthesecondcase,weset

as0.25toevaluate
theselectionperformanceunderderentsituations.
;
;G
;X
,and
Y
aregeneratedasin
chapter3.
WecompareourmethodnamelyIVregressionwithgraphconstraine
dpenalty(IVGC)
withtheIVregressionmethodwithoutgraphconstrainedpenalty(
IV).Thefollowingve
measuresareusedtoevaluatetheperformanceofthetwometho
ds.Sensitivityisdened
63
astheratiooftruepositivestothenumberofnonzerocoecients
.Speccityisdened
astheratiooftruenegativestothenumberofzerocoecients.G
-measureisdenedas
p
sensitivity

specificity
.Thecloserto1aG-measureis,thebetterthevariableselection
performanceis.Modelerrorisdenedas
n

1
=
2
k
X
^

(
Z
)

X
(
Z
)
k
2
.Weusecrossvalidation
tochoosethetuningparameters.Foreachcase,wesimulate200t
imesandreporttheaverage
valueandstandarderror.
Table4.1:Simulationresultsfor
p
=100,
q
=100and

=0
nMethodEstimationLossModelErrorSensitivitySpeccityG-meas
ure
800IVGC1.10(0.57)1.39(0.23)0.99(0.06)0.69(0.09)0.68(0.09)
IV4.45(1.13)2.36(0.28)0.74(0.11)0.67(0.10)0.5(0.07)
1000IVGC1.02(0.74)1.29(0.26)1.00(0.04)0.71(0.08)0.71(0.08)
IV4.04(1.09)2.24(0.27)0.79(0.11)0.68(0.08)0.53(0.07)
1200IVGC0.84(0.29)1.17(0.15)1.00(0.00)0.70(0.07)0.70(0.07)
IV3.63(1.11)2.10(0.29)0.83(0.11)0.66(0.07)0.55(0.08)
1400IVGC0.80(0.58)1.11(0.21)1.00(0.03)0.72(0.08)0.72(0.08)
IV3.46(1.04)2.02(0.28)0.85(0.09)0.67(0.08)0.57(0.07)
1600IVGC0.67(0.22)1.03(0.14)1.00(0.00)0.70(0.08)0.70(0.08)
IV3.12(0.90)1.91(0.25)0.88(0.08)0.64(0.08)0.56(0.07)
Table4.1showstheresultswhen
p
=100,
q
=100and

=0.IVGCobtainssmaller
estimationlossandmodelerrorcomparedtoIV.ThisimpliesthatIVG
Chasabetter
estimationperformancethanIVdoes.ThesensitivityofIVGCishigh
erthanthatofIV.
Thisiswhatweexpected.HerewesetthenumberofcommonSNPsas
4.Whencorrelation
isstrong,graphconstrainedregularizationcanobtainbettervar
iableselectionperformance.
Forspeccity,thetwomethodsperformquitesimilar.IVGChasahig
herG-measurevalue
whichmeansIVGChasabetterperformanceinvariableselection.In
Figure4.1,wecansee
clearlythatIVGChasabettervariableselectionandestimationperf
ormancethanIVdoes.
64
Figures4.2and4.3showstheintegratedmeansquarederrorsfort
ennonzerocoecient
functions.WecanalsoseethatIVGCoutperformsIVintheestimat
ionofallnonzero
coecients.Table4.2showstheselectionrateswhichisdenedasth
etimesagenebeing
selecteddividedbythetimesofsimulation.IVGCcanselectallnonzer
ocoecientsalmost
everytime.However,IVhasapoorselectperformance.
Table4.2:Selectionratefor
p
=100,
q
=100,
n
=800and

=0
Method
SelectionRate

1

2

3

4

5

6

7

8

9

10
IVGC1.000.980.991.000.981.000.991.001.000.98
IV1.000.110.580.870.691.000.781.000.960.45
Forthecase
p
=100,
q
=100,

=0
:25,Tables4.3,4.4andFigures4.4-4.6showthe
result.Weobservesimilarpatternsasthecasewhen

=0.Overall,IVGChasbetter
variableselectionandestimationperformancesthanIVdoes.
Table4.3:Simulationresultsfor
p
=100,
q
=100and

=0
:25
nMethodEstimationLossModelErrorSensitivitySpeccityG-meas
ure
800IVGC1.08(0.55)1.61(0.19)0.99(0.07)0.66(0.08)0.65(0.09)
IV2.95(0.66)2.18(0.19)0.72(0.11)0.65(0.08)0.47(0.07)
1000IVGC0.84(0.26)1.47(0.12)0.99(0.04)0.69(0.08)0.68(0.08)
IV2.49(0.52)2.00(0.16)0.80(0.12)0.66(0.09)0.52(0.09)
1200IVGC0.70(0.14)1.39(0.09)1.00(0.00)0.69(0.08)0.69(0.08)
IV2.18(0.52)1.89(0.17)0.84(0.10)0.65(0.09)0.54(0.07)
1400IVGC0.70(0.19)1.37(0.10)1.00(0.00)0.69(0.08)0.69(0.08)
IV1.97(0.51)1.83(0.17)0.88(0.1)0.65(0.08)0.57(0.08)
1600IVGC0.63(0.13)1.32(0.08)1.00(0.00)0.69(0.07)0.69(0.07)
IV1.77(0.41)1.74(0.14)0.90(0.08)0.64(0.07)0.57(0.07)
Insummary,simulationresultsshowthattheproposedIVGCmetho
dcanachievebetter
selectionresultsandestimationaccuracycomparedtotheIVmeth
odwithoutconsidering
65
0.20.40.60.81.0
SensitivitySample size
Sensitivity4006008001000120014001600
IVGC

IV0.500.550.600.650.700.75
SpecitivitySample size
Specitivity4006008001000120014001600
IVGC

IV0.00.20.40.60.8
G-measureSample size
G-measure4006008001000120014001600
IVGC

IV0246810
Estimation lossSample size
Estimation loss4006008001000120014001600
IVGC

IV012345
Model Error
Sample size
Model error4006008001000120014001600
IVGC

IVFigure4.1:Simulationresultsfor
p
=100,
q
=100and

=0
thegraphicalstructure.Asgenegraphstructurealwaysprese
ntsinreality(wemaynotknow
theexactstructurethough),ourmethodthatincorporatesgr
aphpenaltyshouldalwaysbe
preferredtoachievebetterestimationandselection.
66
0.00.20.40.60.81.0
IMSE of Beta1Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta2Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta3Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta4Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta5Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta6Sample size
IMSE4006008001000120014001600
IVGC

IVFigure4.2:Simulationresultsfor
p
=100,
q
=100and

=0onestimationperformance
4.5ApplicationtoRealData

Inthissection,weapplyourmethodtoaHumanLiverCohortdatase
t.Forthedetails
ofthedata,pleaserefertosection3.5inchapter3.Inthisanalysis
,wepickageasthe
Z
67
0.00.20.40.60.81.0
IMSE of Beta7Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta8Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta9Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.20.40.60.81.0
IMSE of Beta10Sample size
IMSE4006008001000120014001600
IVGC

IVFigure4.3:Simulationresultsfor
p
=100,
q
=100and

=0onestimationperformance
Table4.4:Selectionratefor
p
=100,
q
=100,
n
=800and

=0
:25
Method
SelectionRate

1

2

3

4

5

6

7

8

9

10
IVGC1.001.000.960.961.000.980.981.001.000.98
IV1.001.000.180.211.000.790.490.801.000.76
variableandtrytounderstandhowgeneeectschangeoveraget
oaecttheenzymeCYP2E1
activity.CYP2E1isoneofthecytochromeP450proteinswhichcata
lyzemanyreactions
involvedinthesynthesisofcholesterol,steroidsandotherlipids.As
forthegeneexpressions,
wefocusontheMetabolismofXenobioticsbyCytochromeP450path
way.Followingour
two-stageestimationalgorithm,weapplyunivariatelinearregressio
ntoselectthetop1,000
68
0.20.40.60.81.0
SensitivitySample size
Sensitivity4006008001000120014001600
IVGC

IV0.500.550.600.650.700.75
SpecitivitySample size
Specitivity4006008001000120014001600
IVGC

IV0.00.20.40.60.8
G-measureSample size
G-measure4006008001000120014001600
IVGC

IV02468
Estimation lossSample size
Estimation loss4006008001000120014001600
IVGC

IV1.01.52.02.53.03.54.0
Model Error
Sample size
Model error4006008001000120014001600
IVGC

IVFigure4.4:Simulationresultsfor
p
=100,
q
=100and

=0
:25
SNPsforeachgenetranscript,thenestimatetheircoecientsus
ingLASSObyttingallthe
1,000transcriptsinthesamemodel.Thettedgeneexpressionvalu
eswerethenusedfor
thesecondstagevariableselectionandestimation.Thesameasinch
apter3,thegraphical
69
0.00.10.20.30.40.5
IMSE of Beta1Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.10.20.30.40.5
IMSE of Beta2Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.10.20.30.40.5
IMSE of Beta3Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.10.20.30.40.5
IMSE of Beta4Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.10.20.30.40.5
IMSE of Beta5Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.10.20.30.40.5
IMSE of Beta6Sample size
IMSE4006008001000120014001600
IVGC

IVFigure4.5:Simulationresultsfor
p
=100,
q
=100and

=0
:25onestimation
structureforthepathwaywasmanuallyobtainedfromtheKEGGda
tabase.Ingeneral,if
twogenesareconnectedeitherdirectlyorindirectlyinthepathway
,weconsiderthemas
correlatedandthecorrelationisassumedtobe1whichisusedasthe
weightinthegraph
70
0.00.10.20.30.40.5
IMSE of Beta7Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.10.20.30.40.5
IMSE of Beta8Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.51.01.52.0
IMSE of Beta9Sample size
IMSE4006008001000120014001600
IVGC

IV0.00.10.20.30.40.5
IMSE of Beta10Sample size
IMSE4006008001000120014001600
IVGC

IVFigure4.6:Simulationresultsfor
p
=100,
q
=100and

=0
:25onestimation
constrainedpenaltyfunction.
Stabilityisusedtoshowtheselectionrobustnessoftheselectedge
nes.Werandomly
select80%ofthedatausedtorunouralgorithm.Werepeatdoingth
is100timesand
choosethosevariablesappearthemostoften.Table4.5liststheto
p10genesusingeach
method.Weobserveverysimilarlistofgenesbythetwomethods,ex
ceptthatIVGChas
higherselectionratethanIVdoes.Amongthetopgenes,CYP2E1h
asthehighestselection
rate.Thisisnotsurprisesincetheresponseistheactivityofthisge
ne.Thehigherselection
rateforthisgenesusingIVGCmethodindicatestherobustnessof
ourmethod.Figure4.7
showsthegraphicalstructureforthetop10selectedgenesusin
gtheIVGCmethodextracted
71
fromtheMetabolismofXenobioticsbyCytochromeP450pathwayint
heKEGGpathway
database.CYP2E1hasmoreconnectionsinthegraphandincorpor
atingthegraphstructure
inthepenalizedregressiongreatlyenhancestheselectionrateofr
elevantgenesthatfunction
togetherwithCYP2E1toaecttheenzymeactivityofCYP2E1.
Table4.5:Listoftop10genesbasedonthestabilityselectionrate
IVGCIV
SelectedgenesSelectionrateSelectedgenesSelectionrate
CYP2E10.975CYP2E10.840
UGT2A20.855UGT2A20.635
CYP2F10.740CYP2F10.520
UGT1A90.715UGT1A90.475
GSTM20.690GSTM20.455
UGT1A50.645UGT1A50.355
CYP2S10.635CYP2S10.350
GSTA40.610GSTA40.285
GSTM30.560GSTM30.205
UGT1A40.485MGST20.150
Tounderstandhowthegenefunctionsoverage,weplottheestima
tedcoecientfunctions
forthetop2genesCYP2E1andUGT2A2inTable4.5.AsshowninFigure
4.8,theestimated
functionsoverageforthetwogenesshowanonlinearpattern.Th
eoveralldecreasingpattern
forCYP2E1aspeopleagingindicatesthenonlinearnegativeeectof
ageongeneexpression.
Intheotherwords,aspeopleaging,theexpressionofCYP2E1red
ucesandthecorresponding
enzymeactivitygetsweak.MoststudiesonCYP2E1activityinhuman
arefocusedonfetus
orinfantsandtheresultsdonotapplytoourcase.Studiesinratsh
aveshowndecrease
inCYP2E1activityasratsagingandthismightbecorrelatedwithanac
cumulationof
oxidativedamage(Wauthieretal.2004).Thedecreasingeectoft
hisgeneaspeopleaging
mightalsocontributestotheaccumulativeoxidativedamage.Furth
erbiologicalvalidation
isneededtoconrmthis.
72
CYP2E1UGT2A2CYP2F1UGT1A9GSTM2UGT1A5CYP2S1GSTA4
GSTM3UGT1A4Figure4.7:Graphicalstructureofthetop10genesselectedbyIV
GC.
0.00.20.40.60.81.0
0.00.51.0
CYP2E1Age/80b(Age80)0.00.20.40.60.81.0
0.00.51.0
UGT2A2Age/80b(Age80)Figure4.8:TheestimatedcoecientfunctionsforgeneCYP2E1and
UGT2A2.
Inaddition,comparingtotheresultsweobtainedinchapter3,more
geneshaveaselection
ratelargerthan50%.Thisindicatesthatincorporatingthenonlinea
rmodulationeectofage
73
inthethemodeladdsmorepowertoidentifythepotentialtruth.F
urtherlabinvestigations
areneededtorevealtherealmechanismofthesegenefunctions
.4.6Discussion

Weintroduceamethodwhichcombinesinstrumentalvariablesandgr
aphconstrainedreg-
ularizationtoselectimportantgenesassociatedwithaphenotypeo
finterest.Weallowthe
coecientstovaryastheenvironmentalvariablechanges.Modelin
gthenonlinearmodca-
tioneectofenvironmentalvariableallowsustostudyhowgenesint
eractwithanenviron-
mentalvariabletoaectaphenotypictrait.Weproposeatwo-ste
pestimationprocedure.
Selectionconsistencyisestablished.Simulationresultsshowthatou
rmethodperformsbet-
tercomparedtoIVregressionwithoutconsideringnetworkstruc
ture.Ahumanlivecohort
datasetisusedtodemonstratetheutilityofourmethod.
Ingeneral,groupLASSOdoesnothaveoracleproperty.GroupSC
ADandgroupMCP
arealternativeswhichenjoytheoracleproperty.Thesepenaltyt
ermscanbeappliedinour
proceduretoreplacegroupLASSOpenalty.Thesewillbeinvestigat
edinourfuturestudies.
Inourrealdataanalysis,weapplytheknowngraphstructureobt
ainedfromtheKEGG
pathwaydatabase.Ifthegraphicalstructurecannotbeobtain
edinpractice,itcanbe
learnedfromthedataitself.Forexample,wecanapplymethodsliket
hethresholding
methodsproposedbyBickelandLevina(2008a,2008b),Rothman
,LevinaandZhu(2009),
andLamandFan(2009),tolearnthecorrelationstructure.Then
thelearnedstructurecan
beincorporatedintotheproposedIVGCproceduretoachievebet
tergeneselectionresults.
Therobustnessofthemodelifwronggraphicalstructuresarea
ppliedisanothertopicthat
isworthyexploringinthefuture.
74
Chapter5

Conclusionandfuturework

Estimationandapplicationofgraphicalstructuresarethemaintop
icsinthisdissertation.
Chapter2isabouttheestimationofgraphsandChapters3and4ar
eabouttheapplication
ofgraphs.InChapter2,weproposedacovariate-adjustedPCa
lgorithmtoestimatedirected
graphswhichprovidescausalinformation.InChapter3,weconsid
erthevariableselection
andestimationprobleminaframeworkofinstrumentalvariablesreg
ression.Agraphcon-
strainedregularizationisappliedtotakeadvantageofthepriorkno
wledgeongenenetworks.
InChapter4,weextendtheworkinChapter3toanonparametricm
odelwherevarying
coecientmodelisadoptedtocapturethenonlinearmodulationee
ctofanenvironment
variableongenefunction.
Regularizationmethodsarecommonlyappliedinhigh-dimensionalregr
essionstoover-
comethedcultiesbroughtbythe\curseofdimensionality".Thesp
iritisencouraging
sparsity.Whenweestimategraphicalstructures,thedimensiono
ftheedgeswewantto
estimateismuchhigherthantheamountofthedatainthesample.As
sumingmostofthe
edgesdonotexist,wecansuccessfullyestimatethepositionsofth
eexistentedges.However,
theproblembecomesmoredcultwhenthetotalnumberofedgesa
reincreasingatavery
highratealongwiththeincreaseofthesamplesize.Screeningisagoo
dmethodtohandle
thisproblem.Afterscreeningthevariables,wehavelessvariablest
oconsider.Thecombi-
nationofscreeningmethodsandregularizationmethodsisanintere
stingtopictoexplorein
thecontextofgeneticalgenomicanalysis.
75
Anotherwaytoestimategraphisthroughhypothesistesting(Liu,
2013).Wemay
constructtestprocedurestotesttheexistenceofeachedge,
andcontrolthefamily-wiseerror
rateorfalsediscoveryrate.Thisrelatestheestimationofgraphic
alstructurestoanother
importantstatisticstopicinmultipletesting.Asmanymethodsformu
ltipletestinghave
beenourishedinliterature,howtoapplythemingeneticalgenomica
nalysisisachallenge
topic.Thedcultyreliesonthetwo-dimensionalscanoverthousan
dsofgeneexpressions
andmillionsofSNPvariants.Amongthemethodsbeingstudied,Gauss
ianrandomelds
andtherelatedgeometryhavebeenusedtocontrolfamily-wiseer
rorrate(AdlerandTaylor,
2007),andseemstobepromisingtoinvestigateinthiseld.
TheworksinChapter3andChapter4considerquantitativetraits.
Theycanbegen-
eralizedtogeneralizedlinearmodelswithbinarydiseaseresponseswh
icharecommonly
observedinhumangeneticsstudies.Suchanextensionshouldbeco
mputationallyfeasible
andeasytoimplement.However,theoreticalevaluationshouldbec
hallenging.Wewill
considertheIVGCmethodinaGLMmodelsetupandinvestigatethem
odelperformance
boththeoreticallyandempiricallyinthefuture.
Insummary,thisthesiscontributestothetheoryofgraphestima
tionandpenalized
regressionincorporatinggraphicalstructure,aswellastogene
ticalgenomicsdataanalysis
inapplication.Wewillconsidertheaforementionedfutureworkande
xtensions,andcontinue
statisticalinvestigationsalongtheline.
76
APPENDICES
77
AppendixA

ProofofTheorem1

Thefollowingsaretherequiredconditions.
(C1)
1
n
X
T
X
!
C
as
n
!1
,where
C
isapositivedenitematrix;
(C2)
1
n
max
1

i

n
x
T

i
x
i
!
0;
(C3)Thedistribution
P
n
oftheerrorterminthemodelisnormallydistributedand
faithfultoaDAG
G
n
forall
n
;(C4)
p
n
=
O
(
n
a
),forsome0

a<
1
,and
q
isxed;
(C5)Dene

n
=max
1

j

p
n
j
adj
(
G;j
)
j
,where
adj
(
G;j
)representstheadjacentsetof
j
in
G
.
n
=
O
(
n
1

b
),forsome0
<b

1.
(C6)Let
ˆ
i;j
jk
bethepartialcorrelationbetween

i
and

j
given
f

r
;r
2
k
g
,where
k
is
asubsetof
f
1
;:::;p
(
n
)
gnf
i;j
g
.Thepartialcorrelationsarebounded,i.e.,
(i)inf
i;j;
k
fj
ˆ
i;j
jk
j
:ˆ
i;j
jk
6=0
g
c
n
,where
c

1
n
=
O
(
n
d
),0
<d<b=
2,and
b
isasin(C5);
and

(ii)sup
i;j;
k
j
ˆ
i;j
jk
j
M<
1.
Remarks
:Conditions(C1)and(C2)wereassumedinKnightandFu(2000),an
dare
appliedtoobtainaneectiveestimateofthecoecientmatrixinour
rststep.Conditions
(C3)-(C6)arethesameasAssumptions(A1)-(A4)inKalischandB
uhlmann(2007)where
theywereusedtoestablishtheconsistentresults.Wealsoneedth
eseconditionstoestab-
lishtheestimateconsistencyforourestimatedCPDAG.Theestablis
hmentofourtheorem
parallelstothatinKalischandBuhlmann(2007)exceptthatwehave
totakeintoaccount
78
theerrorsbroughtinbytheestimationofthemeansintherststa
ge.
ProofoftheLemma1.
z
i
=
y
i

XB
i
;z
j
=
y
j

XB
j
;^
z
i
=
y
i

X
^
B
i
;^
z
j
=
y
j

X
^
B
j
;8
i;j
=
1
;
;p:
^
ˆ
n
;i;j
(
^
B
)=
1
=n
P
n

t
=1
(^
z
i
t

^
z
j
)(^
z
j
t

^
z
j
)
1
=n
q
P
n

t
=1
(^
z
i
t

^
z
i
)
2
P
n

t
=1
(^
z
j
t

^
z
j
)
2
=
a
b
^
ˆ
n
;i;j
(
B
)=
1
=n
P
n

t
=1
(
z
i
t

z
i
)(
z
j
t

z
j
)
1
=n
q
P
n

t
=1
(
z
i
t

z
i
)
2
P
n

t
=1
(
z
j
t

z
j
)
2
=
c
d
^
ˆ
n
;i;j
(
^
B
)

^
ˆ
n
;i;j
(
B
)=
a
b

c
d
=
(
a

c
)
d
+(
d

b
)
c
bd
a

c
=
I
+
II
+
III
+
IV
I
=
1
n
n
X
t
=1
(
y
ti
(

x
T

t
+
x
T
))(
^
B
j

B
j
)
II
=
1
n
n
X
t
=1
(
y
tj
(

x
T

t
+
x
T
))(
^
B
i

B
i
)
III
=(
^
B
i

B
i
)
T
(
1
n
X
T
X

x
(
x
)
T
)
^
B
j
IV
=
B
T
i
(
1
n
X
T
X

x
(
x
)
T
)(
^
B
j

B
j
)
Wehave
j
I
jk
1
n
P
n

t
=1
(
y
ti
(

x
T

t
+
x
T
))
k
1
k
^
B
j

B
j
k
1
.UsingTheorem1ofWainwright
(2009),wehave
k
^
B
j

B
j
k
1
=
o
p
(1).Wewanttoprove
k
1
n
P
n

t
=1
(
y
ti
(

x
T

t
+
x
T
))
k
1
=
O
p
(1).
First,wehave
k
1
n
P
n

t
=1
(
y
ti
(

x
T

t
+
x
T
))
k
1
k
1
n
P
n

t
=1
y
ti
x
T

t
k
1
+
k
1
n
P
n

t
=1
y
ti
x
T
k
1
.Forany
j
=1
;
;q
,usingCauchy-Schwarzinequality,wehave
j
1
n
P
n

t
=1
y
ti
x
tj
j
r
P
n

t
=1
y
2
ti
n
P
n

t
=1
x
2

tj
n
and
j
1
n
P
n

t
=1
y
ti
x
j
j
=
j
1
n
P
n

t
=1
y
ti
jj
1
n
P
n

t
=1
x
tj
jj
1
n
P
n

t
=1
y
ti
j
q
1
n
P
n

t
=1
x
2

tj
.Becauseof
P
n

t
=1
y
i
t
n
P
!
E
[y
1
i
],P
n

t
=1
(
y
i
t
)
2
n
P
!
E
[y
2
1
i
],and
P
n

t
=1
x
2

tj
n
!
C
jj
,weobtain
k
1
n
P
n

t
=1
y
ti
x
T

t
k
1
=
79
O
p
(1)and
k
1
n
P
n

t
=1
y
ti
x
T
k
1
=
O
p
(1).So,weproved
I
=
o
p
(1).Similarly,wecanobtain
II
=
o
p
(1).
Wehave
III
k
^
B
i

B
i
k
1
k
(
1
n
X
T
X

x
(
x
)
T
)
^
B
j
k
1
.UsingTheorem1ofWainwright
(2009),wehave
k
^
B
i

B
i
k
1
=
o
p
(1).First,wehave
k
(
1
n
X
T
X

x
(
x
)
T
)
^
B
j
k
1
k
1
n
X
T
X
^
B
j
k
1
+
k
x
(
x
)
T
^
B
j
k
1
.Because

j;n
=
o
(
n
),wehave
^
B
j
=
O
p
(1)
;j
=1
;
;p
byTheorem2of
KnightandFu(2000).Besides,wealsoknow
j
(
x
(
x
)
T
)
ij
j
=
j
1
n
P
n

t
=1
x
ti
jj
1
n
P
n

t
=1
x
tj
j
q
1
n
P
n

t
=1
x
2

ti
q
1
n
P
n

t
=1
x
2

tj
;i
=1
;
;p;j
=1
;:::q;
and
1
n
X
T
X
!
C
.So,weobtainthe
followingresults
k
1
n
X
T
X
^
B
j
k
1
=
O
p
(1)and
k
x
(
x
)
T
^
B
j
k
1
=
O
p
(1).Hence,
III
=
o
p
(1).
Similarly,wecanobtain
IV
=
o
p
(1).Combining
I;II;III;IV
,weobtain
a

c
=
o
p
(1).
Weknow
1
n
P
n

t
=1
(
z
i
t

z
i
)
2
P
!
˙
2
i
;8
i
=1
;:::;p
.So,wehave
d
P
!
˙
i
˙
j
.1
n
n
X
t
=1
(^
z
i
t

^
z
i
)
2

1
n
n
X
t
=1
(
z
i
t

z
i
)
2
=
V
+
VI
V
=
2
n
n
X
t
=1
(
z
i
t

z
i
)(

x
T

t
+
x
T
)(
^
B
i

B
i
)
VI
=(
^
B
i

B
i
)
T
(
1
n
X
T
X

x
(
x
)
T
)(
^
B
i

B
i
)
Implementingthesimilaranalysisaswedidforproving
a

c
=
o
p
(1),wecanobtain
V
=
o
p
(1)and
VI
=
o
p
(1).So,weproved
d

b
=
o
p
(1).With
d
P
!
˙
i
˙
j
,wealsohave
b
P
!
˙
i
˙
j
.Combiningalltheresultsaboveand
c
P
!
˙
ij
,wenallyobtain^
ˆ
n
;i;j
(
^
B
)

^
ˆ
n
;i;j
(
B
)=
o
p
(1)
;8
i;j
=1
;
;p:
ProofoftheLemma2.
WeproveLemma2usingarecursivemethod.ByLemma1,weknow
(A.1)isrightif
j
k
j
0.
80
Now,weassume(1)isrightforall
k
with
j
k
j
m
.Wewanttoprove(A.1)isrightfor
all
k
with
j
k
j
m
+1.
Usingtheformula
ˆ
n
;i;j
jk
=
ˆ
n
;i;j
jk
n
h

ˆ
n
;i;h
jk
n
h
ˆ
n
;j;h
jk
n
h
r
(1

ˆ
2

n
;i;h
jk
n
h
)(1

ˆ
2

n
;j;h
jk
n
h
)
,wehave
^
ˆ
n
;i;j
jk
(
^
B
)

^
ˆ
n
;i;j
jk
(
B
)
=
D(^
ˆ
1

ˆ
1
)+(
D
^
D)
ˆ
1
+
D^
ˆ
2
(
ˆ
3

^
ˆ
3
)+^
ˆ
2
ˆ
3
(
^
D
D)+
^
Dˆ
3
(
ˆ
2

^
ˆ
2
)
D^
D^
ˆ
1
=^
ˆ
n
;i;j
jk
n
h
ˆ
1
=
ˆ
n
;i;j
jk
n
h
^
ˆ
2
=^
ˆ
n
;i;h
jk
n
h
ˆ
2
=
ˆ
n
;i;h
jk
n
h
^
ˆ
3
=^
ˆ
n
;j;h
jk
n
h
ˆ
3
=
ˆ
n
;j;h
jk
n
h
^
D=(1

^
ˆ
2

n
;i;h
jk
n
h
(
^
B
))(1

^
ˆ
2

n
;j;h
jk
n
h
(
^
B
))
D=(1

^
ˆ
2

n
;i;h
jk
n
h
(
B
))(1

^
ˆ
2

n
;j;h
jk
n
h
(
B
))
Usingtheassumptionthat(1)isrightforall
k
with
j
k
j
m
and
1
n
X
T
X
!
C
,weobtain
^
ˆ
1

ˆ
1
P
!
o
p
(1)
;^
ˆ
2

ˆ
2
P
!
o
p
(1)
;^
ˆ
3

ˆ
3
P
!
o
p
(1)
;and
^
D
DP
!
o
p
(1).Besides,we
know
ˆ
1
;ˆ
2
;ˆ
3
;and
Dareconvergentinprobability.Hence,wenallyobtain^
ˆ
n
;i;j
jk
(
^
B
)

^
ˆ
n
;i;j
jk
(
B
)=
o
p
(1),forany
k
satisfying
j
k
j
m
+1.
ProofoftheLemma3.
P
(
j
^
ˆ
n
;i;j
jk
(
^
B
)

ˆ
n
;i;j
jk
j
>
)

P
(
j
^
ˆ
n
;i;j
jk
(
B
)

ˆ
n
;i;j
jk
j
>

j
^
ˆ
n
;i;j
jk
(
^
B
)

^
ˆ
n
;i;j
jk
(
B
)
j
).UsingLemma2andCorollary1inKalischandBuhlmann(2007),
proofisdone.
ProofofTheorem1:TheoremsinKalischandBuhlmann(2007)areb
aseduponCorollary
81
1intheirpaper.ByLemma3,weobtainedthesameresultastheirCor
ollary1after
introducingtheestimationof
B
.So,wecanobtaintheresultastheirTheorem2underthe
assumptionsweputinourTheorem1.
HerewewritethePC-algorithminKalischandBuhlmann(2007).Ther
earetwosteps
inthealgorithm.Intherststep,theskeletonstructureisestima
ted.Inthesecondstep,
thedirectionoftheedgesaredetermined.
Input
:VertexSet
V
,ConditionalIndependenceInformation
Output
:Estimatedskeleton
C
,separationsets
S
(onlyneededwhendirectingthe
skeletonafterwards)
Formthecompleteundirectedgraph
~
C
onthevertexset
V
;repeat
l
=

1;
C
=
~
C
;repeat
Selecta(new)orderedpairofnodes
i;j
thatareadjacentin
C
suchthat
j
adj
(
C;i
)
nf
j
gj
l
;repeat
Choose(new)
k

adj
(
C;i
)
nf
j
g
with
j
k
j
=
l
;if
i
and
j
areconditionalindependentgiven
k
then
Deleteedge
i;j
;Denotethisnewgraphby
C
;Save
k
in
S
(
i;j
)and
S
(
j;i
);
end
until
edge
i;j
isdeletedorall
k

adj
(
C;i
)
nf
j
g
with
j
k
j
=
l
havebeen
chosen
;until
allorderedpairsofadjacentvariables
i
and
j
suchthat
j
adj
(
C;i
)
nf
j
gj
l
and
k

adj
(
C;i
)
nf
j
g
with
j
k
j
=
l
havebeentestedforconditionalindependence
;until
foreachorderedpairofadjacentnodes
i;j
:j
adj
(
C;i
)
nf
j
gj
<l
;Algorithm0:
PC-algorithm:step1
82
Input
:Skeleton
G
skel
,separationsets
S
Output
:CPDAG
G
for
allpairsofnonadjacentvariables
i;j
withcommonneighbour
k
do
if
k
62
S
(
i;j
)
then
Replace
i

k

j
in
G
skel
by
i
!
k
 
j
end
end

IntheresultingPDAG,trytoorientasmanyundirectededgesaspo
ssibleby
repeatedapplicationofthefollowingrules:

(R1)Orient
j

k
into
j
!
k
wheneverthereisanarrow
i
!
j
suchthat
i
and
k
are
nonadjacent.

(R2)Orient
i

j
into
i
!
j
wheneverthereisachain
i
!
k
!
j
.(R3)Orient
i

j
into
i
!
j
whenevertherearetwochains
i

k
!
j
and
i

l
!
j
suchthat
k
and
l
arenonadjacent.
(R4)Orient
i

j
into
i
!
j
whenevertherearetwochains
i

k
!
j
and
k
!
l
!
j
suchthat
k
and
l
arenonadjacent.
Algorithm0:
PC-algorithm:step2
83
AppendixB

ProofofTheorem2

WecloselyfollowproofsinLinetal.(2015)toestablishthetheory.Fir
st,weproveCondition
(C2)alsoholdforthesamplematrixof
^
X
withasmaller

.Lemma4.
Iftheparametersintherststep

j
areselectedtosatisfy
16
˚

2
rs
max
(2
L
+

max
)


2(4


)
andCondition(C2)holds,thenwithprobabilityatleast
1

P
p

j
=1
q
exp(

n
2

j
8
˙
2
j
)
,
thematrix
^
C
=
1
n
(
^
X
)
T
^
X
=
1
n
(
G
^

)
T
G
^

satises
k
(
^
C
SS
+2

2
L
SS
)

1
k
1

2(4


)
8

3

˚
(B.1)
k
(
^
C
S
c
S
+2

2
L
S
c
S
)(
^
C
SS
+2

2
L
SS
)

1
k
1

1

3
4
:
(B.2)
Proof.
Byapplying(A.1)and(A.2)inLinetal.,similartotheirproof,wehave
˚
k
^
C
SS

C
SS
k
1


2(4


)
and
˚
k
^
C
S
C
S

C
S
C
S
k
1


2(4


)
.Then,wehave
k
(
^
C
SS
+2

2
L
SS
)

1

(
C
SS
+2

2
L
SS
)

1
k
1

˚
k
^
C
SS
+2

2
L
SS

C
SS
+2

2
L
SS
k
1
1

˚
k
^
C
SS
+2

2
L
SS

C
SS
+2

2
L
SS
k
1
˚
=
˚
k
^
C
SS

C
SS
k
1
1

˚
k
^
C
SS

C
SS
k
1
˚


8

3

˚:
84
Thetriangleinequalityimplies
k
(
^
C
SS
+2

2
L
SS
)

1
k
1
k
(
^
C
SS
+2

2
L
SS
)

1

(
C
SS
+2

2
L
SS
)

1
k
1
+
k
(
C
SS
+2

2
L
SS
)

1
k
1


8

3

˚
+
˚
=
2(4


)
8

3

˚:
Thenwehave
k
(
^
C
S
c
S
+2

2
L
S
c
S
)(
^
C
SS
+2

2
L
SS
)

1

(
C
S
c
S
+2

2
L
S
c
S
)(
C
SS
+2

2
L
SS
)

1
k
1
=
k
(
^
C
S
C
S

C
S
C
S
)(
^
C
SS
+2

2
L
SS
)

1

(
C
S
c
S
+2

2
L
S
c
S
)(
C
SS
+2

2
L
SS
)

1
(
^
C
SS

C
SS
)(
^
C
SS
+2

2
L
SS
)

1
k
1
k
(
^
C
S
C
S

C
S
C
S
)
k
1
k
(
^
C
SS
+2

2
L
SS
)

1
k
1
+
k
(
C
S
c
S
+2

2
L
S
c
S
)(
C
SS
+2

2
L
SS
)

1
k
1
k
(
^
C
SS

C
SS
)
k
1
k
(
^
C
SS
+2

2
L
SS
)

1
k
1


2(4


)
˚
2(4


)
8

3

˚
+(1


)

2(4


)
˚
2(4


)
8

3

˚
=
2


8

3


1
4
:
85
Finallywehave
k
(
^
C
S
c
S
+2

2
L
S
c
S
)(
^
C
SS
+2

2
L
SS
)

1
k
1
k
(
^
C
S
c
S
+2

2
L
S
c
S
)(
^
C
SS
+2

2
L
SS
)

1

(
C
S
c
S
+2

2
L
S
c
S
)(
C
SS
+2

2
L
SS
)

1
k
1
+
k
(
C
S
c
S
+2

2
L
S
c
S
)(
C
SS
+2

2
L
SS
)

1
k
1

1
4

+1


=1

3
4
:
ProofoftheTheorem2.
HerewecloselyfollowproofsinLinetal.(2015).
Foranindexset
I
,dene
X
J
asthesubmatrixconsistingofthe
j
thcolumnsof
X
,where
j
2
J
.ByKarush-Kuhn-Tuckerconditions,thesolution
^

ofEquation3.4mustsatises
1
n
^
X
T
^
S
(
Y

^
X
^

)

2

2
L
T
^
S
^

=

1
sign(
^

^
S
)(B.3)
k
1
n
^
X
T
^
S
C
(
Y

^
X
^

)

2

2
L
T
^
S
C
^

k
1


1
:(B.4)
Let
^

S
C
=0.Werstnda
^

S
from(B.3),thenweprovesuch
^

alsosatises(B.4).
Besides,weprovesuch
^

possessesthepropertyofconsistency.
UsingthesimilarargumentinLinetal.(2015),wecanndconstants
c
0
;c
1
;c
2
>
0such
that,ifweselect

1
asintheTheorem,thenwithprobabilityatleast1

c
1
(
pq
)

c
2
,wehave
k
1
n
^
X
T


1
n
^
X
T
(
^
X

X
)

)
k
1


2(4


)

1
:(B.5)
Fromnowon,webaseouranalysison(B.5).
86
Byusingtheequality
Y
=
X
S

S
+

,wehave
Y

^
X
^

=


(
^
X
S

X
S
)

S

^
X
S
(
^

S


S
).Replacing
^
S
with
S
,wewrite(B.3)as
^
C
SS
(
^

S


S
)+2

2
L
SS
^

S
+2

2
L
T

S
C
S
^

S
C
=
1
n
^
X
T
S


1
n
^
X
T
S
(
^
X
S

X
S
)

S


1
sign(
^

S
)
:Aftersomealgebra,wehave
^

S


S
=(
^
C
SS
+2

2
L
SS
)

1
h
1
n
^
X
T
S


1
n
^
X
T
S
(
^
X
S

X
S
)

S


1
sign(
^

S
)

2

2
L
SS

S
i
:(B.6)
ByLemma4and(B.5),wehave
k
^

S


S
k
1
k
(
^
C
SS
+2

2
L
SS
)

1
k
1
h
k
1
n
^
X
T
S


1
n
^
X
T
S
(
^
X
S

X
S
)

S
k
1
+
k

1
sign(
^

S
)
k
1
+
k
2

2
L
SS

S
k
1
i

2(4


)
8

3

˚
h

2(4


)

1
+

1
+2

2
C
L
i
=
2(4


)
8

3

˚
h
8


2(4


)

1
+2

2
C
L
i
<b
0
whichimpliessign(
^

S
)=sign(

S
).Besides,weknow
^

S
C
=0bydenition.Hence,we
have
^
S
=
S
.Let
^

S
bedenedby(B.6)withsign(
^

S
)replacedbysign(

S
).
Now,weneedtocheckif(B.4)holds.Byusingtheequality
Y

^
X
^

=


(
^
X
S

87
X
S
)

S

^
X
S
(
^

S


S
)and(B.6),wehave
1
n
^
X
T
S
C
(
Y

^
X
^

)

2

2
L
T

S
C
^

=
1
n
^
X
T
S
C


1
n
^
X
T
S
C
(
^
X
S

X
S
)

S

(
^
C
S
c
S
+2

2
L
S
c
S
)(
^
C
SS
+2

2
L
SS
)

1
h
1
n
^
X
T
S
C


1
n
^
X
T
S
C
(
^
X
S

X
S
)

S


1
sign(
^

S
)

2

2
L
SS

S
i

2

2
L
S
C
S

S
ByLemma4,(B.5),and

2
C
L


(16

3

)
4(4


)(8

3

)

1
,wehave
k
1
n
^
X
T
S
C
(
Y

^
X
^

)

2

2
L
T

S
C
^

k
1
k
1
n
^
X
T
S
C


1
n
^
X
T
S
C
(
^
X
S

X
S
)

S
k
1
+
k
(
^
C
S
c
S
+2

2
L
S
c
S
)(
^
C
SS
+2

2
L
SS
)

1
k
1
h
k
1
n
^
X
T
S
C


1
n
^
X
T
S
C
(
^
X
S

X
S
)

S
k
1
+
k

1
sign(
^

S
)
k
1
+
k
2

2
L
SS

S
k
1
i
+
k
2

2
L
S
C
S

S
k
1


2(4


)

1
+(1

3
4

)
h

2(4


)

1
+

1
+2

2
C
L
i
+2

2
C
L
=
3

2

24

+32
8(4


)

1
+
8

3

2

2
C
L


1
:Since
^
S
=
S
,wesee
^

alsosatises(B.4).Lastly,wehave
k
^

S


S
k
1

2(4


)
8

3

˚
h
8


2(4


)

1
+2

2
C
L
i

2(4


)
8

3

˚
h
8


2(4


)

1
+2

(16

3

)
4(4


)(8

3

)

1
i
=
16(4


)
˚C
0
(8

3

)
2

r
r
(log
p
+log
q
)
n
88
AppendixC

ProofofTheorem3

Lemma5.
Ifsomeregularizationconditiononrststep

j
issatisedandCondition(C2)
holds,thenwithprobabilityapproachingto1,thematrix
^
C
=
1
n
^
V
T
^
V
satises
k
(
^
C
SS
+2

2
L
SS
)

1
k
2

2(4


)
8

3

˚
(C.1)
k
(
^
C
tS
+2

2
L
tS
)(
^
C
SS
+2

2
L
SS
)

1
k
2

(1

3
4

)
1
p
s
;t
2
S
C
:(C.2)
where
˚
=
k
(
C
SS
+2

2
L
SS
)

1
k
2
and
C
=
1
n
V
T
V
.
Proof.
k
^
C
tS

C
tS
k
2
=
k
R

1
2
(
1
n
^
U
T
t
^
U
S

1
n
U
T
t
U
S
)
0

B

B

B

B

B

@
R

1
2
0
...0
R

1
2
1

C

C

C

C

C

A
k
2
k
R

1
2
k
2
k
1
n
^
U
T
t
^
U
S

1
n
U
T
t
U
S
k
2
k
0

B

B

B

B

B

@
R

1
2
0
...0
R

1
2
1

C

C

C

C

C

A
k
2
89
Bydenition,weknow
U
t
=
0

B

B

B

B

B

@
X
1
t
(
B
1
(
Z
1
)
;
;B
l
(
Z
1
))
.
.
.X
nt
(
B
1
(
Z
n
)
;
;B
l
(
Z
n
))
1

C

C

C

C

C

A
=
X

t
B

;t
=1
;
;p
where
X

t
=
0

B

B

B

B

B

@
X
1
t
0
...0
X
nt
1

C

C

C

C

C

A
and
B

=
0

B

B

B

B

B

@
B
1
(
Z
1
)

B
l
(
Z
1
)
.
.
..
.
.B
1
(
Z
n
)

B
l
(
Z
n
)
1

C

C

C

C

C

A
.k
1
n
^
U
T
t
^
U
S

1
n
U
T
t
U
S
k
2
=
k
B

T
"
1
n
^
X

t

^
X

1
;
;^
X

s


1
n
X

t

X

1
;
;X

s

#
0

B

B

B

B

B

@
B

0
...0
B

1

C

C

C

C

C

A
k
2
k
B

T
k
2
k
1
n
^
X

t

^
X

1
;
;^
X

s


1
n
X

t

X

1
;
;X

s

k
2
k
0

B

B

B

B

B

@
B

0
...0
B

1

C

C

C

C

C

A
k
2
FromLinetal.(2015),wehavethefollowingwithprobabilityatleast1

P
p

j
=1
q
exp(

n
2

j
8
˙
2
j
),
k
1
n
^
X

t

^
X

1
;
;^
X

s


1
n
X

t

X

1
;
;X

s

k
2

C
(
;s;
f

j
g
p

j
=1
;L
)
;8
t:
90
Ifwechoose
f

j
g
p

j
=1
suchthat
˚N
2
1
N
2
2
C
(
;s;
f

j
g
p

j
=1
;L
)
s
2
:5


2(4


)
,wehave
˚
k
^
C
tS

C
tS
k
2

˚N
2
1
N
2
2
C
(
;s;
f

j
g
p

j
=1
;L
)
s
2


2(4


)
p
s
;8
t
2
S:
and
˚
k
^
C
SS

C
SS
k
2


2(4


)
:k
(
^
C
SS
+2

2
L
SS
)

1

(
C
SS
+2

2
L
SS
)

1
k
2

˚
k
^
C
SS
+2

2
L
SS

C
SS

2

2
L
SS
k
2
1

˚
k
^
C
SS
+2

2
L
SS

C
SS

2

2
L
SS
k
2
˚
=
˚
k
^
C
SS

C
SS
k
2
1

˚
k
^
C
SS

C
SS
k
2
˚


8

3

˚
Thetriangleinequalityimplies
k
(
^
C
SS
+2

2
L
SS
)

1
k
2
k
(
^
C
SS
+2

2
L
SS
)

1

(
C
SS
+2

2
L
SS
)

1
k
2
+
k
(
C
SS
+2

2
L
SS
)

1
k
2


8

3

˚
+
˚
=
2(4


)
8

3

˚:
91
Wenotice
(
C
tS
+2

2
L
tS
)(
C
SS
+2

2
L
SS
)

1
=
R

1
2

1
n
U
T
t
U
S
+2

2
L
R

tS

1
n
U
T
S
U
S
+2

2
L
R

SS


1
0

B

B

B

B

B

@
R
1
2
0
...0
R
1
2
1

C

C

C

C

C

A
So,wehave
k
(
^
C
tS
+2

2
L
tS
)(
^
C
SS
+2

2
L
SS
)

1

(
C
tS
+2

2
L
tS
)(
C
SS
+2

2
L
SS
)

1
k
2
=
k
(
^
C
tS

C
tS
)(
^
C
SS
+2

2
L
SS
)

1

(
C
tS
+2

2
L
tS
)(
C
SS
+2

2
L
SS
)

1
(
^
C
SS

C
SS
)(
^
C
SS
+2

2
L
SS
)

1
k
2
k
(
^
C
tS

C
tS
)
k
2
k
(
^
C
SS
+2

2
L
SS
)

1
k
2
+
k
(
C
tS
+2

2
L
tS
)(
C
SS
+2

2
L
SS
)

1
k
2
k
(
^
C
SS

C
SS
)
k
2
k
(
^
C
SS
+2

2
L
SS
)

1
k
2


2(4


)
p
s˚
2(4


)
8

3

˚
+
1


p
s

2(4


)
˚
2(4


)
8

3

˚
=
(2


)

(8

3

)
p
s


4
p
s
:92
Andnallywehave,for
8
t
2
S
C
,k
(
^
C
tS
+2

2
L
tS
)(
^
C
SS
+2

2
L
SS
)

1
k
2
k
(
^
C
tS
+2

2
L
S
c
S
)(
^
C
SS
+2

2
L
SS
)

1

(
C
tS
+2

2
L
tS
)(
C
SS
+2

2
L
SS
)

1
k
2
+
k
(
C
tS
+2

2
L
tS
)(
C
SS
+2

2
L
SS
)

1
k
2


4
p
s
+
1


p
s
=(1

3
4

)
1
p
s
:ProofoftheTheorem3.
HerewecloselyfollowproofsinLinetal.(2015).
Foranindexset
I
,dene
X
J
asthesubmatrixconsistingofthe
j
thcolumnof
X
,where
j
2
J
.ByKarush-Kuhn-Tuckerconditions,thesolution
^
˘
ofEquation4.10mustsatises
1
n
^
V
T
^
S
(
Y

^
V
^
˘
)

2

2
L
T
^
S
^
˘
=

1

^
˘
t
k
^
˘
t
k
2

t
2
^
S
(C.3)
k
1
n
^
V
T
t
(
Y

^
V
^
˘
)

2

2
L
T

t
^
˘
k
2


1
;8
t
2
^
S
C
:(C.4)
Let
^
˘
S
C
=0.Werstnda
^
˘
S
from(C.3),thenweprovesuch
^
˘
alsosatises(C.4).Besides,
weprovesuch
^
˘
possessesthepropertyofconsistency.
k
1
n
^
V
T
t


1
n
^
V
T
t
(
^
V
S

V
S
)
˘
S
k
2
=
k
R

1
2
h
1
n
^
U
T
t


1
n
^
U
T
t
(
^
U
S

U
S
)

0
S
i
k
2
k
R

1
2
k
2
k
1
n
^
U
T
t


1
n
^
U
T
t
(
^
U
S

U
S
)

0
S
k
2
93
Assumetherst
s
elementsof
˘
arenotzeroes.
k
1
n
^
U
T
t


1
n
^
U
T
t
(
^
U
S

U
S
)

S
k
2
=
k
1
n
B

T
^
X

t


1
n
B

T
^
X

t
(
^
X

1

X

1
;
;^
X

s

X

s
)
0

B

B

B

B

B

@
B

0
...0
B

1

C

C

C

C

C

A

S
k
2
k
B

T
k
2
k
1
n
^
X

t


1
n
^
X

t
(
^
X

1

X

1
;
;^
X

s

X

s
)
0

B

B

B

B

B

@
B

0
...0
B

1

C

C

C

C

C

A

S
k
2
UsingthesimilarargumentinLinetal.(2015),wecanndconstants
c
0
;c
1
;c
2
>
0such
that,ifweselect

1
asintheTheorem,thenwithprobabilityconvergingto1,wehave
k
1
n
^
X

t


1
n
^
X

t
(
^
X

1

X

1
;
;^
X

s

X

s
)
0

B

B

B

B

B

@
B

0
...0
B

1

C

C

C

C

C

A

S
k
2


2(4


)
N
1
N
2

1
So,wehave
k
1
n
^
V
T
t


1
n
^
V
T
t
(
^
V
S

V
S
)
˘
S
k
2


2(4


)

1
(C.5)
94
Usingtheequality
Y
=
V
S
˘
S
+

andreplace
^
S
with
S
,wehave
^
˘
S

˘
S
=(
^
C
SS
+2

2
L
SS
)

1
h
1
n
^
V
T
S


1
n
^
V
T
S
(
^
V
S

V
S
)
˘
S


1

^
˘
t
k
^
˘
t
k
2

t
2
S

2

2
L
SS
˘
S
i
:(C.6)
Then,wehave
k
^
˘
S

˘
S
k
2
k
(
^
C
SS
+2

2
L
SS
)

1
k
2
h
k
1
n
^
V
T
S


1
n
^
V
T
S
(
^
V
S

V
S
)
˘
S
k
2
+
k

1

^
˘
t
k
^
˘
t
k
2

t
2
S
k
2
+
k
2

2
L
SS
˘
S
k
2
i
k
^
˘
t

˘
t
k
2
=
k
((
^
C
SS
+2

2
L
SS
)

1
)
T

t
h
1
n
^
V
T
S


1
n
^
V
T
S
(
^
V
S

V
S
)
˘
S


1

^
˘
t
k
^
˘
t
k
2

t
2
S

2

2
L
SS
˘
S
i
k
2
k
((
^
C
SS
+2

2
L
SS
)

1
)
T

t
k
2
h
k
1
n
^
V
T
S


1
n
^
V
T
S
(
^
V
S

V
S
)
˘
S
k
2
+
k

1

^
˘
t
k
^
˘
t
k
2

t
2
S
k
2
+
k
2

2
L
SS
˘
S
k
2
i
=
k
((
^
C
SS
+2

2
L
SS
)

1
)
T

t
k
2
h
(
X
t
2
S
k
1
n
^
V
T
t


1
n
^
V
T
t
(
^
V
S

V
S
)
˘
S
k
2

2
)
1
2
+
k

1

^
˘
t
k
^
˘
t
k
2

t
2
S
k
2
+
k
2

2
L
SS
˘
S
k
2
i

2(4


)
8

3

˚
h

2(4


)

1
p
s
+

1
p
s
+2

2
C
L
i
<b
0
:whichimpliessign(
^
˘
S
)=sign(
˘
S
).Besides,weknow
^
˘
S
C
=0bydenition.Hence,wehave
95
^
S
=
S
.Let
^
˘
S
bedenedby(C.6)withsign(
^
˘
S
)replacedbysign(
˘
S
).
Now,wecheckifthesolutionweobtainedsatises(C.4).First,weha
ve
1
n
^
V
T
t
(
Y

^
V
^
˘
)

2

2
L
T

t
^
˘
=
1
n
^
V
T
t


1
n
^
V
T
t
(
^
V
S

V
S
)
˘
S

(
^
C
tS
+2

2
L
tS
)(
^
C
SS
+2

2
L
SS
)

1
h
1
n
^
V
T
S


1
n
^
V
T
S
(
^
V
S

V
S
)
˘
S


1

^
˘
t
k
^
˘
t
k
2

t
2
S

2

2
L
SS
˘
S
i

2

2
L
tS
˘
S
;8
t
2
S
C
:So,
8
t
2
S
C
,wehave
k
1
n
^
V
T
t
(
Y

^
V
^
˘
)

2

2
L
T

t
^
˘
k
2
=
k
1
n
^
V
T
t


1
n
^
V
T
t
(
^
V
S

V
S
)
˘
S

(
^
C
tS
+2

2
L
tS
)(
^
C
SS
+2

2
L
SS
)

1
h
1
n
^
V
T
S


1
n
^
V
T
S
(
^
V
S

V
S
)
˘
S


1

^
˘
t
k
^
˘
t
k
2

t
2
S

2

2
L
SS
˘
S
i

2

2
L
tS
˘
S
k
2
k
1
n
^
V
T
t


1
n
^
V
T
t
(
^
V
S

V
S
)
˘
S
k
2
+
k
(
^
C
tS
+2

2
L
tS
)(
^
C
SS
+2

2
L
SS
)

1
k
2
h
k
1
n
^
V
T
S


1
n
^
V
T
S
(
^
V
S

V
S
)
˘
S
k
2
+
k

1

^
˘
t
k
^
˘
t
k
2

t
2
S
k
2
+
k
2

2
L
SS
˘
S
k
2
i
+
k
2

2
L
tS
˘
S
k
2


2(4


)

1
+(1

3
4

)
1
p
s
h

2(4


)

1
p
s
+

1
p
s
+2

2
C
L
i
+2

2
C
L


1
Hence,thesolutionweobtainedisactuallythesolutionoftheoptimiza
tionproblem.
Wehave
k
^
˘
S

˘
S
k
2

2(4


)
8

3

˚
h
8


2(4


)

1
p
s
+2

2
C
L
i
p
s:
96
So,nally,wehave
k
^

S


S
k
2

2(4


)
8

3

˚
h
8


2(4


)

1
p
s
+2

2
C
L
i
s
1
:5
N
1
:97
BIBLIOGRAPHY
98
BIBLIOGRAPHY
[1]Adler,R.J.,andTaylor,J.E.(2009).Randomeldsandgeometr
y.SpringerScience
&BusinessMedia.
[2]Ali,R.A.,Richardson,T.S.andSpirtes,P.(2009).Markovequiv
alenceforancestral
graphs.
TheAnnalsofStatistics
,37
,2808-2837.
[3]Andersson,S.A.,Madigan,D.andPerlman,M.D.(1997).Achar
acterizationof
Markovequivalenceclassesforacyclicdigraphs.
TheAnnalsofStatistics
,25
,505-541.
[4]Bickel,P.,andLevina,E.(2008).Covarianceregularizationthre
sholding.
TheAnnals
ofStatistics
,36
,2577-2604.
[5]Bondell,H.,andReich,B.(2008).Simultaneousregressionshrink
age,variableselection,
andsupervisedclusteringofpredictiorswithOSCAR.
Biometrics
,64
,115-123.
[6]Cai,T.,Li,H.,Liu,W.andXie,J.(2013).Covariate-adjustedpre
cisionmatrixesti-
mationwithanapplicationingeneticalgenomics.
Biometrika
,100(1)
,407-499.
[7]Cheung,V.G.andSpielman,R.S.(2002).Thegeneticsofvariat
ioningeneexpression.
NatureGenetics
,32
,522-525.
[8]Chichering,D.(2002),LearningequivalenceclassesofBayesian
-networkstructures.
TheJournalofMachineLearningResearch
,2
,445-498.
[9]Daye,Z.,Xie,J.,andLi,H.(2012).Asparsestructuredshrinka
geestimatorfornon-
parametricvarying-coecientmodelwithanapplicationingenomics.
JournalofCom-
putationalandGraphicalStatistics
,21
,110-133.
[10]Dobra,A.,Jones,B.,Hans,C.,Nevins,J.R.andWest,M.(2004)
.Sparsegraphical
modelsforexploringgeneexpressiondata.
JournalofMultivariateAnalysis
,90
,196-
212.
[11]Efron,B.,Hastie,T.,Johnstone,I.andTibshirani,R.(2004).
Leastangleregression.
TheAnnalsofStatistics
,32
,407-499.
99
[12]Fan,J.,andLi,R.(2001).Variableselectionvianonconcavepen
alizedlikelihoodand
itsoracleproperties.
JournaloftheAmericanStatisticalAssociation
,96
,1348-1360.
[13]Fan,J.,andLiao,Y.(2014).Endogeneityinhighdimensions.
TheAnnalsofStatistics
,42
,872-917.
[14]Fan,J.,Han,F.,andLiu,H.(2014).Challengesofbigdataanalys
is.
NationalScience
Review
,1
,293-314.
[15]Friedman,J.,Hastie,T.andTibshirani,R.(2008).Sparseinver
secovarianceestimation
withthegraphicallasso.
Biostatistics
,9
,432-441.
[16]Friedman,J.,Hastie,T.,andTibshirani,R.(2010).Regularizatio
npathsforgeneral-
izedlinearmodelsviacoordinatedescent.
JournalofStatisticalSoftware
,33
,1-22.
[17]Friedman,J.,Hastie,T.,Hoing,H.,andTibshirani,R.(2007).P
athwisecoordinate
optimization.
TheAnnalsofAppliedStatistics
,1
,302-332.
[18]Friedman,N.,M.Linial,I.NachmanandD.Pe'er.(2000).UsingBa
yesiannetworks
toanalyzeexpressiondata.
Journalofcomputationalbiology
,7
,601-620.
[19]Frydenberg,M.(1990).ThechaingraphMarkovproperty.
ScandinavianJournalof
Statistics
,17
,333-353.
[20]Gilad,Y.,Rifkin,S.A.,andPritchard,J.K.(2008).Revealingthear
chitectureofgene
regulation:thepromiseofeQTLstudies.
Trendsingenetics
,24(8)
,408-415.
[21]Kalisch,M.andBuhlmann,P.(2007).EstimatingHigh-Dimension
alDirectedAcyclic
GraphswiththePC-Algorithm.
JournalofMachineLearningResearch
,8
,613-636.
[22]Kalisch,M.,Machler,M.,Colombo,D.,Maathuis,M.andBuhlmann
,P.(2012).Causal
InferenceusingGraphicalModelswiththeRpackagepcalg.
JournalofStatistical
Software
,47(11)
,1-26.
[23]Kang,H.,ZhangA.,CaiT.,andSmall,D.(2015).Instumentalva
riablesestimation
withsomeinvalidinstrumentalanditsapplicationtoMedelian.
JournaloftheAmer-
icanStatisticalAssociation
,(just-accepted).
[24]Knight,K.andFu,W.(2000).AsymptoticsforLasso-Typees
timators.
TheAnnalsof
Statistics
,28
,1356-1378.
100
[25]Knight,K.,andFu,W.(2000).AsymptoticsforLasso-Typees
timators.
TheAnnals
ofStatistics
,28
,1356-1378.
[26]Lee,W.andLiu,Y.(2012).Simultaneousmultipleresponseregr
essionandinverse
covariancematrixestimationviapenalizedGaussianmaximumlikelihood.
Journalof
MultivariateAnalysis
111
,241-255.
[27]Lee,W.,andLiu,Y.(2012).Simultaneousmultipleresponseregr
essionandinverse
covariancematrixestimationviapenalizedGaussianmaximumlikelihood.
Journalof
MultivariateAnalysis
,111
,241-255.
[28]Li,C.,andLi,H.(2008).Network-constrainedregularizationa
ndvariableselectionfor
analysisofgenomicdata.
Bioinformatics
,24
,1175-1182.
[29]Li,C.,andLi,H.(2010).Variableselectionandregressionanalys
isforgraph-structrued
covariateswithanapplicationtogenomics.
TheAnnalsofAppliedStatistics
,4
,1498-
1516.
[30]Lin,W.,Feng,R.,andLi,H.(2015).Regularizationmethodsforh
igh-dimensional
instrumentalvariablesregressionwithapplicationtogeneticalgen
omics.
Journalof
theAmericanStatisticalAssociation
,toappear.
[31]Liu,H.,Han,F.,Yuan,M.,Laerty,J.andWasserman,L.(2012
).Highdimensional
semiparametricGaussiancopulagraphicalmodels.
TheAnnalsofStatistics
40
2293-
2326.
[32]Liu,W.(2013).Gaussiangraphicalmodelestimationwithfalsed
iscoveryratecontrol.
TheAnnalsofStatistics
,41
,2948-2978.
[33]Schadt,E.E.,Monks,S.A.,Drake,T.A.,Lusis,A.J.,Che,N.,Co
linayo,V.,and
Friend,S.H.(2003).Geneticsofgeneexpressionsurveyedinmaiz
e,mouseandman.
Nature
,422(6929)
,297-302.
[34]Meinsharsen,N.,andBuhlmann,P.(2006).High-dimensionalg
raphsandvariable
selectionwiththelasso.
TheAnnalsofStatistics
,34
,1436-1462.
[35]Richardson,T.andSpirtes,P.(2002).AncestralgraphMar
kovmodels.
TheAnnalsof
Statistics
,30
,962-1030.
[36]Ru,T.G.,Milligan,S.B.,Lamb,J.R.,Cavet,G.,Linsley,P.S.,Mao,
M.,Schadt,
E.E.,Monks,S.A.,Drake,T.A.,Lusis,A.J.,Che,N.,Colinayo,V.,Sch
adt,E.et
101
al.(2008).Mappingthegeneticarchitectureofgeneexpressionin
humanliver.
PLoS
biology
,6
,1020-1032.
[37]Spirtes,P.,Glymour,C.N.,andScheines,R.(2000).Causation
,prediction,andsearch.
MITpress.
[38]Stoughton,R.B.andFriend,S.H.(2003).Geneticsofgenee
xpressionsurveyedin
Structures.
JournalofMachineLearningResearch
,2
,445-498.
[39]Tibshirani,R.(1996).Regressionshrinkageandselectionviath
elasso.
Journalofthe
RoyalStatisticalSociety,SeriesB
,58
,267-288.
[40]Verma,T.andPearl,J.(1990).Equivalenceandsynthesisofc
ausalmodels.
InUncer-
taintyinAritcialIntellegence:ProceedingsoftheSixth
Conference(M.Henrion,R.
Shachter,L.KanalandJ.Lemmer,eds)
,220-227.MorganKaufman,SanFrancisco.
[41]Verma,T.andPearl,J.(1992).Analgorithmfordecidingifaset
ofobservedindepen-
dencieshasacausalexplanation.
InUncertaintyinAritcialIntellegence:Proceedings
oftheEighthConference(D.Dubois,M.P.Wellman,B.D'Ambr
osioandP.Smets,
eds.)
,323-330.MorganKaufman,SanFrancisco.
[42]Wainwright,M.J.(2009).Sharpthresholdsforhigh-dimension
alandnoisyrecovery
ofsparsityusing
`
1
-constrainedquadraticprogramming(lasso).
IEEETransactionson
InformationTheory
,55
,2183-2202.
[43]Wang,H.,andXia,Y.(2009).Shrinkageestimationofthevaryin
gcoecientmodel.
JournaloftheAmericanStatisticalAssociation
,104
,747-757.
[44]Wauthier,V.,Verbeeck,R.K.,andBucCalderon,P.(2004).Age
-relatedchangesinthe
proteinandmRNAlevelsofCYP2E1andCYP3Aisoformsaswellasinthe
irhepatic
activitiesinWistarrats.Whatroleforoxidativestress?
Archivesoftoxicology
.78
,131-138.
[45]Webster,JA,Gibbs,JR,Clarke,J,Ray,Metal.(2009).Genet
iccontrolofhuman
braintranscriptexpressioninAlzheimerdisease.
TheAmericanJournalofHuman
Genetics
84(4)
445-58.
[46]WeiF.,Huang,J.,andLi,H.(2011).Variableselectionandestima
tioninhigh-
dimensionalvarying-coecientmodels.
StatisticaSinica
,21
,1515-1540.
[47]Yang,X.etal.(2010).Systematicgeneticandgenomicanalysis
ofcytochromeP450
enzymeactivitiesinhumanliver.
Genomeresearch
,20
,1020-1036.
102
[48]Yin,J.andLi,H.(2011).AsparseconditionalGaussiangraphic
almodesforanalysis
ofgeneticalgenomicsdata.
AnnalsofAppliedStatistics
,5
,2630-2650.
[49]Yin,J.andLi,H.(2013).Ajustingforhigh-dimensionalcovaria
tesinsparseprecision
matrixinestimationby
l
1
-penalization.
JournalofMultivariateAnalysis
,116
,365-381.
[50]Yin,J.,andLi,H.(2011).AsparseconditionalGaussiangraphic
almodesforanalysis
ofgeneticalgenomicsdata.
AnnalsofAppliedStatistics
,5
,2630-2650.
[51]Yin,J.,andLi,H.(2013).Adjustingforhigh-dimensionalcovar
iatesinsparseprecision
matrixinestimationby
l
1
-penalization.
JournalofMultivariateAnalysis
,116
,365-381.
[52]Yuan,M.,andLin,Y.(2007).Modelselectionandestimationinth
eGaussiangraphical
model.
Biometrika
,94
,19-35.
[53]Zhang,C.(2010).Nearlyunbiasedvariableselectionunderminim
axconcavepenalty.
TheAnnalsofStatistics
,38
,894-942.
[54]Zou,H.,andHastie,T.(2005).Regularizationandvariableselec
tionviatheelastic
net.
JournaloftheRoyalStatisticalSociety,SeriesB
,67
,301-320.
103