GRAPHESTIMATIONANDNETWORKCONSTRAINEDREGULARIZATIONW ITH APPLICATIONSINGENETICALGENOMICSANALYSIS By BinGao ADISSERTATION Submittedto MichiganStateUniversity inpartialfullmentoftherequirements forthedegreeof Statistics|DoctorofPhilosophy 2015 ABSTRACT GRAPHESTIMATIONANDNETWORKCONSTRAINED REGULARIZATIONWITHAPPLICATIONSINGENETICALGENOMICS ANALYSIS By BinGao Estimationandapplicationofgraphicalstructureareimportantto picsinmodernstatis- tics.Graphicalstructureisanidealtooltodescribegeneregulat orynetworkwhichisimpor- tantingeneticalgenomicsstudies.Inthisdissertation,agraphes timationmethodandtwo networkconstrainedregularizationmethodsareproposed.Both theirtheoreticalproperties andapplicationsingeneticalgenomicsarestudied. Largeamountofresearcheortshavebeenfocusedonestimatin ggenenetworksbased ongeneexpressiondatainordertounderstandthefunctionalba sisofalivingorganism.By treatinggeneexpressionsasquantitativetraitswhileconsideringg eneticmarkers,genetical genomicsanalysishasshownitspowerinenhancingtheunderstandin gofgeneregulation. Knowingthatgeneexpressionsareoftenduetodirectedregulatio ns,weintroduceacovariate- adjustedGaussiangraphicalmodeltoestimatetheequivalencecla ssofthedirectedacyclic graphs(DAGs)inageneticalgenomicsanalysisframeworkinthesec ondchapter.Anestima- tionprocedureisintroducedandtheestimationconsistencyforsp arseDAGsisestablished. WeapplythemethodtoahumanAlzheimer'sdiseasedatasettoshowt heutilityofthe method. Givenageneregulationnetwork,learnedeitherthroughbiologicale xperimentsorvia statisticalmethods,incorporatingsuchstructureinaregressio nmodelwouldincreasethe phenotypepredictionaccuracy.Inchapter3,weusegeneexpre ssionstopredictphenotypic responseswhileconsideringthegraphicalstructuresongenenet worksinaparametricre- gressionmodel.Giventhatgeneexpressionsareintermediatephen otypesbetweenatrait andgenes,wefollowtheinstrumentalvariableregressionframewo rkproposedbyLinetal. (2015)andtreatgeneticvariantsasinstrumentalvariablestode alwiththeendogeneityissue. Weproposeatwo-stepestimationprocedure.Intherststep,w eapplylassotoestimate theeectsofgeneticvariants.Inthesecondstep,weusethepr edictedexpressionsobtained fromtherststepaspredictorswhileadoptinganetworkconstra inedregularizationmethod toobtainbettervariableselectionandestimation.Weestablishthes electionconsistency andrelated L 1 bound. Itiscommonlyrecognizedthatgeneticeectsoncomplextraitsare largelymodulated byenvironmentalinuences.However,themodulationmechanismis barelyunderstood. Basedontheworkinchapter3,weadoptaninstrumentalvariabler egressionapproach whileincorporatingavaryingcoecientmodeltocapturepotential nonlinearenvironmental modulationeectsinchapter4.B-splinesareusedtoapproximatet hecoecientfunctions. Grouplassoandagraphconstrainedpenaltiesareappliedtoachieve bettervariableselec- tionandestimationwhennetworkstructuresareconsideredforg eneexpressions.Selection consistencyisestablishedundersomeassumptions.Weapplyourme thodtoahumanliver cohortdatasettodemonstrateitsutility. Idedicatethisdissertationtomyparents,LianrongSongandQings hengGao. iv ACKNOWLEDGMENTS IwouldliketogratefullyandsincerelythankmyadvisorDr.YuehuaCu iforhisguidance, understanding,patienceduringmygraduatestudyandresearch .Dr.Cuiguidedmeinto theinterdisciplinaryeldofstatisticalgenetics.Hehelpedmeselect importantpapersto readandtaughtmethebasicconceptsandtechniquesinstatistica lgeneticsatthebeginning ofmyresearch.IrememberthosediscussionsinhisoceinwhichIlea rnedbasictoolsand techniqueswhichwouldbeveryimportantinmyresearch.Whenever Ineedhelp,Dr.Cuiis alwaysthere.EverytimeIknockedthedoorofhisoce,healwayss penttimelisteningto myquestionspatientlyandgavehelpfulcomments.IntheJournal Clubwhichishostedby Dr.Cui,Ilearnedalotfromothers'presentation.Ialsodeveloped mypresentationskills. Iappreciateothermembersinmycommittee.Dr.YiminXiaowasmyinitia ladvisor. Hegavemehelpfuladvisesontheselectionofderentcoursesand guidanceonmylifewhen IjustarrivedatUS.Dr.Ping-ShouZhonghelpedinmanyaspectsth rougheverydiscussion wehad.Hisinsightfulopinionsmademeviewquestionsderently.Dr. QingLu'scomments andquestionsintheJournalClubweregermaneandthoughtful.All themembersinthe committeehelpmeduringmygraduatestudy.Ireallyappreciatethe irguidanceandhelp. IappreciatealltheprofessorswhotaughtmeandhelpedmeintheD epartmentofStatis- ticsandProbability.Iobtainedknowledgefromthosewellprepared coursesandenjoyedthe processoflearningandtheinteractionwiththeinstructors. MythanksgotostudentsinDr.Cui'sgroup,includingHonglangWang, TaoHe,Shunjie Guan,JingyiZhang,postdoctoralresearcherXuLiu,andforma lstudentDr.CenWu.I alsowanttothankotherstudentsinthedepartment,includingYuz henZhou,XiaoqingZhu, LiqianCai,SiddharthaNandy,JiwoongKimandmanyothers.Itismyluc kthatsomany v nicepeopleandfriendsarearoundme.Theyhelpedmeinbothresear chanddailylife. Lastbutdenitelynotleast,Ithankmyparentsfortheirsupport .Theyalwaysmakeall possibleeortstohelpmewheneverIneed.Theyowncreditsinvery achievementofmine. vi TABLEOFCONTENTS LISTOFTABLES .................................... ix LISTOFFIGURES ................................... x Chapter1Introduction ............................... 1 1.1Overview......................................1 1.2Integrativegeneticalgenomicsanalysis.................. ...2 1.3Estimatioinandapplicationofgraphicalstructures.......... ....4 1.4Varyingcoecientmodelcapturinggene-environmentinteract ion......6 1.5Objectivesandorganization........................... 7 Chapter2Learningdirectedacyclicgraphicalstructuresw ithgenetical genomicsdata .............................. 9 2.1Introduction....................................9 2.2ModelandEstimation..............................12 2.2.1Background................................12 2.2.2Covariate-adjustedgaussiangraphicalmodel........... ...13 2.2.3Atwo-stageestimationprocedure....................1 4 2.3Theoreticalresult.................................1 5 2.4Simulation.....................................17 2.4.1Simulationdesign.............................17 2.4.2Performanceinthelowdimensionalcase................19 2.4.3Performanceinthehighdimensionalcase................1 9 2.4.4Comparinggraphswithorwithoutcovariates.............2 2 2.4.5Comparingdirectedandundirectedgraphs............... 23 2.5ApplicationtoRealData.............................24 2.6Discussion.....................................28 Chapter3IntegrativeAnalysisofGeneticalGenomicsDataI ncorporat- ingNetworkStructure ......................... 30 3.1Introduction....................................30 3.2StatisticalMethodsandEstimation...................... .34 3.2.1TheModel.................................34 3.2.2Estimationofgeneticeects.......................35 3.2.3Networkconstrainedregularization...................3 6 3.3TheoreticalProperties.............................. 37 3.4Simulation.....................................39 3.5ApplicationtoRealData.............................42 3.6Discussion.....................................46 vii Chapter4GraphConstrainedRegularizationforNonparamet ricInstru- mentalVariablesRegressioninGeneticalGenomicAnalysis .51 4.1Introduction....................................51 4.2NonparametricModelsandGraphConstrainedRegularization... .....55 4.2.1TheModelSetup.............................55 4.2.2FirstStepEstimation...........................55 4.2.3SecondStepEstimation.........................56 4.3TheoreticalResults................................6 0 4.4Simulation.....................................62 4.5ApplicationtoRealData.............................67 4.6Discussion.....................................74 Chapter5Conclusionandfuturework ...................... 75 APPENDICES ...................................... 77 AppendixAProofofTheorem1............................ 78 AppendixBProofofTheorem2............................ 84 AppendixCProofofTheorem3............................ 89 BIBLIOGRAPHY .................................... 98 viii LISTOFTABLES Table2.1Simulationsetupinthehighdimensionalcasewithxed ......21 Table2.2Simulationsetupinthehighdimensionalcasewithvarying p ,q and .22 Table2.3Estimationcomparisonwithandwithoutadjustingforthema rker eectsunderderentscenarios.....................24 Table2.4EstimationcomparisonbetweenPCandcaPCwhen q>p ......24 Table2.5Estimationcomparisonbetweendirectedandundirectedgr aphs...25 Table3.1Simulationresultsforp=100,q=100,n=600............ .41 Table3.2Simulationresultsfor p =100, q =100, n =200 1400andnumSNP=444 Table3.3Simulationresultsfor p =600, q =600and n =300.........45 Table3.4Listoftop10selectedgenesforeachoftheIVGCandIVme thod.46 Table3.5thegenesinpathwayofMetabolismofxenobioticsbycytoch rome P450-Homosapiens..........................49 Table4.1Simulationresultsfor p =100, q =100and =0...........64 Table4.2Selectionratefor p =100, q =100, n =800and =0........65 Table4.3Simulationresultsfor p =100, q =100and =0 :25.........65 Table4.4Selectionratefor p =100, q =100, n =800and =0 :25......68 Table4.5Listoftop10genesbasedonthestabilityselectionrate... ....72 ix LISTOFFIGURES Figure2.1Performanceofthetwo-stageestimationalgorithmunde rderent parameterandsamplesizesettings.Thecirclesrepresentthespa rse graphestimationresultswith E [N ]=2andthetrianglesrepresent thedensegraphestimationresultswith E [N ]=5.Theupperand lowerboundsofthe95%condenceintervalsarealsoshownwhich aredculttoseparateunderlarge n ..................20 Figure2.2BoxplotofTPRandFPRunderderentsettingswhen q =100...21 Figure2.3BoxplotofTPRandFPRunderderentsettingswhen q =500...22 Figure2.4BoxplotofTPRandFPRunderderentsimulationsettingsw ith varying q and ..............................23 Figure2.5TheestimatedCPDAGinthecase(left)andcontrol(right )group basedonthetop100genes.Thoseoverlappednodesindicatedual regulationdirections...........................26 Figure2.6TheestimatedCPDAGinthecase(left)andcontrol(right )group basedongenesmappedtotheAlzheimer'sdiseasepathwayfromthe KEGGdatabase.Thoseoverlappednodesindicatedualregulation directions.................................27 Figure3.1Illustrationoftheinstrumentalvariablesmodelingenetic algenomics analysis..................................34 Figure3.2Simulationresultsforxed p =100and q =100butvarying n (200 to1400).................................43 Figure4.1Simulationresultsfor p =100, q =100and =0...........66 Figure4.2Simulationresultsfor p =100, q =100and =0onestimation performance...............................67 Figure4.3Simulationresultsfor p =100, q =100and =0onestimation performance...............................68 Figure4.4Simulationresultsfor p =100, q =100and =0 :25.........69 Figure4.5Simulationresultsfor p =100, q =100and =0 :25onestimation.70 x Figure4.6Simulationresultsfor p =100, q =100and =0 :25onestimation.71 Figure4.7Graphicalstructureofthetop10genesselectedbyIVG C.......73 Figure4.8TheestimatedcoecientfunctionsforgeneCYP2E1andU GT2A2.73 xi Chapter1 Introduction 1.1Overview Inthisdissertation,westudytheestimationandapplicationofgrap hicalstructuresinge- neticalgenomicsanalysis.Generegulatorynetworkscontainabun dantinformationabout thefunctionsofgeneexpressionsandthemechanismsofgenereg ulations.Inhumanbody, hundredsofthousandsgenesareexpressedandinteractwithea chothertoachieveallnec- essaryfunctionsincells.Theyareexpressedatpropertimeandins ucientamountto determinewhenandhowproteinsaresynthesized.Iftheenvironm entalfactorsarechanged, likeatoxicsubstanceappears,dysfunctionsofgeneexpressionm ayoccur.Theexpressions ofasetofgenesmaybesuppressedorenlarged.Asaresult,theg eneswhoseexpression areaectedbythatsetofgenesmaybealsodysfunctional.Finally, adiseasemayappears andhencethehealthofhumanbodyisthreatened.So,itisalwaysan interesttolearnthe truearchitectureofgeneregulatorynetworksandtoapplythek nowledgetofacilitateother researchesandimprovepublichealth.Inthischapter,werstpro videsomebackgroundon geneticalgenomicsanalysisandthetheoryofgraphanditsvarious applicationsin1.2and 1.3.Tosolvetheproblembroughtbygene-environmentinteraction ,varyingcoecientmodel anditsadvantageandexibilityareintroducedin1.4.Theobjectives andorganizationof thedissertationarein1.5. 1 1.2Integrativegeneticalgenomicsanalysis DNAmicroarraytechnologyisusedtomeasuretherelativeabundan ceoflargeamountof genessimultaneously.Inthepast,scientistscouldonlyanalyzeafe wgenesatoncebecauseof thelimitationofmeasurementtechnology.DNAmicroarrayoerspe oplegreatopportunities tounderstandthefunctionsofgeneexpressions.Recentdevelo pmentsinnextgeneration sequencingoerfaster,moreaccurate,andcheapertechnolog ytomeasuregeneexpression andgeneticvariantswithdigitalresolution.Alloftheprogressofm easurementtechnology providespromisingopportunityintheeldofgeneticalgenomicsana lysis. Regressionmodelsarecommonstatisticalmodelsusedtoexploreth erelationshipofgene expressionandaphenotypictrait.Aphenotypictraitcanbeadisea seorabodyfeature. However,thedimensionalityofgeneexpressionfeatures( ˇ 40 K )istypicallylarge.Classical statisticalmethodsdonotworkinthissituation.Sincethepioneerin gworkofLASSOby Tibshirani(1996),variableselectioninpenalizedregressionshasbe encommonlyadopted tosolvethelarge p small n problem.Manyregularizationmethodshavebeendeveloped sincethen,includingSCAD(FanandLi,2001),adaptiveLASSO(Zoua ndHastie,2005), groupLASSO(YuanandLin,2006),andMCP(Zhang,2010).Withth evariableselection technique,potentialcandidategenescanbescreenedforfurth erbiologicalvalidation. Whileusinggeneexpressiontopredictthephenotypeofinterest,w ecanalsostudy therelationbetweengenotypeandgeneexpression.Ithasbeena longhistorythatpeople trytondrelationshipbetweenquantitativetraitsandgeneticvar iants.Quantitativetrait loci(QTL)mappinghasbeenusedtondgeneticvariantswhichgove rnthephenotypic variationofacomplextrait,suchasbloodpressureorgrainyield.In QTLmappingstudies, geneticmarkerssuchassinglenucleotidepolymorphisms(SNPs)can beusedtopinpointthe 2 locationofgeneticvariantsthataectthetraitofinterest.Since thehugesuccessofQTL mapping,expressionquantitativetraitloci(eQTL)mappinghasbee nextensivelystudied. IneQTLmapping,expressiontraitsaretreatedasquantitativetr aits.Thus,onecanstudy thegeneticarchitectureofgeneexpressionandfurtherdissect thegeneregulationmechanism suchascis-andtrans-regulation.IftheeQTLisintheregionoforc losetothegenewhose expressionistheinterestedquantitativetrait,sucheQTLiscalledc is-eQTL.IftheeQTL isfarfromthegenewhoseexpressionistheinterest,itiscalledtran s-eQTL.Studieshave shownthatcis-regulationisusuallystrongerthantrans-regulatio nandmuchworkhasbeen donetostudycis-andtrans-eQTL(e.g.,Giladetal.,2008). eQTLmappingisalsocoinedasgeneticalgenomicsanalysissinceitcomb inesgenetic variantswithgeneexpressiondatatogether.Withthedevelopmen tofhumanhapmapand the1000genomeprojects,millionsofSNPvariantshavebeenident ed.Thusithasbe- comefeasibletopinpointspeccvariantslocatedinageneregionorin ter-genicregionthat regulategeneexpressions.Geneexpressiondatahavebeenbroa dlyappliedtostudygenenet- works.Forexample,basedonthecorrelationofgeneexpressions ,onecaninferthenetwork structureusingtechniquessuchasBayesiannetwork(Friedmane tal.,2000)orGaussian graphicalmodel(Dobraetal.,2004).However,withoutadditional biologicalinformation, suchestimatednetworksareundirected.Giventhatgeneregulat ionsareoftendirected,a directednetworkinferenceshouldmakemorebiologicalsenseandb emoreinformative.In addition,twoindependentgenescouldshowfalsecorrelationifthey shareacommongenetic regulator.Thecorrelationshouldbegoneifthecommongeneticee ctisadjusted(Caietal. 2013).Thismotivatesustodevelopadirectedgraphestimationpro cedurewhileadjusting forthegeneticeects. Afterformulatingmodelstodescribetherelationbetweenphenoty peandgeneexpres- 3 sionandtherelationbetweengeneexpressionandgenotype,wear einterestedinhowto incorporategeneregulatorynetworksintotheanalysis.Forthisp urpose,weintroducesome backgroundknowledgeontheestimationandapplicationofgraphica lstructuresinthenext section. 1.3Estimatioinandapplicationofgraphicalstructures Generegulatorynetworkisagraphicalstructurewhichcanbelear nedeitherbyexperimental methodsorbystatisticalmethods.Itisusedtodescriberegulato ryrelationsbetweengenes. Ingeneral,agraphoranetworkconsistsofnodesandedges.Ifw euse G todenoteagraph, V todenotethesetofthenodes,and E todenotethesetoftheedges,wecanwriteagraph as G =( V;E ).Here,anodeisageneexpressionandanedgeisacorrelationbetw eentwo genes.Towkindsofgraphsarestudiedorappliedinthisdissertation ,directedgraphand undirectedgraph. Derentkindsofgraphicalstructuresaredenedintheliteratur e.Let X =( X 1 ;:::;X p ) followamultivariatenormaldistribution N ( ; ) ;=( ˙ ij ) ;= 1 =( ! ij ).Atleast threegraphicalstructurescanbedened. 1.Structuredcovariancematrix.Forexample,itisassumedclosed ataintimeare stronglycorrelatedinlongitudinalananlysis. 2.Gaussiangraphicalmodel. X i ?? X j jf X t g t 6= i;j () ! i;j =0. 3.Causalrelation(directedgraph)canbedenedifotherassump tionsareassumed. Therstoneisusedtoinferthecorrelationrelationsbetweengene s.Thesecondisused toinfertheconditionalcorrelationrelationsbetweengene.Thelas toneisusedtodocausal inference.alotofliteraturescanbefoundaboutthesetopics(Bic kelandLevina,2008a, 4 2008b;Rothman,LevinaandZhu,2009;LamandFan,2009;Meinsh arsenandBuhlmann, 2006;Friedman,HastieandTibshirani,2007;Cai,Li,LiuandXie,2013 ). Aftergeneregulatorynetworksarelearned,thenextchallengeis howtoapplythemas priorknowledgetohelpfurtheranalysis,forinstance,studyingth erelationsbetweengene expressionsandtraitsofinterest.Thefollowinglistsfourregulariz ationmethodswhichtake advantageofthegenegraphicalstructuresinvariableselectiona ndestimation: 1.FusedLASSO(Tibshirani,etal.,2005.): p ( 1 ; 2 )= 1 k k 1 + 2 P j j j j 1 j 2.GRACE(LiandLi,2008): p ( 1 ; 2 )= 1 k k 1 + 2 P i ˘ j ( i p d i j q d j ) 2 3.GFlasso(KimandXing,2009): p ( 1 ; 2 )= 1 k k 1 + 2 P i ˘ j j i sign(^ r ij ) j j 4.GOSCAR(Yangetal.,2012): p ( 1 ; 2 )= 1 k k 1 + 2 P i ˘ j max( i ; j ) where p ( 1 ; 2 )isapenaltytermand 1 ; 2 aretuningparameters.Intheseworks,derent penaltiesareimposedtodealwithderentgraphicalstructures. ThefusedLASSOrequiressmoothnessontheeectsofadjacent variables.Thiscanbe achievedbyorderingthevariablesproperly.GRACEusesthespec cgraphicalstructure andencouragegreatereectsof\hub"geneswhicharethosewit hmanyconnectionsto othergenes.GFlassoissimilartotheideaoffusedLASSOexceptitus esnetworkstructures ratherthanadjacentsmoothness.GOSCARisusedtoencourage theeectsoftwoconnected genestobeexactlythesame.Thederencebetweenabsolutepen altyandsquaredpenalty isabsolutepenaltyencouragestwocoecientstobeexactlythesa mewhilesquaredpenalty doesnot.Thesesuccessfulapplicationsofgraphsinspireustoap plynetworkinformationin ourwork.InChapter3andChapter4,aparametricgraphicalstr uctureandanonparametric graphicalstructureareusedforinstrumentalvariablesregres sionmodel. 5 1.4Varyingcoecientmodelcapturinggene-environment interaction Geneexpressionsarethedirectproductsofgenefunctions.How ever,mostgenefunctions areinuencedbyenvironmentalfactors,resultingintheso-called gene-environment(G ) interactionswhicharebroadlystudiedinliterature.Modelingtheee ctofanenvironmental factorasafunctionbringsexibilitytocaptureeitherlinearornon- linearmodulationeect. AtypicalG Einteractionmodelassumesalinearinteractionstructurebetwee nagenetic variable( G )andenvironmentalvariable( X ),i.e., Y = 0 + 1 X + 1 G + 2 XG + = 0 + 1 X +( 1 + 2 X ) G + : AsdemonstratedinMaetal.(2011),suchasimpleregressionmodel couldeasilymissany interactionsignalsthatdonotshowalinearstructure.Theauthor sproposedthefollowing varying-coecient(VC)modelwhichcancapturethenonlinearmod ulationeect, Y = ( X )+ ( X ) G + : where ( X )and ( X )representtheinterceptandinteractionfunctionwhichareassu med tobesmooth.NonparametricestimationtechniquessuchasB-splin ecanbeappliedto estimatethefunctions.Suchanonparametrictreatmenthasmuc hexibilitytocapturethe realbiologicalfunctionofenvironmentalforces.AsfortheVCmo delling,vastliterature canbefoundonestimationorvariableselection(e.g.,HastieandTibsh irani,1993;Wang 6 andXia,2009;Fengetal.,2011;Tangetal.,2012).Littleworkhasbe endoneongenetical genomicsanalysiswhileincorporatingtheeectofenvironmentalfa ctors. 1.5Objectivesandorganization Inthisdissertation,weproposeagraphestimationmethodtolearn directedgraphicalstruc- turesandtwographapplicationmethodsusingnetworkstructure saspriorknowledgeto improvetheperformanceofvariableselection.Bothparametrican dnonparametricmodels andmethodsareconsidered. PCalgorithmhasbeenoneofthemostpopularmethodsinestimatingd irectedgraphs. However,covariateeectsarenotconsidered.Intheestimation ofundirectedgraphs,two connectedgenesmaysharecommonregulators(e.g.,SNPs).Thet wogenesmaybeinde- pendentiftheeectofcommonregulator(s)isadjusted.Thesam ephenomenonappliesto theestimationofdirectedgraphs.MotivatedbytheworkofCaiet al.(2013),weproposea covariate-adjustedPCalgorithminchapter2whichisusedtolearnt heregulatorynetworks ofgeneswhileadjustingfortheeectsofgeneticvariants.Weest ablishtheestimationand selectionconsistencyoftheproposedtwo-stepestimationalgorit hm. Inageneticalgenomicsanalysistoevaluatethecorrelationbetwee nphenotypictraitsof interestandgeneexpressions,oneusuallyassumesthattheerro rtermisindependentofthe predictors.However,thisisnotalwaystrue.Endogeneityinwhichs omepredictorscorrelate withtheerrormayarise.Instrumentalvariables(IV)regression isacommonwaytosolvethis problem.Linetal.(2015)proposedanIVregressionframeworkinw hichgeneticvariants aretreatedastheinstrumentalvariablestoeasethecorrelation problem.Theauthors showedthesuperiorperformanceofthevariableselectionmethod viaincorporatingtheIV 7 regression.Asweillustratedpreviously,incorporatinggenenetwo rksinvariableselection couldpotentiallyimprovetheperformanceofgeneselection.Thus, weproposeanetwork constrainedregularizationmethodinaIVregressionframeworkan dapplythemethodto anintegrativeanalysisofgeneticalgenomicsdatainchapter3.Wee stablishtheestimation andselectionconsistencyoftheproposedmethod. Asweillustratedpreviously,environmentalfactorscouldpotentia llymodulategenetic eectsongeneexpressionandfurtheraectthephenotypicvar iation.Forexample,the amountofexposuretoheavymetalcouldmodifygeneticeectong eneexpressionand furtherleadtoincreasedriskofneurovegetativediseases.Vary ing-coecientmodelisa naturalchoicetodescribeanypotentialnonlinearmodulationeec t.Builtuponthework inchapter3,weextendtheIVregressionmodeltoincorporateth enonlinearmoderation eectofanenvironmentalfactorinchapter4.Thevaryingcoec ientfunctionsaremodelled throughnonparametrictechniquessuchasB-splineandnetworkc onstrainedregularization isalsoconsideredtoimprovetheperformanceofgeneselection.We alsoestablishthe estimationandselectionconsistencyoftheproposedmethod. Weconductextensivesimulationstudiestoevaluatetheperforman ceofthethreemethods andcomparewithexistingones.Theutilityofthemethodsisdemonst ratedthroughreal dataanalysis.Conclusionandfutureworkarefollowedinchapter5. Technicalproofsare giveninAppendices.Thedissertationcontributestothedevelopme ntofgraphestimation andvariableselectioninstatisticaltheoryandmethod,andtothea nalysisofgenetical genomicsdatainapplication. 8 Chapter2 Learningdirectedacyclicgraphical structureswithgeneticalgenomics data 2.1Introduction Thepastdecadeshavewitnessedenormousmethodologydevelopm entsingenenetworkin- ferenceusinggeneexpressiondata(e.g.,CheungandSpielman,200 2;Schadtetal.,2003). Suchnetworksweredevelopedmostlybyassessingthepairwisecor relationofgeneexpres- sions.Fromastatisticalpointofview,undertheassumptionthatt hejointdistribution ofthegeneexpressionsofinterestisamultivariatenormaldistribu tion,suchnetworkscan beconstructedbyassessingthenonzeroelementsoftheinverse covariancematrix,theso calledprecisionmatrixorconcentrationmatrix.Assume X =( X 1 ;:::;X p ) ˘ N (0 ;)and = 1 .X i ?? X j jf X t g t 6= i;j () ! i;j =0,where ! i;j 'sarethe( i;j )thentriesin. Thatis, X i and X j areconditionallyindependentifandonlyif ! i;j =0.Thusonecan inferthepositionsofzeroentriesintoinferthegraphicalstruct ure,i.e.,theconditional independenciesbetweencomponentsin X .However,whenthedatadimensionislarger thanthesamplesize,theinverseofthesamplecovariancematrices arenotavailable.When 9 consideringsuchnetworksasundirectedgraphs,therehasbeen hugestatisticalinterestsin developingGaussiangraphicalmodelsassumingsparsityoftheprec isionmatrix,tonamea few,Friedmanetal.(2008),Caietal.(2013),Meinshausenetal.( 2006),andYuanand Lin(2007).Undercertainassumptions,thepositionsofzeroentr iesintheprecisionmatrix indicateconditionalindependencebetweenvariablesofstudy. Wheninferringnetworkonlybasedongeneexpressions,twogenes couldshowcorre- lationinexpressionsimplybecausetheyshareacommonregulator,w hiletheyshouldbe independentconditioningonthecommonregulator.Motivatedbyth is,afewmethodology developmentshavebeenfocusedoncovariates-adjustedgraph icalmodelsbyusinggenetic markersascovariatestocorrectbothfalsepositivesandfalsene gatives(e.g.,LeeandLiu, 2012;YinandLi,2011,2013;Caietal.,2013).Intheirestimationpr ocedures,theeectof geneticvariantsisestimatedintherststep.Then,thegraphical structureisestimatedin thesecondstepwhileadjustingforthegeneticeects.Theirsimula tionresultsandrealdata analysisshowedthatgraphestimatesweresubstantiallyimprovedw hentakingthegenetic eectsintoaccount.Thesemodelsareallfocusedontheestimatio nofundirectedgraphs withoutconsideringdirectioninformation.Thedirectedgraphs,ho wever,aremoreattrac- tiveinthesensethatoftenpeopleareinterestedinthecausalrela tionships,thatis,interests arenotonlyfocusedonwhethertwovariablesareconditionallyindep endentbutalsoifone causestheotherone.Inreallife,geneexpressionsareoftenth eresultsofdirectedregula- tions.Thus,theestimationofdirectedgraphsshouldprovidemore biologicalinformation forfurtherfunctionalinvestigation.Geneticalgenomicsdatapr ovidesoptimalresourcesto learnsuchdirectedgraphicalnetworks. Intypicaldirectedacyclicgraph(DAG)inference,oneprobabilityd istributionhasa setofcorrespondingDAGswhichareMarkovequivalentundercert ainassumptions.The 10 directionsoftheedgesinthegraphsindicatecausalrelations.Stu diesondirectedgraphs havebeenourishedinliteraturebothintheoryandinapplication(e.g .,Andersonetal., 1997;RichardsonandSpirtes,2002;Alietal.,2009).Themostcom moncaseistheGaussian case.Let Y =( Y 1 ;:::;Y p )bea p -dimensionalGaussianrandomvector,andeachcoordinate representsageneexpression.Toinfertheregulatoryrelationsh ipofthese p geneexpressions, thePC-algorithmdevelopedbySpirtesetal.(2000)canbeappliedto constructacompleted partiallyDAG(CPDAG).TheCPDAGcanuniquelyrepresentallDAGsco rrespondingto thecommondistributionof Y ,andcanberepresentedastheregulatorynetworksofthe geneexpressions.KalischandBuhlmann(2007)laterprovedthat thegraphobtainedfrom thePC-algorithmisconsistentundersomeconditions,eveninhighdim ensionalcases. DirectedgraphsconstructedwiththePC-algorithmaremorebiolog icallyrelevantcom- paredtoundirectedgraphs.However,asmentionedbefore,iftw ogenesshareacommon regulator,theirexpressionrelationshipcouldbecomplicatedoreve ndistortedwithoutcon- sideringtheunderlyinggeneticstructure.Whenadjustingforthe genotype(i.e.,thecommon regulator)eects,thetworelatedgenescouldbeindependentor viceversa.Thismotivates ustodevelopacovariate-adjusteddirectedgraphicalmodelbye xtendingtheCPDAGes- timationproceduretoaregressionframework.Wetreatgeneticm arkersascovariatesand developatwo-stageestimationmethodwhichcombinesthepenalized estimationandPC- algorithmtoestimatethemarker-adjustedCPDAGstructure. Inthefollowingofthepaper,werstprovidesomebackgroundkno wledgeaboutgraphical modelsandtheformulationofourmodelandthetwo-stageestimat ionprocedureinSection 2.2.TheoreticalresultsaregiveninSection2.3.Simulationsandreal dataanalysisaregiven inSection2.4and2.5tojustifyourmethodfollowedbydiscussionsinSe ction2.6.Allthe proofsofthetheoreticalresultsarerenderedintheAppendixA. 11 2.2ModelandEstimation 2.2.1Background Let G =( V;E )beagraphwhere V representsthesetofverticesand E representstheset ofedges.Adirectedgraphisagraphinwhicheachedgehasadirectio nrepresentedby a\ ! "sign.Adirectedacyclicgraph(DAG)isadirectedgraphwithnodirec tedcycles. Acriteriacalled d -separationisusedtoreadconditionalindependencerelationships froma DAGstructure(Pearl,2000).Othercriteriacanbeusedtogener ateconditionalindepen- dencerelationships.Thereasonwhy d -separationisparticularlyusefulisthatitisrelatedto causalinference(Pearl,2000;Spirtesetal.,2000)andforthepu rposetoconstructdirected graphs. Givenarandomvector,itscoordinatescanbeviewedasthevertice sofagraph G = ( V;E ).AprobabilitydistributioniscalledfaithfultoaDAGifandonlyiftheco nditional independencerelationshipsindicatedbythedistributionareasthes ameasthoseobtained via d -separation.MultipleDAGsmayrepresentthesamesetofcondition alindependence relationships.TheDAGstowhichaprobabilitydistributionisfaithfula reMarkovequivalent andcomposeaMarkovequivalenceclass.TwoDAGsareMarkovequiv alentifandonlyif theyhavethesameskeletonandthesamev-structures(Fryden berg,1990;VermaandPearl, 1990,1992).TheskeletonofaDAGisthegraphwherealldirectede dgesarereplacedwith undirectededges.Av-structureisatriple(v 1 ,v 2 ,v 3 )havingthestructurev 1 ! v 2 v 3 .AMarkovequivalenceclassofDAGscanbedescribedbyauniqueCPDA G(Chickering, 2002).ACPDAGistheunionoftheDAGsinaMarkovequivalenceclassin thesensethata directededgeexistsifthatedgeexistsineachDAGandanundirecte dedgeexistsifthatedge hasderentdirectionsintwoDAGs.PC-algorithm(Spirtesetal.,200 0)isaneectiveand 12 ecientalgorithmtoestimateCPDAGs.Thereasonwhyweestimatet heCPDAGisthat multipleDAGscorrespondtoaprobabilitydistributionwhileonlyoneCPD AGcorresponds toaprobabilitydistribution.AfteraCPDAGisobtained,wecanexten dittotheDAGs byaddingdirectionsontheedgessuchthatnodirectedcyclesareg enerated.Ithasbeen proventhatPC-algorithmcangenerateaconsistentestimateund ercertainassumptions, eveninhighdimensionalcases(KalischandBuhlmann,2007). 2.2.2Covariate-adjustedgaussiangraphicalmodel Consideramulti-responselinearregressionmodelwith p responsesand q covariates.Suppose wehave n independentobservations( y i ;x i ), i =1 ;:::;n ,where y i =( y i 1 ;:::;y ip ) T and x i =( x i 1 ;:::;x iq ) T .Weassumethatthedataarecenteredandstandardized.Dene Y = ( y 1 ;:::;y n ) T and X =( x 1 ;:::;x n ) T .Therelationshipbetweenthemultivariateresponse Y andthepredictors X canbedescribedbythefollowingregressionmodel, Y = XB + E where B =( B 1 ;:::;B p )isa q p coecientmatrixand E =( 1 ;:::; n ) T istheerrorterm.We assumethattheerrors i =( i 1 ;:::; ip ) T ,i =1 ;:::;n ,are i.i.d. randomvariablesfollowing amultivariatenormaldistribution N (0 ;).Let y j =( y 1 j ; ;y nj ) T bethe j -thresponse vector( j =1 ;::;p )and x l =( x 1 l ; ;x nl ) T bethe l -thpredictorvector( l =1 ;::;q ).In ageneticalgenomicsframework, X representtheSNPmarkersand Y representthegene expressions.OurinterestistoestimatetheCPDAGstructureof Y ,whileadjustingfor theeectof X .Conditionalon X ,weexpectthatfalsepositivesand/orfalsenegativesof connectionscouldbereduced. 13 2.2.3Atwo-stageestimationprocedure Without X ,thePC-algorithmcanbeappliedtoinfertheCPDAGstructures.In themodel describedabove,theconditionalmeanof Y changesas X varies.ThusthePC-algorithm cannotbedirectlyappliedsinceitisonlyforamodelwithaconstantme an,i.e., X = ( X 1 ;:::;X p )followsamultivariatenormaldistributionwithmean andcovariancematrix .Toovercomethedculty,weproposeatwo-stageestimationpr oceduretoestimatethe meanfunctionandtheCPDAGstructure.Intherststage,wees timatethecoecient matrix B .Weimplementapenalizedestimationprocedurewithrespecttoeach responsein Y viasolvingthefollowing p optimizationproblems, ^ B j =argmin B j 1 2 n tr f ( y j XB j ) T ( y j XB j ) g + j;n k B j k 1 ;where j;n ;j =1 ;:::;p; aretuningparameters,and kk 1 referstothe L 1 norm.Theestimate of B isdenotedas ^ B =( ^ B 1 ;:::; ^ B p ).Let ^ E = Y X ^ B .Inthesecondstage,thePC-algorithm isappliedto ^ E toestimatethedirectedgraphicalstructurecorrespondingtot heprobability model N (0 ;).Thetwo-stageestimationalgorithmistermedasthecovariate- adjusted PC-algorithm(caPC). Ingeneral,thecaPCalgorithmcontainstwosteps.Therststepis toestimatetheskele- tonandthesecondstepistoextendtheskeletontotheCPDAG.In formationobtainedfrom therststepisusedtodeterminethedirectionsoftheedgesinthe secondstep.Ifwehave theperfectknowledgeabouttheconditionalindependencerelatio nships,thePC-algorithm canreturnuswiththecorrectskeletonintherststep(Spirtese tal.,2000).However,we donothaveperfectknowledgeinpractice.Therefore,weusesam plestoestimatetheskele- tonbyimplementingatestingprocedure.Fortestingtheconditiona lindependence,Kalisch 14 andBuhlmann(2007)appliedestimatedpartialcorrelationsandFis hertransformationto constructatestingstatistic.TheyprovedthatthePC-algorithm couldgenerateconsistent estimatesundercertainassumptionseveninthehighdimensionalsit uations.Tomakethe paperself-contained,wealsoincludedthePCalgorithminthesupplem entalle. 2.3Theoreticalresult Inthisworkweallowthenumberofresponsestoincreasealongwitht hesamplesize,i.e., p isafunctionof n .Thenumberofmarkercovariates q isxed.Thisisavalidtreatment aswecandoaneQTLmappingstudytorstndtheeQTLsfortheex pressionresponses, thenxtheminthemodel.Belowweshowthatthetwo-stageestimat ionmethodgenerates consistentestimateofaCPDAG.Toshowtheestimationconsistenc y,weadoptthesetupof KnightandFu(2000)andKalischandBuhlmann(2007).Thedimensio nofamultivariate responsevariableisdenotedas p ( n )andaDAGisdenotedas G = G n .Conditionsand remarkscanbefoundintheAppendixA. Let ˆ n ;i;j bethecorrelationbetween i and j .Let ˆ n ;i;j jk bethepartialcorrelation between i and j given f r ;r 2 k g ,where k isasubsetof f 1 ;:::;p ( n ) gnf i;j g .^ ˆ n ;i;j and ^ ˆ n ;i;j jk arethecorrespondingestimatedones. Toestablishtheestimationconsistency,weneedthefollowinglemmas .Lemma1. Assumethatthedistribution P n isnormallydistributed,wehave ^ ˆ n ;i;j ( ^ B ) ^ ˆ n ;i;j ( B )= o p (1) ;8 i;j =1 ;:::;p: 15 Lemma2. Assumethatthedistribution P n isnormallydistributed,wehave ^ ˆ n ;i;j jk ( ^ B ) ^ ˆ n ;i;j jk ( B )= o p (1) ;8 i;j =1 ;:::;p: Lemma3. Assumethatthedistribution P n isnormallydistributedand sup n;i;j; k j ˆ i;j jk j M< 1 .Then,forany > 0 ,wehave sup i;j; k 2 K m n i;j P ( j ^ ˆ n ;i;j jk ( ^ B ) ˆ n ;i;j jk j > ) C 1 ( n 2 m n ) exp(( n 4 m n )log( 4 2 4+ 2 )) forsomepositiveconstant C 1 . Lemma1showsthatthederencebetweentheestimatedcorrelat ioncoecient^ ˆ n ;i;j giventheestimatedcoecientmatrix ^ B andtheestimatedcorrelationcoecientgiventhe truecoecientmatrix B is o p (1).Lemma2statesthatthederencebetweentheestimated partialcorrelationcoecient^ ˆ n ;i;j jk giventheestimatedcoecientmatrix ^ B andtheesti- matedpartialcorrelationcoecientgiventhetruecoecientmatr ix B is o p (1).Lemma3 showsthatthederencebetweentheestimatedpartialcorrelat ioncoecient^ ˆ n ;i;j jk given theestimatedcoecientmatrix ^ B andthetruepartialcorrelationcoecient ˆ n ;i;j jk is bounded.TheproofsofthethreelemmasarerelegatedintheSupp lementalle. Dene n asthesigncancelevelfortestingthesigncanceofthepartialc orrelation afterFisher'stransformation,i.e.,fortesting H 0 :ˆ i;j jk =0.Withtheaboveassumptions, wehavethefollowingtheorem. Theorem1. Dene ^ G ( n ) astheestimatedCPDAGusingthetwo-stageestimationproce - 16 dureand G isthetrueCPDAG.If j;n = o ( n ) ;j =1 ;:::;p ,andConditions(C1)-(C6)are satised,thenthereexists n suchthat P ( ^ G ( n )= G )=1 O (exp( cn 1 2 d )) ! 1 ;as n !1 where c isapositiveconstantand d> 0 isasinCondition(C6). TheconvergentrateoftheCPDAGestimationisthesameasthatinT heorem2in KalischandBuhlmann(2007).Thismeanstheerrorsoccurredinth erststagedonothave agreateectonthesecondstageoftheestimationofgraphicals tructures.Theproofofthe TheoremcanbefoundintheAppendixA. 2.4Simulation 2.4.1Simulationdesign Wedidextensivesimulationstoassesstheestimationperformance. Wefollowedthesimu- lationsetupinYinandLi(2013)andKalischandBuhlmann(2007).To generatetheerror term,webeganwithanadjacencymatrix A p p lledwithzeros.Then,everyentryinthe lowertrianglewasreplacedwithindependentrealizationsofaBernou llirandomvariable withsuccessprobability p A (0

p .SimilarpatternthatcaPC 22 set1set2set3set4set5set6 0.50.60.70.8 TPRset1set2set3set4set5set6 0.0040.0080.012 FPRFigure2.4:BoxplotofTPRandFPRunderderentsimulationsettings withvarying q and .outperformedPCwasobserved. 2.4.5Comparingdirectedandundirectedgraphs Wedidsomesimulationsfollowingtocomparetheperformanceofthed irectedandundirected graphestimation.Herewechoose p =50 ;100 ;200, q =2 p and n =250.Asbefore,wechose E [N ]=2, =3.5.TheresultsaresummarizedinTable5.WeappliedgraphicalLASS Oto estimatetheundirectedgraphicalstructure.Wecanseethatun directedmethodachievesa higherTPR,butthedirectedmethodachievesalowerFPRandsmaller SHD.HereSHD isusedtomeasuretheskeletonofthegraphsinceundirectedgrap hsdonothavedirection information.ThereasonwhyundirectedmethodisbetteronTPRisb ecause,intheory, directedgraphissparsethanundirectedgraph.Sooverall,thedir ectedgraphperforms betterthantheundirectedmethodonbothFPRandSHD,butsue rsalittlebitonTPR. 23 Table2.3:Estimationcomparisonwithandwithoutadjustingforthem arkereectsunder derentscenarios. MethodTPRFPRSHD p =25, q =10, n =250, E [N ]=2, =3.5caPC0.76810.001111.5 PC0.66980.038025.9 p =50, q =50, n =250, E [N ]=2, =4caPC0.74650.002126.7 PC0.56800.019755.3 p =100, q =100, n =250, E [N ]=2, =3caPC0.74030.001752.4 PC0.67970.005981.7 p =400, q =200, n =250, E [N ]=2.5, =20caPC0.76780.0054648.6 PC0.23460.01381504.9 p =800, q =200, n =250, E [N ]=1.5, =25caPC0.78190.00521948.1 PC0.13780.00823175.9 p =1000, q =200, n =250, E [N ]=1.5, =20caPC0.79020.00432543.4 PC0.27370.00653874.4 Table2.4:EstimationcomparisonbetweenPCandcaPCwhen q>p MethodTPRFPRSHD p =100, q =200, n =200, E [N ]=2, =3caPC0.81760.008684.0 PC0.72640.0161129.8 p =200, q =400, n =200, E [N ]=2, =20caPC0.66120.0107324.9 PC0.26450.0177516.4 p =400, q =800, n =200, E [N ]=2, =20caPC0.57040.0076866.7 PC0.29840.01001138.3 2.5ApplicationtoRealData Weappliedourproposedtwo-stageestimationmethodtoanAlzheime rdiseasedataset (Websteretal.2009).Followingthepaper,176Alzheimercasesand 187controlswere includedinouranalysis.FortheSNPs,thosewithlowgenotypingcallr ate(lessthan90%), lowminorallelefrequency(lessthan5%)andthosefailedtheHardy-W einbergequilibrium testincontrolgroup(p-value < 0.001)wereremoved.Aftertheseoperations,around332,000 SNPmarkerswereleftforfurtheranalysis.Weusedtheresiduale xpressiondataafter adjustingfortheeectsofseveralcovariatessuchasgender, APOEstatusandageatdeath 24 Table2.5:Estimationcomparisonbetweendirectedandundirectedg raphs MethodTPRFPRSHD p =50, q =100, n =250, E [N ]=2, =3.5directed0.79040.001312.5 undirected0.87360.036949.9 p =100, q =200, n =250, E [N ]=2, =3.5directed0.76400.001431.9 undirected0.83290.014990.2 p =200, q =400, n =250, E [N ]=2.5, =3.5directed0.78160.001475.1 undirected0.81180.0049137.5 (datacanbefoundathttp://labs.med.miami.edu/myers/LFuN/data ajhg.html).Thereare 8,560geneexpressions.Weimplementedatwosamplet-testforeac hgeneexpressionand selectedtop100signcantgeneexpressionstolearntheirDAGstr ucture.Foreachofthe 100geneexpressions,wettedasimplelinearregressionwitheachS NPmarkerasthe predictorandselectedSNPswithp-value < 0 :001.AmongtheseSNPs,wechosethosethat showassociationwithtwoormoregeneexpressions(p-value < 0 :001).Thisendedupwith 3,776SNPsinthecasegroupand3,871SNPsinthecontrolgroupwith 60SNPsincommon. Thenaldatasetusedfortheanalysiscontained100geneexpres sions(asthe Y variable), 3,776SNPsinthecasegroupand3,871SNPsinthecontrolgroup(as the X variables).We appliedourestimatingmethodtothetwogroupsseparately.Ourgo alwastolearntheDAG structureforthe100geneswhileadjustingfortheeectsofSNP markersinthecaseand controlgroupswiththehopetoidentifyderentialDAGstructur esthatcandistinguishthe twogroups.Anyderencesmightpotentiallyexplainthediseaseetio logyofAlzheimer. WelearnedtheCPDAGsatthe =0 :001level.Forthecasegroup,theestimated CPDAGhas101directededges(Figure2.5).Forthecontrolgroup ,theestimatedCPDAG has88directededges(Figure2.5).Thegraphsweregeneratedwit htheRpackage igraph .Inbothgraphs,geneswererepresentedwithnumbers.Thecorr espondinggenenamescan befoundinthesupplementaltable.ThetwoCPDAGsshare18edges incommon.We 25 noticedthatsomesub-CPDAGstructuresshowninthecasegroup werenotpresentinthe controlgroup.Forexample,thegraph70 ! 95 38wereobservedinthecasegroup,while thethreegenesareseparatedfromothergroupsinthecontrol population.Ontheother hand,someCPDAGsshowninthecontrolgroupwerenotpresentin thecasegroup.For example,thesub-graphby17-26-31-32-76isobservedinthecon trolgroupwhilethegenes arelessconnectedinthecasegroup.Suchderentialgraphicals tructuresbetweenthetwo groupsmightindicateregulationheterogeneitybetweenthetwogr oups.Furtherbiological vercationsareneedtovalidatethendings. 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100Figure2.5:TheestimatedCPDAGinthecase(left)andcontrol(righ t)groupbasedonthe top100genes.Thoseoverlappednodesindicatedualregulationdir ections. Asacomparison,weappliedPC-algorithmdirectlytothe100geneswit houtadjusting fortheSNPs'eects.Forthecasegroup,theestimatedgraphh as124edges.Forthecontrol group,theestimatedgraphhas100edges.Thisisconsistentwitho urassumptionthatthe numberofedgesshouldbereducedafteradjustingforthemarke rs'eect.Thisphenomenon 26 hasbeenexplainedanddemonstratedinotherworksfocusingonun directedgraphs(e.g., YinandLi,2011;Caietal.2013). ADAM17 APPNAE1 APBB1GAPDHBACE1 BACE2 PSENENAPH1AAPH1BNDUFV1NDUFV2NDUFV3NDUFA1 NDUFA2 NDUFA3 NDUFA4 NDUFA5 NDUFA6 NDUFA7 NDUFA8 NDUFA9 NDUFA10 NDUFAB1 NDUFA12 NDUFA13 NDUFB1NDUFB2NDUFB4NDUFB5NDUFB6NDUFB7NDUFB8NDUFB9NDUFB10NDUFS2NDUFS3NDUFS4NDUFS5NDUFS6NDUFS7NDUFS8NDUFC1NDUFC2SDHASDHBSDHCSDHDUQCRFS1CYC1UQCRC1UQCRC2UQCRHUQCRBUQCRQUQCR10UQCR11COX1 COX2 COX4I1 COX5A COX5B COX6A1 COX6B1 COX6C COX7A1 COX7A2 COX7A2L COX7B COX7C COX8A ATP5A1 ATP5B ATP5C1 ATP5D ATP5E ATP6 ATP5F1 ATP5G1 ATP5G3 ATP5H ATP5O ATP5J HSD17B10LPLAPOELRP1TNFRSF1ABIDCALM2CALM3CALM1PPP3CAPPP3CBPPP3CCPPP3R1BAD CYCSAPAF1 CASP9CASP3PLCB1PLCB4GRIN1GRIN2CCACNA1C MAPK1MAPK3RYR3 ITPR1ITPR3ATP2A2 ATF6 CAPN1CAPN2CDK5R1CDK5MAPTGSK3BSNCAADAM17 APPNAE1 APBB1GAPDHBACE1 BACE2 PSENENAPH1AAPH1BNDUFV1NDUFV2NDUFV3NDUFA1 NDUFA2 NDUFA3 NDUFA4 NDUFA5 NDUFA6 NDUFA7 NDUFA8 NDUFA9 NDUFA10 NDUFAB1 NDUFA12 NDUFA13 NDUFB1NDUFB2NDUFB4NDUFB5NDUFB6NDUFB7NDUFB8NDUFB9NDUFB10NDUFS2NDUFS3NDUFS4NDUFS5NDUFS6NDUFS7NDUFS8NDUFC1NDUFC2SDHASDHBSDHCSDHDUQCRFS1CYC1UQCRC1UQCRC2UQCRHUQCRBUQCRQUQCR10UQCR11COX1 COX2 COX4I1 COX5A COX5B COX6A1 COX6B1 COX6C COX7A1 COX7A2 COX7A2L COX7B COX7C COX8A ATP5A1 ATP5B ATP5C1 ATP5D ATP5E ATP6 ATP5F1 ATP5G1 ATP5G3 ATP5H ATP5O ATP5J HSD17B10LPLAPOELRP1TNFRSF1ABIDCALM2CALM3CALM1PPP3CAPPP3CBPPP3CCPPP3R1BAD CYCSAPAF1 CASP9CASP3PLCB1PLCB4GRIN1GRIN2CCACNA1C MAPK1MAPK3RYR3 ITPR1ITPR3ATP2A2 ATF6 CAPN1CAPN2CDK5R1CDK5MAPTGSK3BSNCAFigure2.6:TheestimatedCPDAGinthecase(left)andcontrol(righ t)groupbasedongenes mappedtotheAlzheimer'sdiseasepathwayfromtheKEGGdatabase .Thoseoverlapped nodesindicatedualregulationdirections. WedidanothercasestudybyfocusingontheAlzheimer'sdiseasepat hwayfromthe KEGGdatabase.Therearetotal168genesinthispathway,buton ly120geneexpressions inthisdatasetweremappedtothepathway.Foreachofthe120ge neexpressions,we performedasinglemarkerlinearregressionanalysisusingeachofth eSNPs.Weselected thosemarkerswithp-value < 0 :001.Thenthe5000SNPs(with93incommon)withthe smallestp-valueswereselectedincasegroupandcontrolgroupre spectively. Thedataweusedforthenalanalysiscontain120geneexpression s(the Y variables), 5000SNPmarkersinthecasegroupand5000SNPmarkersinthecon trolgroup(the X variables).OurgoalistolearntheDAGstructuresbasedonthe12 0geneexpressionswhile adjustingfortheeectsof5000SNPmarkersinbothgroups.Wea ppliedourestimating 27 methodtothecaseandcontrolgroupseparately.Inthecasegr oup,theestimatedCPDAG has108directededges(Figure2.6).Inthecontrolgroup,thees timatedCPDAGhas109 directededges(Figure2.6).Thereare25commonedgesbetweent hetwographs.We canseederentgraphicalstructuresbetweenthetwogroups. Someoftheestimatedsub- DAGsagreeswiththerealbiologicalnetwork.Forexample,somege nesinmitochondria wereidentedasconnectedinthecasegraph(e.g.,COX7A2,NDUFB 2,SDHC;NDUFV3, COX7B,COX4I1,COX1,NDUFB4,NDUFA10). 2.6Discussion Genesfunctioninnetworksandgeneexpressionsareoftenresult edfromgeneregulations. Hence,learningdirectedregulationscanenhanceourknowledgeab outgenefunctionanddis- easeetiology.Inthiswork,weintroducedacovariate-adjustedm odeltostudythegraphical structuresofmultiplegeneexpressionswithdirectionregulationinf ormation.Weproposed atwo-stageproceduretoobtaintheestimatedCPDAGcorrespon dingtotheprobabilitydis- tribution.Themarkeradjustedexpressionnetworkcanreducet hefalseconnectionsoftwo genesiftheyshareacommonregulator.Simulateddatawereusedt otestourmethodand showedsupportiveresultsbothinlowandhighdimensionalcases.We providedaconsistency resultforthecasewhen p isincreasingalongwithsamplesize n .Weappliedourmethod toarealcase-controldatasettounderstandthegeneregulat ionmechanisminAlzheimer disease.Derentgraphswerelearnedcorrespondingtothecase andcontrolgroup.For theAlzheimerpathwayanalysis,wewereabletorecoversomeofthe structuresforgenes involvedinMitochondriafunction. Intherealdataanalysis,wecouldnotrecovertheexactcausalr elationshipsbetween 28 variablesthatareshownintheKEGGdatabase.Thisisduetothecom plexityofthe biologicalstructureandlimiteddatasamples.Moreover,thedatam ayviolatethenormality assumption.Otherrobustmethodssuchasthenonparanormalm odel(Liuetal.2009;Liu etal.2012)whichimplementanonparametrictransformationofthe datamaybeapplied toourframework.Inaddition,weexpectmorereliablestructures canbeobtainedifwe incorporatepriorknowninformationintoouranalysisbypre-settin gsomeedges.Weexpect moreaccurateandstableestimationcanbeachievedviaincorporat ingbothstatisticaland biologicalknowledgeintothemodeldevelopmentprocess.Thesewill beincorporatedinto ourfutureinvestigation. 29 Chapter3 IntegrativeAnalysisofGenetical GenomicsDataIncorporating NetworkStructure 3.1Introduction Ingeneticalgenomicsanalysis,achallengingproblemishowtocombine phenotype,genotype andgeneexpressiondatatogethertoobtainnewknowledgeabout thegeneticbasisofgene expressionsandtofurtherprovidenovelinsightsintogenefunct ions.Muchworkonmodel developmentshasbeendonetocombinederentsourcesofdata. Forexample,QTLmapping studiescombinegenotypeandphenotypedatatoinferthegenetic architectureofacomplex trait.eQTLmappingcombinesgeneexpressionsandgenotypeswhe regeneexpressionsare viewedasquantitativetraitstounderstandthegeneticbasisofge neexpression.Gaussian graphicalmodels(MeinshausenandBuhlmann,2006;Friedmaneta l.,2007)havebeen appliedtogeneexpressiondatatoinfergeneregulatorynetworks .Recently,covariate- adjustedGaussiangraphicalmodels(YinandLi,2013;Caietal.,201 3)havebeendeveloped tocombinegenotypesandgeneexpressionstoimprovegenenetwo rklearning.Morerecently, integrativemodelingandtestingmethods(e.g.,Huangetal.,2014;Zh aoetal.,2014)combine 30 phenotypes,genotypesandgeneexpressionstoimprovethepow erofstatisticaltesting.As moreandmoregenomicdataarebecomingavailable,integrativeanaly sisofmulti-source genomicdatawouldprovideamorecomprehensivepictureofabiologic alsystem,hence oeringpromisingopportunitiesforpersonalizedmedicineandtreat ment. Whenusinggeneexpressionstopredictphenotypicresponses,line arregressionhasbeen thestandardmeanstomodeltherelationbetweenphenotypesan dgeneexpressions.Dueto thehighcostofobtainingexpressiondata,samplesizeistypicallysma llerthanthenumberof geneexpressions.Tosolvethisproblem,variousregularizationmet hodshavebeendeveloped toachievevariableselectionandestimation.Fromthebiologicalpers pective,thephenotypic responseofinterestandgeneexpressionsareofteninuencedb ycommonexternalforces. Theseexternalforcesareusuallyunobservableandcannotbemo delledexplicitlyinthe model.Hencetheireectsareusuallyputintheerrorterm.Sothee xogeneitycondition inwhichthepredictorsandtheerrorareuncorrelatedisviolated,a ndconsequentlythe estimatorsobtainedbyordinaryleastsquaresmaynotbeconsiste nt.Toovercomethe issuebroughtbyendogeneity,Linetal.(2015)appliedinstrumenta lvariables(IV)models andregularizationmethodstohandletheproblemcausedbythecor relationbetweengene expressionsandtheerrortermsinahighdimensionalsetupwithbot hgeneexpressionsand genotypes.Geneticvariantsareusedasinstrumentalvariabless incetheyareindependent ofexternalenvironmentalexposuresandaectthephenotypic traitofinterestonlythrough aectinggeneexpressions.Givengeneticvariantsusedasinstrum entalvariables,theproblem broughtbythedependencebetweenthegeneexpressionsandth eerrortermswhichincluding environmentalexposuresorotherfactorscanbesolvedbysubs titutingthepredictionsofgene expressionsintheregressionmodel.Thedcultiesonhighdimensiona lityofgenotypesand geneexpressionsarehandledbyregularizationmethodsuchasthe LASSOtyperegression. 31 Weallknowthatgenesfunctioninnetworkstofulltheirjointtask .Thusgeneexpres- sionstendtobecorrelatedwhentheybelongtothesamenetwork( e.g.,pathway).When geneexpressionsareconsideredinaregressionmodel,suchcorre lationinformationshouldbe consideredwhenselectingimportantgenes.Althoughvastamount sofstatisticalliterature havebeenfocusedonthevariableselectionprobleminahighdimension alsetup,manyof themfailforcorrelatedvariables.AspointedoutbyZouandHastie (2005),LASSOtendsto selectonlyonevariablefromagroupofvariableshighlycorrelatedwit heachother.Several methodswereintroducedtosolvetheproblem.ZouandHastie(200 5)proposedelasticnet toachievegroupingeect,i.e.,\thecoecientsofagroupofhighlyc orrelatedvariablestend tobeequal(uptoachangeofsignifnegativelycorrelated)".InLe mma2andTheorem2in theirpaper,itsaidthatifastrictlyconvexpenaltyisapplied,thehigh erthecorrelationof twovariablesis,thelesstheupperboundofthedistanceoftheest imatedcoecientsofthe twovariablesis.Ifthecorrelationisalmostzero,thetwoestimated coecientsarealmost thesame(exceptaminussignifnegativelycorrelated).LiandLi(2 008)proposedanetwork constrainedmethodwhichtakesadvantageofthecorrelationinfo rmationnamelynetwork information.TheyintroducedaLaplacianmatrixand L 2 penaltytocopewiththecorre- lationproblem.Theyapplied L 2 normonthepairwisederencesofthecoecientsofthe correlatedvariablestoachievegroupingeect,andobtainedsimilar theoreticalresultsasin ZouandHastie(2005).Theirsimulationresultsshowedthattheirme thodworksbetterthan elasticnetincaseswherepriorknowledgeongraphicalstructures isconsidered.InLiand Li(2010),theauthorsproposedamodedpenaltyfunctionwhich takesaccountofthesign derencestoencouragetheabsolutevaluesofthecoecientsof theconnectedvariablesto shrinktowardthesame.Huangetal.(2011)proposedasparseLa placianshrinkagemethod andproveditsoracleproperty. 32 MotivatedbyLiandLi(2008),inthischapterweproposeatwo-st epproceduretoachieve variableselectionandestimationundertheIVregressionframewor kincorporatinggene regulatorynetwork.Throughapplyinggraphicalstructuresasp riorknowledge,wehopeto addresstheproblemcausedbyhighcorrelationsbetweengenesina pathwaytoachievebetter variableselectionandestimationresults.Werstadoptamultivariat emulti-responselinear modeltoprojectgeneexpressionsintothegeneticspace.Under theassumptionthateach geneisonlycontrolledbyafewgeneticvariants,weusetheLASSOalg orithmtoestimatethe coecientmatrix.Sincegeneticvariablesareindependentoftheer rorterms,theprojected expressionvaluesarepotentiallyuncorrelatedwiththeerrorterm s.Thentheprojected valuesareusedinthesecondstageestimationwhilegenenetworkst ructuresareconsidered. Weproposeagraphconstrainedregularizationmethodtoachievev ariableselectionand estimationinthesecondstage.Assumingcertaingraphicalstruct uresongeneexpressions whichcanbeobtainedeitherbybiologicalexperimentsorbystatistic alestimation,two penalties(LASSOandgraphconstrainedpenalties)areapplied.The graphconstrained penaltyisusedtoencourageshrinkingcoecientsofapairofvariab lesconnectedinthe networktowardthesametoachievegroupeect.Inthestudyof thetheoreticalproperties oftheestimatorobtainedthroughourmethod,weassumethe\gr aphicalirrepresentable condition"whichisusedtoguaranteetheselectionconsistency. Thischapterisorganizedasthefollowing.In3.2,weintroducetheins trumentalvariables modelandthetwo-stepestimationmethod.Theoreticalconsiste ncyresultsaregivenin3.3. Simulationsandrealdataanalysisaregivenin3.4and3.5,respectively ,followedbydiscussion in3.6.AllthetechnicalproofsarerenderedintheAppendixB. 33 3.2StatisticalMethodsandEstimation 3.2.1TheModel Let Y =( Y 1 ;:::;Y n ) T ,X =( X 1 ;:::;X n ) T ,and G =( G 1 ;:::;G n ) T bematricesof n inde- pendentandidenticallydistributedphenotypes,geneexpressions ,andgenotypes,where Y i isascaler, X i =( X i 1 ;:::;X ip ) T ,and G i =( G i 1 ;:::;G iq ) T .Themodelsweusearethefollowing Y = X + ;(3.1) X = G + ;(3.2) where =( 1 ;:::; p )isa q p coecientmatrix, =( 1 ;:::; p ) T isacoecientvector, =( 1 ;:::; n ) T and =( 1 ;:::; n ) T areerrorvectoranderrormatrixsatisfying( T i ; i ) T are i.i.d. p +1dimensionalrandomvariablesfollowingamultivariatenormaldistrib ution N (0 ;).Wewriteas= 0 B @ 11 12 21 ˙ 22 1 C A ,where 11 isthecovariancematrixof i ,˙ 22 is thevarianceof i ,and 12 = T 21 isthecovariancebetween i and i .Figure3.1:Illustrationoftheinstrumentalvariablesmodelingenet icalgenomicsanalysis. Therelationshipbetweengenotype,geneexpressionandphenoty pecanbemodelledwith linearregressionmodels.Figure3.1showstherelationshipbetweent hethreevariablesas 34 wellasthelatentvariable E ,where Y isthephenotypeofinterest, X isthegeneexpression, G isthegenotype,and E representsexternalstimulatingfactorswhichareunobservable henceareputintotheerrorterm.Weassume G aect Y onlythrough X ,and E aects both X and Y .ToillustratetheideaofIVregressioninourcase,weconsiderasimple regressionmodel Y = X + ( X )since E aectsboth Y and X .Sotheerrorterm coulddependson X andwehave dY dX = + d ( X ) dX .Wecanseethatusingordinaryleastsquaresregressionwill notgiveconsistentestimatefor unless d ( X ) dX =0,i.e., ( X )isindependentof X .For thepurposeofhopingtoobtainconsistentestimatorof ,PhilipG.Wrightintroducedthe instrumentalvariablesmodelin1928.InFigure3.1,geneticvariant G isusedasinstrumental variableinourgeneticalgenomicsanalysis.Asexplainedbefore, G isagoodchoicesinceit aect Y onlythrough X andisindependentof E .Moredetailedexplanationcanbefound inLinetal.(2015). 3.2.2Estimationofgeneticeects Undertheassumptionthatonlyafewgeneticvariantsinuencethe geneexpressions,the coecientmatrix issparse.Weapplyregularizedleastsquaresmethodtoestimateea ch geneexpressionseparately.Theestimationproblemcanbeformula tedasthefollowing p optimizationproblems ^ j =argmin j 1 2 n n X i =1 tr f ( X j G j ) T ( X j G j ) g + j k j k 1 ;j =1 ;:::;p: (3.3) where j ;j =1 ;::;p aretuningparameters, kk 1 referstothe L 1 normand X j =( X 1 j ;:::;X nj ) T ;j =1 ;:::;p .Theestimateof isdenotedas ^ =( ^ 1 ;:::; ^ p ). 35 3.2.3Networkconstrainedregularization Intherststep,weobtainanestimateofthecoecientmatrix .So,wehavepredictions ofthecovariatesas ^ X = G ^ .Wesubstitute ^ X for X inthesecondstep. Theoptimizationproblemweareinterestedinis ^ =argmin n 1 2 n n X i =1 Y i p X j =1 ^ X ij j 2 + p ( ) o ;(3.4) p ( )= p X j =1 j j j + 1 2 X k ˘ t j r kt j ( k s kt t ) 2 ;(3.5) where r kt isaweightofcorrelationstrengthbetweentwovariables, s kt =sign( ˆ kt ), ˆ kt is correlationcoecient, > 0and 2 [0 ;1]aretuningparameters,and k ˘ t means r kt 6=0. ThepenaltyfunctioninEquation3.5includestwopenaltieswhichareus edtoselectvariables andtohandletheproblemsbroughtbyhighcorrelationsviaencoura gingcoecientsoftwo correlatedvariablestoshrinktothesamevaluetoachievegrouping eect. Weapplycoordinatedescentalgorithmtosolvetheoptimizationprob lem.Wecenterthe responsevariable Y andstandardize ^ X j ,j =1 ;:::;p .Let h 2f 1 ;:::;p g ,wehave 1 2 n n X i =1 Y i p X j =1 ^ X ij j 2 + p X j =1 j j j + 1 2 X k ˘ t j r kt j ( k s kt t ) 2 = 1 2 n n X i =1 Y i ^ X ih h X j 6= h ^ X ij j 2 + j h j + X j 6= h j j j + 1 2 j r hj j ( u s hj j ) 2 + 1 2 X k ˘ t k;t 6= h j r kt j ( k s kt t ) 2 :36 Takingthederivativewithrespectto h ,wehave @ @ h n 1 2 n n X i =1 Y i p X j =1 ^ X ij j 2 + p X j =1 j j j + 1 2 X k ˘ t j r kt j ( k s kt t ) 2 o = 1 n n X i =1 Y i X j 6= h ^ X ij j (1 ) X h ˘ j r hj j + 1+ (1 ) X h ˘ j j r hj j h + h j h j :Let @@ h n 1 2 n P n i =1 Y i P p j =1 ^ X ij j 2 + p ( ) i =0.Wesolvetheequationandobtain thesolutionas ^ h = S ( 1 n P n i =1 Y i P j 6= h ^ X ij j + (1 ) P h ˘ j r hj j ; ) 1+ (1 ) P h ˘ j j r hj j :SimilartoFriedmanetal.(2007)andFriedmanetal.(2010),weset max = max j jh ^ X j ;Y ij n and min = max ,where =0 :001,andconstructasequenceof K valuesof decreasing from max to min onthelogscale.Weusecrossvalidationtochoosethetuningparame ters and .3.3TheoreticalProperties Herewegivethetheoreticalresultsonthevariableselectionandes timation.Weadopt notationsusedinLinetal.(2015).Foramatrix A ,k A k 1 =max j P i j a ij j and k A k 1 = max i P j j a ij j .Foravector ,amatrix A ,andindexsets I and J , I and A IJ denotethe subvectorandthesubmatrix. J c denotesthecomplementof J and j J j denotesthenumber oftheelementsin J .Denetherestrictedeigenvalueforamatrix A n m and1 s m by ( A;s )=min jJ s min 6=0 k J c k 1 3 k J k 1 k A k 2 p n k J k 2 .Let r =max 1 j p j supp( j ) j ,s = j supp( ) j ,and ˙ max =max 1 j p ˙ j ,wheresupp( )isthesupportiveset.Weassume k k 1 M 1 37 and k k 1 M 2 forsomepositiveconstants M 1 and M 2 .Wewrite(3.5)as p ( )= P p j =1 j j j + 1 2 T L = 1 P p j =1 j j j + 2 T L ,where 1 = and 2 = 1 2 .L isanon-negativedenitematrixequippedwiththegraphicalinforma tion.Weassume k L k 1 0. Weallow p ,q ,r ,s ,and todependonthesamplesize n .InLinetal.(2015),they establishedatheoremontheestimationandpredictionlossof ^ .Webaseouranalysison theirresults. Inordertoobtaintheselectionconsistencyandthe L 1 bound,weneedtoimpose anassumptionon G whichissimilartotheirrepresentableconditioninZhaoandYu (2006).Dene C = 1 n ( G ) T G ,S =supp( ),and ˚ = k ( C SS +2 2 L SS ) 1 k 1 .Let b 0 =min j 2 S j j j .Theassumptionisneededtoobtainselectionconsistency, (C2) k ( C S c S +2 2 L S c S )( C SS +2 2 L SS ) 1 k 1 < 1 ,where isaconstantand 0 << 1. SimilarlytotheargumentinZhaoandYu(2006),(C2)requiresthe\r egressioncoe- cients"oftheirrelevantcovariatesontherelevantcovariatesar elessthan1.InLemma4 oftheAppendixB,weprovethatcondition(C2)holdsforthesample covariancematrix of ^ X withasmaller .Withthese,weestablishthefollowingtheoremabouttheselection consistencyof ^ .Theorem2. Iftheregularizationparametersintherststepareselect edas j = C˙ j q (log p +log q ) n andsatisfy 16 ˚ 2 rs max (2 M 1 + max ) 2(4 ) ,where C> 2 p 2 and max = max 1 j p j ,theregularizationparametersinthesecondstepareselec tedas 1 = C 0 38 q r (log p +log q ) n and 2 C L (16 3 ) 4(4 )(8 3 ) 1 ,where C 0 = c 0 L max( ˙ p +1 ;M 2 ˙ max ) with c 0 > 0 , b 0 hasalowerboundthat b 0 > 2(4 ) 8 3 ˚ h 8 2(4 ) 1 +2 2 C L i ,andfurtherassume that(C1)and(C2)hold,thenwithprobabilityatleast 1 c 1 ( pq ) c 2 ,where c 1 ;c 2 > 0 , ^ obtainedfrom(3.4)satises sign( ^ )=sign( ) ;k ^ S S k 1 16(4 ) ˚C 0 (8 3 ) 2 r r (log p +log q ) n :3.4Simulation HerewecloselyfollowthesimulationsetupproposedinLinetal.(2015) ,butimpose certaincorrelationstructuresongenestoshowtheimpactofnet workstructuresonvariable selectionandestimation.Wesimulatetotal p variablesofgeneexpressionandconsiderthree groupstructureson with5variablesineachgroup.Withineachgroup,thevariablesare correlated.Thestrengthofcorrelationiscontrolledbythenumbe rofeectiveSNPsthey shareincommon.Twogroupsofvariableshavenonzerocoecients andthe3rdgrouphas zerocoecients.Alltherestofthe p 15variableshavezerocoecientsandhavenograph structure,i.e., =( 1 ; ; 5 | {z } 5 ; 6 ; ; 10 | {z } 5 ;0 ; ;0 | {z } 5 ;0 ; ;0 | {z } p 15 ) T .Thepurposetosimulate thestructuredgroupwithzerocoecientsisforthecomparisonp urpose.Twosimulation scenariosareconsidered.Inscenario1,wesetallthenonzeroco ecientstobe0.5(i.e., k =0 :5for k =1 ; ;10),whileinscenario2, k ˘ U (0 :5 ;1)for k =1 ; ;10.Again, thetwosetupsareforthepurposeofcomparison. 39 Forthecovariancematrixcov( ; )== 0 B @ 11 12 21 ˙ 22 1 C A ,the i throwand j thcolumn entryin 11 issettobe(0 :2) ji j jfor i;j =1 ; ;p ,and ˙ 22 =1.Tomake andthe15 X variableswithagraphstructurecorrelated,wesetthe15entries inthelastcolumnsof 12 as0.23andtherestas0.Thenwesimulate( ; ) ˘ N p +1 ( 0 ;). G isgeneratedby samplingfromaBernoullidistributionwithsuccessprobability0.5. Tosimulate X ,weneedtogeneratethecoecientmatrix rst.Aswementionedbefore, thegraphstructurein X iscontrolledbythenumberofcommonlyshared G variables.The moretheyshareincommon,thestrongerthecorrelationbetween them.Weassumethe numberofcommonSNPsas3,4or5toachievederentstrengthso fcorrelation.The correspondingcoecientsin aresettobeindependentrealizationsfrom U (0 :75 ;1).Here eachcolumnof isacoecientvectorcorrespondingtoanexpressionvariable.For the restcolumnsof correspondingtothe X variableswithoutastructure,5nonzeroentriesof eachcolumnarerandomlyselectedandtheirvaluesareindependent lyfrom U ([ 1 ; 0 :75] [ [0 :75 ;1]).Thenwecangenerate X and Y by X = G + and Y = T X + .Weconsiderbothlowdimensionalandhighdimensionalsituationsinour simulation. Inthelowdimensionalcase,weset p =100 ;q =100andvary n from200to1400when both p and q arexed.Inthehighdimensionalcase,weset p =600 ;q =600and n = 300.WecompareourmethodnamelyIVregressionwithgraphconst rainedregularization (denotedasIVGC)withthemethodproposedbyLinetal.(2015)wit hIVregressiononly (denotedasIV).Themeasuresweusetoevaluateandcompared erentmethodsandsetups arethenumberofcorrectlyestimatednonzerocoecients(True Positive),thenumberof estimatednonzerocoecients(ModelSize), k ^ k 2 (EstimationLoss), n 1 = 2 k X ( ^ 40 ) k 2 (ModelError),and MCC (= TP TN FP FN p ( TP + FP )( TP + FN )( TN + FP )( TN + FN ) )whereTP=true positive,TN=truenegative,FP=falsepositiveandFN=falsenegativ e.TNisdenedasthe numberofcorrectlyestimatedzerocoecients;FPisdenedasth enumberofincorrectly estimatednonzerocoecients;andFNisdenedasthenumberofin correctlyestimated zerocoecients.ThegreatertheMCCis,thebetteravariablesele ctionperformanceis. Thereasonthatwechoosemodelerrorinsteadofpredictionerro risexplainedinLinetal. (2015).Weusecrossvalidationtochoosethepenaltytuningparam eters.Foreachcase,we run200replicationsandreportthesamplemeansandstandarderr ors. Table3.1:Simulationresultsforp=100,q=100,n=600 MethodnumSNPEstimationLossModelErrorTruePositiveModelSiz eMCC k =0 :5 ;k =1 ; ;10 30.67(0.28)0.34(0.11)10.00(0.00)14.79(3.29)0.82(0.10) 40.62(0.34)0.35(0.12)10.00(0.00)13.73(4.10)0.86(0.11) 51.33(1.43)0.59(0.46)9.60(1.15)12.98(3.50)0.84(0.13) IVGC k ˘ U (0 :5 ;1) ;k =1 ; ;10 31.51(0.41)0.67(0.14)10.00(0.00)15.56(4.15)0.80(0.11) 41.28(0.38)0.59(0.14)10.00(0.00)14.35(3.96)0.83(0.11) 52.13(1.70)0.95(0.59)9.63(1.07)12.97(3.61)0.84(0.11) k =0 :5 ;k =1 ; ;10 31.62(0.46)0.66(0.14)9.99(0.10)15.77(4.36)0.79(0.11) 41.82(0.49)0.75(0.15)9.96(0.20)14.38(3.87)0.83(0.11) 54.20(0.71)1.52(0.24)7.60(1.06)10.88(3.15)0.71(0.12) IV k ˘ U (0 :5 ;1) ;k =1 ; ;10 32.05(0.58)0.89(0.18)10.00(0.00)15.49(4.42)0.80(0.11) 42.30(0.62)0.94(0.19)9.96(0.20)14.64(4.26)0.82(0.11) 55.70(1.22)2.14(0.44)8.01(1.11)11.44(3.32)0.74(0.14) Table3.1showstheresultsforthelowdimensionalsituationwith n =600.Asintroduced inthesimulationsetup,thegreaterthenumberofsharedSNPs(nu mSNP),thehigherthe correlationbetweentheexpressionvariables.WecanseeIVGChas lessestimationerrorsand 41 modelerrorsinthetwoscenarioswithderent values.Forthetruepositive,bothIVGC andIVperformsimilarwhennumSNPis3or4(weakcorrelationcompar edtonumSNP=5). WhennumSNPincreasesto5,IVGChasahighertruepositiverateth anIVdoes.This iswhatweexpecttosee.IVimplementsaLASSOpenaltywhichonlyran domlypicks onevariableamongcorrelatedones.Introducinggraphicalconst rainedpenaltyachievesthe selectionconsistencyforgraphicalstructurespresent.Forth emodelsizeandMCC,IVGC andIVhavequitesimilarperformance,althoughIVGChasslightlylarg erMCCthnaIV does.Ingeneral,IVGCperformsbettercomparingtoIVinbothva riableselectionand estimation. Wealsovariedthesamplesize n butxed p and q .TheresultsaredisplayedinFigure 3.2.Wecanseethatbothmethodsperformbetterasthesamplesiz eincreases.IVGC completelydominatesIVintermsofestimationlossandmodelerror. Inaddition,IVGC hasslightlysmallermodelsizeandlargerMCCcomparedtoIV.Bothme thodsperform quitesimilarintermsoftruepositive,especiallywhen n getslarge.Adetailedcomparison canbefoundinTable3.2alongwiththestandarderrorgiveninthepar enthesis. Table3.3showstheresultsinhighdimensionalsituations.Wecansees imilarpatterns asweobservedinthelowdimensionalcases.Ingeneral,IVGCperfo rmsasgoodasorbetter thanIVdoes. 3.5ApplicationtoRealData Weapplyourmethodtoahumanlivercohortdatasettoshowtheutilit yofthemethod. Thedatacontaingenotypes,geneexpressions,andphenotypes ofenzymeactivitieswhich canbedownloadedfromsagebionetworks'synapseplatformusing SynapseIDsyn4499.For 42 0123456 Estimation LossSample size Estimation loss200400600800100012001400 IVGC IV0.00.51.01.52.02.5 Model Error Sample size Model error200400600800100012001400 IVGC IV0246810 True Positive Sample size No. of true positives 200400600800100012001400 IVGC IV1012141618 Model Size Sample size Model size 200400600800100012001400 IVGC IV0.60.70.80.91.0 MCCSample size MCC200400600800100012001400 IVGC IVFigure3.2:Simulationresultsforxed p =100and q =100butvarying n (200to1400) detailsofthedataset,pleaserefertoSchadtetal.(2008)andYa ngetal.(2010).The phenotypesareenzymeactivitymeasurementsofCytochromeP4 50.Therearetotal170 individualsmeasuredon18,556genetranscripts,449,699SNPsands omecovariateslike 43 Table3.2:Simulationresultsfor p =100, q =100, n =200 1400andnumSNP=4 n MethodEstimationLossModelErrorTruePositiveModelSizeMCC 200IVGC1.29(0.77)0.72(0.28)9.93(0.41)14.31(3.87)0.83(0.11) IV3.20(0.78)1.32(0.27)9.30(0.76)13.69(3.60)0.79(0.12) 400IVGC0.86(0.43)0.47(0.15)9.98(0.20)14.43(3.53)0.83(0.11) IV2.34(0.63)0.95(0.20)9.77(0.47)14.74(3.43)0.80(0.11) 600IVGC0.62(0.34)0.35(0.12)10.00(0.00)13.73(4.10)0.86(0.11) IV1.82(0.49)0.75(0.15)9.96(0.20)14.38(3.87)0.83(0.11) 800IVGC0.55(0.25)0.29(0.09)10.00(0.00)14.36(4.15)0.83(0.11) IV1.70(0.42)0.68(0.14)9.99(0.10)14.91(4.14)0.82(0.11) 1000IVGC0.49(0.22)0.26(0.09)10.00(0.00)14.52(3.85)0.83(0.11) IV1.47(0.38)0.60(0.12)10.00(0.00)15.27(3.84)0.80(0.11) 1200IVGC0.45(0.18)0.24(0.08)10.00(0.00)13.91(3.44)0.85(0.11) IV1.29(0.33)0.53(0.11)10.00(0.00)14.33(3.32)0.83(0.10) 1400IVGC0.43(0.19)0.23(0.07)10.00(0.00)14.50(4.02)0.83(0.11) IV1.25(0.33)0.50(0.10)10.00(0.00)14.71(4.28)0.83(0.12) genderandage.Therearetotal9enzymeactivitymeasuresandw eparticularlyfocuson CYP2E1sinceitistheonlyonepassedtheShapiro-Wilknormalitytest( p-value > 0 :1)after thelogtransformation.Weregressthelog-transformedCYP2E1 overcovariatesgenderand age,thenobtainthecovariatesadjustedresponsesforfollowup analysis. FortheSNPdata,thosewithgenotypingcallratelessthan90%orm inorallelefrequency (MAF)lessthan5%areremovedfromthedataset.Afterthisstep ,therearetotal312,082 SNPsleft.HerewefocusonKEGGpathway\MetabolismofXenobiotic sbyCytochrome P450".Therearetotal74genesinthispathwayand72ofthemare mappedtoourdata set.Wetalinearregressionmodelforeachgenetranscripttose lectthetop1,000SNPs accordingtotheirmarginalp-values.Thenwetamultipleregressio nmodelwiththe1,000 44 Table3.3:Simulationresultsfor p =600, q =600and n =300 MethodnumSNPEstimationLossModelErrorTruePositiveModelSiz eMCC k =0 :5 ;k =1 ; ;10 31.65(1.08)0.73(0.27)10.00(0.00)20.70(11.02)0.74(0.14) 41.68(0.91)0.84(0.25)9.98(0.14)18.17(8.91)0.78(0.15) 52.93(1.76)1.25(0.51)9.13(1.38)16.25(7.36)0.75(0.15) IVGC k ˘ U (0 :5 ;1) ;k =1 ; ;10 32.59(0.94)1.22(0.27)9.99(0.10)19.65(7.07)0.74(0.12) 42.62(1.04)1.37(0.36)9.91(0.45)17.54(7.25)0.78(0.13) 54.36(2.64)1.93(0.70)9.15(1.42)16.95(8.67)0.75(0.18) k =0 :5 ;k =1 ; ;10 32.95(1.21)1.13(0.25)9.84(0.39)22.22(11.68)0.70(0.14) 43.28(1.21)1.31(0.28)9.63(0.56)19.77(10.38)0.73(0.15) 54.86(1.23)1.77(0.35)7.71(1.03)16.70(9.69)0.64(0.15) IV k ˘ U (0 :5 ;1) ;k =1 ; ;10 33.86(1.48)1.61(0.35)9.90(0.30)21.84(10.98)0.71(0.13) 44.45(1.10)1.86(0.35)9.66(0.59)19.35(7.50)0.72(0.12) 56.91(1.69)2.64(0.44)7.87(1.08)16.49(9.00)0.65(0.16) SNPsandestimatetheircoecientsusingtheLASSOalgorithm.Weth enobtainthetted valuesforthe72genetranscriptsandapplythegraphconstraine dIVmethodtoselect importantones.Wemanuallyobtainthegraphstructureforthe72 genesinthispathway fromtheKEGGpathwaydatabase.Stabilityselection(Meinshausen andBuhlmann,2009; ShahandSamworh,2013)isappliedtoobtainastablevariableselectio nresult.Eachtime werandomlyselect80%ofthedatatorunouralgorithmandwerepea t100times,then calculatethepercentageofselectionforeachgeneaatheselectio nrate. Table3.4showsthetop10selectedgenesusingIVGCandIV.Wecans eethatalarge proportionofgenesselectedwiththetwomethodsoverlap.Howev er,ahigherselectrate isachievedfortheIVGCmethod.ThisimpliesthattheIVGCmethodism orestable whengraphicalinformationistakenintoaccount.Amongthelistedg enes,CYP2E1is selected100%usingtheIVGCmethodcomparedtoa99%fortheIVm ethod.Weexpect 45 Table3.4:Listoftop10selectedgenesforeachoftheIVGCandIVm ethod IVGCIV CYP2E11.000CYP2E10.990 UGT1A50.775UGT1A50.765 SULT2A10.540SULT2A10.420 CYP2B60.465CYP2B60.295 MGST10.445MGST10.240 HSD11B10.435HSD11B10.295 CYP2C90.365CYP2C90.195 UGT1A10.345- GSTM30.325GSTM30.210 -CYP2F10.165 UGT1A30.295- -UGT1A10.120 thatthisgeneshouldbeselectedverytimesincetheresponseisthe activityofthisgene. ThehighselectionratefortheIVGCmethodindicatetherobustofo urmethodwhen networkinformationisincorporated.Inadditiontothisgene,othe rgenessuchasUGT1A5, SULT2A1,CYP2B6,MGST1,andHSD11B1alsohaveselectionratesla rgerthan40%.The stabilityevaluationindicatestherelativeimportanceofthesegenes intheactivityofenzyme CYP2E1.Furtherbiologicalvalidationisneededtodisclosetherealr elationshipofthese geneswiththeenzymeactivity.Acompletelistofthegenesinthepat hwayisgiveninTable 3.5. 3.6Discussion Linetal.(2015)proposedanIVregressionframeworktodealwith theendogeneityissuein geneticalgenomicdataanalysis.Giventhatgenesfunctioninnetwo rktofulltheirjoint tasks,inthisworkweproposedagraphconstrainedselectionande stimationmethodunder IVregressionframework.OurworkisbuiltupontheworkofLinetal. (2015)whilein- 46 corporatingthepriorknowledgeaboutthegraphstructureofge nesbelongingtoanetwork (e.g.,pathway).Weestablishedtheselectionconsistencyunderth eproposedtwo-stepesti- mationprocedureandshowedthe L 1 boundwhichprovidessometheoreticalinsightsinto thepropertiesofourmethod.Intensivesimulationswereconduct edtoevaluatethemodel performancecomparedwiththeIVregressionmethodbyLinetal.( 2015).Sincetheau- thorshavedemonstratedtheadvantageoftheIVregressionov eranaiveregressionwithout consideringinstrumentalvariables(Linetal.2015),wedidnotinclud ethecomparisonof ourmethodwiththenaivemethodinthiswork.Weappliedourmethodt oahumanliver cohortdatasettodemonstratetheeectivenessofourmethod .Thestabilityevaluationre- sultsindicatetherobustnessofourmethodcomparedtotheIVme thodwithoutconsidering therealbiologicalregulationrelationship. Ingeneral,LASSOdoesnothaveoracleproperty.Itispossibletha ttheestimation errorsofthecoecientscannotbereducedtozeroevenunderla rgesamples.SCADand MCParenaturalalternativeswhichinfactenjoytheoracleproper ty.Bothpenaltycanbe appliedinourproceduretoreplacethelassopenalty.Wewillconsider thesepenaltiesinour futurework,althoughsubstantialmodcationmaybeneededfor theproofoftheselection consistency. Whenapplyingthegraphconstrainedpenalizationmethod,thegrap hicalstructureneeds tobeknownbeforethepenalizedestimationisapplied.Inourrealda taanalysis,weusethe pathwayinformationfromtheKEGGpathwaydatabaseaspriorkno wledgetoestablishthe graphstructure.Thisisanaturalwayifpriorknowledgeisavailable. However,whensuch knowledgeisnotavailable,somestatisticalmethodscanbeappliedto estimatethenetwork structurebeforeapplyingourmethod.Suchmethodsinclude,for example,thethresholding methodsbyBickelandLevina(2008a,2008b),Rothman,Levinaan dZhu(2009),andLam 47 andFan(2009).Inaddition,ourcurrentmodelisdevelopedforco ntinuousresponse.In humangenetics,manydiseaseresponsesareshownasbinaryvaria bles.Wewillgeneralize ourmethodtoageneralizedlinearmodelframeworkinfutureinvest igations. 48 Table3.5:thegenesinpathwayofMetabolismofxenobioticsbycytoc hromeP450-Homo sapiens CYP1A1cytochromeP450,family1,subfamilyA,polypeptide1[KO:K07 408] CYP2C9cytochromeP450,family2,subfamilyC,polypeptide9[KO:K17 719] CYP3A4cytochromeP450,family3,subfamilyA,polypeptide4[KO:K17 689] CYP1B1cytochromeP450,family1,subfamilyB,polypeptide1[KO:K07 410] GSTA5glutathioneS-transferasealpha5[KO:K00799] GSTA2glutathioneS-transferasealpha2[KO:K00799] GSTA4glutathioneS-transferasealpha4[KO:K00799] GSTO2glutathioneS-transferaseomega2[KO:K00799] GSTM4glutathioneS-transferasemu4[KO:K00799] GSTT2glutathioneS-transferasetheta2(gene/pseudogene)[K O:K00799] GSTT1glutathioneS-transferasetheta1[KO:K00799] GSTM3glutathioneS-transferasemu3(brain)[KO:K00799] MGST1microsomalglutathioneS-transferase1[KO:K00799] MGST3microsomalglutathioneS-transferase3[KO:K00799] GSTP1glutathioneS-transferasepi1[KO:K00799] GSTM1glutathioneS-transferasemu1[KO:K00799] GSTM5glutathioneS-transferasemu5[KO:K00799] MGST2microsomalglutathioneS-transferase2[KO:K00799] GSTA1glutathioneS-transferasealpha1[KO:K00799] GSTM2glutathioneS-transferasemu2(muscle)[KO:K00799] GSTA3glutathioneS-transferasealpha3[KO:K00799] GSTO1glutathioneS-transferaseomega1[KO:K00799] GSTT2BglutathioneS-transferasetheta2B(gene/pseudogene )[KO:K00799] GSTK1glutathioneS-transferasekappa1[KO:K13299] EPHX1epoxidehydrolase1,microsomal(xenobiotic)[KO:K01253] CYP2B6cytochromeP450,family2,subfamilyB,polypeptide6[KO:K17 709] SULT2A1sulfotransferasefamily,cytosolic,2A,dehydroepiandr osterone (DHEA)-preferring,member1[KO:K11822] CYP1A2cytochromeP450,family1,subfamilyA,polypeptide2[KO:K07 409] CYP2A6cytochromeP450,family2,subfamilyA,polypeptide6[KO:K17 683] CYP2E1cytochromeP450,family2,subfamilyE,polypeptide1[KO:K07 415] CYP2F1cytochromeP450,family2,subfamilyF,polypeptide1[KO:K07 416] CYP2S1cytochromeP450,family2,subfamilyS,polypeptide1[KO:K07 420] AKR1C2aldo-ketoreductasefamily1,memberC2[KO:K00089K00212 ]AKR1C4aldo-ketoreductasefamily1,memberC4[KO:K00037K00089 K00092 K00212] AKR1C1aldo-ketoreductasefamily1,memberC1[KO:K00089K00212 ]49 Table3.5(cont'd) DHDHdihydrodioldehydrogenase(dimeric)[KO:K00078[EC:1.1.1.1791.3 .1.20] CYP2A13cytochromeP450,family2,subfamilyA,polypeptide13[KO:K 17685] CYP2D6cytochromeP450,family2,subfamilyD,polypeptide6[KO:K17 712] HSD11B1hydroxysteroid(11-beta)dehydrogenase1[KO:K15680 ]CBR1carbonylreductase1[KO:K00079] CBR3carbonylreductase3[KO:K00084] UGT2A1UDPglucuronosyltransferase2family,polypeptideA1,com plexlocus [KO:K00699] UGT2A3UDPglucuronosyltransferase2family,polypeptideA3[KO:K0 0699] UGT2B17UDPglucuronosyltransferase2family,polypeptideB17[KO :K00699] UGT2B11UDPglucuronosyltransferase2family,polypeptideB11[KO :K00699] UGT2B28UDPglucuronosyltransferase2family,polypeptideB28[KO :K00699] UGT1A6UDPglucuronosyltransferase1family,polypeptideA6[KO:K0 0699] UGT1A4UDPglucuronosyltransferase1family,polypeptideA4[KO:K0 0699] UGT1A1UDPglucuronosyltransferase1family,polypeptideA1[KO:K0 0699] UGT1A3UDPglucuronosyltransferase1family,polypeptideA3[KO:K0 0699] UGT2B10UDPglucuronosyltransferase2family,polypeptideB10[KO :K00699] UGT1A9UDPglucuronosyltransferase1family,polypeptideA9[KO:K0 0699] UGT2B7UDPglucuronosyltransferase2family,polypeptideB7[KO:K0 0699] UGT1A10UDPglucuronosyltransferase1family,polypeptideA10[KO :K00699] UGT1A8UDPglucuronosyltransferase1family,polypeptideA8[KO:K0 0699] UGT1A5UDPglucuronosyltransferase1family,polypeptideA5[KO:K0 0699] UGT2B15UDPglucuronosyltransferase2family,polypeptideB15[KO :K00699] UGT1A7UDPglucuronosyltransferase1family,polypeptideA7[KO:K0 0699] UGT2B4UDPglucuronosyltransferase2family,polypeptideB4[KO:K0 0699] UGT2A2UDPglucuronosyltransferase2family,polypeptideA2[KO:K0 0699] CYP3A5cytochromeP450,family3,subfamilyA,polypeptide5[KO:K17 690] AKR7A2aldo-ketoreductasefamily7,memberA2(aatoxinaldehyd ereductase) [KO:K15303] AKR7A3aldo-ketoreductasefamily7,memberA3(aatoxinaldehyd ereductase) [KO:K15303] ALDH3B1aldehydedehydrogenase3family,memberB1[KO:K00129] ALDH3B2aldehydedehydrogenase3family,memberB2[KO:K00129] ALDH1A3aldehydedehydrogenase1family,memberA3[KO:K00129] ALDH3A1aldehydedehydrogenase3family,memberA1[KO:K00129] ADH1Aalcoholdehydrogenase1A(classI),alphapolypeptide[KO:K1 3951] ADH1Balcoholdehydrogenase1B(classI),betapolypeptide[KO:K1 3951] ADH1Calcoholdehydrogenase1C(classI),gammapolypeptide[KO:K 13951] ADH7alcoholdehydrogenase7(classIV),muorsigmapolypeptide[K O:K13951] ADH4alcoholdehydrogenase4(classII),pipolypeptide[KO:K1398 0] ADH5alcoholdehydrogenase5(classIII),chipolypeptide[KO:K00 121] ADH6alcoholdehydrogenase6(classV)[KO:K13952] 50 Chapter4 GraphConstrainedRegularizationfor NonparametricInstrumental VariablesRegressioninGenetical GenomicAnalysis 4.1Introduction Drivenbytheadvancementofmodernbiotechnology,vastamount ofgeneticandgenomics dataareroutinelygeneratedwiththehopetounderstandthemole cularfunctionofliving organisms.Anyphenotypictraitsrootingeneticsinwhichgenefunc tionsareconstantly modedbyexternalfactorssuchasenvironmentalchanges.As publichealthisbecoming anincreasingconcernduetovariousissuessuchasfoodsafetyan dair/waterpollution, understandingtheimpactofexternalforcesongenefunctionan dthedegreeandmechanism bywhichtheseexternalstimuliactonourgenomeisthusparamoun tlyimportantindisease preventionandimprovingpublichealth.Geneticalgenomicdatawhich includegenetic variants(e.g.,SNPs),geneexpressionandphenotypictraitscoup lingwithenvironmental measurements,providepowerfulresourcestodeciphersuchme chanisms. 51 ThecentraldogmaingeneticstellsusthatDNAtranscribestoRNAa ndRNAtranslates toproteinwhicheventuallyleadingtophenotypicproducts.Withgen eticalgenomicdata, therelationshipamongthethreecanbemodelledandtestedthroug hpowerfulstatistical methods.Ingeneticassociationstudies,therelationshipbetween aphenotypicresponse andgeneticvariantscanbemodeledbyasimplelinearorlogisticregres sionmodeldepend- ingonthenatureofthedatadistribution.Let Y beaquantitativetraitresponseand G =( G 1 ; ;G q ) T denote q geneticvariants,thentherelationshipbetween Y (assuming centered)and G canbedescribedbythefollowingmodel, Y = G + " ,where represent thegeneticeects.Let Z beanenvironmentalvariable.Tounderstandhow Z inuences G toaect Y ,Maetal.(2011)introducedavarying-coecientmodeltostudyn onlinear environmentalmodulationeecton G withthemodel Y = q X k =1 k ( Z ) G k + ": (4.1) Let X =( X 1 ; ;X p ) T denote p geneexpressionvaluesinageneticalgenomicstudy.As theintermediatephenotypebetween G and Y ,onenaturalchoiceistousegeneexpression valuesasthepredictorstondoutwhichgenesareassociatedwith aphenotype Y .Thiscan bedonebyttingthemodel Y = X + ,where representtheeectsofgeneexpressions. IfweassumeacausalrelationshipwiththeorderofDNA-RNA-phen otype,i.e.,DNAaects phenotypeonlythroughgeneexpression,thenthefollowingtwo-s tagesequentialmodelscan bettedtodissecttherelationship,i.e., Y = X + ;X = G + ;52 wherewelet Y n 1 ,X n p ,and G n q bematricesof n independentandidenticallydis- tributedphenotypes,geneexpressions,andgenotypes,respe ctively; q p and p 1 are coecientmatrixandvector, n p and n 1 areerrormatrixandvector.Asarguedin chapter3,whenunobservablelatentvariablesaectboth X and Y ,G canbeconsidered asinstrumentalvariablesandtheeectsof X on Y canbeconsistentlyestimated.Linet al.(2015)proposedaninstrumentalvariables(IV)regressionfr ameworktoachievevariable selectionandestimationwhenbothdimensionsof p and q arelargetosolvetheendogeneity problem.Intherststage,apenalizedvariableselectionalgorithmis appliedtoeachgene expression.Then X isreplacedbythettedvalues ^ X inthesecondstageandthemodelof interestbecomes Y =( G ^ ) + :(4.2) Whentheeectofanenvironmentalvariable Z isconsidered,wecanincorporatemodel (4.1)tounderstandhow Z modulatesgeneeecttoaect Y .Thusmodel(4.2)canbe modedtoreectthisby, Y = G ^ ( Z )+ :where ( )are p -dimensionalfunctionswhichareassumedtobesmooth.Thus,und erthe IVregressionframework,itisnaturaltoconsiderthefollowingmod el, Y = X ( Z )+ ;X = G + ;(4.3) Aswearguedinchapter3,genesfunctioninanetworkfashiontofu lltheirjointtasks. Astheintermediatephenotype,geneexpressionscouldbehighlyco rrelatedforgenesbe- 53 longingtothesamenetworkorpathway.Thus,anetworkconstra inedpenalizationmethod ismoreappealingtosolvetheissuesbroughtbythegraphicalstruc tureofgeneexpressions. Focusingonmodel(4.3),inthischapterweapplyinstrumentalvaria blestosolvetheproblem causedbyendogeneityandproposeagraphconstrainedregulariz ationmethodtosolvethe problemcausedbyhighcorrelation.Theeectofageneexpression ismodeledbyavarying coecientfunctionwhichreectsthenonlinear(ifany)moderation eectof G on Y .We proposeatwo-stepproceduretoachievevariableselectionandes timation.Intherststep, undertheassumptionthateachgeneisonlycontrolledbyafewgene ticvariants,weuse LASSOtoestimatethecoecientmatrix .ThenweapplygroupLASSOandgraphcon- strainedregularizationtoachievevariableselectionandestimationf orthevaryingcoecient functions.Afterreplacing X withthettedvalues X ,weuseB-splineexpansionstoapprox- imatethecoecientfunctions.Finallyweimposetwopenalties,group LASSOpenaltyand graphconstrainedpenaltyonB-splinecoecients.Thegraphcons trainedpenaltyisused tohandlethecorrelationproblemandencouragethecoecientsof apairofvariableswhich areconnectedinthenetworktoshrinktowardthesamevalues. Thechapterisorganizedasthefollowing.In4.2,weintroducethemo delandthetwo- stepestimationmethod.Theoreticalconsistencyresultsaregive nin4.3.Simulatingdata andrealdataanalysesareusedtojustifyourmethodin4.4and4.5, respectively,followed byabriefdiscussionin4.6.ProofsarerenderedintheAppendixC. 54 4.2NonparametricModelsandGraphConstrainedReg- ularization 4.2.1TheModelSetup Let Y =( Y 1 ;:::;Y n ) T ,X =( X 1 ;:::;X n ) T ,G =( G 1 ;:::;G n ) T ,and Z =( Z 1 ;:::;Z n ) T be matricesof n independentandidenticallydistributed( i.i.d. )phenotype,geneexpressions, genotypes,andenvironmentalvariable,where Y i isascaler, X i =( X i 1 ;:::;X ip ) T ,G i = ( G i 1 ;:::;G iq ) T ,and Z i isascaler.Hereweproposethefollowingmodel, Y i = 0 ( Z i )+ p X j =1 j ( Z i ) X ij + i ;(4.4) X i = T G i + i ;(4.5) where =( 1 ;:::; p )isa q p coecientmatrix, j ( ) ;j =0 ;1 ;:::;p; aresmoothfunctions. =( 1 ;:::; n ) T and =( 1 ;:::; n ) T areerrorvectoranderrormatrixsatisfying( T i ; i ) T are i.i.d. p +1dimensionalrandomvariablesfollowingamultivariatenormaldistrib ution N (0 ;).Wewriteas= 0 B @ 11 12 21 ˙ 22 1 C A ,where 11 isthecovariancematrixof i ,˙ 22 is thevarianceof i ,and 12 = T 21 isthecovariancebetween i and i .4.2.2FirstStepEstimation Undertheassumptionthatonlyafewgeneticvariantsinuencethe geneexpressions,the coecientmatrix issparse.Weadoptanaiveestimationmethodandapplyregularized leastsquarestoeachgeneexpressionseparately.Theestimation problemcanbeformulated 55 asthefollowing p optimizationproblems,i.e., ^ j =argmin j 1 2 n n X i =1 tr f ( X j G j ) T ( X j G j ) g + j k j k 1 ;j =1 ;:::;p: (4.6) where j ;j =1 ;::;p aretuningparameters, kk 1 referstothe L 1 normand X j =( X 1 j ;:::;X nj ) T ;j = 1 ;:::;p .Theestimateofisdenotedas ^ =( ^ 1 ;:::; ^ p ). 4.2.3SecondStepEstimation Intherststep,weobtainanestimateofthecoecientmatrix .Thettedvalues ^ X = G ^ isthenusedtosubstitute X inthesecondstep. Theoptimizationproblemweareinterestedinis ^ ( )=argmin ( ) n 1 2 n n X i =1 Y i p X j =0 j ( Z i ) ^ X ij 2 + p ( ( )) o ;(4.7) where p ( ( ))= p X j =1 k j ( ) k + 1 2 X k ˘ t j r kt jk k ( ) s kt t ( ) k 2 ; ( )=( 0 ( ) ; 1 ( ) ;:::; p ( )) T ,r kt isaweightofcorrelationstrengthbetweentwovariables; s kt =sign( ˆ kt ); ˆ kt iscorrelationcoecient; > 0and 2 [0 ;1]aretuningparameters; and k ˘ t means r kt 6=0.Denethenormofafunction g ( t )as jj g ( t ) jj = q R g 2 ( t )d t .Similarformsofoptimizationproblemhaveappearedintheliterature. Huangetal. (2011)studiedthepropertyofthisprobleminaparametricsituatio n.Dayeetal.(2012) proposedasimilarproblemandanalgorithm.However,theydidnotinc orporatethege- neticvariantsasinstrumentalvariables.Weapplythenonparamet ricgraphconstrained regularizationinanIVregressionframework.Thisbringschallenges intheproofofselection 56 consistencysince ^ X arenoti.i.d.samples.InDayeetal.(2012),theauthorsappliedtheir methodtolongitudinaldatawhichbroughtsomechallengesincomput ation.Weadoptan algorithmsimilartowhatWeietal.(2011)proposedandthecomputa tionproblemiseased. Weapproximatesmoothfunctions j ( ) ;j =0 ;1 ; ;p bypolynomialsplines.Without lossofgenerality,weassume Z 2 [0 ;1].Let k j beapartitionoftheinterval[0,1],with K j interiorknots k j = f 0= k j; 0 0. Weconsiderthelowdimensionalsituationwherethedimensionisxed. Inordertoobtain 61 theselectionconsistencyandthe L 1 bound,weneedtoassumeonemoreassumption.Dene C = 1 n U ( G ) T U ( G ),where X isreplacedwith G ,and S =supp( )whichistheindex ofsupportivegroups.Let b 0 =min j 2 S k j k 2 .Thefollowingconditionwhichisinspirit similartotheirrepresentableconditioninZhaoandYu(2006),isneed ed. (C2) k ( C tS +2 2 L R tS )( C tS +2 2 L R SS ) 1 k 2 < 1 s 2 = 3 N 2 1 ,8 t 2 S C ,where isaconstant and0 << 1. SimilarlytotheargumentinZhaoandYu(2006),(C2)requiresthe\r egressioncoe- cients"oftheirrelevantcovariatesontherelevantcovariatesar elessthan1.Thefollowing theoremstatestheselectionconsistencyof ^ .Theorem3. Iftheregularizationparametersselectedintherststeps atisfysomeregular- izationconditions;theregularizationparametersinthes econdstepareselectedas 1 = O ( n 1 = 2 ) and 2 C L h (1 3 4 ) 1 p s +1 i (16 3 ) 16(4 ) 1 ; b 0 hasalowerboundthat b 0 > 2(4 ) 8 3 ˚ h 8 2(4 ) 1 p s +2 2 C L i ,andfurtherassume(C1)and(C2)hold,thenwithproba- bilityapproachingto1,theestimated ^ satises sign( ^ )=sign( ) ;k ^ S S k 1 2(4 ) 8 3 ˚ h 8 2(4 ) 1 p s +2 2 C L i s 1 :5 N 1 :4.4Simulation Todenethecoecientfunctions ( )=( 0 ( ) ; 1 ( ) ; ; p ( )) T ,weusethefunctionsde- nedinDayeetal.(2012).Thefollowingfunctionsareusedtoconst ructthetruecoecient functions. 62 Polynomial: f pn 1 ( z )=2 z 1 ;f pn 2 ( z )=(1 2 z ) 3 Low-Frequency: f lf 1 ( z )=sin(2 ˇz ) ;f lf 2 ( z )=cos(2 ˇz ) High-Frequency: f hf 1 ( z )=sin(8 ˇz ) ;f hf 2 ( z )=cos(8 ˇz ) Compound: f cp ( z )=(1 2 z ) 3 cos(4 ˇz ). Now,wedenethetruecoecientsasthefollowing. 0 =2 f cp ; 3 =(1 ) f lf 2 + f lf 1 ; 1 =(1 ) f lf 2 + f pn 1 ; 4 =(1 ) f lf 2 + f hf 2 ; 2 =(1 ) f lf 2 + f hf 1 ; 5 =(1 ) f lf 2 + f pn 2 :where 0 istheinterceptfunctionandothersarethose5nonzerofunction s. isaconstant. If =0,allthevecoecientfunctionsarethesame.If 6=0,thecoecientfunctionsare notthesamesowecanassesstherobustnessofourmethod. Wegeneratethegraphicalstructureaswedidinchapter3with3gr oups.Thedegreeof correlationwithineachgroupsiscontrolledbythenumberofSNPsth eyshareincommon whichissetas4.Forthethreegroupswithagraphicalstructure, 5variablesaresimulatedin eachgroup.Weassumetwogroupsofvariableshavenonzerocoe cientfunctionsasspeced bythevefunctionsgivenabove.Therestvariableshavezerocoe cients.Wesimulatetwo cases.Intherstcase,weset as0.Inthesecondcase,weset as0.25toevaluate theselectionperformanceunderderentsituations. ; ;G ;X ,and Y aregeneratedasin chapter3. WecompareourmethodnamelyIVregressionwithgraphconstraine dpenalty(IVGC) withtheIVregressionmethodwithoutgraphconstrainedpenalty( IV).Thefollowingve measuresareusedtoevaluatetheperformanceofthetwometho ds.Sensitivityisdened 63 astheratiooftruepositivestothenumberofnonzerocoecients .Speccityisdened astheratiooftruenegativestothenumberofzerocoecients.G -measureisdenedas p sensitivity specificity .Thecloserto1aG-measureis,thebetterthevariableselection performanceis.Modelerrorisdenedas n 1 = 2 k X ^ ( Z ) X ( Z ) k 2 .Weusecrossvalidation tochoosethetuningparameters.Foreachcase,wesimulate200t imesandreporttheaverage valueandstandarderror. Table4.1:Simulationresultsfor p =100, q =100and =0 nMethodEstimationLossModelErrorSensitivitySpeccityG-meas ure 800IVGC1.10(0.57)1.39(0.23)0.99(0.06)0.69(0.09)0.68(0.09) IV4.45(1.13)2.36(0.28)0.74(0.11)0.67(0.10)0.5(0.07) 1000IVGC1.02(0.74)1.29(0.26)1.00(0.04)0.71(0.08)0.71(0.08) IV4.04(1.09)2.24(0.27)0.79(0.11)0.68(0.08)0.53(0.07) 1200IVGC0.84(0.29)1.17(0.15)1.00(0.00)0.70(0.07)0.70(0.07) IV3.63(1.11)2.10(0.29)0.83(0.11)0.66(0.07)0.55(0.08) 1400IVGC0.80(0.58)1.11(0.21)1.00(0.03)0.72(0.08)0.72(0.08) IV3.46(1.04)2.02(0.28)0.85(0.09)0.67(0.08)0.57(0.07) 1600IVGC0.67(0.22)1.03(0.14)1.00(0.00)0.70(0.08)0.70(0.08) IV3.12(0.90)1.91(0.25)0.88(0.08)0.64(0.08)0.56(0.07) Table4.1showstheresultswhen p =100, q =100and =0.IVGCobtainssmaller estimationlossandmodelerrorcomparedtoIV.ThisimpliesthatIVG Chasabetter estimationperformancethanIVdoes.ThesensitivityofIVGCishigh erthanthatofIV. Thisiswhatweexpected.HerewesetthenumberofcommonSNPsas 4.Whencorrelation isstrong,graphconstrainedregularizationcanobtainbettervar iableselectionperformance. Forspeccity,thetwomethodsperformquitesimilar.IVGChasahig herG-measurevalue whichmeansIVGChasabetterperformanceinvariableselection.In Figure4.1,wecansee clearlythatIVGChasabettervariableselectionandestimationperf ormancethanIVdoes. 64 Figures4.2and4.3showstheintegratedmeansquarederrorsfort ennonzerocoecient functions.WecanalsoseethatIVGCoutperformsIVintheestimat ionofallnonzero coecients.Table4.2showstheselectionrateswhichisdenedasth etimesagenebeing selecteddividedbythetimesofsimulation.IVGCcanselectallnonzer ocoecientsalmost everytime.However,IVhasapoorselectperformance. Table4.2:Selectionratefor p =100, q =100, n =800and =0 Method SelectionRate 1 2 3 4 5 6 7 8 9 10 IVGC1.000.980.991.000.981.000.991.001.000.98 IV1.000.110.580.870.691.000.781.000.960.45 Forthecase p =100, q =100, =0 :25,Tables4.3,4.4andFigures4.4-4.6showthe result.Weobservesimilarpatternsasthecasewhen =0.Overall,IVGChasbetter variableselectionandestimationperformancesthanIVdoes. Table4.3:Simulationresultsfor p =100, q =100and =0 :25 nMethodEstimationLossModelErrorSensitivitySpeccityG-meas ure 800IVGC1.08(0.55)1.61(0.19)0.99(0.07)0.66(0.08)0.65(0.09) IV2.95(0.66)2.18(0.19)0.72(0.11)0.65(0.08)0.47(0.07) 1000IVGC0.84(0.26)1.47(0.12)0.99(0.04)0.69(0.08)0.68(0.08) IV2.49(0.52)2.00(0.16)0.80(0.12)0.66(0.09)0.52(0.09) 1200IVGC0.70(0.14)1.39(0.09)1.00(0.00)0.69(0.08)0.69(0.08) IV2.18(0.52)1.89(0.17)0.84(0.10)0.65(0.09)0.54(0.07) 1400IVGC0.70(0.19)1.37(0.10)1.00(0.00)0.69(0.08)0.69(0.08) IV1.97(0.51)1.83(0.17)0.88(0.1)0.65(0.08)0.57(0.08) 1600IVGC0.63(0.13)1.32(0.08)1.00(0.00)0.69(0.07)0.69(0.07) IV1.77(0.41)1.74(0.14)0.90(0.08)0.64(0.07)0.57(0.07) Insummary,simulationresultsshowthattheproposedIVGCmetho dcanachievebetter selectionresultsandestimationaccuracycomparedtotheIVmeth odwithoutconsidering 65 0.20.40.60.81.0 SensitivitySample size Sensitivity4006008001000120014001600 IVGC IV0.500.550.600.650.700.75 SpecitivitySample size Specitivity4006008001000120014001600 IVGC IV0.00.20.40.60.8 G-measureSample size G-measure4006008001000120014001600 IVGC IV0246810 Estimation lossSample size Estimation loss4006008001000120014001600 IVGC IV012345 Model Error Sample size Model error4006008001000120014001600 IVGC IVFigure4.1:Simulationresultsfor p =100, q =100and =0 thegraphicalstructure.Asgenegraphstructurealwaysprese ntsinreality(wemaynotknow theexactstructurethough),ourmethodthatincorporatesgr aphpenaltyshouldalwaysbe preferredtoachievebetterestimationandselection. 66 0.00.20.40.60.81.0 IMSE of Beta1Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta2Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta3Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta4Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta5Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta6Sample size IMSE4006008001000120014001600 IVGC IVFigure4.2:Simulationresultsfor p =100, q =100and =0onestimationperformance 4.5ApplicationtoRealData Inthissection,weapplyourmethodtoaHumanLiverCohortdatase t.Forthedetails ofthedata,pleaserefertosection3.5inchapter3.Inthisanalysis ,wepickageasthe Z 67 0.00.20.40.60.81.0 IMSE of Beta7Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta8Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta9Sample size IMSE4006008001000120014001600 IVGC IV0.00.20.40.60.81.0 IMSE of Beta10Sample size IMSE4006008001000120014001600 IVGC IVFigure4.3:Simulationresultsfor p =100, q =100and =0onestimationperformance Table4.4:Selectionratefor p =100, q =100, n =800and =0 :25 Method SelectionRate 1 2 3 4 5 6 7 8 9 10 IVGC1.001.000.960.961.000.980.981.001.000.98 IV1.001.000.180.211.000.790.490.801.000.76 variableandtrytounderstandhowgeneeectschangeoveraget oaecttheenzymeCYP2E1 activity.CYP2E1isoneofthecytochromeP450proteinswhichcata lyzemanyreactions involvedinthesynthesisofcholesterol,steroidsandotherlipids.As forthegeneexpressions, wefocusontheMetabolismofXenobioticsbyCytochromeP450path way.Followingour two-stageestimationalgorithm,weapplyunivariatelinearregressio ntoselectthetop1,000 68 0.20.40.60.81.0 SensitivitySample size Sensitivity4006008001000120014001600 IVGC IV0.500.550.600.650.700.75 SpecitivitySample size Specitivity4006008001000120014001600 IVGC IV0.00.20.40.60.8 G-measureSample size G-measure4006008001000120014001600 IVGC IV02468 Estimation lossSample size Estimation loss4006008001000120014001600 IVGC IV1.01.52.02.53.03.54.0 Model Error Sample size Model error4006008001000120014001600 IVGC IVFigure4.4:Simulationresultsfor p =100, q =100and =0 :25 SNPsforeachgenetranscript,thenestimatetheircoecientsus ingLASSObyttingallthe 1,000transcriptsinthesamemodel.Thettedgeneexpressionvalu eswerethenusedfor thesecondstagevariableselectionandestimation.Thesameasinch apter3,thegraphical 69 0.00.10.20.30.40.5 IMSE of Beta1Sample size IMSE4006008001000120014001600 IVGC IV0.00.10.20.30.40.5 IMSE of Beta2Sample size IMSE4006008001000120014001600 IVGC IV0.00.10.20.30.40.5 IMSE of Beta3Sample size IMSE4006008001000120014001600 IVGC IV0.00.10.20.30.40.5 IMSE of Beta4Sample size IMSE4006008001000120014001600 IVGC IV0.00.10.20.30.40.5 IMSE of Beta5Sample size IMSE4006008001000120014001600 IVGC IV0.00.10.20.30.40.5 IMSE of Beta6Sample size IMSE4006008001000120014001600 IVGC IVFigure4.5:Simulationresultsfor p =100, q =100and =0 :25onestimation structureforthepathwaywasmanuallyobtainedfromtheKEGGda tabase.Ingeneral,if twogenesareconnectedeitherdirectlyorindirectlyinthepathway ,weconsiderthemas correlatedandthecorrelationisassumedtobe1whichisusedasthe weightinthegraph 70 0.00.10.20.30.40.5 IMSE of Beta7Sample size IMSE4006008001000120014001600 IVGC IV0.00.10.20.30.40.5 IMSE of Beta8Sample size IMSE4006008001000120014001600 IVGC IV0.00.51.01.52.0 IMSE of Beta9Sample size IMSE4006008001000120014001600 IVGC IV0.00.10.20.30.40.5 IMSE of Beta10Sample size IMSE4006008001000120014001600 IVGC IVFigure4.6:Simulationresultsfor p =100, q =100and =0 :25onestimation constrainedpenaltyfunction. Stabilityisusedtoshowtheselectionrobustnessoftheselectedge nes.Werandomly select80%ofthedatausedtorunouralgorithm.Werepeatdoingth is100timesand choosethosevariablesappearthemostoften.Table4.5liststheto p10genesusingeach method.Weobserveverysimilarlistofgenesbythetwomethods,ex ceptthatIVGChas higherselectionratethanIVdoes.Amongthetopgenes,CYP2E1h asthehighestselection rate.Thisisnotsurprisesincetheresponseistheactivityofthisge ne.Thehigherselection rateforthisgenesusingIVGCmethodindicatestherobustnessof ourmethod.Figure4.7 showsthegraphicalstructureforthetop10selectedgenesusin gtheIVGCmethodextracted 71 fromtheMetabolismofXenobioticsbyCytochromeP450pathwayint heKEGGpathway database.CYP2E1hasmoreconnectionsinthegraphandincorpor atingthegraphstructure inthepenalizedregressiongreatlyenhancestheselectionrateofr elevantgenesthatfunction togetherwithCYP2E1toaecttheenzymeactivityofCYP2E1. Table4.5:Listoftop10genesbasedonthestabilityselectionrate IVGCIV SelectedgenesSelectionrateSelectedgenesSelectionrate CYP2E10.975CYP2E10.840 UGT2A20.855UGT2A20.635 CYP2F10.740CYP2F10.520 UGT1A90.715UGT1A90.475 GSTM20.690GSTM20.455 UGT1A50.645UGT1A50.355 CYP2S10.635CYP2S10.350 GSTA40.610GSTA40.285 GSTM30.560GSTM30.205 UGT1A40.485MGST20.150 Tounderstandhowthegenefunctionsoverage,weplottheestima tedcoecientfunctions forthetop2genesCYP2E1andUGT2A2inTable4.5.AsshowninFigure 4.8,theestimated functionsoverageforthetwogenesshowanonlinearpattern.Th eoveralldecreasingpattern forCYP2E1aspeopleagingindicatesthenonlinearnegativeeectof ageongeneexpression. Intheotherwords,aspeopleaging,theexpressionofCYP2E1red ucesandthecorresponding enzymeactivitygetsweak.MoststudiesonCYP2E1activityinhuman arefocusedonfetus orinfantsandtheresultsdonotapplytoourcase.Studiesinratsh aveshowndecrease inCYP2E1activityasratsagingandthismightbecorrelatedwithanac cumulationof oxidativedamage(Wauthieretal.2004).Thedecreasingeectoft hisgeneaspeopleaging mightalsocontributestotheaccumulativeoxidativedamage.Furth erbiologicalvalidation isneededtoconrmthis. 72 CYP2E1UGT2A2CYP2F1UGT1A9GSTM2UGT1A5CYP2S1GSTA4 GSTM3UGT1A4Figure4.7:Graphicalstructureofthetop10genesselectedbyIV GC. 0.00.20.40.60.81.0 0.00.51.0 CYP2E1Age/80b(Age80)0.00.20.40.60.81.0 0.00.51.0 UGT2A2Age/80b(Age80)Figure4.8:TheestimatedcoecientfunctionsforgeneCYP2E1and UGT2A2. Inaddition,comparingtotheresultsweobtainedinchapter3,more geneshaveaselection ratelargerthan50%.Thisindicatesthatincorporatingthenonlinea rmodulationeectofage 73 inthethemodeladdsmorepowertoidentifythepotentialtruth.F urtherlabinvestigations areneededtorevealtherealmechanismofthesegenefunctions .4.6Discussion Weintroduceamethodwhichcombinesinstrumentalvariablesandgr aphconstrainedreg- ularizationtoselectimportantgenesassociatedwithaphenotypeo finterest.Weallowthe coecientstovaryastheenvironmentalvariablechanges.Modelin gthenonlinearmodca- tioneectofenvironmentalvariableallowsustostudyhowgenesint eractwithanenviron- mentalvariabletoaectaphenotypictrait.Weproposeatwo-ste pestimationprocedure. Selectionconsistencyisestablished.Simulationresultsshowthatou rmethodperformsbet- tercomparedtoIVregressionwithoutconsideringnetworkstruc ture.Ahumanlivecohort datasetisusedtodemonstratetheutilityofourmethod. Ingeneral,groupLASSOdoesnothaveoracleproperty.GroupSC ADandgroupMCP arealternativeswhichenjoytheoracleproperty.Thesepenaltyt ermscanbeappliedinour proceduretoreplacegroupLASSOpenalty.Thesewillbeinvestigat edinourfuturestudies. Inourrealdataanalysis,weapplytheknowngraphstructureobt ainedfromtheKEGG pathwaydatabase.Ifthegraphicalstructurecannotbeobtain edinpractice,itcanbe learnedfromthedataitself.Forexample,wecanapplymethodsliket hethresholding methodsproposedbyBickelandLevina(2008a,2008b),Rothman ,LevinaandZhu(2009), andLamandFan(2009),tolearnthecorrelationstructure.Then thelearnedstructurecan beincorporatedintotheproposedIVGCproceduretoachievebet tergeneselectionresults. Therobustnessofthemodelifwronggraphicalstructuresarea ppliedisanothertopicthat isworthyexploringinthefuture. 74 Chapter5 Conclusionandfuturework Estimationandapplicationofgraphicalstructuresarethemaintop icsinthisdissertation. Chapter2isabouttheestimationofgraphsandChapters3and4ar eabouttheapplication ofgraphs.InChapter2,weproposedacovariate-adjustedPCa lgorithmtoestimatedirected graphswhichprovidescausalinformation.InChapter3,weconsid erthevariableselection andestimationprobleminaframeworkofinstrumentalvariablesreg ression.Agraphcon- strainedregularizationisappliedtotakeadvantageofthepriorkno wledgeongenenetworks. InChapter4,weextendtheworkinChapter3toanonparametricm odelwherevarying coecientmodelisadoptedtocapturethenonlinearmodulationee ctofanenvironment variableongenefunction. Regularizationmethodsarecommonlyappliedinhigh-dimensionalregr essionstoover- comethedcultiesbroughtbythe\curseofdimensionality".Thesp iritisencouraging sparsity.Whenweestimategraphicalstructures,thedimensiono ftheedgeswewantto estimateismuchhigherthantheamountofthedatainthesample.As sumingmostofthe edgesdonotexist,wecansuccessfullyestimatethepositionsofth eexistentedges.However, theproblembecomesmoredcultwhenthetotalnumberofedgesa reincreasingatavery highratealongwiththeincreaseofthesamplesize.Screeningisagoo dmethodtohandle thisproblem.Afterscreeningthevariables,wehavelessvariablest oconsider.Thecombi- nationofscreeningmethodsandregularizationmethodsisanintere stingtopictoexplorein thecontextofgeneticalgenomicanalysis. 75 Anotherwaytoestimategraphisthroughhypothesistesting(Liu, 2013).Wemay constructtestprocedurestotesttheexistenceofeachedge, andcontrolthefamily-wiseerror rateorfalsediscoveryrate.Thisrelatestheestimationofgraphic alstructurestoanother importantstatisticstopicinmultipletesting.Asmanymethodsformu ltipletestinghave beenourishedinliterature,howtoapplythemingeneticalgenomica nalysisisachallenge topic.Thedcultyreliesonthetwo-dimensionalscanoverthousan dsofgeneexpressions andmillionsofSNPvariants.Amongthemethodsbeingstudied,Gauss ianrandomelds andtherelatedgeometryhavebeenusedtocontrolfamily-wiseer rorrate(AdlerandTaylor, 2007),andseemstobepromisingtoinvestigateinthiseld. TheworksinChapter3andChapter4considerquantitativetraits. Theycanbegen- eralizedtogeneralizedlinearmodelswithbinarydiseaseresponseswh icharecommonly observedinhumangeneticsstudies.Suchanextensionshouldbeco mputationallyfeasible andeasytoimplement.However,theoreticalevaluationshouldbec hallenging.Wewill considertheIVGCmethodinaGLMmodelsetupandinvestigatethem odelperformance boththeoreticallyandempiricallyinthefuture. Insummary,thisthesiscontributestothetheoryofgraphestima tionandpenalized regressionincorporatinggraphicalstructure,aswellastogene ticalgenomicsdataanalysis inapplication.Wewillconsidertheaforementionedfutureworkande xtensions,andcontinue statisticalinvestigationsalongtheline. 76 APPENDICES 77 AppendixA ProofofTheorem1 Thefollowingsaretherequiredconditions. (C1) 1 n X T X ! C as n !1 ,where C isapositivedenitematrix; (C2) 1 n max 1 i n x T i x i ! 0; (C3)Thedistribution P n oftheerrorterminthemodelisnormallydistributedand faithfultoaDAG G n forall n ;(C4) p n = O ( n a ),forsome0 a< 1 ,and q isxed; (C5)Dene n =max 1 j p n j adj ( G;j ) j ,where adj ( G;j )representstheadjacentsetof j in G . n = O ( n 1 b ),forsome0 ) P ( j ^ ˆ n ;i;j jk ( B ) ˆ n ;i;j jk j > j ^ ˆ n ;i;j jk ( ^ B ) ^ ˆ n ;i;j jk ( B ) j ).UsingLemma2andCorollary1inKalischandBuhlmann(2007), proofisdone. ProofofTheorem1:TheoremsinKalischandBuhlmann(2007)areb aseduponCorollary 81 1intheirpaper.ByLemma3,weobtainedthesameresultastheirCor ollary1after introducingtheestimationof B .So,wecanobtaintheresultastheirTheorem2underthe assumptionsweputinourTheorem1. HerewewritethePC-algorithminKalischandBuhlmann(2007).Ther earetwosteps inthealgorithm.Intherststep,theskeletonstructureisestima ted.Inthesecondstep, thedirectionoftheedgesaredetermined. Input :VertexSet V ,ConditionalIndependenceInformation Output :Estimatedskeleton C ,separationsets S (onlyneededwhendirectingthe skeletonafterwards) Formthecompleteundirectedgraph ~ C onthevertexset V ;repeat l = 1; C = ~ C ;repeat Selecta(new)orderedpairofnodes i;j thatareadjacentin C suchthat j adj ( C;i ) nf j gj l ;repeat Choose(new) k adj ( C;i ) nf j g with j k j = l ;if i and j areconditionalindependentgiven k then Deleteedge i;j ;Denotethisnewgraphby C ;Save k in S ( i;j )and S ( j;i ); end until edge i;j isdeletedorall k adj ( C;i ) nf j g with j k j = l havebeen chosen ;until allorderedpairsofadjacentvariables i and j suchthat j adj ( C;i ) nf j gj l and k adj ( C;i ) nf j g with j k j = l havebeentestedforconditionalindependence ;until foreachorderedpairofadjacentnodes i;j :j adj ( C;i ) nf j gj 0such that,ifweselect 1 asintheTheorem,thenwithprobabilityatleast1 c 1 ( pq ) c 2 ,wehave k 1 n ^ X T 1 n ^ X T ( ^ X X ) ) k 1 2(4 ) 1 :(B.5) Fromnowon,webaseouranalysison(B.5). 86 Byusingtheequality Y = X S S + ,wehave Y ^ X ^ = ( ^ X S X S ) S ^ X S ( ^ S S ).Replacing ^ S with S ,wewrite(B.3)as ^ C SS ( ^ S S )+2 2 L SS ^ S +2 2 L T S C S ^ S C = 1 n ^ X T S 1 n ^ X T S ( ^ X S X S ) S 1 sign( ^ S ) :Aftersomealgebra,wehave ^ S S =( ^ C SS +2 2 L SS ) 1 h 1 n ^ X T S 1 n ^ X T S ( ^ X S X S ) S 1 sign( ^ S ) 2 2 L SS S i :(B.6) ByLemma4and(B.5),wehave k ^ S S k 1 k ( ^ C SS +2 2 L SS ) 1 k 1 h k 1 n ^ X T S 1 n ^ X T S ( ^ X S X S ) S k 1 + k 1 sign( ^ S ) k 1 + k 2 2 L SS S k 1 i 2(4 ) 8 3 ˚ h 2(4 ) 1 + 1 +2 2 C L i = 2(4 ) 8 3 ˚ h 8 2(4 ) 1 +2 2 C L i 0such that,ifweselect 1 asintheTheorem,thenwithprobabilityconvergingto1,wehave k 1 n ^ X t 1 n ^ X t ( ^ X 1 X 1 ; ;^ X s X s ) 0 B B B B B @ B 0 ...0 B 1 C C C C C A S k 2 2(4 ) N 1 N 2 1 So,wehave k 1 n ^ V T t 1 n ^ V T t ( ^ V S V S ) ˘ S k 2 2(4 ) 1 (C.5) 94 Usingtheequality Y = V S ˘ S + andreplace ^ S with S ,wehave ^ ˘ S ˘ S =( ^ C SS +2 2 L SS ) 1 h 1 n ^ V T S 1 n ^ V T S ( ^ V S V S ) ˘ S 1 ^ ˘ t k ^ ˘ t k 2 t 2 S 2 2 L SS ˘ S i :(C.6) Then,wehave k ^ ˘ S ˘ S k 2 k ( ^ C SS +2 2 L SS ) 1 k 2 h k 1 n ^ V T S 1 n ^ V T S ( ^ V S V S ) ˘ S k 2 + k 1 ^ ˘ t k ^ ˘ t k 2 t 2 S k 2 + k 2 2 L SS ˘ S k 2 i k ^ ˘ t ˘ t k 2 = k (( ^ C SS +2 2 L SS ) 1 ) T t h 1 n ^ V T S 1 n ^ V T S ( ^ V S V S ) ˘ S 1 ^ ˘ t k ^ ˘ t k 2 t 2 S 2 2 L SS ˘ S i k 2 k (( ^ C SS +2 2 L SS ) 1 ) T t k 2 h k 1 n ^ V T S 1 n ^ V T S ( ^ V S V S ) ˘ S k 2 + k 1 ^ ˘ t k ^ ˘ t k 2 t 2 S k 2 + k 2 2 L SS ˘ S k 2 i = k (( ^ C SS +2 2 L SS ) 1 ) T t k 2 h ( X t 2 S k 1 n ^ V T t 1 n ^ V T t ( ^ V S V S ) ˘ S k 2 2 ) 1 2 + k 1 ^ ˘ t k ^ ˘ t k 2 t 2 S k 2 + k 2 2 L SS ˘ S k 2 i 2(4 ) 8 3 ˚ h 2(4 ) 1 p s + 1 p s +2 2 C L i