GRAPHESTIMATIONANDNETWORKCONSTRAINEDREGULARIZATIONW ITH APPLICATIONSINGENETICALGENOMICSANALYSIS By BinGao ADISSERTATION Submittedto MichiganStateUniversity inpartialfullmentoftherequirements forthedegreeof Statistics|DoctorofPhilosophy 2015 ABSTRACT GRAPHESTIMATIONANDNETWORKCONSTRAINED REGULARIZATIONWITHAPPLICATIONSINGENETICALGENOMICS ANALYSIS By BinGao Estimationandapplicationofgraphicalstructureareimportantto picsinmodernstatis- tics.Graphicalstructureisanidealtooltodescribegeneregulat orynetworkwhichisimpor- tantingeneticalgenomicsstudies.Inthisdissertation,agraphes timationmethodandtwo networkconstrainedregularizationmethodsareproposed.Both theirtheoreticalproperties andapplicationsingeneticalgenomicsarestudied. Largeamountofresearcheortshavebeenfocusedonestimatin ggenenetworksbased ongeneexpressiondatainordertounderstandthefunctionalba sisofalivingorganism.By treatinggeneexpressionsasquantitativetraitswhileconsideringg eneticmarkers,genetical genomicsanalysishasshownitspowerinenhancingtheunderstandin gofgeneregulation. Knowingthatgeneexpressionsareoftenduetodirectedregulatio ns,weintroduceacovariate- adjustedGaussiangraphicalmodeltoestimatetheequivalencecla ssofthedirectedacyclic graphs(DAGs)inageneticalgenomicsanalysisframeworkinthesec ondchapter.Anestima- tionprocedureisintroducedandtheestimationconsistencyforsp arseDAGsisestablished. WeapplythemethodtoahumanAlzheimer'sdiseasedatasettoshowt heutilityofthe method. Givenageneregulationnetwork,learnedeitherthroughbiologicale xperimentsorvia statisticalmethods,incorporatingsuchstructureinaregressio nmodelwouldincreasethe phenotypepredictionaccuracy.Inchapter3,weusegeneexpre ssionstopredictphenotypic responseswhileconsideringthegraphicalstructuresongenenet worksinaparametricre- gressionmodel.Giventhatgeneexpressionsareintermediatephen otypesbetweenatrait andgenes,wefollowtheinstrumentalvariableregressionframewo rkproposedbyLinetal. (2015)andtreatgeneticvariantsasinstrumentalvariablestode alwiththeendogeneityissue. Weproposeatwo-stepestimationprocedure.Intherststep,w eapplylassotoestimate theeectsofgeneticvariants.Inthesecondstep,weusethepr edictedexpressionsobtained fromtherststepaspredictorswhileadoptinganetworkconstra inedregularizationmethod toobtainbettervariableselectionandestimation.Weestablishthes electionconsistency andrelated L 1 bound. Itiscommonlyrecognizedthatgeneticeectsoncomplextraitsare largelymodulated byenvironmentalinuences.However,themodulationmechanismis barelyunderstood. Basedontheworkinchapter3,weadoptaninstrumentalvariabler egressionapproach whileincorporatingavaryingcoecientmodeltocapturepotential nonlinearenvironmental modulationeectsinchapter4.B-splinesareusedtoapproximatet hecoecientfunctions. Grouplassoandagraphconstrainedpenaltiesareappliedtoachieve bettervariableselec- tionandestimationwhennetworkstructuresareconsideredforg eneexpressions.Selection consistencyisestablishedundersomeassumptions.Weapplyourme thodtoahumanliver cohortdatasettodemonstrateitsutility. Idedicatethisdissertationtomyparents,LianrongSongandQings hengGao. iv ACKNOWLEDGMENTS IwouldliketogratefullyandsincerelythankmyadvisorDr.YuehuaCu iforhisguidance, understanding,patienceduringmygraduatestudyandresearch .Dr.Cuiguidedmeinto theinterdisciplinaryeldofstatisticalgenetics.Hehelpedmeselect importantpapersto readandtaughtmethebasicconceptsandtechniquesinstatistica lgeneticsatthebeginning ofmyresearch.IrememberthosediscussionsinhisoceinwhichIlea rnedbasictoolsand techniqueswhichwouldbeveryimportantinmyresearch.Whenever Ineedhelp,Dr.Cuiis alwaysthere.EverytimeIknockedthedoorofhisoce,healwayss penttimelisteningto myquestionspatientlyandgavehelpfulcomments.IntheJournal Clubwhichishostedby Dr.Cui,Ilearnedalotfromothers'presentation.Ialsodeveloped mypresentationskills. Iappreciateothermembersinmycommittee.Dr.YiminXiaowasmyinitia ladvisor. Hegavemehelpfuladvisesontheselectionofderentcoursesand guidanceonmylifewhen IjustarrivedatUS.Dr.Ping-ShouZhonghelpedinmanyaspectsth rougheverydiscussion wehad.Hisinsightfulopinionsmademeviewquestionsderently.Dr. QingLu'scomments andquestionsintheJournalClubweregermaneandthoughtful.All themembersinthe committeehelpmeduringmygraduatestudy.Ireallyappreciatethe irguidanceandhelp. IappreciatealltheprofessorswhotaughtmeandhelpedmeintheD epartmentofStatis- ticsandProbability.Iobtainedknowledgefromthosewellprepared coursesandenjoyedthe processoflearningandtheinteractionwiththeinstructors. MythanksgotostudentsinDr.Cui'sgroup,includingHonglangWang, TaoHe,Shunjie Guan,JingyiZhang,postdoctoralresearcherXuLiu,andforma lstudentDr.CenWu.I alsowanttothankotherstudentsinthedepartment,includingYuz henZhou,XiaoqingZhu, LiqianCai,SiddharthaNandy,JiwoongKimandmanyothers.Itismyluc kthatsomany v nicepeopleandfriendsarearoundme.Theyhelpedmeinbothresear chanddailylife. Lastbutdenitelynotleast,Ithankmyparentsfortheirsupport .Theyalwaysmakeall possibleeortstohelpmewheneverIneed.Theyowncreditsinvery achievementofmine. vi TABLEOFCONTENTS LISTOFTABLES .................................... ix LISTOFFIGURES ................................... x Chapter1Introduction ............................... 1 1.1Overview......................................1 1.2Integrativegeneticalgenomicsanalysis.................. ...2 1.3Estimatioinandapplicationofgraphicalstructures.......... ....4 1.4Varyingcoecientmodelcapturinggene-environmentinteract ion......6 1.5Objectivesandorganization........................... 7 Chapter2Learningdirectedacyclicgraphicalstructuresw ithgenetical genomicsdata .............................. 9 2.1Introduction....................................9 2.2ModelandEstimation..............................12 2.2.1Background................................12 2.2.2Covariate-adjustedgaussiangraphicalmodel........... ...13 2.2.3Atwo-stageestimationprocedure....................1 4 2.3Theoreticalresult.................................1 5 2.4Simulation.....................................17 2.4.1Simulationdesign.............................17 2.4.2Performanceinthelowdimensionalcase................19 2.4.3Performanceinthehighdimensionalcase................1 9 2.4.4Comparinggraphswithorwithoutcovariates.............2 2 2.4.5Comparingdirectedandundirectedgraphs............... 23 2.5ApplicationtoRealData.............................24 2.6Discussion.....................................28 Chapter3IntegrativeAnalysisofGeneticalGenomicsDataI ncorporat- ingNetworkStructure ......................... 30 3.1Introduction....................................30 3.2StatisticalMethodsandEstimation...................... .34 3.2.1TheModel.................................34 3.2.2Estimationofgeneticeects.......................35 3.2.3Networkconstrainedregularization...................3 6 3.3TheoreticalProperties.............................. 37 3.4Simulation.....................................39 3.5ApplicationtoRealData.............................42 3.6Discussion.....................................46 vii Chapter4GraphConstrainedRegularizationforNonparamet ricInstru- mentalVariablesRegressioninGeneticalGenomicAnalysis .51 4.1Introduction....................................51 4.2NonparametricModelsandGraphConstrainedRegularization... .....55 4.2.1TheModelSetup.............................55 4.2.2FirstStepEstimation...........................55 4.2.3SecondStepEstimation.........................56 4.3TheoreticalResults................................6 0 4.4Simulation.....................................62 4.5ApplicationtoRealData.............................67 4.6Discussion.....................................74 Chapter5Conclusionandfuturework ...................... 75 APPENDICES ...................................... 77 AppendixAProofofTheorem1............................ 78 AppendixBProofofTheorem2............................ 84 AppendixCProofofTheorem3............................ 89 BIBLIOGRAPHY .................................... 98 viii LISTOFTABLES Table2.1Simulationsetupinthehighdimensionalcasewithxed ......21 Table2.2Simulationsetupinthehighdimensionalcasewithvarying p ,q and .22 Table2.3Estimationcomparisonwithandwithoutadjustingforthema rker eectsunderderentscenarios.....................24 Table2.4EstimationcomparisonbetweenPCandcaPCwhen q>p ......24 Table2.5Estimationcomparisonbetweendirectedandundirectedgr aphs...25 Table3.1Simulationresultsforp=100,q=100,n=600............ .41 Table3.2Simulationresultsfor p =100, q =100, n =200 1400andnumSNP=444 Table3.3Simulationresultsfor p =600, q =600and n =300.........45 Table3.4Listoftop10selectedgenesforeachoftheIVGCandIVme thod.46 Table3.5thegenesinpathwayofMetabolismofxenobioticsbycytoch rome P450-Homosapiens..........................49 Table4.1Simulationresultsfor p =100, q =100and =0...........64 Table4.2Selectionratefor p =100, q =100, n =800and =0........65 Table4.3Simulationresultsfor p =100, q =100and =0 :25.........65 Table4.4Selectionratefor p =100, q =100, n =800and =0 :25......68 Table4.5Listoftop10genesbasedonthestabilityselectionrate... ....72 ix LISTOFFIGURES Figure2.1Performanceofthetwo-stageestimationalgorithmunde rderent parameterandsamplesizesettings.Thecirclesrepresentthespa rse graphestimationresultswith E [N ]=2andthetrianglesrepresent thedensegraphestimationresultswith E [N ]=5.Theupperand lowerboundsofthe95%condenceintervalsarealsoshownwhich aredculttoseparateunderlarge n ..................20 Figure2.2BoxplotofTPRandFPRunderderentsettingswhen q =100...21 Figure2.3BoxplotofTPRandFPRunderderentsettingswhen q =500...22 Figure2.4BoxplotofTPRandFPRunderderentsimulationsettingsw ith varying q and ..............................23 Figure2.5TheestimatedCPDAGinthecase(left)andcontrol(right )group basedonthetop100genes.Thoseoverlappednodesindicatedual regulationdirections...........................26 Figure2.6TheestimatedCPDAGinthecase(left)andcontrol(right )group basedongenesmappedtotheAlzheimer'sdiseasepathwayfromthe KEGGdatabase.Thoseoverlappednodesindicatedualregulation directions.................................27 Figure3.1Illustrationoftheinstrumentalvariablesmodelingenetic algenomics analysis..................................34 Figure3.2Simulationresultsforxed p =100and q =100butvarying n (200 to1400).................................43 Figure4.1Simulationresultsfor p =100, q =100and =0...........66 Figure4.2Simulationresultsfor p =100, q =100and =0onestimation performance...............................67 Figure4.3Simulationresultsfor p =100, q =100and =0onestimation performance...............................68 Figure4.4Simulationresultsfor p =100, q =100and =0 :25.........69 Figure4.5Simulationresultsfor p =100, q =100and =0 :25onestimation.70 x Figure4.6Simulationresultsfor p =100, q =100and =0 :25onestimation.71 Figure4.7Graphicalstructureofthetop10genesselectedbyIVG C.......73 Figure4.8TheestimatedcoecientfunctionsforgeneCYP2E1andU GT2A2.73 xi Chapter1 Introduction 1.1Overview Inthisdissertation,westudytheestimationandapplicationofgrap hicalstructuresinge- neticalgenomicsanalysis.Generegulatorynetworkscontainabun dantinformationabout thefunctionsofgeneexpressionsandthemechanismsofgenereg ulations.Inhumanbody, hundredsofthousandsgenesareexpressedandinteractwithea chothertoachieveallnec- essaryfunctionsincells.Theyareexpressedatpropertimeandins ucientamountto determinewhenandhowproteinsaresynthesized.Iftheenvironm entalfactorsarechanged, likeatoxicsubstanceappears,dysfunctionsofgeneexpressionm ayoccur.Theexpressions ofasetofgenesmaybesuppressedorenlarged.Asaresult,theg eneswhoseexpression areaectedbythatsetofgenesmaybealsodysfunctional.Finally, adiseasemayappears andhencethehealthofhumanbodyisthreatened.So,itisalwaysan interesttolearnthe truearchitectureofgeneregulatorynetworksandtoapplythek nowledgetofacilitateother researchesandimprovepublichealth.Inthischapter,werstpro videsomebackgroundon geneticalgenomicsanalysisandthetheoryofgraphanditsvarious applicationsin1.2and 1.3.Tosolvetheproblembroughtbygene-environmentinteraction ,varyingcoecientmodel anditsadvantageandexibilityareintroducedin1.4.Theobjectives andorganizationof thedissertationarein1.5. 1 1.2Integrativegeneticalgenomicsanalysis DNAmicroarraytechnologyisusedtomeasuretherelativeabundan ceoflargeamountof genessimultaneously.Inthepast,scientistscouldonlyanalyzeafe wgenesatoncebecauseof thelimitationofmeasurementtechnology.DNAmicroarrayoerspe oplegreatopportunities tounderstandthefunctionsofgeneexpressions.Recentdevelo pmentsinnextgeneration sequencingoerfaster,moreaccurate,andcheapertechnolog ytomeasuregeneexpression andgeneticvariantswithdigitalresolution.Alloftheprogressofm easurementtechnology providespromisingopportunityintheeldofgeneticalgenomicsana lysis. Regressionmodelsarecommonstatisticalmodelsusedtoexploreth erelationshipofgene expressionandaphenotypictrait.Aphenotypictraitcanbeadisea seorabodyfeature. However,thedimensionalityofgeneexpressionfeatures( ˇ 40 K )istypicallylarge.Classical statisticalmethodsdonotworkinthissituation.Sincethepioneerin gworkofLASSOby Tibshirani(1996),variableselectioninpenalizedregressionshasbe encommonlyadopted tosolvethelarge p small n problem.Manyregularizationmethodshavebeendeveloped sincethen,includingSCAD(FanandLi,2001),adaptiveLASSO(Zoua ndHastie,2005), groupLASSO(YuanandLin,2006),andMCP(Zhang,2010).Withth evariableselection technique,potentialcandidategenescanbescreenedforfurth erbiologicalvalidation. Whileusinggeneexpressiontopredictthephenotypeofinterest,w ecanalsostudy therelationbetweengenotypeandgeneexpression.Ithasbeena longhistorythatpeople trytondrelationshipbetweenquantitativetraitsandgeneticvar iants.Quantitativetrait loci(QTL)mappinghasbeenusedtondgeneticvariantswhichgove rnthephenotypic variationofacomplextrait,suchasbloodpressureorgrainyield.In QTLmappingstudies, geneticmarkerssuchassinglenucleotidepolymorphisms(SNPs)can beusedtopinpointthe 2 locationofgeneticvariantsthataectthetraitofinterest.Since thehugesuccessofQTL mapping,expressionquantitativetraitloci(eQTL)mappinghasbee nextensivelystudied. IneQTLmapping,expressiontraitsaretreatedasquantitativetr aits.Thus,onecanstudy thegeneticarchitectureofgeneexpressionandfurtherdissect thegeneregulationmechanism suchascis-andtrans-regulation.IftheeQTLisintheregionoforc losetothegenewhose expressionistheinterestedquantitativetrait,sucheQTLiscalledc is-eQTL.IftheeQTL isfarfromthegenewhoseexpressionistheinterest,itiscalledtran s-eQTL.Studieshave shownthatcis-regulationisusuallystrongerthantrans-regulatio nandmuchworkhasbeen donetostudycis-andtrans-eQTL(e.g.,Giladetal.,2008). eQTLmappingisalsocoinedasgeneticalgenomicsanalysissinceitcomb inesgenetic variantswithgeneexpressiondatatogether.Withthedevelopmen tofhumanhapmapand the1000genomeprojects,millionsofSNPvariantshavebeenident ed.Thusithasbe- comefeasibletopinpointspeccvariantslocatedinageneregionorin ter-genicregionthat regulategeneexpressions.Geneexpressiondatahavebeenbroa dlyappliedtostudygenenet- works.Forexample,basedonthecorrelationofgeneexpressions ,onecaninferthenetwork structureusingtechniquessuchasBayesiannetwork(Friedmane tal.,2000)orGaussian graphicalmodel(Dobraetal.,2004).However,withoutadditional biologicalinformation, suchestimatednetworksareundirected.Giventhatgeneregulat ionsareoftendirected,a directednetworkinferenceshouldmakemorebiologicalsenseandb emoreinformative.In addition,twoindependentgenescouldshowfalsecorrelationifthey shareacommongenetic regulator.Thecorrelationshouldbegoneifthecommongeneticee ctisadjusted(Caietal. 2013).Thismotivatesustodevelopadirectedgraphestimationpro cedurewhileadjusting forthegeneticeects. Afterformulatingmodelstodescribetherelationbetweenphenoty peandgeneexpres- 3 sionandtherelationbetweengeneexpressionandgenotype,wear einterestedinhowto incorporategeneregulatorynetworksintotheanalysis.Forthisp urpose,weintroducesome backgroundknowledgeontheestimationandapplicationofgraphica lstructuresinthenext section. 1.3Estimatioinandapplicationofgraphicalstructures Generegulatorynetworkisagraphicalstructurewhichcanbelear nedeitherbyexperimental methodsorbystatisticalmethods.Itisusedtodescriberegulato ryrelationsbetweengenes. Ingeneral,agraphoranetworkconsistsofnodesandedges.Ifw euse G todenoteagraph, V todenotethesetofthenodes,and E todenotethesetoftheedges,wecanwriteagraph as G =( V;E ).Here,anodeisageneexpressionandanedgeisacorrelationbetw eentwo genes.Towkindsofgraphsarestudiedorappliedinthisdissertation ,directedgraphand undirectedgraph. Derentkindsofgraphicalstructuresaredenedintheliteratur e.Let X =( X 1 ;:::;X p ) followamultivariatenormaldistribution N ( ; ) ;=( ˙ ij ) ;= 1 =( ! ij ).Atleast threegraphicalstructurescanbedened. 1.Structuredcovariancematrix.Forexample,itisassumedclosed ataintimeare stronglycorrelatedinlongitudinalananlysis. 2.Gaussiangraphicalmodel. X i ?? X j jf X t g t 6= i;j () ! i;j =0. 3.Causalrelation(directedgraph)canbedenedifotherassump tionsareassumed. Therstoneisusedtoinferthecorrelationrelationsbetweengene s.Thesecondisused toinfertheconditionalcorrelationrelationsbetweengene.Thelas toneisusedtodocausal inference.alotofliteraturescanbefoundaboutthesetopics(Bic kelandLevina,2008a, 4 2008b;Rothman,LevinaandZhu,2009;LamandFan,2009;Meinsh arsenandBuhlmann, 2006;Friedman,HastieandTibshirani,2007;Cai,Li,LiuandXie,2013 ). Aftergeneregulatorynetworksarelearned,thenextchallengeis howtoapplythemas priorknowledgetohelpfurtheranalysis,forinstance,studyingth erelationsbetweengene expressionsandtraitsofinterest.Thefollowinglistsfourregulariz ationmethodswhichtake advantageofthegenegraphicalstructuresinvariableselectiona ndestimation: 1.FusedLASSO(Tibshirani,etal.,2005.): p ( 1 ; 2 )= 1 k k 1 + 2 P j j j j 1 j 2.GRACE(LiandLi,2008): p ( 1 ; 2 )= 1 k k 1 + 2 P i ˘ j ( i p d i j q d j ) 2 3.GFlasso(KimandXing,2009): p ( 1 ; 2 )= 1 k k 1 + 2 P i ˘ j j i sign(^ r ij ) j j 4.GOSCAR(Yangetal.,2012): p ( 1 ; 2 )= 1 k k 1 + 2 P i ˘ j max( i ; j ) where p ( 1 ; 2 )isapenaltytermand 1 ; 2 aretuningparameters.Intheseworks,derent penaltiesareimposedtodealwithderentgraphicalstructures. ThefusedLASSOrequiressmoothnessontheeectsofadjacent variables.Thiscanbe achievedbyorderingthevariablesproperly.GRACEusesthespec cgraphicalstructure andencouragegreatereectsof\hub"geneswhicharethosewit hmanyconnectionsto othergenes.GFlassoissimilartotheideaoffusedLASSOexceptitus esnetworkstructures ratherthanadjacentsmoothness.GOSCARisusedtoencourage theeectsoftwoconnected genestobeexactlythesame.Thederencebetweenabsolutepen altyandsquaredpenalty isabsolutepenaltyencouragestwocoecientstobeexactlythesa mewhilesquaredpenalty doesnot.Thesesuccessfulapplicationsofgraphsinspireustoap plynetworkinformationin ourwork.InChapter3andChapter4,aparametricgraphicalstr uctureandanonparametric graphicalstructureareusedforinstrumentalvariablesregres sionmodel. 5 1.4Varyingcoecientmodelcapturinggene-environment interaction Geneexpressionsarethedirectproductsofgenefunctions.How ever,mostgenefunctions areinuencedbyenvironmentalfactors,resultingintheso-called gene-environment(G ) interactionswhicharebroadlystudiedinliterature.Modelingtheee ctofanenvironmental factorasafunctionbringsexibilitytocaptureeitherlinearornon- linearmodulationeect. AtypicalG Einteractionmodelassumesalinearinteractionstructurebetwee nagenetic variable( G )andenvironmentalvariable( X ),i.e., Y = 0 + 1 X + 1 G + 2 XG + = 0 + 1 X +( 1 + 2 X ) G + : AsdemonstratedinMaetal.(2011),suchasimpleregressionmodel couldeasilymissany interactionsignalsthatdonotshowalinearstructure.Theauthor sproposedthefollowing varying-coecient(VC)modelwhichcancapturethenonlinearmod ulationeect, Y = ( X )+ ( X ) G + : where ( X )and ( X )representtheinterceptandinteractionfunctionwhichareassu med tobesmooth.NonparametricestimationtechniquessuchasB-splin ecanbeappliedto estimatethefunctions.Suchanonparametrictreatmenthasmuc hexibilitytocapturethe realbiologicalfunctionofenvironmentalforces.AsfortheVCmo delling,vastliterature canbefoundonestimationorvariableselection(e.g.,HastieandTibsh irani,1993;Wang 6 andXia,2009;Fengetal.,2011;Tangetal.,2012).Littleworkhasbe endoneongenetical genomicsanalysiswhileincorporatingtheeectofenvironmentalfa ctors. 1.5Objectivesandorganization Inthisdissertation,weproposeagraphestimationmethodtolearn directedgraphicalstruc- turesandtwographapplicationmethodsusingnetworkstructure saspriorknowledgeto improvetheperformanceofvariableselection.Bothparametrican dnonparametricmodels andmethodsareconsidered. PCalgorithmhasbeenoneofthemostpopularmethodsinestimatingd irectedgraphs. However,covariateeectsarenotconsidered.Intheestimation ofundirectedgraphs,two connectedgenesmaysharecommonregulators(e.g.,SNPs).Thet wogenesmaybeinde- pendentiftheeectofcommonregulator(s)isadjusted.Thesam ephenomenonappliesto theestimationofdirectedgraphs.MotivatedbytheworkofCaiet al.(2013),weproposea covariate-adjustedPCalgorithminchapter2whichisusedtolearnt heregulatorynetworks ofgeneswhileadjustingfortheeectsofgeneticvariants.Weest ablishtheestimationand selectionconsistencyoftheproposedtwo-stepestimationalgorit hm. Inageneticalgenomicsanalysistoevaluatethecorrelationbetwee nphenotypictraitsof interestandgeneexpressions,oneusuallyassumesthattheerro rtermisindependentofthe predictors.However,thisisnotalwaystrue.Endogeneityinwhichs omepredictorscorrelate withtheerrormayarise.Instrumentalvariables(IV)regression isacommonwaytosolvethis problem.Linetal.(2015)proposedanIVregressionframeworkinw hichgeneticvariants aretreatedastheinstrumentalvariablestoeasethecorrelation problem.Theauthors showedthesuperiorperformanceofthevariableselectionmethod viaincorporatingtheIV 7 regression.Asweillustratedpreviously,incorporatinggenenetwo rksinvariableselection couldpotentiallyimprovetheperformanceofgeneselection.Thus, weproposeanetwork constrainedregularizationmethodinaIVregressionframeworkan dapplythemethodto anintegrativeanalysisofgeneticalgenomicsdatainchapter3.Wee stablishtheestimation andselectionconsistencyoftheproposedmethod. Asweillustratedpreviously,environmentalfactorscouldpotentia llymodulategenetic eectsongeneexpressionandfurtheraectthephenotypicvar iation.Forexample,the amountofexposuretoheavymetalcouldmodifygeneticeectong eneexpressionand furtherleadtoincreasedriskofneurovegetativediseases.Vary ing-coecientmodelisa naturalchoicetodescribeanypotentialnonlinearmodulationeec t.Builtuponthework inchapter3,weextendtheIVregressionmodeltoincorporateth enonlinearmoderation eectofanenvironmentalfactorinchapter4.Thevaryingcoec ientfunctionsaremodelled throughnonparametrictechniquessuchasB-splineandnetworkc onstrainedregularization isalsoconsideredtoimprovetheperformanceofgeneselection.We alsoestablishthe estimationandselectionconsistencyoftheproposedmethod. Weconductextensivesimulationstudiestoevaluatetheperforman ceofthethreemethods andcomparewithexistingones.Theutilityofthemethodsisdemonst ratedthroughreal dataanalysis.Conclusionandfutureworkarefollowedinchapter5. Technicalproofsare giveninAppendices.Thedissertationcontributestothedevelopme ntofgraphestimation andvariableselectioninstatisticaltheoryandmethod,andtothea nalysisofgenetical genomicsdatainapplication. 8 Chapter2 Learningdirectedacyclicgraphical structureswithgeneticalgenomics data 2.1Introduction Thepastdecadeshavewitnessedenormousmethodologydevelopm entsingenenetworkin- ferenceusinggeneexpressiondata(e.g.,CheungandSpielman,200 2;Schadtetal.,2003). Suchnetworksweredevelopedmostlybyassessingthepairwisecor relationofgeneexpres- sions.Fromastatisticalpointofview,undertheassumptionthatt hejointdistribution ofthegeneexpressionsofinterestisamultivariatenormaldistribu tion,suchnetworkscan beconstructedbyassessingthenonzeroelementsoftheinverse covariancematrix,theso calledprecisionmatrixorconcentrationmatrix.Assume X =( X 1 ;:::;X p ) ˘ N (0 ;)and = 1 .X i ?? X j jf X t g t 6= i;j () ! i;j =0,where ! i;j 'sarethe( i;j )thentriesin. Thatis, X i and X j areconditionallyindependentifandonlyif ! i;j =0.Thusonecan inferthepositionsofzeroentriesintoinferthegraphicalstruct ure,i.e.,theconditional independenciesbetweencomponentsin X .However,whenthedatadimensionislarger thanthesamplesize,theinverseofthesamplecovariancematrices arenotavailable.When 9 consideringsuchnetworksasundirectedgraphs,therehasbeen hugestatisticalinterestsin developingGaussiangraphicalmodelsassumingsparsityoftheprec isionmatrix,tonamea few,Friedmanetal.(2008),Caietal.(2013),Meinshausenetal.( 2006),andYuanand Lin(2007).Undercertainassumptions,thepositionsofzeroentr iesintheprecisionmatrix indicateconditionalindependencebetweenvariablesofstudy. Wheninferringnetworkonlybasedongeneexpressions,twogenes couldshowcorre- lationinexpressionsimplybecausetheyshareacommonregulator,w hiletheyshouldbe independentconditioningonthecommonregulator.Motivatedbyth is,afewmethodology developmentshavebeenfocusedoncovariates-adjustedgraph icalmodelsbyusinggenetic markersascovariatestocorrectbothfalsepositivesandfalsene gatives(e.g.,LeeandLiu, 2012;YinandLi,2011,2013;Caietal.,2013).Intheirestimationpr ocedures,theeectof geneticvariantsisestimatedintherststep.Then,thegraphical structureisestimatedin thesecondstepwhileadjustingforthegeneticeects.Theirsimula tionresultsandrealdata analysisshowedthatgraphestimatesweresubstantiallyimprovedw hentakingthegenetic eectsintoaccount.Thesemodelsareallfocusedontheestimatio nofundirectedgraphs withoutconsideringdirectioninformation.Thedirectedgraphs,ho wever,aremoreattrac- tiveinthesensethatoftenpeopleareinterestedinthecausalrela tionships,thatis,interests arenotonlyfocusedonwhethertwovariablesareconditionallyindep endentbutalsoifone causestheotherone.Inreallife,geneexpressionsareoftenth eresultsofdirectedregula- tions.Thus,theestimationofdirectedgraphsshouldprovidemore biologicalinformation forfurtherfunctionalinvestigation.Geneticalgenomicsdatapr ovidesoptimalresourcesto learnsuchdirectedgraphicalnetworks. Intypicaldirectedacyclicgraph(DAG)inference,oneprobabilityd istributionhasa setofcorrespondingDAGswhichareMarkovequivalentundercert ainassumptions.The 10 directionsoftheedgesinthegraphsindicatecausalrelations.Stu diesondirectedgraphs havebeenourishedinliteraturebothintheoryandinapplication(e.g .,Andersonetal., 1997;RichardsonandSpirtes,2002;Alietal.,2009).Themostcom moncaseistheGaussian case.Let Y =( Y 1 ;:::;Y p )bea p -dimensionalGaussianrandomvector,andeachcoordinate representsageneexpression.Toinfertheregulatoryrelationsh ipofthese p geneexpressions, thePC-algorithmdevelopedbySpirtesetal.(2000)canbeappliedto constructacompleted partiallyDAG(CPDAG).TheCPDAGcanuniquelyrepresentallDAGsco rrespondingto thecommondistributionof Y ,andcanberepresentedastheregulatorynetworksofthe geneexpressions.KalischandBuhlmann(2007)laterprovedthat thegraphobtainedfrom thePC-algorithmisconsistentundersomeconditions,eveninhighdim ensionalcases. DirectedgraphsconstructedwiththePC-algorithmaremorebiolog icallyrelevantcom- paredtoundirectedgraphs.However,asmentionedbefore,iftw ogenesshareacommon regulator,theirexpressionrelationshipcouldbecomplicatedoreve ndistortedwithoutcon- sideringtheunderlyinggeneticstructure.Whenadjustingforthe genotype(i.e.,thecommon regulator)eects,thetworelatedgenescouldbeindependentor viceversa.Thismotivates ustodevelopacovariate-adjusteddirectedgraphicalmodelbye xtendingtheCPDAGes- timationproceduretoaregressionframework.Wetreatgeneticm arkersascovariatesand developatwo-stageestimationmethodwhichcombinesthepenalized estimationandPC- algorithmtoestimatethemarker-adjustedCPDAGstructure. Inthefollowingofthepaper,werstprovidesomebackgroundkno wledgeaboutgraphical modelsandtheformulationofourmodelandthetwo-stageestimat ionprocedureinSection 2.2.TheoreticalresultsaregiveninSection2.3.Simulationsandreal dataanalysisaregiven inSection2.4and2.5tojustifyourmethodfollowedbydiscussionsinSe ction2.6.Allthe proofsofthetheoreticalresultsarerenderedintheAppendixA. 11 2.2ModelandEstimation 2.2.1Background Let G =( V;E )beagraphwhere V representsthesetofverticesand E representstheset ofedges.Adirectedgraphisagraphinwhicheachedgehasadirectio nrepresentedby a\ ! "sign.Adirectedacyclicgraph(DAG)isadirectedgraphwithnodirec tedcycles. Acriteriacalled d -separationisusedtoreadconditionalindependencerelationships froma DAGstructure(Pearl,2000).Othercriteriacanbeusedtogener ateconditionalindepen- dencerelationships.Thereasonwhy d -separationisparticularlyusefulisthatitisrelatedto causalinference(Pearl,2000;Spirtesetal.,2000)andforthepu rposetoconstructdirected graphs. Givenarandomvector,itscoordinatescanbeviewedasthevertice sofagraph G = ( V;E ).AprobabilitydistributioniscalledfaithfultoaDAGifandonlyiftheco nditional independencerelationshipsindicatedbythedistributionareasthes ameasthoseobtained via d -separation.MultipleDAGsmayrepresentthesamesetofcondition alindependence relationships.TheDAGstowhichaprobabilitydistributionisfaithfula reMarkovequivalent andcomposeaMarkovequivalenceclass.TwoDAGsareMarkovequiv alentifandonlyif theyhavethesameskeletonandthesamev-structures(Fryden berg,1990;VermaandPearl, 1990,1992).TheskeletonofaDAGisthegraphwherealldirectede dgesarereplacedwith undirectededges.Av-structureisatriple(v 1 ,v 2 ,v 3 )havingthestructurev 1 ! v 2 v 3 .AMarkovequivalenceclassofDAGscanbedescribedbyauniqueCPDA G(Chickering, 2002).ACPDAGistheunionoftheDAGsinaMarkovequivalenceclassin thesensethata directededgeexistsifthatedgeexistsineachDAGandanundirecte dedgeexistsifthatedge hasderentdirectionsintwoDAGs.PC-algorithm(Spirtesetal.,200 0)isaneectiveand 12 ecientalgorithmtoestimateCPDAGs.Thereasonwhyweestimatet heCPDAGisthat multipleDAGscorrespondtoaprobabilitydistributionwhileonlyoneCPD AGcorresponds toaprobabilitydistribution.AfteraCPDAGisobtained,wecanexten dittotheDAGs byaddingdirectionsontheedgessuchthatnodirectedcyclesareg enerated.Ithasbeen proventhatPC-algorithmcangenerateaconsistentestimateund ercertainassumptions, eveninhighdimensionalcases(KalischandBuhlmann,2007). 2.2.2Covariate-adjustedgaussiangraphicalmodel Consideramulti-responselinearregressionmodelwith p responsesand q covariates.Suppose wehave n independentobservations( y i ;x i ), i =1 ;:::;n ,where y i =( y i 1 ;:::;y ip ) T and x i =( x i 1 ;:::;x iq ) T .Weassumethatthedataarecenteredandstandardized.Dene Y = ( y 1 ;:::;y n ) T and X =( x 1 ;:::;x n ) T .Therelationshipbetweenthemultivariateresponse Y andthepredictors X canbedescribedbythefollowingregressionmodel, Y = XB + E where B =( B 1 ;:::;B p )isa q p coecientmatrixand E =( 1 ;:::; n ) T istheerrorterm.We assumethattheerrors i =( i 1 ;:::; ip ) T ,i =1 ;:::;n ,are i.i.d. randomvariablesfollowing amultivariatenormaldistribution N (0 ;).Let y j =( y 1 j ; ;y nj ) T bethe j -thresponse vector( j =1 ;::;p )and x l =( x 1 l ; ;x nl ) T bethe l -thpredictorvector( l =1 ;::;q ).In ageneticalgenomicsframework, X representtheSNPmarkersand Y representthegene expressions.OurinterestistoestimatetheCPDAGstructureof Y ,whileadjustingfor theeectof X .Conditionalon X ,weexpectthatfalsepositivesand/orfalsenegativesof connectionscouldbereduced. 13 2.2.3Atwo-stageestimationprocedure Without X ,thePC-algorithmcanbeappliedtoinfertheCPDAGstructures.In themodel describedabove,theconditionalmeanof Y changesas X varies.ThusthePC-algorithm cannotbedirectlyappliedsinceitisonlyforamodelwithaconstantme an,i.e., X = ( X 1 ;:::;X p )followsamultivariatenormaldistributionwithmean andcovariancematrix .Toovercomethedculty,weproposeatwo-stageestimationpr oceduretoestimatethe meanfunctionandtheCPDAGstructure.Intherststage,wees timatethecoecient matrix B .Weimplementapenalizedestimationprocedurewithrespecttoeach responsein Y viasolvingthefollowing p optimizationproblems, ^ B j =argmin B j 1 2 n tr f ( y j XB j ) T ( y j XB j ) g + j;n k B j k 1 ;where j;n ;j =1 ;:::;p; aretuningparameters,and kk 1 referstothe L 1 norm.Theestimate of B isdenotedas ^ B =( ^ B 1 ;:::; ^ B p ).Let ^ E = Y X ^ B .Inthesecondstage,thePC-algorithm isappliedto ^ E toestimatethedirectedgraphicalstructurecorrespondingtot heprobability model N (0 ;).Thetwo-stageestimationalgorithmistermedasthecovariate- adjusted PC-algorithm(caPC). Ingeneral,thecaPCalgorithmcontainstwosteps.Therststepis toestimatetheskele- tonandthesecondstepistoextendtheskeletontotheCPDAG.In formationobtainedfrom therststepisusedtodeterminethedirectionsoftheedgesinthe secondstep.Ifwehave theperfectknowledgeabouttheconditionalindependencerelatio nships,thePC-algorithm canreturnuswiththecorrectskeletonintherststep(Spirtese tal.,2000).However,we donothaveperfectknowledgeinpractice.Therefore,weusesam plestoestimatetheskele- tonbyimplementingatestingprocedure.Fortestingtheconditiona lindependence,Kalisch 14 andBuhlmann(2007)appliedestimatedpartialcorrelationsandFis hertransformationto constructatestingstatistic.TheyprovedthatthePC-algorithm couldgenerateconsistent estimatesundercertainassumptionseveninthehighdimensionalsit uations.Tomakethe paperself-contained,wealsoincludedthePCalgorithminthesupplem entalle. 2.3Theoreticalresult Inthisworkweallowthenumberofresponsestoincreasealongwitht hesamplesize,i.e., p isafunctionof n .Thenumberofmarkercovariates q isxed.Thisisavalidtreatment aswecandoaneQTLmappingstudytorstndtheeQTLsfortheex pressionresponses, thenxtheminthemodel.Belowweshowthatthetwo-stageestimat ionmethodgenerates consistentestimateofaCPDAG.Toshowtheestimationconsistenc y,weadoptthesetupof KnightandFu(2000)andKalischandBuhlmann(2007).Thedimensio nofamultivariate responsevariableisdenotedas p ( n )andaDAGisdenotedas G = G n .Conditionsand remarkscanbefoundintheAppendixA. Let ˆ n ;i;j bethecorrelationbetween i and j .Let ˆ n ;i;j jk bethepartialcorrelation between i and j given f r ;r 2 k g ,where k isasubsetof f 1 ;:::;p ( n ) gnf i;j g .^ ˆ n ;i;j and ^ ˆ n ;i;j jk arethecorrespondingestimatedones. Toestablishtheestimationconsistency,weneedthefollowinglemmas .Lemma1. Assumethatthedistribution P n isnormallydistributed,wehave ^ ˆ n ;i;j ( ^ B ) ^ ˆ n ;i;j ( B )= o p (1) ;8 i;j =1 ;:::;p: 15 Lemma2. Assumethatthedistribution P n isnormallydistributed,wehave ^ ˆ n ;i;j jk ( ^ B ) ^ ˆ n ;i;j jk ( B )= o p (1) ;8 i;j =1 ;:::;p: Lemma3. Assumethatthedistribution P n isnormallydistributedand sup n;i;j; k j ˆ i;j jk j M< 1 .Then,forany > 0 ,wehave sup i;j; k 2 K m n i;j P ( j ^ ˆ n ;i;j jk ( ^ B ) ˆ n ;i;j jk j > ) C 1 ( n 2 m n ) exp(( n 4 m n )log( 4 2 4+ 2 )) forsomepositiveconstant C 1 . Lemma1showsthatthederencebetweentheestimatedcorrelat ioncoecient^ ˆ n ;i;j giventheestimatedcoecientmatrix ^ B andtheestimatedcorrelationcoecientgiventhe truecoecientmatrix B is o p (1).Lemma2statesthatthederencebetweentheestimated partialcorrelationcoecient^ ˆ n ;i;j jk giventheestimatedcoecientmatrix ^ B andtheesti- matedpartialcorrelationcoecientgiventhetruecoecientmatr ix B is o p (1).Lemma3 showsthatthederencebetweentheestimatedpartialcorrelat ioncoecient^ ˆ n ;i;j jk given theestimatedcoecientmatrix ^ B andthetruepartialcorrelationcoecient ˆ n ;i;j jk is bounded.TheproofsofthethreelemmasarerelegatedintheSupp lementalle. Dene n asthesigncancelevelfortestingthesigncanceofthepartialc orrelation afterFisher'stransformation,i.e.,fortesting H 0 :ˆ i;j jk =0.Withtheaboveassumptions, wehavethefollowingtheorem. Theorem1. Dene ^ G ( n ) astheestimatedCPDAGusingthetwo-stageestimationproce - 16 dureand G isthetrueCPDAG.If j;n = o ( n ) ;j =1 ;:::;p ,andConditions(C1)-(C6)are satised,thenthereexists n suchthat P ( ^ G ( n )= G )=1 O (exp( cn 1 2 d )) ! 1 ;as n !1 where c isapositiveconstantand d> 0 isasinCondition(C6). TheconvergentrateoftheCPDAGestimationisthesameasthatinT heorem2in KalischandBuhlmann(2007).Thismeanstheerrorsoccurredinth erststagedonothave agreateectonthesecondstageoftheestimationofgraphicals tructures.Theproofofthe TheoremcanbefoundintheAppendixA. 2.4Simulation 2.4.1Simulationdesign Wedidextensivesimulationstoassesstheestimationperformance. Wefollowedthesimu- lationsetupinYinandLi(2013)andKalischandBuhlmann(2007).To generatetheerror term,webeganwithanadjacencymatrix A p p lledwithzeros.Then,everyentryinthe lowertrianglewasreplacedwithindependentrealizationsofaBernou llirandomvariable withsuccessprobability p A (0
p
.SimilarpatternthatcaPC
22
set1set2set3set4set5set6
0.50.60.70.8
TPRset1set2set3set4set5set6
0.0040.0080.012
FPRFigure2.4:BoxplotofTPRandFPRunderderentsimulationsettings
withvarying
q
and
.outperformedPCwasobserved.
2.4.5Comparingdirectedandundirectedgraphs
Wedidsomesimulationsfollowingtocomparetheperformanceofthed
irectedandundirected
graphestimation.Herewechoose
p
=50
;100
;200,
q
=2
p
and
n
=250.Asbefore,wechose
E
[N
]=2,
=3.5.TheresultsaresummarizedinTable5.WeappliedgraphicalLASS
Oto
estimatetheundirectedgraphicalstructure.Wecanseethatun
directedmethodachievesa
higherTPR,butthedirectedmethodachievesalowerFPRandsmaller
SHD.HereSHD
isusedtomeasuretheskeletonofthegraphsinceundirectedgrap
hsdonothavedirection
information.ThereasonwhyundirectedmethodisbetteronTPRisb
ecause,intheory,
directedgraphissparsethanundirectedgraph.Sooverall,thedir
ectedgraphperforms
betterthantheundirectedmethodonbothFPRandSHD,butsue
rsalittlebitonTPR.
23
Table2.3:Estimationcomparisonwithandwithoutadjustingforthem
arkereectsunder
derentscenarios.
MethodTPRFPRSHD
p
=25,
q
=10,
n
=250,
E
[N
]=2,
=3.5caPC0.76810.001111.5
PC0.66980.038025.9
p
=50,
q
=50,
n
=250,
E
[N
]=2,
=4caPC0.74650.002126.7
PC0.56800.019755.3
p
=100,
q
=100,
n
=250,
E
[N
]=2,
=3caPC0.74030.001752.4
PC0.67970.005981.7
p
=400,
q
=200,
n
=250,
E
[N
]=2.5,
=20caPC0.76780.0054648.6
PC0.23460.01381504.9
p
=800,
q
=200,
n
=250,
E
[N
]=1.5,
=25caPC0.78190.00521948.1
PC0.13780.00823175.9
p
=1000,
q
=200,
n
=250,
E
[N
]=1.5,
=20caPC0.79020.00432543.4
PC0.27370.00653874.4
Table2.4:EstimationcomparisonbetweenPCandcaPCwhen
q>p
MethodTPRFPRSHD
p
=100,
q
=200,
n
=200,
E
[N
]=2,
=3caPC0.81760.008684.0
PC0.72640.0161129.8
p
=200,
q
=400,
n
=200,
E
[N
]=2,
=20caPC0.66120.0107324.9
PC0.26450.0177516.4
p
=400,
q
=800,
n
=200,
E
[N
]=2,
=20caPC0.57040.0076866.7
PC0.29840.01001138.3
2.5ApplicationtoRealData
Weappliedourproposedtwo-stageestimationmethodtoanAlzheime
rdiseasedataset
(Websteretal.2009).Followingthepaper,176Alzheimercasesand
187controlswere
includedinouranalysis.FortheSNPs,thosewithlowgenotypingcallr
ate(lessthan90%),
lowminorallelefrequency(lessthan5%)andthosefailedtheHardy-W
einbergequilibrium
testincontrolgroup(p-value
<
0.001)wereremoved.Aftertheseoperations,around332,000
SNPmarkerswereleftforfurtheranalysis.Weusedtheresiduale
xpressiondataafter
adjustingfortheeectsofseveralcovariatessuchasgender,
APOEstatusandageatdeath
24
Table2.5:Estimationcomparisonbetweendirectedandundirectedg
raphs
MethodTPRFPRSHD
p
=50,
q
=100,
n
=250,
E
[N
]=2,
=3.5directed0.79040.001312.5
undirected0.87360.036949.9
p
=100,
q
=200,
n
=250,
E
[N
]=2,
=3.5directed0.76400.001431.9
undirected0.83290.014990.2
p
=200,
q
=400,
n
=250,
E
[N
]=2.5,
=3.5directed0.78160.001475.1
undirected0.81180.0049137.5
(datacanbefoundathttp://labs.med.miami.edu/myers/LFuN/data
ajhg.html).Thereare
8,560geneexpressions.Weimplementedatwosamplet-testforeac
hgeneexpressionand
selectedtop100signcantgeneexpressionstolearntheirDAGstr
ucture.Foreachofthe
100geneexpressions,wettedasimplelinearregressionwitheachS
NPmarkerasthe
predictorandselectedSNPswithp-value
<
0
:001.AmongtheseSNPs,wechosethosethat
showassociationwithtwoormoregeneexpressions(p-value
<
0
:001).Thisendedupwith
3,776SNPsinthecasegroupand3,871SNPsinthecontrolgroupwith
60SNPsincommon.
Thenaldatasetusedfortheanalysiscontained100geneexpres
sions(asthe
Y
variable),
3,776SNPsinthecasegroupand3,871SNPsinthecontrolgroup(as
the
X
variables).We
appliedourestimatingmethodtothetwogroupsseparately.Ourgo
alwastolearntheDAG
structureforthe100geneswhileadjustingfortheeectsofSNP
markersinthecaseand
controlgroupswiththehopetoidentifyderentialDAGstructur
esthatcandistinguishthe
twogroups.Anyderencesmightpotentiallyexplainthediseaseetio
logyofAlzheimer.
WelearnedtheCPDAGsatthe
=0
:001level.Forthecasegroup,theestimated
CPDAGhas101directededges(Figure2.5).Forthecontrolgroup
,theestimatedCPDAG
has88directededges(Figure2.5).Thegraphsweregeneratedwit
htheRpackage
igraph
.Inbothgraphs,geneswererepresentedwithnumbers.Thecorr
espondinggenenamescan
befoundinthesupplementaltable.ThetwoCPDAGsshare18edges
incommon.We
25
noticedthatsomesub-CPDAGstructuresshowninthecasegroup
werenotpresentinthe
controlgroup.Forexample,thegraph70
!
95
38wereobservedinthecasegroup,while
thethreegenesareseparatedfromothergroupsinthecontrol
population.Ontheother
hand,someCPDAGsshowninthecontrolgroupwerenotpresentin
thecasegroup.For
example,thesub-graphby17-26-31-32-76isobservedinthecon
trolgroupwhilethegenes
arelessconnectedinthecasegroup.Suchderentialgraphicals
tructuresbetweenthetwo
groupsmightindicateregulationheterogeneitybetweenthetwogr
oups.Furtherbiological
vercationsareneedtovalidatethendings.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100Figure2.5:TheestimatedCPDAGinthecase(left)andcontrol(righ
t)groupbasedonthe
top100genes.Thoseoverlappednodesindicatedualregulationdir
ections.
Asacomparison,weappliedPC-algorithmdirectlytothe100geneswit
houtadjusting
fortheSNPs'eects.Forthecasegroup,theestimatedgraphh
as124edges.Forthecontrol
group,theestimatedgraphhas100edges.Thisisconsistentwitho
urassumptionthatthe
numberofedgesshouldbereducedafteradjustingforthemarke
rs'eect.Thisphenomenon
26
hasbeenexplainedanddemonstratedinotherworksfocusingonun
directedgraphs(e.g.,
YinandLi,2011;Caietal.2013).
ADAM17
APPNAE1
APBB1GAPDHBACE1
BACE2
PSENENAPH1AAPH1BNDUFV1NDUFV2NDUFV3NDUFA1
NDUFA2
NDUFA3
NDUFA4
NDUFA5
NDUFA6
NDUFA7
NDUFA8
NDUFA9
NDUFA10
NDUFAB1
NDUFA12
NDUFA13
NDUFB1NDUFB2NDUFB4NDUFB5NDUFB6NDUFB7NDUFB8NDUFB9NDUFB10NDUFS2NDUFS3NDUFS4NDUFS5NDUFS6NDUFS7NDUFS8NDUFC1NDUFC2SDHASDHBSDHCSDHDUQCRFS1CYC1UQCRC1UQCRC2UQCRHUQCRBUQCRQUQCR10UQCR11COX1
COX2
COX4I1
COX5A
COX5B
COX6A1
COX6B1
COX6C
COX7A1
COX7A2
COX7A2L
COX7B
COX7C
COX8A
ATP5A1
ATP5B
ATP5C1
ATP5D
ATP5E
ATP6
ATP5F1
ATP5G1
ATP5G3
ATP5H
ATP5O
ATP5J
HSD17B10LPLAPOELRP1TNFRSF1ABIDCALM2CALM3CALM1PPP3CAPPP3CBPPP3CCPPP3R1BAD
CYCSAPAF1
CASP9CASP3PLCB1PLCB4GRIN1GRIN2CCACNA1C
MAPK1MAPK3RYR3
ITPR1ITPR3ATP2A2
ATF6
CAPN1CAPN2CDK5R1CDK5MAPTGSK3BSNCAADAM17
APPNAE1
APBB1GAPDHBACE1
BACE2
PSENENAPH1AAPH1BNDUFV1NDUFV2NDUFV3NDUFA1
NDUFA2
NDUFA3
NDUFA4
NDUFA5
NDUFA6
NDUFA7
NDUFA8
NDUFA9
NDUFA10
NDUFAB1
NDUFA12
NDUFA13
NDUFB1NDUFB2NDUFB4NDUFB5NDUFB6NDUFB7NDUFB8NDUFB9NDUFB10NDUFS2NDUFS3NDUFS4NDUFS5NDUFS6NDUFS7NDUFS8NDUFC1NDUFC2SDHASDHBSDHCSDHDUQCRFS1CYC1UQCRC1UQCRC2UQCRHUQCRBUQCRQUQCR10UQCR11COX1
COX2
COX4I1
COX5A
COX5B
COX6A1
COX6B1
COX6C
COX7A1
COX7A2
COX7A2L
COX7B
COX7C
COX8A
ATP5A1
ATP5B
ATP5C1
ATP5D
ATP5E
ATP6
ATP5F1
ATP5G1
ATP5G3
ATP5H
ATP5O
ATP5J
HSD17B10LPLAPOELRP1TNFRSF1ABIDCALM2CALM3CALM1PPP3CAPPP3CBPPP3CCPPP3R1BAD
CYCSAPAF1
CASP9CASP3PLCB1PLCB4GRIN1GRIN2CCACNA1C
MAPK1MAPK3RYR3
ITPR1ITPR3ATP2A2
ATF6
CAPN1CAPN2CDK5R1CDK5MAPTGSK3BSNCAFigure2.6:TheestimatedCPDAGinthecase(left)andcontrol(righ
t)groupbasedongenes
mappedtotheAlzheimer'sdiseasepathwayfromtheKEGGdatabase
.Thoseoverlapped
nodesindicatedualregulationdirections.
WedidanothercasestudybyfocusingontheAlzheimer'sdiseasepat
hwayfromthe
KEGGdatabase.Therearetotal168genesinthispathway,buton
ly120geneexpressions
inthisdatasetweremappedtothepathway.Foreachofthe120ge
neexpressions,we
performedasinglemarkerlinearregressionanalysisusingeachofth
eSNPs.Weselected
thosemarkerswithp-value
<
0
:001.Thenthe5000SNPs(with93incommon)withthe
smallestp-valueswereselectedincasegroupandcontrolgroupre
spectively.
Thedataweusedforthenalanalysiscontain120geneexpression
s(the
Y
variables),
5000SNPmarkersinthecasegroupand5000SNPmarkersinthecon
trolgroup(the
X
variables).OurgoalistolearntheDAGstructuresbasedonthe12
0geneexpressionswhile
adjustingfortheeectsof5000SNPmarkersinbothgroups.Wea
ppliedourestimating
27
methodtothecaseandcontrolgroupseparately.Inthecasegr
oup,theestimatedCPDAG
has108directededges(Figure2.6).Inthecontrolgroup,thees
timatedCPDAGhas109
directededges(Figure2.6).Thereare25commonedgesbetweent
hetwographs.We
canseederentgraphicalstructuresbetweenthetwogroups.
Someoftheestimatedsub-
DAGsagreeswiththerealbiologicalnetwork.Forexample,somege
nesinmitochondria
wereidentedasconnectedinthecasegraph(e.g.,COX7A2,NDUFB
2,SDHC;NDUFV3,
COX7B,COX4I1,COX1,NDUFB4,NDUFA10).
2.6Discussion
Genesfunctioninnetworksandgeneexpressionsareoftenresult
edfromgeneregulations.
Hence,learningdirectedregulationscanenhanceourknowledgeab
outgenefunctionanddis-
easeetiology.Inthiswork,weintroducedacovariate-adjustedm
odeltostudythegraphical
structuresofmultiplegeneexpressionswithdirectionregulationinf
ormation.Weproposed
atwo-stageproceduretoobtaintheestimatedCPDAGcorrespon
dingtotheprobabilitydis-
tribution.Themarkeradjustedexpressionnetworkcanreducet
hefalseconnectionsoftwo
genesiftheyshareacommonregulator.Simulateddatawereusedt
otestourmethodand
showedsupportiveresultsbothinlowandhighdimensionalcases.We
providedaconsistency
resultforthecasewhen
p
isincreasingalongwithsamplesize
n
.Weappliedourmethod
toarealcase-controldatasettounderstandthegeneregulat
ionmechanisminAlzheimer
disease.Derentgraphswerelearnedcorrespondingtothecase
andcontrolgroup.For
theAlzheimerpathwayanalysis,wewereabletorecoversomeofthe
structuresforgenes
involvedinMitochondriafunction.
Intherealdataanalysis,wecouldnotrecovertheexactcausalr
elationshipsbetween
28
variablesthatareshownintheKEGGdatabase.Thisisduetothecom
plexityofthe
biologicalstructureandlimiteddatasamples.Moreover,thedatam
ayviolatethenormality
assumption.Otherrobustmethodssuchasthenonparanormalm
odel(Liuetal.2009;Liu
etal.2012)whichimplementanonparametrictransformationofthe
datamaybeapplied
toourframework.Inaddition,weexpectmorereliablestructures
canbeobtainedifwe
incorporatepriorknowninformationintoouranalysisbypre-settin
gsomeedges.Weexpect
moreaccurateandstableestimationcanbeachievedviaincorporat
ingbothstatisticaland
biologicalknowledgeintothemodeldevelopmentprocess.Thesewill
beincorporatedinto
ourfutureinvestigation.
29
Chapter3
IntegrativeAnalysisofGenetical
GenomicsDataIncorporating
NetworkStructure
3.1Introduction
Ingeneticalgenomicsanalysis,achallengingproblemishowtocombine
phenotype,genotype
andgeneexpressiondatatogethertoobtainnewknowledgeabout
thegeneticbasisofgene
expressionsandtofurtherprovidenovelinsightsintogenefunct
ions.Muchworkonmodel
developmentshasbeendonetocombinederentsourcesofdata.
Forexample,QTLmapping
studiescombinegenotypeandphenotypedatatoinferthegenetic
architectureofacomplex
trait.eQTLmappingcombinesgeneexpressionsandgenotypeswhe
regeneexpressionsare
viewedasquantitativetraitstounderstandthegeneticbasisofge
neexpression.Gaussian
graphicalmodels(MeinshausenandBuhlmann,2006;Friedmaneta
l.,2007)havebeen
appliedtogeneexpressiondatatoinfergeneregulatorynetworks
.Recently,covariate-
adjustedGaussiangraphicalmodels(YinandLi,2013;Caietal.,201
3)havebeendeveloped
tocombinegenotypesandgeneexpressionstoimprovegenenetwo
rklearning.Morerecently,
integrativemodelingandtestingmethods(e.g.,Huangetal.,2014;Zh
aoetal.,2014)combine
30
phenotypes,genotypesandgeneexpressionstoimprovethepow
erofstatisticaltesting.As
moreandmoregenomicdataarebecomingavailable,integrativeanaly
sisofmulti-source
genomicdatawouldprovideamorecomprehensivepictureofabiologic
alsystem,hence
oeringpromisingopportunitiesforpersonalizedmedicineandtreat
ment.
Whenusinggeneexpressionstopredictphenotypicresponses,line
arregressionhasbeen
thestandardmeanstomodeltherelationbetweenphenotypesan
dgeneexpressions.Dueto
thehighcostofobtainingexpressiondata,samplesizeistypicallysma
llerthanthenumberof
geneexpressions.Tosolvethisproblem,variousregularizationmet
hodshavebeendeveloped
toachievevariableselectionandestimation.Fromthebiologicalpers
pective,thephenotypic
responseofinterestandgeneexpressionsareofteninuencedb
ycommonexternalforces.
Theseexternalforcesareusuallyunobservableandcannotbemo
delledexplicitlyinthe
model.Hencetheireectsareusuallyputintheerrorterm.Sothee
xogeneitycondition
inwhichthepredictorsandtheerrorareuncorrelatedisviolated,a
ndconsequentlythe
estimatorsobtainedbyordinaryleastsquaresmaynotbeconsiste
nt.Toovercomethe
issuebroughtbyendogeneity,Linetal.(2015)appliedinstrumenta
lvariables(IV)models
andregularizationmethodstohandletheproblemcausedbythecor
relationbetweengene
expressionsandtheerrortermsinahighdimensionalsetupwithbot
hgeneexpressionsand
genotypes.Geneticvariantsareusedasinstrumentalvariabless
incetheyareindependent
ofexternalenvironmentalexposuresandaectthephenotypic
traitofinterestonlythrough
aectinggeneexpressions.Givengeneticvariantsusedasinstrum
entalvariables,theproblem
broughtbythedependencebetweenthegeneexpressionsandth
eerrortermswhichincluding
environmentalexposuresorotherfactorscanbesolvedbysubs
titutingthepredictionsofgene
expressionsintheregressionmodel.Thedcultiesonhighdimensiona
lityofgenotypesand
geneexpressionsarehandledbyregularizationmethodsuchasthe
LASSOtyperegression.
31
Weallknowthatgenesfunctioninnetworkstofulltheirjointtask
.Thusgeneexpres-
sionstendtobecorrelatedwhentheybelongtothesamenetwork(
e.g.,pathway).When
geneexpressionsareconsideredinaregressionmodel,suchcorre
lationinformationshouldbe
consideredwhenselectingimportantgenes.Althoughvastamount
sofstatisticalliterature
havebeenfocusedonthevariableselectionprobleminahighdimension
alsetup,manyof
themfailforcorrelatedvariables.AspointedoutbyZouandHastie
(2005),LASSOtendsto
selectonlyonevariablefromagroupofvariableshighlycorrelatedwit
heachother.Several
methodswereintroducedtosolvetheproblem.ZouandHastie(200
5)proposedelasticnet
toachievegroupingeect,i.e.,\thecoecientsofagroupofhighlyc
orrelatedvariablestend
tobeequal(uptoachangeofsignifnegativelycorrelated)".InLe
mma2andTheorem2in
theirpaper,itsaidthatifastrictlyconvexpenaltyisapplied,thehigh
erthecorrelationof
twovariablesis,thelesstheupperboundofthedistanceoftheest
imatedcoecientsofthe
twovariablesis.Ifthecorrelationisalmostzero,thetwoestimated
coecientsarealmost
thesame(exceptaminussignifnegativelycorrelated).LiandLi(2
008)proposedanetwork
constrainedmethodwhichtakesadvantageofthecorrelationinfo
rmationnamelynetwork
information.TheyintroducedaLaplacianmatrixand
L
2
penaltytocopewiththecorre-
lationproblem.Theyapplied
L
2
normonthepairwisederencesofthecoecientsofthe
correlatedvariablestoachievegroupingeect,andobtainedsimilar
theoreticalresultsasin
ZouandHastie(2005).Theirsimulationresultsshowedthattheirme
thodworksbetterthan
elasticnetincaseswherepriorknowledgeongraphicalstructures
isconsidered.InLiand
Li(2010),theauthorsproposedamodedpenaltyfunctionwhich
takesaccountofthesign
derencestoencouragetheabsolutevaluesofthecoecientsof
theconnectedvariablesto
shrinktowardthesame.Huangetal.(2011)proposedasparseLa
placianshrinkagemethod
andproveditsoracleproperty.
32
MotivatedbyLiandLi(2008),inthischapterweproposeatwo-st
epproceduretoachieve
variableselectionandestimationundertheIVregressionframewor
kincorporatinggene
regulatorynetwork.Throughapplyinggraphicalstructuresasp
riorknowledge,wehopeto
addresstheproblemcausedbyhighcorrelationsbetweengenesina
pathwaytoachievebetter
variableselectionandestimationresults.Werstadoptamultivariat
emulti-responselinear
modeltoprojectgeneexpressionsintothegeneticspace.Under
theassumptionthateach
geneisonlycontrolledbyafewgeneticvariants,weusetheLASSOalg
orithmtoestimatethe
coecientmatrix.Sincegeneticvariablesareindependentoftheer
rorterms,theprojected
expressionvaluesarepotentiallyuncorrelatedwiththeerrorterm
s.Thentheprojected
valuesareusedinthesecondstageestimationwhilegenenetworkst
ructuresareconsidered.
Weproposeagraphconstrainedregularizationmethodtoachievev
ariableselectionand
estimationinthesecondstage.Assumingcertaingraphicalstruct
uresongeneexpressions
whichcanbeobtainedeitherbybiologicalexperimentsorbystatistic
alestimation,two
penalties(LASSOandgraphconstrainedpenalties)areapplied.The
graphconstrained
penaltyisusedtoencourageshrinkingcoecientsofapairofvariab
lesconnectedinthe
networktowardthesametoachievegroupeect.Inthestudyof
thetheoreticalproperties
oftheestimatorobtainedthroughourmethod,weassumethe\gr
aphicalirrepresentable
condition"whichisusedtoguaranteetheselectionconsistency.
Thischapterisorganizedasthefollowing.In3.2,weintroducetheins
trumentalvariables
modelandthetwo-stepestimationmethod.Theoreticalconsiste
ncyresultsaregivenin3.3.
Simulationsandrealdataanalysisaregivenin3.4and3.5,respectively
,followedbydiscussion
in3.6.AllthetechnicalproofsarerenderedintheAppendixB.
33
3.2StatisticalMethodsandEstimation
3.2.1TheModel
Let
Y
=(
Y
1
;:::;Y
n
)
T
,X
=(
X
1
;:::;X
n
)
T
,and
G
=(
G
1
;:::;G
n
)
T
bematricesof
n
inde-
pendentandidenticallydistributedphenotypes,geneexpressions
,andgenotypes,where
Y
i
isascaler,
X
i
=(
X
i
1
;:::;X
ip
)
T
,and
G
i
=(
G
i
1
;:::;G
iq
)
T
.Themodelsweusearethefollowing
Y
=
X
+
;(3.1)
X
=
G
+
;(3.2)
where
=(
1
;:::;
p
)isa
q
p
coecientmatrix,
=(
1
;:::;
p
)
T
isacoecientvector,
=(
1
;:::;
n
)
T
and
=(
1
;:::;
n
)
T
areerrorvectoranderrormatrixsatisfying(
T
i
;
i
)
T
are
i.i.d.
p
+1dimensionalrandomvariablesfollowingamultivariatenormaldistrib
ution
N
(0
;).Wewriteas=
0
B
@
11
12
21
˙
22
1
C
A
,where
11
isthecovariancematrixof
i
,˙
22
is
thevarianceof
i
,and
12
=
T
21
isthecovariancebetween
i
and
i
.Figure3.1:Illustrationoftheinstrumentalvariablesmodelingenet
icalgenomicsanalysis.
Therelationshipbetweengenotype,geneexpressionandphenoty
pecanbemodelledwith
linearregressionmodels.Figure3.1showstherelationshipbetweent
hethreevariablesas
34
wellasthelatentvariable
E
,where
Y
isthephenotypeofinterest,
X
isthegeneexpression,
G
isthegenotype,and
E
representsexternalstimulatingfactorswhichareunobservable
henceareputintotheerrorterm.Weassume
G
aect
Y
onlythrough
X
,and
E
aects
both
X
and
Y
.ToillustratetheideaofIVregressioninourcase,weconsiderasimple
regressionmodel
Y
=
X
+
(
X
)since
E
aectsboth
Y
and
X
.Sotheerrorterm
coulddependson
X
andwehave
dY
dX
=
+
d
(
X
)
dX
.Wecanseethatusingordinaryleastsquaresregressionwill
notgiveconsistentestimatefor
unless
d
(
X
)
dX
=0,i.e.,
(
X
)isindependentof
X
.For
thepurposeofhopingtoobtainconsistentestimatorof
,PhilipG.Wrightintroducedthe
instrumentalvariablesmodelin1928.InFigure3.1,geneticvariant
G
isusedasinstrumental
variableinourgeneticalgenomicsanalysis.Asexplainedbefore,
G
isagoodchoicesinceit
aect
Y
onlythrough
X
andisindependentof
E
.Moredetailedexplanationcanbefound
inLinetal.(2015).
3.2.2Estimationofgeneticeects
Undertheassumptionthatonlyafewgeneticvariantsinuencethe
geneexpressions,the
coecientmatrix
issparse.Weapplyregularizedleastsquaresmethodtoestimateea
ch
geneexpressionseparately.Theestimationproblemcanbeformula
tedasthefollowing
p
optimizationproblems
^
j
=argmin
j
1
2
n
n
X
i
=1
tr
f
(
X
j
G
j
)
T
(
X
j
G
j
)
g
+
j
k
j
k
1
;j
=1
;:::;p:
(3.3)
where
j
;j
=1
;::;p
aretuningparameters,
kk
1
referstothe
L
1
normand
X
j
=(
X
1
j
;:::;X
nj
)
T
;j
=1
;:::;p
.Theestimateof
isdenotedas
^
=(
^
1
;:::;
^
p
).
35
3.2.3Networkconstrainedregularization
Intherststep,weobtainanestimateofthecoecientmatrix
.So,wehavepredictions
ofthecovariatesas
^
X
=
G
^
.Wesubstitute
^
X
for
X
inthesecondstep.
Theoptimizationproblemweareinterestedinis
^
=argmin
n
1
2
n
n
X
i
=1
Y
i
p
X
j
=1
^
X
ij
j
2
+
p
(
)
o
;(3.4)
p
(
)=
p
X
j
=1
j
j
j
+
1
2
X
k
˘
t
j
r
kt
j
(
k
s
kt
t
)
2
;(3.5)
where
r
kt
isaweightofcorrelationstrengthbetweentwovariables,
s
kt
=sign(
ˆ
kt
),
ˆ
kt
is
correlationcoecient,
>
0and
2
[0
;1]aretuningparameters,and
k
˘
t
means
r
kt
6=0.
ThepenaltyfunctioninEquation3.5includestwopenaltieswhichareus
edtoselectvariables
andtohandletheproblemsbroughtbyhighcorrelationsviaencoura
gingcoecientsoftwo
correlatedvariablestoshrinktothesamevaluetoachievegrouping
eect.
Weapplycoordinatedescentalgorithmtosolvetheoptimizationprob
lem.Wecenterthe
responsevariable
Y
andstandardize
^
X
j
,j
=1
;:::;p
.Let
h
2f
1
;:::;p
g
,wehave
1
2
n
n
X
i
=1
Y
i
p
X
j
=1
^
X
ij
j
2
+
p
X
j
=1
j
j
j
+
1
2
X
k
˘
t
j
r
kt
j
(
k
s
kt
t
)
2
=
1
2
n
n
X
i
=1
Y
i
^
X
ih
h
X
j
6=
h
^
X
ij
j
2
+
j
h
j
+
X
j
6=
h
j
j
j
+
1
2
j
r
hj
j
(
u
s
hj
j
)
2
+
1
2
X
k
˘
t
k;t
6=
h
j
r
kt
j
(
k
s
kt
t
)
2
:36
Takingthederivativewithrespectto
h
,wehave
@
@
h
n
1
2
n
n
X
i
=1
Y
i
p
X
j
=1
^
X
ij
j
2
+
p
X
j
=1
j
j
j
+
1
2
X
k
˘
t
j
r
kt
j
(
k
s
kt
t
)
2
o
=
1
n
n
X
i
=1
Y
i
X
j
6=
h
^
X
ij
j
(1
)
X
h
˘
j
r
hj
j
+
1+
(1
)
X
h
˘
j
j
r
hj
j
h
+
h
j
h
j
:Let
@@
h
n
1
2
n
P
n
i
=1
Y
i
P
p
j
=1
^
X
ij
j
2
+
p
(
)
i
=0.Wesolvetheequationandobtain
thesolutionas
^
h
=
S
(
1
n
P
n
i
=1
Y
i
P
j
6=
h
^
X
ij
j
+
(1
)
P
h
˘
j
r
hj
j
;
)
1+
(1
)
P
h
˘
j
j
r
hj
j
:SimilartoFriedmanetal.(2007)andFriedmanetal.(2010),weset
max
=
max
j
jh
^
X
j
;Y
ij
n
and
min
=
max
,where
=0
:001,andconstructasequenceof
K
valuesof
decreasing
from
max
to
min
onthelogscale.Weusecrossvalidationtochoosethetuningparame
ters
and
.3.3TheoreticalProperties
Herewegivethetheoreticalresultsonthevariableselectionandes
timation.Weadopt
notationsusedinLinetal.(2015).Foramatrix
A
,k
A
k
1
=max
j
P
i
j
a
ij
j
and
k
A
k
1
=
max
i
P
j
j
a
ij
j
.Foravector
,amatrix
A
,andindexsets
I
and
J
,
I
and
A
IJ
denotethe
subvectorandthesubmatrix.
J
c
denotesthecomplementof
J
and
j
J
j
denotesthenumber
oftheelementsin
J
.Denetherestrictedeigenvalueforamatrix
A
n
m
and1
s
m
by
(
A;s
)=min
jJ
s
min
6=0
k
J
c
k
1
3
k
J
k
1
k
A
k
2
p
n
k
J
k
2
.Let
r
=max
1
j
p
j
supp(
j
)
j
,s
=
j
supp(
)
j
,and
˙
max
=max
1
j
p
˙
j
,wheresupp(
)isthesupportiveset.Weassume
k
k
1
M
1
37
and
k
k
1
M
2
forsomepositiveconstants
M
1
and
M
2
.Wewrite(3.5)as
p
(
)=
P
p
j
=1
j
j
j
+
1
2
T
L
=
1
P
p
j
=1
j
j
j
+
2
T
L
,where
1
=
and
2
=
1
2
.L
isanon-negativedenitematrixequippedwiththegraphicalinforma
tion.Weassume
k
L
k
1