DATAQUALITYCONTROLANDINTER-FUNCTIONALANALYSISONDYNAMIC PHENOTYPE-ENVIRONMENTALRELATIONSHIPS By LeiXu ATHESIS Submittedto MichiganStateUniversity inpartialentoftherequirements forthedegreeof ComputerScience-MasterofScience 2016 ABSTRACT DATAQUALITYCONTROLANDINTER-FUNCTIONALANALYSISONDYNAMIC PHENOTYPE-ENVIRONMENTALRELATIONSHIPS By LeiXu Plantphenomicshavebecomeessentialcomponentofmodernplantscience.Suchcom- plexdatasetsarecriticalforunderstandingthemechanismsgoverningenergyintakeand storageinplants.Large-scalephenotypingtechniqueshavebeendevelopedtoconducthigh- throughputphenotypingonplants.However,amajorissuefacingtheseisthede- terminationofthequalityofphenotypicdata.Automatedmethodsareneededtoidentify andcharacterizealteractionscausedbysystemerrors,allofwhicharetoremovein thedatacollectionstep.Anotherissueiswearelimitedbythetoolstoanalyzefullythe phenomicsdata, esp. thedynamicrelationshipsbetweenenvironmentsandphenotypes. Theoverarchinggoalofthisthesisistoexploredynamicphenotype-environmentaldata viadatamining/machinelearningmethods.Rawdatameasuredfrombiologicaldevicesis pre-processedtonumericaldata,thencleanedbyDynamicFiltertoensurehighdataquality forfurtheranalysis.Thecleaneddataisfurtherexploredandappliedwithinter-functional analysisinordertopatternsthatcomplywithbothmachinelearningmethodologies andbiologicalconstraints. Inthisthesiswedevelopedtwotoolstomakeexplorationofphenotypingdataavailable: (1)Fordataqualitycontrol,wedevelopedamodelcalledDynamicFilter toidentifyabnormalitiesinplantphotosynthesisphenotypedata.(2)Forinter-functional phenomicsdataanalysis,wepresentanewalgorithmcalledPhenoCurveforinter-functional phenomicsdataanalysis. ACKNOWLEDGMENTS Firstandforemost,Ifeelindebtedtomyadvisor,ProfessorJinChen,forhisguidance, encouragement,andinspiringsupervisionthroughoutthecourseofthisresearchwork.His patience,prudentialattitude,extensiveknowledge,andcreativethinkinghavebeenthe sourceofinspirationforme.HewasavailableforadviceoracademichelpwheneverIneeded andgentlyguidedmefordeeperunderstanding.WhenIhesitatedbetweenresearchprogram andcourseprogramtwoyearsago,Iwasnotsureaboutmydecision,butnowIamsohappy andproudtosaythatImadeasowisedecision.Itisextremelyhardtoexpresshowgrateful Iamforhisunwaveringsupportoverthelasttwoyearsandinthecomingfuture. IwouldliketotakeonthisopportunitytothankmythesiscommitteemembersProfessor ArunRossandProfessorYanniSunwhohaveaccommodatedmytimingconstraintsdespite theirfullschedules,andprovidedmewithpreciousfeedbackforthepresentationofthe results,inbothwrittenandoralform.IwouldalsoliketothankProfessorRongJinforhis helpandadviceduringmygraduateprogramstudies. DuringmyMasterstudies,Ihadthepleasureofcollaboratingwithmanyresearchers fromeachandeveryoneofwhichIhadthingstolearn,andthequalityofmyresearchwas considerablyenhancedbytheseinteractions.IwouldliketothankDavidMKramer, ACruz,LindaJSavage,QiaoziGao,OliverTessmer,YifanYang,andZheyunFengforall thediscussionswehadandthefunmomentswespenttogetherondoingresearchandfuture planning. LivinginEastLansingwithoutmygoodfriendswouldnothavebeeneasy.Iwantto thankallmyfriendsinthedepartmentandoutsidethedepartment. Lastbutnotleast,Iwanttoexpressmydeepestgratitudetomybeloved iii parentsanddearestgirlfriend.Theirloveandunwaveringsupporthavebeencrucialtomy success,andaconstantsourceofcomfortandcounsel. iv TABLEOFCONTENTS LISTOFTABLES .................................... vii LISTOFFIGURES ................................... viii LISTOFALGORITHMS ............................... x Chapter1Introduction ................................ 1 1.1Dataqualitycontrol...............................3 1.2Inter-functionalAnalysis.............................4 1.3ThesisContributions...............................6 1.4ThesisOverview..................................7 1.5BibliographicNotes................................8 Chapter2Background ................................ 9 2.1Datacleaningforhighdataquality.......................9 2.1.1Relatedwork...............................10 2.2Inter-functionalanalysis.............................11 2.2.1Relatedwork...............................13 2.2.1.1SlidingWindowbasedCurveFittingMethods........13 2.2.1.2BayesianLinearModelwithNormalInverseGammaPrior.14 Chapter3DatacleaningwithDynamicFilter .................. 16 3.1Methods......................................16 3.1.1TheoreticalPhotosyntheticCurve....................18 3.1.2FrameworkofDynamicFilter......................20 3.1.2.1Step1.Coarseprocesstoidentifyabnormalcandidates...23 3.1.2.2Step2.KNNprocesstoabnormalityiden.23 3.1.3RelatedWorks..............................26 3.1.3.1GaussianMixtureModel....................26 3.1.3.2LinearDiscriminantAnalysisforFeatureSelection......27 3.2Experiment....................................28 3.2.1Realphenotypedataset..........................29 3.2.2Syntheticdataset.............................33 Chapter4Inter-functionalanalysisbyPhenoCurve .............. 37 4.1Method......................................37 4.1.1DataSeparationwithSlidingWindow.................38 4.1.2LocalCurveFitting............................38 4.1.3PolynomialGeneralization........................40 v 4.1.4BayesianMLEoptimization.......................41 4.2ExperimentalResults...............................43 4.2.1ExperimentalData............................44 4.2.2EvaluationCriteria............................45 4.2.3ExperimentalResultsonRealData...................46 4.2.4ExperimentalResultsonSyntheticData................48 4.2.5BiologicalV..........................50 Chapter5SummaryandConclusions ....................... 52 5.1Contributions...................................52 5.1.1Datacleaningforhighdataquality...................52 5.1.2Inter-functionalanalysis.........................54 5.2Conclusions....................................55 Chapter6FutureWork ................................ 56 6.1FutureWork....................................56 6.1.1DynamicFilterfordataqualitycontrol.................56 6.1.2Inter-functionalanalysis.........................56 APPENDIX .................................. 58 BIBLIOGRAPHY ................................... 67 vi LISTOFTABLES Table5.1TestingtherobustnessofPhenoCurveagainstwindowapproach withmultiplenoiseandbiasrates....................54 Table1Performancecomparisonon63typesofsyntheticdatasets.Eachpair ofthescoreisaMCCscoreanditsstandarddeviation.Boldfont indicatesthebestperformanceineachdataset.............63 Table2Performancecomparisonon63typesofsyntheticdatasets.Eachpair ofthescoreisaTPRscoreanditsstandarddeviation........65 vii LISTOFFIGURES Figure2.1Photosynthesis-Irradiance(PI)curve..................12 Figure3.1TheframeworkofDynamicFilter....................17 Figure3.2AnexampleofDynamicFilter.Thesolidpointsareabnormalities andthehollowpointsarenormalvalues................21 Figure3.3Performanceevaluationofprecision-recallandMatthewsCorrelation Cotonrealdataset.DFrepresentsDynamicFilter.......29 Figure3.4PerformanceimprovementbyapplyingtheKNNtprocess.31 Figure3.5Performancecomparison.Eachversioncorrespondstoat versionofDynamicFilter(DF)without:KNNt(v5),it- erationofEM(v4),consensusonregions(v3),reassignmentofnor- mal/abnormallabelsinEM(v2),orthewholeedprocess(v1)..32 Figure3.6AcasestudyontherealdatashowsthatDynamicFiltercorrectly idenalltheabnormalities......................33 Figure3.7AcasestudyontherealdatashowsthatDynamicFilteriden truebiologicaldiscoveriesunderthediurnallightcondition.Lines withthesamemarkerrepresentbiologicalreplicates..........34 Figure3.8Performancetestontemporalcheckingregionsize.(a)Fixingmax numberofabnormalities n ,andvarying m ;(b)Fixing m ,andvarying maxnumberofabnormalities n .....................35 Figure3.9TheMatthewsCorrelationCoentofbiologicaldiscoveriesand abnormalitiesonsyntheticdata.....................35 Figure4.1AnillustrativeexampleofPhenoCurvethatoptimizesbothlocal tingandsmoothnessof i 1 = 2 ,whichshowsA)dynamiclightvariation overtime,B)corresponding II values,C)anoptimizedon localregion,andD) i 1 = 2 valuesonalltemporalregions.......39 Figure4.2DemonstrationofPhenoCurverunningexampleonsyntheticdata..44 Figure4.3Cotofdetermination R 2 andsmoothnessof i 1 = 2 ontheunre- liableregionsandthewholerealphenotypedata............47 viii Figure4.4Cotofdetermination R 2 andsmoothnessof i 1 = 2 ontheunre- liableregionsandthewholesyntheticphenotypedata.........48 Figure4.5 II and i 1 = 2 onthesyntheticphenotypedatawith0.10noiseand biasrate..................................49 Figure4.6Phenotypeclusteringbasedonmaximalrelative i 1 = 2 andmaximal II .51 ix LISTOFALGORITHMS Algorithm1KNNprocesstoresults........................................24 Algorithm2FeatureSelection.....................................................24 Algorithm3EMoptimizationoneachlocalregion r ..............................25 Algorithm4DynamicFilter:DynamicFilterAlgorithm...........................59 Algorithm5Sub-procedures:getInterval..............................59 Algorithm6Sub-procedures:UpdateCandidate..................................60 Algorithm7Sub-procedures:UpdateWindow....................................60 Algorithm8Consensus...........................................................60 x Chapter1 Introduction Plantscapturesunlightto CO 2 intoenergyrichmolecules,thussupplyingourecosystem with O 2 andessentiallyallofitsbiologicalenergy,including100%ofourfood.Inplants, photosynthesisistheprimaryenergysourceformetabolismandgrowth.Understandinghow plantsoptimizeorregulateitinresponsetoacontinuouslychangingandunpredictableen- vironmentisanessentialcomponentfordevelopingstrategiestoimprovecropyieldstomeet ourgrowingneedsforfoodandfuel[32].Recentworkhasfocusedonimprovingthe ofphotosynthesistomeetourgrowingneedsforfoodandfuel([9,65,32]).Inordertode- velopoostingmechanismsthatreduceenergylossesorenhance CO 2 deliveryto cellsduringphotosynthesis,advancedtechnologiesinhigh-throughputplantphotosynthetic phenotypingandpheno-informaticshavebeendeveloped([74],[29,61,14]).Thesetechnolo- gieshaveallowedplantphotosynthesisphenotypicvariabilitytobecharacterizedandtobe relatedtoputativebiologicalfunctions,leadingtoabetterunderstandingoftheunderlying mechanismsthatcontrolphotosyntheticpropertiesundervariousenvironmentalconditions. Plantphenomicsisaassetforunderstandingthemechanismsregulatingenergy intakeinplants([52,19]). Plantphenotypingsystemsmonitorphotosyntheticperformanceformanyplantsboth continuouslyandsimultaneously.Phenomicsdatasetsarelargeandcontinuetogrowas weincreasedurationofsamplingandresolution.Yetdespitethesizeandrichnessofthe data,smallclustersoferroneousvalues,whichgivetheappearanceofrealin 1 biologicalresponses,canskewtheanalysistowardsaninvalidinterpretation([27]).Thereare severalwaysinwhichameasurementcanbeinerror:errorsoriginatingfrominstrumentation malfunctions,biasedvaluesfrommiscalibratedsensors,andinevitableerrorsofprecision.All theseissuescompromisethedownstreamdataanalysistasks.Giventhevalueofcleandata foranyoperation,theabilitytoimprovedataqualityisakeyrequirementfore knowledgeminingfromlarge-scalephenotypedata. Advancedtechnologiesinhigh-throughputplantphotosyntheticphenotypinghavebeen recentlydeveloped,allowingvariousphotosynthesisparameters(suchas II , q E , q L , q I ,and NPQ )tobetocharacterizedandtoberelatedtoputativebiologicalfunctions[29,52,59, 14].Inparallel,machinelearningandcomputervisionalgorithmshavebeendevelopedfor phenotypeinformationretrieval,dataqualitycontrol,andknowledgediscovery[24,62,71, 72,69,21]. Withthelargevolumeofplantphenotypedatabeengenerated,normalizedandcleaned, biologistsexpecttoimmediatelyidentifymutantstrainswithtphotosynthesisma- chinery,andquicklygenerateandtestbiologicalhypothesesthatmayleadtonewbreak- throughinbio-energyresearch.Indeed,computationaltoolshavebeendevelopedtoidentify temporalpatternsfromphenomicsdata[70],togroupplantsbyphenotypes[21],andto predictunknowngenefunctions[64].Nevertheless,ithasbeenemphasizedinliteraturethat itiscrucialtosimultaneouslymeasureandmodelmultiplekindsofphenotypesandenviron- mentalfactorstoarriveataholisticcharacterizationofplantperformance[68,33,25,66]. Sp,photosynthesismustrespondtochangingenvironmenttoprovidetheoptimal amountofenergytomeettheneedsoftheorganism,inthecorrectforms,withoutproducing toxicbyproducts,e.g.reactiveoxygenspeciesorglycolate[32].Inthiscontext,photosyn- thesisisasetofintegratedmodulesthatformaself-regulatingnetworkmodulatedbysignal 2 transduction. Thischapterisdevotedtoanoverviewofthesetwobroadtopicsofexploringphenotyping data,aimingtodevelopageneralcorrespondencefromdatacleaningtointer-functional analysis.Herewemovetowardstotheinafairlynon-technicalmannerandthe formaldetailedwillbegiveninChapter2. 1.1Dataqualitycontrol Inthisthesis,wefocusondataabnormalitiesdetection,whichisatypeofmeasurement error,inordertodemonstratehowcleanphenotypedatacanbeobtained.Similartosensor data,abnormalitiesinplantphenotypedatadeviatetlyfromexpectedpatterns andarevisibleoutliersinthewholedataset([56,58]).Themajorityofabnormalitiesin plantphenotypingoriginatefrominstrumentationmalfunctions(e.g.,lossofsensorsynchro- nizationduringmeasurement),ornon-biologicalstatisticaloutlierscausedbydatacollection limitations(e.g.,deteriorationofsignal-to-noiseratioforasampleasitprogressesthrough theexperiment). Dataabnormalitiesareoftenviewedasoutliersinthewholedataset.Recentworkhas showntheenessofapplyingdataminingtechniques,especiallyoutlierdetection,for thepurposeofdatacleaning([39]),makingitpossibletoautomatethecleansingprocess foravarietyofdomains([48,12,40,17]).Inthesemethods,bydetectingtheminoritiesof valuesthatdonotconformtothegeneralcharacteristicsofagivendatacollection,outliers areidenandareconsideredviolationsofassociationrulesorotherpatternsinthedata. However,theexistingmodelsarenotsuitableforphenotypedatacleaning.Thesemethods, whileappliedtophenotypedata,mayremoveoutliersincludingbothmeasurementerrors 3 andtruebiologicaldiscoveries,sincetruebiologicaldiscoveries,tosomeextent,areoutliers aswell.Furthermore,detectingabnormalitiesfromlongtime-seriesphenotypedatarequires handlingahightemporaldimension,whichincreasesthemodelcomplexity. Inordertoidentifyandremoveabnormalitiesinphenotypedataandtominimizethe deletionofbiologicaldiscoveries,wehavedevelopedadresidualanalysis algorithm,called DynamicFilter .DynamicFilterhasthreekeysteps:1)identifyabnormal candidatesatthecoarselevel,2)abnormalityideninaprojectedfeature space,and3)iterativelyidentifyabnormalitiesatthelevel.DynamicFiltercan speedupthedatapreparationprocessandmakeitmoree.Suchimprovementswill minimizetimeconsumingandlaborintensivedatapreparationandincreasethe andinbiologicaldiscoveries.Insummary,ourmodelhasthefollowingadvantages: * Toourknowledge,DynamicFilteristheworktointegratebiologicalconstraints withtime-seriesphenotypedatafordatacleaning. * Ourmodelcanidentifybothabnormalitiesandbiologicaldiscoveries. * Dynamicteroutperformstheexistingsolutionsbyoptimizingthebetween phenotypedataandbiologicalconstraints. 1.2Inter-functionalAnalysis Studyingthecomplexrelationshipsbetweenphenotypeandenvironmentposesseveralcom- putationalchallenges.First,giventhatthephenotype-environmentrelationshipislargely unknown,peopletendtolearntheinter-functionalrelationshipusingdatadrivenapproaches suchaslinearregressionorcurveting.However,1)itistochoosethebest 4 functionsincetherelationshipisusuallynon-linear[55];2)biologicalknowledgefordescrib- ingtheorganismresponsestoenvironmentalchangescannotbeelyincorporatedinto thesemodels;3)thepurelydatadrivenapproachesmaybetlybyoutliers andbiasinphenomicsdata,resultingininaccuratephenotype-environmentrelationships. Second,ifaphenotype-environmentrelationshiphasbeenwellstudied,peopletendtodi- rectlydataontoaknownbiologicalmodel(suchasMichaelis-Mentenkinetics,oneofthe bestknownmodelsofenzymekinetics)[16,11].Usingabiologicalmodel,phenotypedata canbeconvertedintolesscomplicatedmodelparametervalues,whicharebiologicallymean- ingfulandareeasiertointerpretthanrawdata[18].However,biologicalmodelsaresimple comparedwiththerealworldsituation.Itisinappropriatetodirectlyuseastatictheoretical modeltolearndynamicrelationshipsthatvaryovertimeandenvironmentalconditions[69]. Finally,photosynthesisphenotypesareusuallymeasuredunderdynamicenvironmentalcon- ditions,inarelativelylongtimeperiod,andonmanyplantswithvastlytgenetic backgrounds.Thebroadrangeofdatavariationaddsanotherlevelofcomplexitytothe problem.Insummary,newinter-functionalalgorithmsarerequiredtoexplorethecomplex phenotype-environmentrelationships. Inthisarticle,wepresentacomprehensivedataanalysisapproachcalled PhenoCurve toexplorethedynamicphenotype-environmentrelationships.Comparingtotheexisting methodsthatsolelymodelphenotypesorenvironmentalfactors,PhenoCurve,whichenables researcherstomodelphenotypesandenvironmentalfactorssimultaneously,hasthreema- joradvantages.First,althoughphenotypeandenvironmentaremeasuredseparatelywith tsensortechniques,theyarebiologicallycorrelated.Studyingtheinter-functional relationshipmayrevealpatternsthatcannotbediscoveredbyonlyusingphenotypeorenvi- ronmentaldata.Second,theinter-functionalanalysisallowsustocondensethehugeamount 5 ofphenotypedataintoasuccinct,highlycompactedform,whichwillballthefollowed dataminingprocessessuchasclusteringandgeneranking.Third,ttopurelydata drivenapproaches,PhenoCurveisabletofurtherimproveitsperformancebyincorporating preciousbiologicalknowledgeonplantphotosynthesisphenotypes.Inthefollowingcontent, wedemonstratetheenessofPhenoCurvebyidentifyingthedynamicrelationshipsbe- tweenakeyphotosynthesisphenotype II (steadystatequantumyieldofphotosystemsII) andlightintensity(denotedas i ).PhenoCurvecanbeeasilyextendedforotherphenotype andenvironmentdatawithsimplemo 1.3ThesisContributions Inthissectionweshallelaborateonthemainproblemsconsideredinthisthesisandourkey contributionstoaddresstheseproblems. Thisdissertationmainlydealswiththeexploringphenotypingdatausingmachinelearn- ing/dataminingtechniques,includingdataqualitycontrolandinter-functionalanalysis. Generally,weattempttodevelopanapplicationthatisabletoidentifyoutlierswithhigh performance,andweattempttodesignanalgorithmforreliableandrobustcurvefor phenotypingvalues. DynamicFilter .ThethesisproposesamodelcalledDynamicFil- tertoidentifyabnormalitiesinplantphotosynthesisphenotypedatabycomparing lightresponsesofphotosynthesisusingaedkineticmodelofphotosynthesis. DynamicFilteremploysanExpectation-Maximizationprocesstoadjustthekinetic modelincoarseandregionstoidentifybothabnormalitiesandbiologicalout- liers.Theexperimentalresultsshowthatouralgorithmcanelyidentifymost 6 oftheabnormalitiesinboththerealandsyntheticdatasets. Phenocurveforinter-functionalanalysis .Thethesispresentsanewalgo- rithmcalledPhenoCurveforinter-functionalphenomicsdataanalysis.PhenoCurveis amodelbasedcurvealgorithmaimingtostudyboththevaluesandthechanging ratesofthedynamicphenomicsdata.Theevaluationontherealandsimulatedphe- notypedatashowedthatPhenoCurvehasthebestperformanceamongallthetested methods.Itsapplicationonrealplantphotosynthesisdatarevealednewfunctionsof coregenesandprocessesthatcontrolphotosynthesisinresponsetovarying environmentalconditions,whicharecriticalforunderstandingplantenergystorage andimprovingcropproductivity. 1.4ThesisOverview Theremainderofthisdissertationisorganizedasfollows.Chapter2laysoutthefoundation fortherestofthethesis. Thepartofthethesisfocusesondataqualitycontrolproblem.InChapter3we focusondataqualitycontrolandtheapplicationDynamicFilterdeveloped,investigatehow itisproposedandtrytoaddressrealcaseproblemoflowdataquality. Thesecondpartofthethesisdealswithinter-functionalanalysisproblem.Chapter4 discussesthemotivationandcontextofdoinginter-functionalanalysisforphenotypingdata, especiallyafterweobtainedhighdataqualityusingDynamicFilter. Finally,Chapter5summarizesthisworkbyconcludingthemaincontributions,some potentialextensionsandthefutureresearchdirections.Inordertofacilitateindependent readingofvariouschapters,someofthedarerepeatedformultipletimes. 7 1.5BibliographicNotes Someoftheresultsinthisdissertationhaveappearedinpriorpublications.Thematerial inChapter3isbasedonaworkpublishedintheBioinformaticsonOxfordJournal[69] (Bioinformatics)andthecontentofChapter4comesfromcurrentworkwhichistobe submitted. 8 Chapter2 Background Thegoalofthischapteristogiveageneralandformaloverviewofthematerialsrelatedto theworkthathasbeendoneinthisthesis.Inparticular,wewilldiscussthekeyconcepts andquestionsrelevanttoproblemsofdatacleaningandcurveTheexpositiongiven hereisnecessarilyverybriefandthedetaileddiscussionwillbeprovidedintherelevant chapters. 2.1Datacleaningforhighdataquality Datacleaningistheprocessofidentifyingincorrectorcorruptedrecordsinadataset.The goalofdatacleaningistoensureanaccuraterepresentationofthereal-worldconstructsto whichthedatarefer.Removingimpuritiesfromdataistraditionallyanengineeringproblem, wheread-hoctoolsmadeupoflow-levelrules(suchasdetectingsyntaxerrors)andman- uallytunedalgorithmsaredesignedforsptasks(suchastheeliminationofintegrity constraintsviolations)([44]).Detectionandeliminationofcomplexerrorsrepresentingin- validvalues,however,gobeyondthecheckingandenforcementofintegrityconstraints.They ofteninvolverelationshipsbetweentwoormoreattributesthatareverytouncover anddescribebyintegrityconstraints.Recentworkhasshowntheenessofapplying techniquesfromstatisticallearningforthepurposeofdatacleaning.Inparticular,outlier detectionmethodshavemadeitpossibletoautomatethecleansingprocessforavarietyof 9 domains([13,17,31,39,40,48]). 2.1.1Relatedwork Noneoftheexistingoutlier-detectionbasedmethodsaresuitableforphenotypedataclean- ing.First,bothbiologicaldiscoveriesanderrorsofdetectionaretoseparatefrom distribution.Second,thecohesivenessruleusedintemporaldatacleaningisnotapplicable forthephenotypedata,because1)anon-cohesivetime-serialcouldrepresentaninteresting phenotypepatternratherthananerror;2)alltheobservationsatthesametimepointmay besimilarlybyasystematicabnormalevent([44]). Alternatively,ratherthancheckingtherawvalues,residueanalysiscanbeemployedto modelthebetweentherealvaluesandthetheoreticalcurve,whichisusually derivedfrombiologicalconstraintssuchasthegeneralizedlightreactions([30,38]).Thisis oftencalledthe goo model.Thegoobaseddatacleaningmodelscanbe intotwocategories.First,statisticaldistributioncharacterssuchasmean,standard deviation,intervalorrangehavebeenusedtounexpectedvaluesindicating possibleinvalidvalues([39]).Suchsimplemethodscanbetlyappliedtobigdata. However,theseparameters(suchasmean)areinclinedtobebiasedbyabnormalitieswith largedeviations.Sinceitdoesnottakeintoaccountlocalcharacteristicsofdata,thereisa riskofmislabelingarangeofnormaldataasabnormalities,andviceversa.Second,combined dataminingtechniquesareusedtoidentifypatternsthatapplytomostresidualrecords. Apatternisbyagroupofresidualsthathavesimilarcharacteristics(behaviorfor certainpercentageoftheinthedataset).Outliersarethenidenasvaluesthatdo notconformtothepatternsinthedata.Amongthem,theHampelusesthemedian ofneighboringobservationsasareferencevalueandlooksforlocaloutliersinastreaming 10 datasequence([48,49]).WhiletheHampelissuitablefortemporaldatacleaning,it assumesthatthedataareindependentandidenticallydistributed,whichisnotvalidunder dynamicenvironmentalconditions.Butwhenthedataishighlyautocorrelated,Hampel mayfailtocaptureabnormalities. Itshouldbenotedthatwhilethegoobaseddatacleaningmodelsfocusonthe modelingofdeviation,theyarenotawarethatthetheoreticalcurve,whichisusedasthe reference,maynotalwaysbeprecise.Typically,theoreticalcurvesderivedfrombiological knowledgearesimplecomparedwiththerealworldsituation.Itisthereforeinappropriate todirectlyusetheimperfecttheoreticalcurvetoinferabnormalities. Inthispaper,wedeveloparesidualanalysismodelcalled Dynamic Filter toelyidentifyabnormalitiesinplantphotosynthesisphenotypedata.Our modelderivesatheoreticalcurvefromthephotosyntheticbiologicalconstraints;adjuststhe theoreticalcurvetothephenotypedataviaoptimization;andthenstudiesthedeviations ofindividualphenotypevaluesfromtheoreticalcurve.Theresultingpatternsinresiduals indicateabnormalities,whichisatypeoferrorsofdetection,andtheoptimizedtheoretical curvesrevealtruebiologicaloutliers. 2.2Inter-functionalanalysis Inthissection,weintroducethebiologicalbackgroundonmodelingtherelationshipbe- tweenlightintensityandtherateofphotosynthesis,aswellastheexistingcomputational approachesoncurveandregression. Thephotosynthesis-irradiance(PI)curveisagraphicalrepresentationoftheempirical relationshipbetweensolarirradianceandphotosynthesis[38].AsaderivationofMichaelis- 11 (a)HyperbolicPhotosynthesis-Irradiance(PI) curve. (b)MeasuredphotosyntheticratesofCol-0and thePIcurve. Figure2.1:Photosynthesis-Irradiance(PI)curve. Mentenkinetics,PIismodeledasahyperboliccurveasshowninFig.2.1(a)[42],indicating thatthereisagenerallypositivecorrelationbetweenlightintensityandphotosyntheticrate. ThePIcurvehasbeenappliedsuccessfullytocleanplantphotosynthesisphenotypedata,as wellastoexplainocean-dwellingphytoplanktonphotosyntheticresponsetochangesinlight intensity[30,69]. Let P bephotosyntheticrateatagivenlightintensity[ I ],thePIfunctionisinEq.2.1: P = P max [ I ] i 1 = 2 +[ I ] (2.1) where P max isthemaximumpotentialphotosyntheticrateperindividual,and i 1 = 2 ishalf- saturationparameterrepresentingtheamountoflighttoproducthalfofmaximumphoto- synthesis.Parameter i 1 = 2 isusuallylearnedwithaserialofvaluesof II andlight( i )using curvemethods. Bydescribingthephotosyntheticrate P usinglinearelectronw,wecanassociateboth 12 II and i withtime t usingEq.2.1: II ( t;i 1 = 2 )= II ) 1+ i ( t ) i 1 = 2 (2.2) where t isatimepointinatemporalregion R ; II ( t )and i ( t )represent II (steadystatequantumyieldofphotosystemsII)andlightintensityat t ;max II )isthe maximal II in R .Seemoredetailsin[69]. Duringshorttimeregion, i 1 = 2 inEq.2.2remainsconstant.However,itmaychange graduallyoverlongtimeregionwiththechangesofmultipleenvironmentalfactorssuch aslight,temperature, CO 2 ,and O 2 [14].Thisphenomenon,namelyphenotypicplasticity, indicatestheabilityofanorganismtoproducemorethanonephenotypewhenexposedto tenvironments[50,46]. 2.2.1Relatedwork Knowingthetrendof i 1 = 2 willgreatlyenhanceourunderstandingtothemechanismthat regulateplantsinresponsetoacontinuouslychangingandunpredictableenvironment.How- ever,tocapture i 1 = 2 bydirectlyapplyingthePIfunctiononthephenotypeandenvironment datamaybeinappropriateduetothehighnoiserateinrealdata[69]. 2.2.1.1SlidingWindowbasedCurveFittingMethods Sincethechangeof i 1 = 2 isassumedtobesmoothandcontinuous,aslidingwindowapproach maybemoreappropriatethandirectlyapplyingthePIfunctiononeverytimepoint.Us- ingtheslidingwindowapproach,wecandividethewholephenotypedataintooverlapped temporalregions,andthenemployacurvettingorregressionmethodtocomputealocal 13 i 1 = 2 valueforeachregion.Finallyallthelocal i 1 = 2 valuesaremergedtocapturetheglobal phenotype-environmentrelationship.Notethatthereisnoexplicitboundarybetweencurve andregression,whiletheformerismoreontheoptimizationitself,andlater focusesmoreonstatisticalinference[43]. Curvegisacommonlyusedmethodtomodeltherelationshipsamongtwoormore variables[43].Mathematically,itisaprocesstotunetheparametersofaknownmathemati- calfunction f ( x )toachievethebesttoaseriesofdatapoints.TheLevenberg-Marquardt algorithm(LMA),akathedampedleast-squaresmethod,hasbeenwidelyusedfornonlinear leastsquarescalculationsforsolvinggenericcurvproblems[35].LMAinterpolates betweentheGauss-Newtonalgorithmandthemethodofgradientdescent,aimingtoa localminimum[28,6]. Ifthefunction f ( x )isunknown,kernelsmoother,suchaslocallinearregression (LLR),isoftenusedforestimatingasmoothfunctionfromaserialsofnoisyobservations[26]. Inthisway,non-linearrelationshipsbetweenphenotypesandenvironmentalfactorscanbe learnedfromdata,eveniftheunderlyingbiologicalmechanismbetweenthemisunknown. 2.2.1.2BayesianLinearModelwithNormalInverseGammaPrior Whilethesliding-windowbasedcurvemethodsoptimizeeverylocalthey simplyignoretheglobalcontinuityoftheparameter i 1 = 2 .Thus,theymaybesensitiveto noiseandbiasinlocalregionsinrawphenotypedata,resultingininaccuratephenotype- environmentrelationships.Tothisend,performanceimprovementmaybeachievedbyusing Bayesianlinearmodelwithnormalinversegamma(NIG)prior[22]. 14 First,wetransferEq.2.2toitslinearformforeasierrepresentation: II ) II ( t i ) 1= 1 i 1 = 2 i ( t i )+ ( t i )(2.3) where ( t i )istheerrorassociatedwith t i ,distributedasnormaldistribution N (0 ;˙ 2 i ). Second,similartoSection2.2.1.1,weadoptaslidingwindowapproachtoestimate i 1 = 2 and ˙ 2 i foreachtemporalregionusinglinearregressionmethods[20].Givenathreshold onthelinearregressionreliability R 2 ,wecanclassifyallthetemporalregions D intotwo groups,i.e.reliabledata D r andunreliabledata D u [28]. Third,inordertoderivetheposteriorprobability p ( i 1 = 2 ; 2 j II ;i ),weassumethat ( i 1 = 2 ;˙ 2 i )followsthenormalinversegammadistribution,i.e. i 1 = 2 ;˙ 2 i ˘ NIG ( V;a;b ), where V;a;b isthesetofhyperparametersofNIG.Subsequently i 1 = 2 ; 2 j II ;i isalsodis- tributedasNIGyetwiththyperparameters( ;V ;a ;b ).Assumingthat D r and D u followthesinglemodeoftheNIGdistribution,weestimatethehyperparametersusing D r andapplythemon D u .Sp, i 1 = 2 ofeachtemporalregionin D u canbeobtained by1)obtainingthemarginalposteriordistributionbyintegratingout ˙ 2 i ,2)maximizinga posterioriprobabilityofthemarginalposteriordistribution,and3)optimizingaBayesian linearmodelusingboththesharedhyperparametersandlocaldata.Seemoredetailsat [22]and[8]. Inthismodel,theglobalcontinuityof i 1 = 2 isguaranteedduetoasinglemodeofthe normalinversegammadistribution(i.e. D r and D u sharethesamesetofcommonhyper parameters)[8].However,theassumptionsonconstanthyperparametersmaybetoorigid forrealphenotypeandenvironmentdata,inthat D r and D u maybeundertprior distributionsduetotenvironmentalconditions. 15 Chapter3 DatacleaningwithDynamicFilter AdatacleaningframeworkcalledDynamicFilterisproposedfordataqualitycontrolinthis Chapter. Theremainderofthechapterisorganizedasfollows.Section3.1motivatestheproblem andmainintuitionbehindtheproposedalgorithm,thedetaileddescriptionoftheproposed algorithm.Section3.2presentsthedetailedimplementationissues,theexperimentalsetup, resultsandanalysis. 3.1Methods Inthissection,weintroducethetheoreticalcurveoftime-seriessteady-statequantum yielddata,andthenintroduceaframeworkforabnormalitydetection. Inthispaper,thetime-seriessteadystatequantumyieldofphoto-systemsII(denoted as II )ischosenforabnormalitydetectionforthreereasons.First, II canbereadily measuredusingvideoimagingmakingitusefulforhighthroughputphenotyping. Second,becauseitlight-drivenelectrontransfer,itcanbeusedasanindicatorof photosyntheticratesand,albeitwiththecaveatthatitthesumof CO 2 photo-respirationandotherprocesses([47]).Finally, II isagooddemonstration oftheapproachbecauseittendstofollow,toareasonabledegree,relativelysimplesaturation behaviors.Givenanadequatemodel,thecleaningproceduredescribedinthemanuscript 16 Figure3.1:TheframeworkofDynamicFilter. mayalsobeappliedtootherphotosyntheticparameterslikenon-photochemicalquenching (NPQ),whichcandisplaycomplexbehaviors. 17 3.1.1TheoreticalPhotosyntheticCurve Anabnormalityinresidualanalysisisanobservationexhibitingalargerencebetween thetheoreticalvalueandtheobservedvalue,andmayindicateadataentryerrorfromthe phenotypingsensors.Toderivethetheoreticalcurve,wemodel II withthephotosynthesis- irradiance(PI)curve(seeFig2.1(a))([30,38]). AsaderivationofMichaelis-Mentenkinetics,oneofthebest-knownmodelsofenzyme kineticsinbiochemistry([42]),PIismodeledasahyperboliccurve(seeFig2.1(a))inEq. 2.1,revealingtheempiricalrelationshipbetweensolarirradianceandphotosynthesis([38]). where P isphotosyntheticrateatagivenlightintensity, P max isthemaximumpotential photosyntheticrateperindividual,[ I ]isagivenlightintensity,and i 1 = 2 ishalf-saturation constant.Fig2.1(a)showsthegenerallypositivecorrelationbetweenlightintensityand photosyntheticrate.ThePIcurvehasalreadybeenappliedsuccessfullytoexplainocean- dwellingphytoplanktonphotosyntheticresponsetochangesinlightintensity([30]),aswell asterrestrialandmarinereactions. Wedescribethephotosyntheticrate P intermsoflinearelectronw([32]),andasso- ciatebothtemporalsteadystatequantumyieldofphotosystemsII II andtemporallight intensity i withtime t ,asshowninEq.2.2:where t isatimepointinatemporal region T ( t 2 T ); II ( t )and i ( t )representthesteadystatequantumyieldofphotosystemsII andlightintensityat t ;m II )isthemaximal II in T ;andthehalf-saturationconstant i 1 = 2 isthelightintensityatwhichthephotosyntheticrateproceedsathalf P max .Seeproof insupplementarySection1. OnemayreasonablyaskiftheNPQorphotoinhibitionwouldthetheoreticalmodel forlightsaturation.Infact,NPQhas(surprisingly)littleontherelationshipbetween 18 II andlightintensityascanbereadilyseeninthefactthatthe II lightsaturation curvesforwildtypeandthe npq 4mutantofArabidopsisareessentiallyidenticaldespite largeinqE(i.e.,rapidlyreversiblephotoprotectionofNPQ)([36]).Thereason forthisapparentdisconnectisthat,athighlight,thesloweststepinthelightreactionsof photosynthesisoccurssubsequenttolightabsorptionatthecytochrome b 6 f complexandis regulatedbythepHofthelumen([60]).Lightabsorptionbecomeratelimitingonlyat NPQlevelsmuchhigherthanthoseobservedhere.ThebiologicalroleofNPQundermost conditionsappearstobeinregulatingelectrontransferbutinpreventingthebuildupof reactiveintermediateswithinthephotosystemIIreactioncenter([45]).Thus,theof moderatelevelsofNPQandphotoinhibitionshouldhavelittlethebehaviorofthewild typesystem.However,underextremeconditionsofinmutantlineswithalteredbehavior producinghighlevelsofNPQorphotoinhibitionweexpecttoseebehaviorthatdeviates fromthatproducedbythemodel.Theseinstanceswillbedetectedasoutliersand forfurtherinvestigationofpossiblebiologicaldiscoveries. Consequently,thehalf-saturationconstant i 1 = 2 canbelearnedusingall II ( t )and i ( t )in T withanonlinearregressionmethod([54]).Notethatthehalf-saturationconstantcanbe dramaticallytbetweenplantsandbetweenleavesinplants.Thus,thegeneralshape ofthecurveistypicallymaintained,butnotitsmaximalorhalf-saturationlightintensity. Finally,given i 1 = 2 ,theresidualvalueateachtimepoint t isas: rsd ( t )= II ( t ) 0 II ( t;i 1 = 2 )(3.1) where rsd ( t )istheresidualvalueatattime t ;and II ( t )istheobservedvalueand 0 II ( t ) isthetheoreticalvalueofsteadystatequantumyieldat t calculatedusingEq.2.2. 19 WenotethattherearemultiplemodelsforPIcurves,whichgivesimilarresponsesto light([37,23,73,34,15]).Inthispaper,wechosetheMichaelis-MentenKineticsmodel becauseitisconvenienttouse,andplantphotosynthesisratedatawell(seeFig2.1(b)). Itshouldbenotedthatanimportantfeatureofourapproachisthatthesealternativemodels canbeeasilyaddedorsubstitutedforcomparison. 3.1.2FrameworkofDynamicFilter DynamicFilterisacoarse-residualanalysisapproach,whichhasthreemajorsteps asshowninFig.3.1.Weabnormalitiesusingatothatproposedin[40]: 3.1. Abnormality. Let f nor g beasetofnormalphenotypedata,and f rsd nor g bethecorrespondingresidualset.Anabnormality abn isaphenotypevaluewhoseresidual fallingthe ceintervalofthemajornormaldistributionof f rsd nor g . Notethatinterval =99%iscommonlyusedinliterature([40,57]),but isadjustablebyusers.Inthispaper,byadoptingtheconceptofinterval,we assumethat1)themajorityofthephenotypevaluesarecorrect,and2)theyformthemajor distributionintheresidualdata,whichisalsodistinctlytfromthedistribution(s)of theresidualdataoftheabnormalities. 20 (a)GMMincoarseprocess (b)Candidatesincoarseprocess (c)Beforeregionalnt (d)ApplyEMonlocalregion (e)OutputofEM (f)Afterregionalt Figure3.2:AnexampleofDynamicFilter.Thesolidpointsareabnormalitiesandthehollow pointsarenormalvalues. 21 (a)Candidatesinprocess (b)Finaloutputs Figure3.2(cont'd) 22 3.1.2.1Step1.Coarseprocesstoidentifyabnormalcandidates. Givenasetofphenotypedata II ,weadoptEq.2.2togeneratethetheoreticalvaluesof steadystatequantumyieldsforeachplant,denotedas f 0 II g ,byusingthewholetime-serial astemporalregion T ,akathecoarselevel.ForthedatasetusedintheExperimentSection, thesmallestvalueoftimeintervalis10minutes,andthescaleof T inthewholedataset is3days.Consequently,wegeneratetheresidualdataofallplants f rsd g usingEq.3.1, andmodelthemusingaGaussianmixturemodel(GMM)(seedetailsinSection3.1.3and exampleinFig.3.2(a)).Finally,wegeneratetheabnormalitycandidateset f abn g with 3.1.InFig.3.2(b),solidpointsareabnormalitycandidatesinthecoarseprocess. Clearly,becauseofthePIcurvemodel,notalltheabnormalitycandidatesare correctlyiden 3.1.2.2Step2.KNNprocesstoabnormalityiden Abnormalitycandidatesmayhavecertainintrinsicpatternsofdistributionhighlyrelatedto certainrangesoffeaturespace.Forexample,accidentaldysfunctionofdata-capturingdevices maycauseabnormalitiesconcentratedaroundsomeregions,whichformstatisticalpatterns onthedistributionplotofthefeaturespace.Fromastatisticalviewpoint,abnormalities shouldbeawayfromnormalvaluesinthefeaturespace,andvalueswithsimilarfeatures tendtohavethesamelabels.Thisleadstoatprocesstoexploitthepatterns ofabnormalitiescandidatesonselectedfeaturespace,andtomakeuseofthesepatternsto abnormalityidenasdescribedinAlgorithm1. Sp,weselecttheoptimalfeaturesfrom II , rsd , i , t ,etc.,inwhichab- normalitiesandnormalvaluesaremaximallyseparated.Tosolvethisfeaturereduction problem,LinearDiscriminantAnalysis(LDA)isadoptedtogettheprincipalcomponents 23 oftheoptimalfeaturespace(Algorithm2,seedetailsinSection3.1.3).Second,weap- plyK-nearest-neighborapproachontheselectedfeaturespace,suchthateachabnormality candidatewillberelabeledasitsmajoritylabelof k nearestneighbors(Algorithm1)([4]). Algorithm1 KNNprocesstoresults Input: :originalfeaturespace C :thesetoflabels(abnormalityornormal) k :numberofnearestneighborsused 1: proj Feature ;C )2 2: for i in proj do 3: C i majoritylabelof k nearestneighbors 4: endfor 5: Output: C Algorithm2 FeatureSelection Input: :originalfeaturespace C :thesetoflabels(abnormalityornormal) 1: 1 j C j P 2: for i from1to2 do 3: i Featuresof i th label 4: n i j i j ; i 1 n i P i 5: SW i 1 n i P ( i i )( i i ) T 6: endfor 7: SW P 2 i =1 SW i 8: SB P 2 i =1 n i j C j ( i )( i ) T 9: proj eig ( SB=SW ) 10: Output: proj : proj isprojectedspace Step3.processtoidentifyabnormalitiesinlocalregions. Becausethetheoreticalvalues f 0 II g arelearnedwiththedPIcurvemodelatthe coarselevel,notalltheassignmentsoftheabnormalcandidatesarecorrect.Consequently, weseparatetheabnormalcandidates f abn g totemporalcheckingregions(seen 3.2)andabnormalityidenationineachregion. 24 Algorithm3 EMoptimizationoneachlocalregion r Input: phenotypevaluesinalocalregion i :lightintensity :interval 1: Let nor and abn benormalvaluesandabnormalitiesin 2: repeat 3: E-step: 4: [ rsd nor ;i 1 = 2 nor ] PI Curv nor ;i ) 5: [ nor ;˙ nor ] GMM ( rsd nor ) 6: [ rsd min ;rsd max ] terval( nor ;˙ nor ; ) 7: M-step: 8: rsd abn abn ;i 1 = 2 nor ) 9: nor ; abn ] UpdateCandidate( rsd nor ;rsd abn ;rsd min ;rsd max ) 10: until nor and abn arestable 11: Output: nor 3.2. TemporalCheckingRegion. Acheckingregion r consistsofatmost m normalvaluestheselectedabnormalcandidates,dependingondataavailability, denotedas f nor g ,andatmost n abnormalcandidatessuchthatthelastabnormalcandidate isconstrainedtobeatmost l -timepointsawayfromtheone,denotedas f abn g . In3.2, m , n and l areuserparametersthatdeterminethesizeofa temporalcheckingregion.Acheckregionhasatmost m + l n normalvaluesandatmost n abnormities.Notethatabnormalcandidatescanbecontinuousordiscontinuous,andtwo checkingregionsmaysharecommonnormalvalues. Intheprocess,anExpectation-Maximization(EM)processisemployedtorepeat- edlyoptimizetheresultsineachtemporalregion r .Pseudo-codeoftheEMprocessisshown inAlgorithm3.IntheEstep,usingthelocalnormalvalues f nor g incheckingregion r as inputs,weregeneratethetheoreticalvalues f 0 g withEq.2.2.Thentheresiduals f rsd g for boththeabnormalcandidates f abn g andthenormalvalues f nor g areregeneratedusing Eq.3.1(Algorithm3line6-8).IntheMstep,wetheabnormalcandidateset f abn g 25 withthestatisticaldistributionofthenewresidualdata f rsd g accordingto3.1. Sp,avaluefallstheintervalthresholdofthemajordistributionofthe normalresidualvalueswillbemovedto f abn g ;andifanabnormalcandidateiswithinthe intervalthresholdofthemajordistributionofthenormalresidualvalues,itwill belabeledasnormalandbemovedto f nor g (Algorithm3line10-11). TheEMprocesswillstopwhenthelabelassignmentisstable.Fig.3.2(c-f)showsthe iterativeprocessinacheckingregion.Sincecheckingregionsmaysharecommonvalues,the resultsfromtregionsmaybeForexample,aphenotypevalueisiden asanabnormalityinoneregionbutisconsideredanormalvalueinanotherregion.To solveandconsequentlyimproveperformance,weemployaninformationsharing processintheendoftheEMprocesstobroadcastallthelocalresultstoallthechecking regions.Ifexists,votingresultswillbeusedtoabnormalcandidatesinthe selectedfeaturespace(step2),andtheEMprocesswillrerunonthenewcheckingregions. Theprocesswillrepeattilltheresultsconverge.Fig.3.2(g,h)demonstratethatallthe abnormalitiesareiden 3.1.3RelatedWorks WeintroducetheGaussianMixtureModel(GMM)andtheLineardiscriminantanalysis (LDA)usedinSection3.1.2.1asfollows. 3.1.3.1GaussianMixtureModel. AGaussianMixtureModelisaparametricprobabilitydensityfunctionrepresentedasa weightedsumofGaussiancomponentdensities([53]).GMMsarecommonlyusedasapara- metricmodeloftheprobabilitydistributionofcontinuousfeatures([53]).Theprobability 26 densityfunctionisgivenbytheequation: p ( x j )= M X i =1 ! i g ( x j i ; i )(3.2) where x isaD-dimensionalcontinuous-valuedvector, ! i , i =1 ;:::;M ,arethemixture weights,and g ( x j i ; i ) ;i =1 ;:::;M arethecomponentGaussiandensities.Eachcompo- nentdensityisaD-variateGaussianfunctionoftheform: g ( x j i ; i )= 1 (2 ˇ ) D= 2 j i j 1 = 2 exp 1 2 ( x i ) T 1 i ( x i ) g (3.3) with i bethemeanvectorand i bethecovariancematrix( i =1 ;:::;M ).Themixture weightssatisfytheconstraintthat P M i =1 ! i =1.GMMparametersareestimatedfromtrain- ingdatausingtheMaximumLikelihoodParameterEstimationorMaximumAPosteriori Estimation([53]).Inthispaper,residualsareone-dimensionalscalardata,weuse i and ˙ i torepresentthemeanandvarianceofresiduals. 3.1.3.2LinearDiscriminantAnalysisforFeatureSelection. Lineardiscriminantanalysis(LDA)isamethodsusedinstatistics,patternrecognitionand machinelearningtoalinearcombinationoffeatures,whichcharacterizesorseparatestwo ormoreclassesofobjectsorevents,suchthattheinter-classvarianceismaximizedandthe intra-classvarianceisminimized([67]).Theresultingcombinationmaybeusedasalinear ormorecommonly,fordimensionalityreductionbeforelaterInthis paper,weseekcombinationoffeatures,withwhichnormalvalues(oneclass)arecentered aroundonearea,whileabnormalities(anotherclass)arecenteredaroundadistinctively separatedarea. 27 Supposethereare C classes,andeachclasshas n i points,mean i andintra-classvariance i .Thentheinter-classvariancemaybebythesamplecovarianceoftheclassmeans: SB = C X i =1 n i j C j ( i )( i ) T (3.4) andtheintra-classvarianceofwholedatasetis SW = P C i =1 SW i ([41]).Theclassseparation inadirection ~ ! inthiscasewillbegivenby: S = ~ ! T SB ~ ! T ~ ! T SW ~ ! T (3.5) Theobjectivefunctionistomaximize S anditcanbeshownthatwhen ~ ! istheeigenvector of SW 1 SB , S willhavemaximizedvaluecorrespondingtoeigenvalue([51]). 3.2Experiment WecomparedDynamicFilteronbothrealandsyntheticdatasetswithtwowidely-used datacleaningalgorithms:1)astatisticalapproachthatabnormalitiesbasedon standardvariance([39]),and2)Hampelthatidenabnormalitiesbasedondigress frommedianoftrends([48,49]).Notethatallthethreemethodswereappliedonthesame phenotyperesidualdataforafaircomparison. Forperformanceevaluation,weusedboththeprecision-recallcurveandtheMatthews correlationcot(MCC)([5]).TheMCCthatcanappropriatelyrepresentaconfusion matrixiscomputedwith: MCC = TP TN FP FN p ( TP + FP )( TP + FN )( TN + FP )( TN + FN ) (3.6) 28 (c)Precision-Recallcurves (d)MCCw.r.t.intervalthreshold Figure3.3:Performanceevaluationofprecision-recallandMatthewsCorrelationCot onrealdataset.DFrepresentsDynamicFilter. 3.2.1Realphenotypedataset WetestedtheperformanceofDynamicFilterusingtheplantphotosyntheticphenotype dataconsistingof106 Arabidopsisthaliana plantsedT-DNAinsertionmutantsand wild-types)sampledat64timepointsunderdynamiclightconditions([3,1]).Thephoto- 29 syntheticphenotypevaluesvarydramaticallyacrossplants,potentialdi indevelopment,stressresponses,orregulationofprocessessuchasstomatalconductance, photodamage,andstorageofphotosynthate([32]).Expertswentthroughthedataand manuallymarkedthegroundtruthofabnormalities,andfoundtheerrorrateis6.5%. TheexperimentalresultsshowninFig3.3(a)indicatedthatDynamicFilteristly betterthantheothertwoapproachesintheprecision-recallcurve.Sp,Dynamic FilteryieldsAUCashighas0.964,higherthantheAUCofsimplestatisticsandHampel Filter(0.147and0.543respectively).Fig3.3(b)showsourmodelisalsotlybetter accordingtoMCC.Furthermore,itshowsthatDynamicFilterisinsensitivetotheselection oftheintervalthreshold,whichisdistinctlytfromtheotheralgorithms thatrelyonwell-pickedparameters. NotethattheAUCofDynamicFilterwithoutKNNis0.862(Fig3.3(a)),implyingthe KNNt(step2)isakeycomponentofDynamicFilter.Sp,Fig3.4shows howtheKNNtimprovedtheperformanceofdatacleansing.Onthe II vs. residualplotshowninFig3.4(a)(detailedvisualizationonFig3.4(b)),someisolatednormal valuesareasabnormalities,andcertainabnormalitiesasnormal values.Clearly,thesevaluesdonotconformwiththemostnearbyvalues.Byapplyingthe KNNt,theseareelycorrected(Fig3.4(c,d)). WesystematicallytestedtheperformanceofthetcomponentsofDynamicFilter. Fig.3.5showstheperformanceimprovementbycomparingDynamicFilterwithamodel withoutKNNt(v5),iterationofEM(v4),consensusonallregions(v3),reassign- mentofnormalvaluesandabnormalitiesinEM(v2),orevenwithoutthewhole process(v1).Itimpliesthattheedprocess,especiallytheKNNandEMt,is thekeyofperformanceimprovement. 30 (a)Abnormalitiesincoarseprocess (b)Detailedviewof(a) (c)AbnormalitiesafterKNN (d)Detailedviewof(c) Figure3.4:PerformanceimprovementbyapplyingtheKNNtprocess. Fig3.6and3.7showcasestudiesontherealdata.InFig3.6,theexperimentwasrunon awild-typereferenceplant,ArabidopsisCol-0.Inthecoarseprocess,theresidualanalysis wasappliedtoidentifytheabnormalcandidates(Fig3.6(a)andsolidpointsinFig3.6(b)). Clearly,6solidpointsonthebottomwereincorrectlylabeledasabnormalities,whichwere graduallycorrectedintheprocess(Fig3.6(c,d)).Fig3.7(a)showsatruebiological discoveryontherealdata.OurscreenrevealedaccessionELYexhibitingphotosynthetic characteristicsmarkedlytfromthereference(Col-0).Itwould,however,belabeled asabnormalandsubsequentlydeletedbytheexistingoutlierdetectionbaseddatacleaning 31 Figure3.5:Performancecomparison.Eachversioncorrespondstoatversionof DynamicFilter(DF)without:KNNt(v5),iterationofEM(v4),consensuson regions(v3),reassignmentofnormal/abnormallabelsinEM(v2),orthewhole process(v1). methods,resultinginover-cleanproblem.DynamicFilteridenELYcorrectlyandsug- geststhattheinitsquantumyieldiscausedbythemonotonedecreaseof i 1 = 2 regardlessthechangeofsunlight(seeFig3.7(b)).Thenonnegligibledeviationbetweenthe observedvaluesandthetheoreticalcurvelearnedfromthecoarsephaseofDynamicFilter (seeFig3.7(c))impliesthetheoreticalmodelissimplecomparedwiththerealworldsitu- ation.InsteadofdirectlyusethePIcurvetoinferabnormalities,weoptimizethe resultsinthephaseofDynamicFilter,resultinginalmostperfectmatchbetweenthe observedvaluesandthetheoreticalcurve(seeFig3.7(d)). Furthermore,wevariedthesizeofthetemporalcheckingregionandcomparedtheper- formanceinFig3.8.TheresultsinFig3.8(a)revealthatDynamicFilterachievesthebest performancewhen m isbetween10and15.Thisnumberallowsenoughtrainingdataforthe tprocess,meanwhileavoidingNPQvariationoverlongtimeinterval.Fig3.8(b) 32 (a)GMMincoarseprocess (b)Candidatesincoarseprocess (c)Candidatesinprocess (d)Finaloutputs Figure3.6:AcasestudyontherealdatashowsthatDynamicFiltercorrectlyidenall theabnormalities. showsperformanceofDynamicFilterisrelativelystableagainstmaxnumberofabnormali- ties n ,implyingtherobustnessofDynamicFilterishigh. 3.2.2Syntheticdataset Sincethetruebiologicaldiscoveriesintherealdataareunknown,wefurthertestedDynamic Filteronserialsofsyntheticdatasets.Thesyntheticdatasetsweregeneratedbyvaryingfour parameterssystematically:lightsand i 1 = 2 beingsmoothlyorabruptlychanged,abnormal- itiesbeingcontinuouslyordiscontinuouslydistributed,anderrorratiobeingloworhigh. Furthermore,weaddedvariationsrepresentingabnormalitiesandbiologicaldiscoveries(dif- 33 (a)PhenotypeofELYandCol-0 (b) i 1 = 2 learnedbyourmethod (c)ELYobserved&modelleddatainthecoarse phase (d)ELYobserved&modelledinthephase Figure3.7:AcasestudyontherealdatashowsthatDynamicFilteridentruebiological discoveriesunderthediurnallightcondition.Lineswiththesamemarkerrepresentbiological replicates. ferent i 1 = 2 values)inthesyntheticdatasets.Intotal,63kindsofsyntheticdatasetsin9 groupsweregenerated,andforeachkindofsyntheticdata,werepeatedlygenerated100 datasets. Fig3.9showstherobustnessofDynamicFilterontsyntheticdatasetsgenerated under9tsettings.TheperformanceisevaluatedusingMatthewsCorrelationCo- tonbothabnormalitiesandonbiologicaloutliers.Eachrepresentssynthetic datageneratedundertsettings(seedetailsinsupplementarysection2).Eachpoint 34 (a)Performancevs.maxnumberofnormalvalues (b)Performancevs.maxnumberofabnormalities Figure3.8:Performancetestontemporalcheckingregionsize.(a)Fixingmaxnumberof abnormalities n ,andvarying m ;(b)Fixing m ,andvaryingmaxnumberofabnormalities n . Figure3.9:TheMatthewsCorrelationCotofbiologicaldiscoveriesandabnormalities onsyntheticdata. 35 inFig3.9representsaMCCscoreofbiologicaldiscoveryidenatx-axisandaMCC scoreofabnormalityidenaty-axis.Thehighestpossiblevalueis(1 : 0 ; 1 : 0).The experimentalresultsshowthatDynamicFilter(redcircle)isbetterthantheothertwometh- odsinalmostallthesyntheticdatasets.ThisisbecauseDynamicFiltercanidentifyand removeabnormalitieswhilereservingbiologicaldiscoveries(seesupplementaryTable1and Table2forperformancecomparisononMCCandtruepositiveraterespectively). 36 Chapter4 Inter-functionalanalysisby PhenoCurve InthisSection,weproposeamodelbasedBayescurvealgorithm,whichisableto studyboththevaluesandthechangingratesofthedynamicphenomicsdata.Inparticular, thisalgorithmisdesignedforstudyingthehiddenparameterofphenomics,butactuallyit isnotexclusivetophemomicsdatabutalsomaybeextendedtobeusedonothermodels. Therestofthechapterisarrangedasfollows.Section4.1motivatestheproblemandmain intuitionbehindtheproposedalgorithm,thedetaileddescriptionoftheproposedalgorithm. Section4.2presentsthedetailedimplementationissues,theexperimentalsetup,resultsand analysis. 4.1Method Inordertoexplorethedynamicphenotype-environmentrelationshipswithoutthetechnical limitationsincurveandBayesianNIGmethods,wepresentacomprehensivedata analysisapproach PhenoCurve basedonBayesiantheoremandpolynomialgeneralization. PhenoCurvehasfourcomponents.First,itsplitsthewholephenotypeandenvironment dataintohighlyoverlappedtemporalregionsusingaslidingwindowapproach,allowingfor modelingthegradualchangeof i 1 = 2 .Second,itemploysanon-linearcurvemethod 37 tocompute i 1 = 2 foreachtemporalregion,andtheresultsintotwogroups,i.e. reliable( D r )andunreliable( D u ),basedon R 2 .Third,usingdatafrom D r ,itestimates i 1 = 2 foreachunreliableregionin D u withpolynomialgeneralization.Finally,itoptimizes the i 1 = 2 valuesforallunreliableregionswithBayesiantheoremusinglocaldata,resulting inincreasedreliabilityincurveWewillintroduceallthefourstepsinthefollowing content.Anillustrativeexampleof II andlightdataisshowninFig.4.1AB. 4.1.1DataSeparationwithSlidingWindow Duetobiologicalconstraints,thesamplingrateofthephenotypeisusuallymuchlowerthan thatoftheenvironmentalfactors[14].Subsequently,wesplitthewholedatasetintohighly overlappedtemporalregionssolelybasedonthephenotypedata.Inthesliding windowapproach,eachtemporalregionhas d pairsofphenotypeandenvironmentvalues, andthevaluesineachpairareprobedatexactlythesametime(orarecloseenoughto eachother).Eachtemporalregionshare d 1valueswiththepreviousandthenextregion. Thewindowwidth d twoconditions:1) i 1 = 2 remainsrelativelyconstantwithineach temporalregion;and2)thereareenoughdataineachtemporalregionforinferring i 1 = 2 . 4.1.2LocalCurveFitting Thenextstepistoinfer i 1 = 2 foreverytemporalregion R n centeredaroundtimepoint t n .Its halflightparameter i n 1 = 2 canbeestimatedusingleastsquarecurve[20].Thegeneral ideaistoidentifyparametersinafunctiontominimizethesumofallthesquareof errorsbetweenthepredictedandtheobservedvalues.Inourcase,assumingthereare d 38 Figure4.1:AnillustrativeexampleofPhenoCurvethatoptimizesbothlocaland smoothnessof i 1 = 2 ,whichshowsA)dynamiclightvariationovertime,B)corresponding II values,C)anoptimizedonlocalregion,andD) i 1 = 2 valuesonalltemporalregions. 39 phenotype-environmentpairs f 1 II ;i 1 ) ; 2 II ;i 2 ) ;:::; d II ;i d ) g in R n ,weidentify i n 1 = 2 by: i n 1 = 2 =argmin i 1 = 2 d X k =1 0 B @ k II II ) 1+ i k i 1 = 2 1 C A 2 (4.1) TheprocedureinEq.4.1alsoyieldsa R 2 score,indicatingthelevelofreliability oftheBasedon R 2 ,wedivideallthetemporalregionsintotwogroups,i.e.reliable data D r andunreliabledata D u .Inourexperiment, R 2 thresholdat0 : 9isusedonboththe realandthesyntheticdatasets. 4.1.3PolynomialGeneralization Usingthe i 1 = 2 valuesforthereliabletemporalregionsin D r asinputs,webuildaregularized polynomiallinearregressionmodel,aimingtogeneralizea p -orderpolynomialsmoothcurve of i 1 = 2 [63].Inaregularizedpolynomiallinearregressionmodel,orderofonerepresents alinearmodel,orderoftworepresentsaquadraticform,andorderofthreeandabove worksforarbitraryshapes.Inourexperiments,thehighestorderofpolynomialtermwas settothreebyusingcross-validation.Notethatusinghigh-orderpolynomialwouldriskin ovtting[7]. Let X =(1 ;t;i; II ;t 2 ;i 2 ; 2 II ;t 3 ;i 3 ; 3 II ) T bethevectorofpolynomialterms,and W = ( w 0 ;w 1 ;:::;w 9 ) T bethecotvectorof X ,thegeneralizedpolynomiallinearregression modelis i 1 = 2 = W T X . 40 Theaimoftheregularizedistominimize E ( W )where E ( W )= j D r j X n =1 ( W T X n i n 1 = 2 ) 2 + j W j 2 2 = j W T X i 1 = 2 j 2 2 + j W j 2 2 (4.2) where j D r j isthenumberofreliabletemporalregionsidenedinSection4.1.2, X isthe polynomialfeaturematrix, i 1 = 2 isthevectorof i 1 = 2 valueslearnedfromthereliabletemporal regions,and jj 2 2 denotes L 2 normofavector. InordertosolveEq.4.2,weset @E ( W ) @ W =0,whichyields W =( I + X T X ) 1 X T i 1 = 2 (4.3) where isaregularizationtermprovidedbyuser. Applying i 1 = 2 = W T X todataineverytemporalregion,includingtheunreliableregions, wethereforeobtainageneralizedpredictedhalf-saturationparameterforeveryregion R n , denotedas ^ i n 1 = 2 = W T X n .Thegeneralized i 1 = 2 valuesensuresmoothhiddenstatevariables underdynamicenvironments. 4.1.4BayesianMLEoptimization Givenanunreliabletemporalregion R n in D u ,weoptimizeitslocalhalflightparameter ^ i n 1 = 2 usingBayesiantheorem,suchthat ^ i n 1 = 2 bestwithboththelocaldatain R n and ^ i 1 = 2 learnedfromthegeneralizedmodelinthepreviousstep.Mathematically,welookfor optimal ^ i n 1 = 2 that ^ i n 1 = 2 =argmax i 1 = 2 P ( i 1 = 2 j R n ; ^ i 1 = 2 )(4.4) 41 Since R n and ^ i 1 = 2 areindependent,weadopttheBayesiantheoremtosolveEq.4.4: P ( i 1 = 2 j R n ; ^ i 1 = 2 )= P ( i 1 = 2 ; ^ i 1 = 2 ) P ( R n j i 1 = 2 ; ^ i 1 = 2 ) P ( R n ; ^ i 1 = 2 ) = P ( i 1 = 2 ; ^ i 1 = 2 ) P ( R n j i 1 = 2 ; ^ i 1 = 2 ) P ( R n ) P ( ^ i 1 = 2 ) = P ( i 1 = 2 j ^ i 1 = 2 ) P ( R n j i 1 = 2 ) P ( R n ) = P ( i 1 = 2 j ^ i 1 = 2 ) Q d k =1 P ( d k j i 1 = 2 ) P ( R n ) / P ( i 1 = 2 j ^ i 1 = 2 ) d Y k =1 P ( d k j i 1 = 2 ) (4.5) where d k isthe k thpairofphenotype-environmentpairin R n and R n 2 D u . InEq.4.5, P ( i 1 = 2 j ^ i 1 = 2 )representstheprobabilityof i 1 = 2 given ^ i 1 = 2 .Weassuming i 1 = 2 followsGaussiandistributionwith i 1 = 2 ˘N ( ^ i 1 = 2 ;˙ 2 1 ),where ^ i 1 = 2 isthemeanand ˙ 2 1 isthe variance.InEq.4.5, P ( d k j i 1 = 2 )representstheofthephenotype-environmentdata given i 1 = 2 .Infact,when i 1 = 2 isprovided, II canbecomputedusingEq.2.2.Withthe sameGaussiandistributionassumption,wehave P ( d k j i 1 = 2 )= N k II ;˙ 2 2 ).Hereboth ˙ 2 1 and ˙ 2 2 areuserparameters. ByadoptingtheGaussiandistributionassumptionon P ( i 1 = 2 j ^ i 1 = 2 )and P ( d k j i 1 = 2 ),and bytakingalogtransformonEq.4.5,wehave: ln Pr ( i 1 = 2 j R n ; ^ i 1 = 2 ) / ln N ( t j ^ i 1 = 2 ;˙ 2 1 )+ d X k =1 ln N k II j f PI ( t k ;i 1 = 2 ) ;˙ 2 2 ) / ( i 1 = 2 ^ i 1 = 2 ) 2 2 ˙ 2 1 d X k =1 k II max II ) (1+ i k =i 1 = 2 ) ) 2 2 ˙ 2 2 (4.6) 42 SubstitutingtherightpartofEq.4.4withEq.4.6,theequationthatcanbesolved byadoptingthemaximumlikelihoodestimation(MLE)is: ^ i n 1 = 2 =argmin i 1 = 2 0 B B @ ( i 1 = 2 ^ i 1 = 2 ) 2 2 ˙ 2 1 + d X k =1 k II max II ) (1+ i k =i 1 = 2 ) ) 2 2 ˙ 2 2 1 C C A (4.7) Notethatinordertoincreasethereliabilityofcurveinunreliableregion R n , duringtheBayesianMLEoptimizationprocess,wesearchforthephenotype-environment pairsoutside R n thathavethebestmatchtothecurveof R n ,andaddtheminto R n .Fig.4.1CDshowsanexampleofthewhichoptimizesbothlocalandthe smoothnessof i 1 = 2 . TodemonstratethePhenoCurveprocedure,Fig.4.2showsthetstagesofcurve Fig.4.2(a)presentsthegroundtruthdataforandaswellastheobserved inputdatawhichiscomposedofnoiserateat0.1andbiasat0.1.Theresultfrom slidingwindowapproach(WIN)isshowninFig.4.2(b).Basedonslidingwindowresult,the dataregionisseparatedintoreliableregionsandunreliableregionsasshowninFig.4.2(c). Lastly,PhenoCurveusethereliableregionstogeneralizedhalf-saturationparameters(as showningreencurvewithmarkinFig.4.2(d))andmakethepredictionontheunreliable regions(asshowningreencurvewithoutmarkinFig.4.2(d)). 4.2ExperimentalResults WeevaluatedtheperformanceofPhenoCurveonboththerealandsimulatedphenotype dataintermsofperformanceandreliability.WealsocomparedPhenoCurve witheexistingmethodsi.e.1)thedirectcomputationwithPIfunction(PI),2)theone- 43 (a)GroundTruthandInputdata. (b)Slidingwindowapproach(WIN). (c)Separationoftrainingandtestingregions. (d)PhenoCurvegneralizationandBayesoptimiza- tion Figure4.2:DemonstrationofPhenoCurverunningexampleonsyntheticdata. windowcurve(ONE),3)thesliding-windowbasedcurve(WIN),4)thekernel smoothingmethodusinglocallinearregression(LLR),and5)theBayesianlinearmodel withnormalinversegammaprior(NIG).Allthesemethodshavebeenintroducedinthe RelatedWorkSection. 4.2.1ExperimentalData WetestedtheperformanceofPhenoCurveusingtherealplantphotosyntheticphenotype dataconsistingof375Arabidopsisthalianaplants(330coT-DNAinsertionmutants 44 and45wildtypeplants)[3,2].Duringtheexperiment,alltheplantswereevenlysampled at32timepoints.Alltheenvironmentalfactorsexceptlightareconstant.Followinga sinusoidalcurve,lightintensitychangesgraduallyfrom35 m 2 s 1 to500 m 2 s 1 thengoesbackto35 m 2 s 1 duringtheexperiment.Thephotosyntheticphenotype valuesvarydramaticallyacrossplants,potentialindevelopment,stress responsesorregulationofprocessessuchasstomatalconductance,photodamageandstorage ofphotosynthate[32,14]. Thesyntheticdataweregeneratedinthreesteps.First,werandomlyavectorof i 1 = 2 thatchangesgraduallyovertime.Second,wereconstructedavectorofphenotype II usingthevectorof i 1 = 2 ,thesamevectoroflightastherealdata,andthePIfunction[38]. Third,werandomlyaddednoiseandbias(levelsvaryfrom5%to15%)tothephenotype data.Theprocesswasrepeated4,000timestogeneratethefullsyntheticdata(seeTable S1). 4.2.2EvaluationCriteria Wefourperformanceevaluationcriteria,andappliedallofthemonboththeunreliable regions D u andthereliableregions D r ,aswellasthewholeregionsofthephenotypedata D . First,cotofdetermination,denotedas R 2 ,isoftenusedasthemaincriteriafor measuringwhetheracurveisadequate[28,10].Inourcase, R 2 isas: R 2 =1 P n i =1 II ( t i ) ^ II ( t i )) 2 P n i =1 II ( t i ) II ) 2 (4.8) where ^ II ( t i )isthe II valueattime t i ,and II istheaveraged II valuesina 45 temporalregion.InEq.4.8, R 2 measuresthefractionofthetotalvariationinthephenotype datathatcanbeexplainedbythecurve.Highervaluesindicatethatthecurvethedata better.If R 2 =1 : 0,allpointslieexactlyonthecurvewithnoscatter. Second,wecomputethesmoothnessofthe i 1 = 2 vector.Foracontinuouscurve,smooth- nesscanbemeasuredusinghighorderofderivatives.Fordiscretevalues(whichisourcase), wemeasurealltheanglesformedbyadjacenttemporalregions.Mathematically,wehave: smoothness ( i 1 = 2 )= 1 j D j j D j X n =1 [ n T ](4.9) where D isasetoftemporalregions, n = j arctan( i n +1 1 = 2 i n 1 = 2 ) arctan( i n 1 = 2 i n 1 1 = 2 ) j representstheanglecenteredaroundtemporalregion R n ( R n 2 D ), T isauser givenanglethreshold,and[ X ]returns1ifthecondition X isotherwiseitreturns 0.Inourexperiment, T =30 o . Finally,forsyntheticdata,wecomputedboth II and i 1 = 2 ,whicharethesumof alltheabsolutebetweeneveryphenotypevalueanditscorrespondingvalueon thecurveandthesumofalltheabsolutebetweenevery i 1 = 2 valueandits correspondingparameterofthePIcurverespectively. 4.2.3ExperimentalResultsonRealData WeranPhenoCurveontherealphotosynthesisphenotypedatausingthreetwindow sizes,i.e.6,12and18.WecomparedtheperformanceofPhenoCurvewitheexisting methodsusingthesamecommonparameters. TheexperimentalresultsindicatethatPhenoCurvehastlymorereliablecurve results.Fig.4.3showsthatontheunreliableregions,thecotofdetermination 46 (a) R 2 onunreliableregions. (b) R 2 onwholedata. (c)Smoothnessonunreliableregions. (d)Smoothnessonwholedata. Figure4.3:Cotofdetermination R 2 andsmoothnessof i 1 = 2 ontheunreliableregions andthewholerealphenotypedata. ( R 2 )increases23%from0.72to0.89whenwindowsizeis12,comparedwiththesliding- windowcurvemethod.Meanwhile,thesmoothnessofthecurveincreasesfrom 0.55insliding-windowmethodto0.63inPhenoCurveapproach.Amongalltestsonrealdata, PhenoCurveyieldshighest R 2 inbothunreliabledataandwholedataparts.PhenoCurve alsooutputssmoothestcurveinallcasesexcepttheonewithwindowsize=6.Overall, PhenoCurveperformsbestinexperimentsonrealdata. 47 (a) R 2 onunreliableregions. (b) R 2 onwholedata. (c)Smoothnessonunreliableregions. (d)Smoothnessonwholedata. Figure4.4:Cotofdetermination R 2 andsmoothnessof i 1 = 2 ontheunreliableregions andthewholesyntheticphenotypedata. 4.2.4ExperimentalResultsonSyntheticData Similarly,wecomparedtheperformanceofPhenoCurveandtheveexistingmethodsona syntheticphenotypedatausingthreetwindowsizes. TheexperimentalresultsindicatethatPhenoCurvehasmostreliableandsmoothestcurve results,andithaslowesterrorratefor II and HL .Fig.4.4showsthatontheunre- liableregions,thecotofdetermination( R 2 )increasestlyfrombarely0.09in sliding-windowmethodto0.67inPhenoCurvewhenwindowsizeis12.Besides,smoothness ofPhenoCurveisalwaysthebestamongallmethods.Fig.4.5showsthecomparisonof 48 (a) II onunreliableregions. (b) II onwholedata. (c) i 1 = 2 onunreliableregions. (d) i 1 = 2 onwholedata. Figure4.5: II and i 1 = 2 onthesyntheticphenotypedatawith0.10noiseandbiasrate. errorrate II and i 1 = 2 .Onunreliabledataregions,PhenoCurveperformsbetterthan allothermethodsintermof i 1 = 2 .Asfor II ,PhenoCurveandonewindowapproach hasbestperformanceinunreliableregions,whilePhenoCurveperformsbestinwholedata. ThisisduetothefactthatPhenoCurvenotonlyoptimizeunreliabledata,butalsooptimize thereliableregions.Overall,PhenoCurveperformsbestonsyntheticdata. 49 4.2.5BiologicalV Givenallthe i 1 = 2 valuesofalltheArabidopsismutantlines,wecomputedthemaximal relative i 1 = 2 as max j i 1 = 2 ( t m ) i 1 = 2 ( t n ) j foralltimepoints t m and t n satisfying i ( t m )= i ( t n ) and m 6 = n .Thepurposeistochecktherecoveryabilityofaplantbyprobingitbeforeand afterlightstress.ThedistributionofdatainFig.4.6showsthatwecaneasilyclusterall theArabidopsismutantlinesintothreegroups A , B , C asindicatedintheMutants ingroupAandBhavehighphotosynthesis,whereasgroupCmutantshaverelativelylow photosynthesis.MutantsingroupAandCaremoreresistanttodynamiclightstressthan groupBmutants. Inordertoverifythebiologicaldiscovery,abiologicalexperimentbymeasuring qI (pho- toinhibition)ofthesameplantsusingDEPI[14]maypotentiallybedesigned.Insummary, learning i 1 = 2 withPhenoCurvecangreatlysimplifytheprocesstoidentifythecomplicated phenotype-environmentrelationshipsandtovisualizetheresults,enablingbiologiststodis- coverthemechanismthatregulateplantsinresponsetodynamicenvironment. 50 Figure4.6:Phenotypeclusteringbasedonmaximalrelative i 1 = 2 andmaximal II . 51 Chapter5 SummaryandConclusions Inthisthesis,wedesignedtwotechniquesforcontroldataqualityandexploredataof phenotypingvalues.Theproposedapplications,namely,DynamicFilterandPhenoCurve, achievetheultimatetaskaroundtwoquestionsincluding Howcanwegethighqualitydatafromproblematicrawdata? Howcanwedocurvetingfordynamicallychangingphenotypingvalueswithhigh reliability? 5.1Contributions Thisthesismainlyanswersthetwoquestionsraisedabovebyproposingspecapplications asfollows,givingtheoreticalstudiesandprovidingrealcaseandsyntheticcasetests. 5.1.1Datacleaningforhighdataquality Withanaimtowardsidentifyingtargetsforimprovingenergyyield,advancedtechnologies inhigh-throughputplantphotosyntheticphenotypinghavebeendeveloped([29,14]).These systemscanbeusedtoquantifyphotosyntheticbehavioringeneticallydiversepopulations andtodrawrelationshipsamonggenotype,phenotypeandbiologicalfunction,leadingto betterunderstandingoftheunderlyingmechanismsthatcontrolthephotosyntheticproper- 52 tiesundervariousenvironmentalconditions([52,19]).Asaconsequenceofthelong-time high-throughputplantphenotyping,thescaleofplantphenomicsdatagrowsexponentially. However,thequalityofphenotypedatamaybeskewedbysourcesofnoisethatare toremoveinthedatacollectionstep. Thepurposeofplantphenotypingistodiscoverphenotypevaluesthataretly tfromareference.Butphenotypevaluesleadingtobiologicaldiscoveriesmaybe obscuredbyabnormalvaluescausedbyerrorsduringdetection.Toensurehighdataquality, edatacleaningshouldbeconsideredaprimarytask.However,sinceadvanceddata cleaningalgorithmsareprimarilybasedonindiscriminateoutlierdetection,theymayremove bothabnormalitiesandbiologicaldiscoveriesnotseparableinthedatadistribution. Wehavedevelopedanewmodelcalled DynamicFilter toely identifybothabnormalitiesandbiologicaldiscoveriesbyadoptingawidely-usedphotosyn- theticmodel.Spally,DynamicFilterisaresidualanalysisapproachbydynamically tracingstatisticaldistributionsofallsamplesratherthanindividuals,andincorporatingEM forperformanceoptimizationincheckingregions. Wenotethat,certainevents,suchastransientchangesingrowthenvironment,could introducesignalssimilartogrowthlightingmalfunction,whichcouldbewronglylabeled asabnormalitiesbyDynamicFilter.Therefore,insteadofautomaticallydeletingallthe predictedabnormalities,wesendallofthemtodomainexpertsfortion.Meanwhile, allrawdataarekeptforanyrollbackoperation. Experimentalresultsshowthatourmodelistlybetterthantheexistingdata cleaningtoolsonbothrealphenomicsdataandsyntheticdata.DynamicFiltermayhave awideimpactbecauseoftherapidincreaseoflarge-scalephenotypingtechnologies.It shouldbenotedthatalthoughweusedaphotosynthesis-spcurve,themodelitselfis 53 independentofactualbiologicalconstraints.Inprinciple,ourapproachcanbeusedtoclean dataforanynumberofphenotypesaslongassuitabletheoreticalcurvescanbederivedfor theirbehavior.Implementationfornewusecaseswouldinvolvesubstitutingtheappropriate theoreticalcurveintotheprogram,calculatingtheresidualsoftothedatasets,and optimizingtheprocedure(asinFig3.1). 5.1.2Inter-functionalanalysis Photosynthesisphenotypesareusuallymeasuredunderdynamicconditions,inarelatively longtimeperiod,andonplantswithvastlytgeneticbackgrounds.Tomeetthe growingneedstostudythedynamicrelationshipsbetweenphenotypesandenvironmental factorsusinglimitedbiologicalknowledge,wedevelopedanewtoolcalledPhenoCurve. PhenoCurvesplitsthewholephenotypeandenvironmentdataintohighlyoverlapped temporalregions,employsnon-linearcurvemethodstocalculate i 1 = 2 forthereliable regions,andthenoptimizethe i 1 = 2 fortheunreliableregionsusingpolynomialgeneralization andBayesianMLE.TheresultsonbothsyntheticandrealdatashowthatPhenoCurveos tlybetterthantheothereexistingmethods. Table5.1:TestingtherobustnessofPhenoCurveagainstwindowapproachwithmul- tiplenoiseandbiasrates. Unreliabledata Wholedata noiseandbias R 2 " smoothness " II # HL # R 2 " smoothness " II # HL # 0.05 67%7%83%35% 24%20%33%15% 0.10 87%7%54%50% 36%5%32%25% 0.15 118%3%46%42% 46%5%27%22% InordertotesttherobustnessofPhenoCurve,weranPhenoCurveonsyntheticdatawith tlevelsofnoiseandbias.TheresultsshowninTable.5.1indicatesthatPhenoCurveis moreethanwindowapproachwiththeincreaseofnoiseandbiasrate.Compared towindowapproach,PhenoCurveyieldshight R 2 ,smoothercurve,lowererrorin II 54 andhalflightparameters. 5.2Conclusions Thisdissertationanswersthetwoquestionsraisedinthebeginningofthischapterbypresent- ingmachinelearningtechniquesindataqualitycontrolanddataexplorationofdynamically changingphenotypingvalues.Toaddresstheseproblems,DynamicFilterandPhenoCurve areproposedtocleanrawdataandexplorecleaneddata. Theconcludedresearchmakestcontributionsto(i)therealisticsolutionfor maintainhighqualityandcleanproblematicdatafromrawinput,(ii)thechallengesof exploringandobtainingreliableandcomparablehiddenparametersfromphenotypingdata. 55 Chapter6 FutureWork 6.1FutureWork Thestudiespresentedinthethesisleadtoseveralimportantresearchquestions,whichmay beextendedinthefuture. 6.1.1DynamicFilterfordataqualitycontrol Futureworkmayincorporatesemi-supervisedlearningtechniquesintotheapplication.Cur- rentversionofDynamicFilterprocessresultwithonepass,anddoesnotinteractwith humanvtoautomaticallyadjusttheparametersetting. Asforapplicationpotential,aslongasasuitabletheoreticalcurveisgiven,thetheoretical photosyntheticcurvecanbeupdatedwiththenewone,andresidualsofthecurvecanbe learnedatthecoarselevel,andthenbeoptimizedinthelevel(seeFig3.1).Therefore, itispossibletoautomatethecleansingprocessonanylongtime-seriesdataforavarietyof applications. 6.1.2Inter-functionalanalysis Inthefuture,wewillextendPhenoCurvetomodelmultiplephenotypesandmultipleenvi- ronmentalfactors.Oneofthetechnicalchallengesisthatwiththeincreaseofdatatypes, 56 thenumberofparameterswillrapidlyincreased,whicheitherrequirestlymore samplingdata,orrequiresabetteralgorithmtomodelthedata. 57 APPENDIX 58 Pseudo-codeofDynamicFilter Algorithm4 DynamicFilter:DynamicFilterAlgorithm Input: f II g :arrayofphotosynthesisintensity,Light,99% light :arrayoflightintensity 99%:99%istheuser-spinterval 1: [ RSD;I i= 2 t ] PI CurveFitting( f II g ;Light )2.1 2: [ main ;˙ main ] GMM ( RSD ) 3: [ r min ;r max ] getInterval( main ;˙ main ; 99%)5 4: C UpdateCandidate( RSD;r min ;r max )6 5: repeat 6: ( f II g ; f RSD g ;Light ) 7: C ;C;k 0 )1 8: f r g UpdateWindow( C;num )7 9: for r in f r g do 10: r EM r ;Light; 99%)3 11: endfor 12: C Consensus(W)8 13: until C isstable 14: Output: C Algorithm5 Sub-procedures:getInterval Input: :meanofGaussiandistribution ˙ :standarddeviationofGaussiandistribution 99%:userchosencinterval,correspondsto3 ˙ 1: r min 3 ˙ 2: r max +3 ˙ 3: Output: [ r min ;r max ] 59 Algorithm6 Sub-procedures:UpdateCandidate Input: R :arrayofresidual r min :minimumresidualthreshold r max :maximumresidualthreshold 1: for r in R do 2: if r min r r max then 3: C r 0 4: else 5: C r 1 6: endif 7: endfor 8: Output: C Algorithm7 Sub-procedures:UpdateWindow Input: C :candidatearray num :maximumnumberofabnormalityineachregion 1: for abnormality c in C do 2: repeat 3: w:abnormality ( end +1) c 4: next c 5: until length ( w:abnormality ) > = num 6: w:normal [ num pointsahead, num pointsafter] 7: W i w 8: clear w 9: endfor 10: Output: W Algorithm8 Sub-procedures:Consensus Input: W :arrayofwindows 1: ( Count n ;Count ab ) ([] ; []) 2: for w in W do 3: for Idx n in w:normal do 4: Count n [ Idx n ] Count n [ Idx n ]+1 5: endfor 6: for Idx ab in w:normal do 7: Count ab [ Idx ab ] Count ab [ Idx ab ]+1 8: endfor 9: endfor 10: C [ i ] ( Count n [ i ]