DATAQUALITYCONTROLANDINTER-FUNCTIONALANALYSISONDYNAMIC
PHENOTYPE-ENVIRONMENTALRELATIONSHIPS
By
LeiXu
ATHESIS
Submittedto
MichiganStateUniversity
inpartialentoftherequirements
forthedegreeof
ComputerScience-MasterofScience
2016
ABSTRACT
DATAQUALITYCONTROLANDINTER-FUNCTIONALANALYSISONDYNAMIC
PHENOTYPE-ENVIRONMENTALRELATIONSHIPS
By
LeiXu
Plantphenomicshavebecomeessentialcomponentofmodernplantscience.Suchcom-
plexdatasetsarecriticalforunderstandingthemechanismsgoverningenergyintakeand
storageinplants.Large-scalephenotypingtechniqueshavebeendevelopedtoconducthigh-
throughputphenotypingonplants.However,amajorissuefacingtheseisthede-
terminationofthequalityofphenotypicdata.Automatedmethodsareneededtoidentify
andcharacterizealteractionscausedbysystemerrors,allofwhicharetoremovein
thedatacollectionstep.Anotherissueiswearelimitedbythetoolstoanalyzefullythe
phenomicsdata,
esp.
thedynamicrelationshipsbetweenenvironmentsandphenotypes.
Theoverarchinggoalofthisthesisistoexploredynamicphenotype-environmentaldata
viadatamining/machinelearningmethods.Rawdatameasuredfrombiologicaldevicesis
pre-processedtonumericaldata,thencleanedbyDynamicFiltertoensurehighdataquality
forfurtheranalysis.Thecleaneddataisfurtherexploredandappliedwithinter-functional
analysisinordertopatternsthatcomplywithbothmachinelearningmethodologies
andbiologicalconstraints.
Inthisthesiswedevelopedtwotoolstomakeexplorationofphenotypingdataavailable:
(1)Fordataqualitycontrol,wedevelopedamodelcalledDynamicFilter
toidentifyabnormalitiesinplantphotosynthesisphenotypedata.(2)Forinter-functional
phenomicsdataanalysis,wepresentanewalgorithmcalledPhenoCurveforinter-functional
phenomicsdataanalysis.
ACKNOWLEDGMENTS
Firstandforemost,Ifeelindebtedtomyadvisor,ProfessorJinChen,forhisguidance,
encouragement,andinspiringsupervisionthroughoutthecourseofthisresearchwork.His
patience,prudentialattitude,extensiveknowledge,andcreativethinkinghavebeenthe
sourceofinspirationforme.HewasavailableforadviceoracademichelpwheneverIneeded
andgentlyguidedmefordeeperunderstanding.WhenIhesitatedbetweenresearchprogram
andcourseprogramtwoyearsago,Iwasnotsureaboutmydecision,butnowIamsohappy
andproudtosaythatImadeasowisedecision.Itisextremelyhardtoexpresshowgrateful
Iamforhisunwaveringsupportoverthelasttwoyearsandinthecomingfuture.
IwouldliketotakeonthisopportunitytothankmythesiscommitteemembersProfessor
ArunRossandProfessorYanniSunwhohaveaccommodatedmytimingconstraintsdespite
theirfullschedules,andprovidedmewithpreciousfeedbackforthepresentationofthe
results,inbothwrittenandoralform.IwouldalsoliketothankProfessorRongJinforhis
helpandadviceduringmygraduateprogramstudies.
DuringmyMasterstudies,Ihadthepleasureofcollaboratingwithmanyresearchers
fromeachandeveryoneofwhichIhadthingstolearn,andthequalityofmyresearchwas
considerablyenhancedbytheseinteractions.IwouldliketothankDavidMKramer,
ACruz,LindaJSavage,QiaoziGao,OliverTessmer,YifanYang,andZheyunFengforall
thediscussionswehadandthefunmomentswespenttogetherondoingresearchandfuture
planning.
LivinginEastLansingwithoutmygoodfriendswouldnothavebeeneasy.Iwantto
thankallmyfriendsinthedepartmentandoutsidethedepartment.
Lastbutnotleast,Iwanttoexpressmydeepestgratitudetomybeloved
iii
parentsanddearestgirlfriend.Theirloveandunwaveringsupporthavebeencrucialtomy
success,andaconstantsourceofcomfortandcounsel.
iv
TABLEOFCONTENTS
LISTOFTABLES
....................................
vii
LISTOFFIGURES
...................................
viii
LISTOFALGORITHMS
...............................
x
Chapter1Introduction
................................
1
1.1Dataqualitycontrol...............................3
1.2Inter-functionalAnalysis.............................4
1.3ThesisContributions...............................6
1.4ThesisOverview..................................7
1.5BibliographicNotes................................8
Chapter2Background
................................
9
2.1Datacleaningforhighdataquality.......................9
2.1.1Relatedwork...............................10
2.2Inter-functionalanalysis.............................11
2.2.1Relatedwork...............................13
2.2.1.1SlidingWindowbasedCurveFittingMethods........13
2.2.1.2BayesianLinearModelwithNormalInverseGammaPrior.14
Chapter3DatacleaningwithDynamicFilter
..................
16
3.1Methods......................................16
3.1.1TheoreticalPhotosyntheticCurve....................18
3.1.2FrameworkofDynamicFilter......................20
3.1.2.1Step1.Coarseprocesstoidentifyabnormalcandidates...23
3.1.2.2Step2.KNNprocesstoabnormalityiden.23
3.1.3RelatedWorks..............................26
3.1.3.1GaussianMixtureModel....................26
3.1.3.2LinearDiscriminantAnalysisforFeatureSelection......27
3.2Experiment....................................28
3.2.1Realphenotypedataset..........................29
3.2.2Syntheticdataset.............................33
Chapter4Inter-functionalanalysisbyPhenoCurve
..............
37
4.1Method......................................37
4.1.1DataSeparationwithSlidingWindow.................38
4.1.2LocalCurveFitting............................38
4.1.3PolynomialGeneralization........................40
v
4.1.4BayesianMLEoptimization.......................41
4.2ExperimentalResults...............................43
4.2.1ExperimentalData............................44
4.2.2EvaluationCriteria............................45
4.2.3ExperimentalResultsonRealData...................46
4.2.4ExperimentalResultsonSyntheticData................48
4.2.5BiologicalV..........................50
Chapter5SummaryandConclusions
.......................
52
5.1Contributions...................................52
5.1.1Datacleaningforhighdataquality...................52
5.1.2Inter-functionalanalysis.........................54
5.2Conclusions....................................55
Chapter6FutureWork
................................
56
6.1FutureWork....................................56
6.1.1DynamicFilterfordataqualitycontrol.................56
6.1.2Inter-functionalanalysis.........................56
APPENDIX
..................................
58
BIBLIOGRAPHY
...................................
67
vi
LISTOFTABLES
Table5.1TestingtherobustnessofPhenoCurveagainstwindowapproach
withmultiplenoiseandbiasrates....................54
Table1Performancecomparisonon63typesofsyntheticdatasets.Eachpair
ofthescoreisaMCCscoreanditsstandarddeviation.Boldfont
indicatesthebestperformanceineachdataset.............63
Table2Performancecomparisonon63typesofsyntheticdatasets.Eachpair
ofthescoreisaTPRscoreanditsstandarddeviation........65
vii
LISTOFFIGURES
Figure2.1Photosynthesis-Irradiance(PI)curve..................12
Figure3.1TheframeworkofDynamicFilter....................17
Figure3.2AnexampleofDynamicFilter.Thesolidpointsareabnormalities
andthehollowpointsarenormalvalues................21
Figure3.3Performanceevaluationofprecision-recallandMatthewsCorrelation
Cotonrealdataset.DFrepresentsDynamicFilter.......29
Figure3.4PerformanceimprovementbyapplyingtheKNNtprocess.31
Figure3.5Performancecomparison.Eachversioncorrespondstoat
versionofDynamicFilter(DF)without:KNNt(v5),it-
erationofEM(v4),consensusonregions(v3),reassignmentofnor-
mal/abnormallabelsinEM(v2),orthewholeedprocess(v1)..32
Figure3.6AcasestudyontherealdatashowsthatDynamicFiltercorrectly
idenalltheabnormalities......................33
Figure3.7AcasestudyontherealdatashowsthatDynamicFilteriden
truebiologicaldiscoveriesunderthediurnallightcondition.Lines
withthesamemarkerrepresentbiologicalreplicates..........34
Figure3.8Performancetestontemporalcheckingregionsize.(a)Fixingmax
numberofabnormalities
n
,andvarying
m
;(b)Fixing
m
,andvarying
maxnumberofabnormalities
n
.....................35
Figure3.9TheMatthewsCorrelationCoentofbiologicaldiscoveriesand
abnormalitiesonsyntheticdata.....................35
Figure4.1AnillustrativeexampleofPhenoCurvethatoptimizesbothlocal
tingandsmoothnessof
i
1
=
2
,whichshowsA)dynamiclightvariation
overtime,B)corresponding
II
values,C)anoptimizedon
localregion,andD)
i
1
=
2
valuesonalltemporalregions.......39
Figure4.2DemonstrationofPhenoCurverunningexampleonsyntheticdata..44
Figure4.3Cotofdetermination
R
2
andsmoothnessof
i
1
=
2
ontheunre-
liableregionsandthewholerealphenotypedata............47
viii
Figure4.4Cotofdetermination
R
2
andsmoothnessof
i
1
=
2
ontheunre-
liableregionsandthewholesyntheticphenotypedata.........48
Figure4.5
II
and
i
1
=
2
onthesyntheticphenotypedatawith0.10noiseand
biasrate..................................49
Figure4.6Phenotypeclusteringbasedonmaximalrelative
i
1
=
2
andmaximal
II
.51
ix
LISTOFALGORITHMS
Algorithm1KNNprocesstoresults........................................24
Algorithm2FeatureSelection.....................................................24
Algorithm3EMoptimizationoneachlocalregion
r
..............................25
Algorithm4DynamicFilter:DynamicFilterAlgorithm...........................59
Algorithm5Sub-procedures:getInterval..............................59
Algorithm6Sub-procedures:UpdateCandidate..................................60
Algorithm7Sub-procedures:UpdateWindow....................................60
Algorithm8Consensus...........................................................60
x
Chapter1
Introduction
Plantscapturesunlightto
CO
2
intoenergyrichmolecules,thussupplyingourecosystem
with
O
2
andessentiallyallofitsbiologicalenergy,including100%ofourfood.Inplants,
photosynthesisistheprimaryenergysourceformetabolismandgrowth.Understandinghow
plantsoptimizeorregulateitinresponsetoacontinuouslychangingandunpredictableen-
vironmentisanessentialcomponentfordevelopingstrategiestoimprovecropyieldstomeet
ourgrowingneedsforfoodandfuel[32].Recentworkhasfocusedonimprovingthe
ofphotosynthesistomeetourgrowingneedsforfoodandfuel([9,65,32]).Inordertode-
velopoostingmechanismsthatreduceenergylossesorenhance
CO
2
deliveryto
cellsduringphotosynthesis,advancedtechnologiesinhigh-throughputplantphotosynthetic
phenotypingandpheno-informaticshavebeendeveloped([74],[29,61,14]).Thesetechnolo-
gieshaveallowedplantphotosynthesisphenotypicvariabilitytobecharacterizedandtobe
relatedtoputativebiologicalfunctions,leadingtoabetterunderstandingoftheunderlying
mechanismsthatcontrolphotosyntheticpropertiesundervariousenvironmentalconditions.
Plantphenomicsisaassetforunderstandingthemechanismsregulatingenergy
intakeinplants([52,19]).
Plantphenotypingsystemsmonitorphotosyntheticperformanceformanyplantsboth
continuouslyandsimultaneously.Phenomicsdatasetsarelargeandcontinuetogrowas
weincreasedurationofsamplingandresolution.Yetdespitethesizeandrichnessofthe
data,smallclustersoferroneousvalues,whichgivetheappearanceofrealin
1
biologicalresponses,canskewtheanalysistowardsaninvalidinterpretation([27]).Thereare
severalwaysinwhichameasurementcanbeinerror:errorsoriginatingfrominstrumentation
malfunctions,biasedvaluesfrommiscalibratedsensors,andinevitableerrorsofprecision.All
theseissuescompromisethedownstreamdataanalysistasks.Giventhevalueofcleandata
foranyoperation,theabilitytoimprovedataqualityisakeyrequirementfore
knowledgeminingfromlarge-scalephenotypedata.
Advancedtechnologiesinhigh-throughputplantphotosyntheticphenotypinghavebeen
recentlydeveloped,allowingvariousphotosynthesisparameters(suchas
II
,
q
E
,
q
L
,
q
I
,and
NPQ
)tobetocharacterizedandtoberelatedtoputativebiologicalfunctions[29,52,59,
14].Inparallel,machinelearningandcomputervisionalgorithmshavebeendevelopedfor
phenotypeinformationretrieval,dataqualitycontrol,andknowledgediscovery[24,62,71,
72,69,21].
Withthelargevolumeofplantphenotypedatabeengenerated,normalizedandcleaned,
biologistsexpecttoimmediatelyidentifymutantstrainswithtphotosynthesisma-
chinery,andquicklygenerateandtestbiologicalhypothesesthatmayleadtonewbreak-
throughinbio-energyresearch.Indeed,computationaltoolshavebeendevelopedtoidentify
temporalpatternsfromphenomicsdata[70],togroupplantsbyphenotypes[21],andto
predictunknowngenefunctions[64].Nevertheless,ithasbeenemphasizedinliteraturethat
itiscrucialtosimultaneouslymeasureandmodelmultiplekindsofphenotypesandenviron-
mentalfactorstoarriveataholisticcharacterizationofplantperformance[68,33,25,66].
Sp,photosynthesismustrespondtochangingenvironmenttoprovidetheoptimal
amountofenergytomeettheneedsoftheorganism,inthecorrectforms,withoutproducing
toxicbyproducts,e.g.reactiveoxygenspeciesorglycolate[32].Inthiscontext,photosyn-
thesisisasetofintegratedmodulesthatformaself-regulatingnetworkmodulatedbysignal
2
transduction.
Thischapterisdevotedtoanoverviewofthesetwobroadtopicsofexploringphenotyping
data,aimingtodevelopageneralcorrespondencefromdatacleaningtointer-functional
analysis.Herewemovetowardstotheinafairlynon-technicalmannerandthe
formaldetailedwillbegiveninChapter2.
1.1Dataqualitycontrol
Inthisthesis,wefocusondataabnormalitiesdetection,whichisatypeofmeasurement
error,inordertodemonstratehowcleanphenotypedatacanbeobtained.Similartosensor
data,abnormalitiesinplantphenotypedatadeviatetlyfromexpectedpatterns
andarevisibleoutliersinthewholedataset([56,58]).Themajorityofabnormalitiesin
plantphenotypingoriginatefrominstrumentationmalfunctions(e.g.,lossofsensorsynchro-
nizationduringmeasurement),ornon-biologicalstatisticaloutlierscausedbydatacollection
limitations(e.g.,deteriorationofsignal-to-noiseratioforasampleasitprogressesthrough
theexperiment).
Dataabnormalitiesareoftenviewedasoutliersinthewholedataset.Recentworkhas
showntheenessofapplyingdataminingtechniques,especiallyoutlierdetection,for
thepurposeofdatacleaning([39]),makingitpossibletoautomatethecleansingprocess
foravarietyofdomains([48,12,40,17]).Inthesemethods,bydetectingtheminoritiesof
valuesthatdonotconformtothegeneralcharacteristicsofagivendatacollection,outliers
areidenandareconsideredviolationsofassociationrulesorotherpatternsinthedata.
However,theexistingmodelsarenotsuitableforphenotypedatacleaning.Thesemethods,
whileappliedtophenotypedata,mayremoveoutliersincludingbothmeasurementerrors
3
andtruebiologicaldiscoveries,sincetruebiologicaldiscoveries,tosomeextent,areoutliers
aswell.Furthermore,detectingabnormalitiesfromlongtime-seriesphenotypedatarequires
handlingahightemporaldimension,whichincreasesthemodelcomplexity.
Inordertoidentifyandremoveabnormalitiesinphenotypedataandtominimizethe
deletionofbiologicaldiscoveries,wehavedevelopedadresidualanalysis
algorithm,called
DynamicFilter
.DynamicFilterhasthreekeysteps:1)identifyabnormal
candidatesatthecoarselevel,2)abnormalityideninaprojectedfeature
space,and3)iterativelyidentifyabnormalitiesatthelevel.DynamicFiltercan
speedupthedatapreparationprocessandmakeitmoree.Suchimprovementswill
minimizetimeconsumingandlaborintensivedatapreparationandincreasethe
andinbiologicaldiscoveries.Insummary,ourmodelhasthefollowingadvantages:
*
Toourknowledge,DynamicFilteristheworktointegratebiologicalconstraints
withtime-seriesphenotypedatafordatacleaning.
*
Ourmodelcanidentifybothabnormalitiesandbiologicaldiscoveries.
*
Dynamicteroutperformstheexistingsolutionsbyoptimizingthebetween
phenotypedataandbiologicalconstraints.
1.2Inter-functionalAnalysis
Studyingthecomplexrelationshipsbetweenphenotypeandenvironmentposesseveralcom-
putationalchallenges.First,giventhatthephenotype-environmentrelationshipislargely
unknown,peopletendtolearntheinter-functionalrelationshipusingdatadrivenapproaches
suchaslinearregressionorcurveting.However,1)itistochoosethebest
4
functionsincetherelationshipisusuallynon-linear[55];2)biologicalknowledgefordescrib-
ingtheorganismresponsestoenvironmentalchangescannotbeelyincorporatedinto
thesemodels;3)thepurelydatadrivenapproachesmaybetlybyoutliers
andbiasinphenomicsdata,resultingininaccuratephenotype-environmentrelationships.
Second,ifaphenotype-environmentrelationshiphasbeenwellstudied,peopletendtodi-
rectlydataontoaknownbiologicalmodel(suchasMichaelis-Mentenkinetics,oneofthe
bestknownmodelsofenzymekinetics)[16,11].Usingabiologicalmodel,phenotypedata
canbeconvertedintolesscomplicatedmodelparametervalues,whicharebiologicallymean-
ingfulandareeasiertointerpretthanrawdata[18].However,biologicalmodelsaresimple
comparedwiththerealworldsituation.Itisinappropriatetodirectlyuseastatictheoretical
modeltolearndynamicrelationshipsthatvaryovertimeandenvironmentalconditions[69].
Finally,photosynthesisphenotypesareusuallymeasuredunderdynamicenvironmentalcon-
ditions,inarelativelylongtimeperiod,andonmanyplantswithvastlytgenetic
backgrounds.Thebroadrangeofdatavariationaddsanotherlevelofcomplexitytothe
problem.Insummary,newinter-functionalalgorithmsarerequiredtoexplorethecomplex
phenotype-environmentrelationships.
Inthisarticle,wepresentacomprehensivedataanalysisapproachcalled
PhenoCurve
toexplorethedynamicphenotype-environmentrelationships.Comparingtotheexisting
methodsthatsolelymodelphenotypesorenvironmentalfactors,PhenoCurve,whichenables
researcherstomodelphenotypesandenvironmentalfactorssimultaneously,hasthreema-
joradvantages.First,althoughphenotypeandenvironmentaremeasuredseparatelywith
tsensortechniques,theyarebiologicallycorrelated.Studyingtheinter-functional
relationshipmayrevealpatternsthatcannotbediscoveredbyonlyusingphenotypeorenvi-
ronmentaldata.Second,theinter-functionalanalysisallowsustocondensethehugeamount
5
ofphenotypedataintoasuccinct,highlycompactedform,whichwillballthefollowed
dataminingprocessessuchasclusteringandgeneranking.Third,ttopurelydata
drivenapproaches,PhenoCurveisabletofurtherimproveitsperformancebyincorporating
preciousbiologicalknowledgeonplantphotosynthesisphenotypes.Inthefollowingcontent,
wedemonstratetheenessofPhenoCurvebyidentifyingthedynamicrelationshipsbe-
tweenakeyphotosynthesisphenotype
II
(steadystatequantumyieldofphotosystemsII)
andlightintensity(denotedas
i
).PhenoCurvecanbeeasilyextendedforotherphenotype
andenvironmentdatawithsimplemo
1.3ThesisContributions
Inthissectionweshallelaborateonthemainproblemsconsideredinthisthesisandourkey
contributionstoaddresstheseproblems.
Thisdissertationmainlydealswiththeexploringphenotypingdatausingmachinelearn-
ing/dataminingtechniques,includingdataqualitycontrolandinter-functionalanalysis.
Generally,weattempttodevelopanapplicationthatisabletoidentifyoutlierswithhigh
performance,andweattempttodesignanalgorithmforreliableandrobustcurvefor
phenotypingvalues.

DynamicFilter
.ThethesisproposesamodelcalledDynamicFil-
tertoidentifyabnormalitiesinplantphotosynthesisphenotypedatabycomparing
lightresponsesofphotosynthesisusingaedkineticmodelofphotosynthesis.
DynamicFilteremploysanExpectation-Maximizationprocesstoadjustthekinetic
modelincoarseandregionstoidentifybothabnormalitiesandbiologicalout-
liers.Theexperimentalresultsshowthatouralgorithmcanelyidentifymost
6
oftheabnormalitiesinboththerealandsyntheticdatasets.

Phenocurveforinter-functionalanalysis
.Thethesispresentsanewalgo-
rithmcalledPhenoCurveforinter-functionalphenomicsdataanalysis.PhenoCurveis
amodelbasedcurvealgorithmaimingtostudyboththevaluesandthechanging
ratesofthedynamicphenomicsdata.Theevaluationontherealandsimulatedphe-
notypedatashowedthatPhenoCurvehasthebestperformanceamongallthetested
methods.Itsapplicationonrealplantphotosynthesisdatarevealednewfunctionsof
coregenesandprocessesthatcontrolphotosynthesisinresponsetovarying
environmentalconditions,whicharecriticalforunderstandingplantenergystorage
andimprovingcropproductivity.
1.4ThesisOverview
Theremainderofthisdissertationisorganizedasfollows.Chapter2laysoutthefoundation
fortherestofthethesis.
Thepartofthethesisfocusesondataqualitycontrolproblem.InChapter3we
focusondataqualitycontrolandtheapplicationDynamicFilterdeveloped,investigatehow
itisproposedandtrytoaddressrealcaseproblemoflowdataquality.
Thesecondpartofthethesisdealswithinter-functionalanalysisproblem.Chapter4
discussesthemotivationandcontextofdoinginter-functionalanalysisforphenotypingdata,
especiallyafterweobtainedhighdataqualityusingDynamicFilter.
Finally,Chapter5summarizesthisworkbyconcludingthemaincontributions,some
potentialextensionsandthefutureresearchdirections.Inordertofacilitateindependent
readingofvariouschapters,someofthedarerepeatedformultipletimes.
7
1.5BibliographicNotes
Someoftheresultsinthisdissertationhaveappearedinpriorpublications.Thematerial
inChapter3isbasedonaworkpublishedintheBioinformaticsonOxfordJournal[69]
(Bioinformatics)andthecontentofChapter4comesfromcurrentworkwhichistobe
submitted.
8
Chapter2
Background
Thegoalofthischapteristogiveageneralandformaloverviewofthematerialsrelatedto
theworkthathasbeendoneinthisthesis.Inparticular,wewilldiscussthekeyconcepts
andquestionsrelevanttoproblemsofdatacleaningandcurveTheexpositiongiven
hereisnecessarilyverybriefandthedetaileddiscussionwillbeprovidedintherelevant
chapters.
2.1Datacleaningforhighdataquality
Datacleaningistheprocessofidentifyingincorrectorcorruptedrecordsinadataset.The
goalofdatacleaningistoensureanaccuraterepresentationofthereal-worldconstructsto
whichthedatarefer.Removingimpuritiesfromdataistraditionallyanengineeringproblem,
wheread-hoctoolsmadeupoflow-levelrules(suchasdetectingsyntaxerrors)andman-
uallytunedalgorithmsaredesignedforsptasks(suchastheeliminationofintegrity
constraintsviolations)([44]).Detectionandeliminationofcomplexerrorsrepresentingin-
validvalues,however,gobeyondthecheckingandenforcementofintegrityconstraints.They
ofteninvolverelationshipsbetweentwoormoreattributesthatareverytouncover
anddescribebyintegrityconstraints.Recentworkhasshowntheenessofapplying
techniquesfromstatisticallearningforthepurposeofdatacleaning.Inparticular,outlier
detectionmethodshavemadeitpossibletoautomatethecleansingprocessforavarietyof
9
domains([13,17,31,39,40,48]).
2.1.1Relatedwork
Noneoftheexistingoutlier-detectionbasedmethodsaresuitableforphenotypedataclean-
ing.First,bothbiologicaldiscoveriesanderrorsofdetectionaretoseparatefrom
distribution.Second,thecohesivenessruleusedintemporaldatacleaningisnotapplicable
forthephenotypedata,because1)anon-cohesivetime-serialcouldrepresentaninteresting
phenotypepatternratherthananerror;2)alltheobservationsatthesametimepointmay
besimilarlybyasystematicabnormalevent([44]).
Alternatively,ratherthancheckingtherawvalues,residueanalysiscanbeemployedto
modelthebetweentherealvaluesandthetheoreticalcurve,whichisusually
derivedfrombiologicalconstraintssuchasthegeneralizedlightreactions([30,38]).Thisis
oftencalledthe
goo
model.Thegoobaseddatacleaningmodelscanbe
intotwocategories.First,statisticaldistributioncharacterssuchasmean,standard
deviation,intervalorrangehavebeenusedtounexpectedvaluesindicating
possibleinvalidvalues([39]).Suchsimplemethodscanbetlyappliedtobigdata.
However,theseparameters(suchasmean)areinclinedtobebiasedbyabnormalitieswith
largedeviations.Sinceitdoesnottakeintoaccountlocalcharacteristicsofdata,thereisa
riskofmislabelingarangeofnormaldataasabnormalities,andviceversa.Second,combined
dataminingtechniquesareusedtoidentifypatternsthatapplytomostresidualrecords.
Apatternisbyagroupofresidualsthathavesimilarcharacteristics(behaviorfor
certainpercentageoftheinthedataset).Outliersarethenidenasvaluesthatdo
notconformtothepatternsinthedata.Amongthem,theHampelusesthemedian
ofneighboringobservationsasareferencevalueandlooksforlocaloutliersinastreaming
10
datasequence([48,49]).WhiletheHampelissuitablefortemporaldatacleaning,it
assumesthatthedataareindependentandidenticallydistributed,whichisnotvalidunder
dynamicenvironmentalconditions.Butwhenthedataishighlyautocorrelated,Hampel
mayfailtocaptureabnormalities.
Itshouldbenotedthatwhilethegoobaseddatacleaningmodelsfocusonthe
modelingofdeviation,theyarenotawarethatthetheoreticalcurve,whichisusedasthe
reference,maynotalwaysbeprecise.Typically,theoreticalcurvesderivedfrombiological
knowledgearesimplecomparedwiththerealworldsituation.Itisthereforeinappropriate
todirectlyusetheimperfecttheoreticalcurvetoinferabnormalities.
Inthispaper,wedeveloparesidualanalysismodelcalled
Dynamic
Filter
toelyidentifyabnormalitiesinplantphotosynthesisphenotypedata.Our
modelderivesatheoreticalcurvefromthephotosyntheticbiologicalconstraints;adjuststhe
theoreticalcurvetothephenotypedataviaoptimization;andthenstudiesthedeviations
ofindividualphenotypevaluesfromtheoreticalcurve.Theresultingpatternsinresiduals
indicateabnormalities,whichisatypeoferrorsofdetection,andtheoptimizedtheoretical
curvesrevealtruebiologicaloutliers.
2.2Inter-functionalanalysis
Inthissection,weintroducethebiologicalbackgroundonmodelingtherelationshipbe-
tweenlightintensityandtherateofphotosynthesis,aswellastheexistingcomputational
approachesoncurveandregression.
Thephotosynthesis-irradiance(PI)curveisagraphicalrepresentationoftheempirical
relationshipbetweensolarirradianceandphotosynthesis[38].AsaderivationofMichaelis-
11
(a)HyperbolicPhotosynthesis-Irradiance(PI)
curve.
(b)MeasuredphotosyntheticratesofCol-0and
thePIcurve.
Figure2.1:Photosynthesis-Irradiance(PI)curve.
Mentenkinetics,PIismodeledasahyperboliccurveasshowninFig.2.1(a)[42],indicating
thatthereisagenerallypositivecorrelationbetweenlightintensityandphotosyntheticrate.
ThePIcurvehasbeenappliedsuccessfullytocleanplantphotosynthesisphenotypedata,as
wellastoexplainocean-dwellingphytoplanktonphotosyntheticresponsetochangesinlight
intensity[30,69].
Let
P
bephotosyntheticrateatagivenlightintensity[
I
],thePIfunctionisinEq.2.1:
P
=
P
max
[
I
]
i
1
=
2
+[
I
]
(2.1)
where
P
max
isthemaximumpotentialphotosyntheticrateperindividual,and
i
1
=
2
ishalf-
saturationparameterrepresentingtheamountoflighttoproducthalfofmaximumphoto-
synthesis.Parameter
i
1
=
2
isusuallylearnedwithaserialofvaluesof
II
andlight(
i
)using
curvemethods.
Bydescribingthephotosyntheticrate
P
usinglinearelectronw,wecanassociateboth
12

II
and
i
withtime
t
usingEq.2.1:

II
(
t;i
1
=
2
)=

II
)
1+
i
(
t
)
i
1
=
2
(2.2)
where
t
isatimepointinatemporalregion
R
;
II
(
t
)and
i
(
t
)represent
II
(steadystatequantumyieldofphotosystemsII)andlightintensityat
t
;max
II
)isthe
maximal
II
in
R
.Seemoredetailsin[69].
Duringshorttimeregion,
i
1
=
2
inEq.2.2remainsconstant.However,itmaychange
graduallyoverlongtimeregionwiththechangesofmultipleenvironmentalfactorssuch
aslight,temperature,
CO
2
,and
O
2
[14].Thisphenomenon,namelyphenotypicplasticity,
indicatestheabilityofanorganismtoproducemorethanonephenotypewhenexposedto
tenvironments[50,46].
2.2.1Relatedwork
Knowingthetrendof
i
1
=
2
willgreatlyenhanceourunderstandingtothemechanismthat
regulateplantsinresponsetoacontinuouslychangingandunpredictableenvironment.How-
ever,tocapture
i
1
=
2
bydirectlyapplyingthePIfunctiononthephenotypeandenvironment
datamaybeinappropriateduetothehighnoiserateinrealdata[69].
2.2.1.1SlidingWindowbasedCurveFittingMethods
Sincethechangeof
i
1
=
2
isassumedtobesmoothandcontinuous,aslidingwindowapproach
maybemoreappropriatethandirectlyapplyingthePIfunctiononeverytimepoint.Us-
ingtheslidingwindowapproach,wecandividethewholephenotypedataintooverlapped
temporalregions,andthenemployacurvettingorregressionmethodtocomputealocal
13
i
1
=
2
valueforeachregion.Finallyallthelocal
i
1
=
2
valuesaremergedtocapturetheglobal
phenotype-environmentrelationship.Notethatthereisnoexplicitboundarybetweencurve
andregression,whiletheformerismoreontheoptimizationitself,andlater
focusesmoreonstatisticalinference[43].
Curvegisacommonlyusedmethodtomodeltherelationshipsamongtwoormore
variables[43].Mathematically,itisaprocesstotunetheparametersofaknownmathemati-
calfunction
f
(
x
)toachievethebesttoaseriesofdatapoints.TheLevenberg-Marquardt
algorithm(LMA),akathedampedleast-squaresmethod,hasbeenwidelyusedfornonlinear
leastsquarescalculationsforsolvinggenericcurvproblems[35].LMAinterpolates
betweentheGauss-Newtonalgorithmandthemethodofgradientdescent,aimingtoa
localminimum[28,6].
Ifthefunction
f
(
x
)isunknown,kernelsmoother,suchaslocallinearregression
(LLR),isoftenusedforestimatingasmoothfunctionfromaserialsofnoisyobservations[26].
Inthisway,non-linearrelationshipsbetweenphenotypesandenvironmentalfactorscanbe
learnedfromdata,eveniftheunderlyingbiologicalmechanismbetweenthemisunknown.
2.2.1.2BayesianLinearModelwithNormalInverseGammaPrior
Whilethesliding-windowbasedcurvemethodsoptimizeeverylocalthey
simplyignoretheglobalcontinuityoftheparameter
i
1
=
2
.Thus,theymaybesensitiveto
noiseandbiasinlocalregionsinrawphenotypedata,resultingininaccuratephenotype-
environmentrelationships.Tothisend,performanceimprovementmaybeachievedbyusing
Bayesianlinearmodelwithnormalinversegamma(NIG)prior[22].
14
First,wetransferEq.2.2toitslinearformforeasierrepresentation:

II
)

II
(
t
i
)

1=
1
i
1
=
2
i
(
t
i
)+

(
t
i
)(2.3)
where

(
t
i
)istheerrorassociatedwith
t
i
,distributedasnormaldistribution
N
(0
;˙
2
i
).
Second,similartoSection2.2.1.1,weadoptaslidingwindowapproachtoestimate
i
1
=
2
and
˙
2
i
foreachtemporalregionusinglinearregressionmethods[20].Givenathreshold
onthelinearregressionreliability
R
2
,wecanclassifyallthetemporalregions
D
intotwo
groups,i.e.reliabledata
D
r
andunreliabledata
D
u
[28].
Third,inordertoderivetheposteriorprobability
p
(
i
1
=
2
;
2
j

II
;i
),weassumethat
(
i
1
=
2
;˙
2
i
)followsthenormalinversegammadistribution,i.e.
i
1
=
2
;˙
2
i
˘
NIG
(
V;a;b
),
where
V;a;b
isthesetofhyperparametersofNIG.Subsequently
i
1
=
2
;
2
j

II
;i
isalsodis-
tributedasNIGyetwiththyperparameters(


;V

;a

;b

).Assumingthat
D
r
and
D
u
followthesinglemodeoftheNIGdistribution,weestimatethehyperparametersusing
D
r
andapplythemon
D
u
.Sp,
i
1
=
2
ofeachtemporalregionin
D
u
canbeobtained
by1)obtainingthemarginalposteriordistributionbyintegratingout
˙
2
i
,2)maximizinga
posterioriprobabilityofthemarginalposteriordistribution,and3)optimizingaBayesian
linearmodelusingboththesharedhyperparametersandlocaldata.Seemoredetailsat
[22]and[8].
Inthismodel,theglobalcontinuityof
i
1
=
2
isguaranteedduetoasinglemodeofthe
normalinversegammadistribution(i.e.
D
r
and
D
u
sharethesamesetofcommonhyper
parameters)[8].However,theassumptionsonconstanthyperparametersmaybetoorigid
forrealphenotypeandenvironmentdata,inthat
D
r
and
D
u
maybeundertprior
distributionsduetotenvironmentalconditions.
15
Chapter3
DatacleaningwithDynamicFilter
AdatacleaningframeworkcalledDynamicFilterisproposedfordataqualitycontrolinthis
Chapter.
Theremainderofthechapterisorganizedasfollows.Section3.1motivatestheproblem
andmainintuitionbehindtheproposedalgorithm,thedetaileddescriptionoftheproposed
algorithm.Section3.2presentsthedetailedimplementationissues,theexperimentalsetup,
resultsandanalysis.
3.1Methods
Inthissection,weintroducethetheoreticalcurveoftime-seriessteady-statequantum
yielddata,andthenintroduceaframeworkforabnormalitydetection.
Inthispaper,thetime-seriessteadystatequantumyieldofphoto-systemsII(denoted
as
II
)ischosenforabnormalitydetectionforthreereasons.First,
II
canbereadily
measuredusingvideoimagingmakingitusefulforhighthroughputphenotyping.
Second,becauseitlight-drivenelectrontransfer,itcanbeusedasanindicatorof
photosyntheticratesand,albeitwiththecaveatthatitthesumof
CO
2
photo-respirationandotherprocesses([47]).Finally,
II
isagooddemonstration
oftheapproachbecauseittendstofollow,toareasonabledegree,relativelysimplesaturation
behaviors.Givenanadequatemodel,thecleaningproceduredescribedinthemanuscript
16
Figure3.1:TheframeworkofDynamicFilter.
mayalsobeappliedtootherphotosyntheticparameterslikenon-photochemicalquenching
(NPQ),whichcandisplaycomplexbehaviors.
17
3.1.1TheoreticalPhotosyntheticCurve
Anabnormalityinresidualanalysisisanobservationexhibitingalargerencebetween
thetheoreticalvalueandtheobservedvalue,andmayindicateadataentryerrorfromthe
phenotypingsensors.Toderivethetheoreticalcurve,wemodel
II
withthephotosynthesis-
irradiance(PI)curve(seeFig2.1(a))([30,38]).
AsaderivationofMichaelis-Mentenkinetics,oneofthebest-knownmodelsofenzyme
kineticsinbiochemistry([42]),PIismodeledasahyperboliccurve(seeFig2.1(a))inEq.
2.1,revealingtheempiricalrelationshipbetweensolarirradianceandphotosynthesis([38]).
where
P
isphotosyntheticrateatagivenlightintensity,
P
max
isthemaximumpotential
photosyntheticrateperindividual,[
I
]isagivenlightintensity,and
i
1
=
2
ishalf-saturation
constant.Fig2.1(a)showsthegenerallypositivecorrelationbetweenlightintensityand
photosyntheticrate.ThePIcurvehasalreadybeenappliedsuccessfullytoexplainocean-
dwellingphytoplanktonphotosyntheticresponsetochangesinlightintensity([30]),aswell
asterrestrialandmarinereactions.
Wedescribethephotosyntheticrate
P
intermsoflinearelectronw([32]),andasso-
ciatebothtemporalsteadystatequantumyieldofphotosystemsII
II
andtemporallight
intensity
i
withtime
t
,asshowninEq.2.2:where
t
isatimepointinatemporal
region
T
(
t
2
T
);
II
(
t
)and
i
(
t
)representthesteadystatequantumyieldofphotosystemsII
andlightintensityat
t
;m
II
)isthemaximal
II
in
T
;andthehalf-saturationconstant
i
1
=
2
isthelightintensityatwhichthephotosyntheticrateproceedsathalf
P
max
.Seeproof
insupplementarySection1.
OnemayreasonablyaskiftheNPQorphotoinhibitionwouldthetheoreticalmodel
forlightsaturation.Infact,NPQhas(surprisingly)littleontherelationshipbetween
18

II
andlightintensityascanbereadilyseeninthefactthatthe
II
lightsaturation
curvesforwildtypeandthe
npq
4mutantofArabidopsisareessentiallyidenticaldespite
largeinqE(i.e.,rapidlyreversiblephotoprotectionofNPQ)([36]).Thereason
forthisapparentdisconnectisthat,athighlight,thesloweststepinthelightreactionsof
photosynthesisoccurssubsequenttolightabsorptionatthecytochrome
b
6
f
complexandis
regulatedbythepHofthelumen([60]).Lightabsorptionbecomeratelimitingonlyat
NPQlevelsmuchhigherthanthoseobservedhere.ThebiologicalroleofNPQundermost
conditionsappearstobeinregulatingelectrontransferbutinpreventingthebuildupof
reactiveintermediateswithinthephotosystemIIreactioncenter([45]).Thus,theof
moderatelevelsofNPQandphotoinhibitionshouldhavelittlethebehaviorofthewild
typesystem.However,underextremeconditionsofinmutantlineswithalteredbehavior
producinghighlevelsofNPQorphotoinhibitionweexpecttoseebehaviorthatdeviates
fromthatproducedbythemodel.Theseinstanceswillbedetectedasoutliersand
forfurtherinvestigationofpossiblebiologicaldiscoveries.
Consequently,thehalf-saturationconstant
i
1
=
2
canbelearnedusingall
II
(
t
)and
i
(
t
)in
T
withanonlinearregressionmethod([54]).Notethatthehalf-saturationconstantcanbe
dramaticallytbetweenplantsandbetweenleavesinplants.Thus,thegeneralshape
ofthecurveistypicallymaintained,butnotitsmaximalorhalf-saturationlightintensity.
Finally,given
i
1
=
2
,theresidualvalueateachtimepoint
t
isas:
rsd
(
t
)=
II
(
t
)


0
II
(
t;i
1
=
2
)(3.1)
where
rsd
(
t
)istheresidualvalueatattime
t
;and
II
(
t
)istheobservedvalueand
0
II
(
t
)
isthetheoreticalvalueofsteadystatequantumyieldat
t
calculatedusingEq.2.2.
19
WenotethattherearemultiplemodelsforPIcurves,whichgivesimilarresponsesto
light([37,23,73,34,15]).Inthispaper,wechosetheMichaelis-MentenKineticsmodel
becauseitisconvenienttouse,andplantphotosynthesisratedatawell(seeFig2.1(b)).
Itshouldbenotedthatanimportantfeatureofourapproachisthatthesealternativemodels
canbeeasilyaddedorsubstitutedforcomparison.
3.1.2FrameworkofDynamicFilter
DynamicFilterisacoarse-residualanalysisapproach,whichhasthreemajorsteps
asshowninFig.3.1.Weabnormalitiesusingatothatproposedin[40]:
3.1.
Abnormality.
Let
f

nor
g
beasetofnormalphenotypedata,and
f
rsd
nor
g
bethecorrespondingresidualset.Anabnormality

abn
isaphenotypevaluewhoseresidual
fallingthe

ceintervalofthemajornormaldistributionof
f
rsd
nor
g
.
Notethatinterval

=99%iscommonlyusedinliterature([40,57]),but
isadjustablebyusers.Inthispaper,byadoptingtheconceptofinterval,we
assumethat1)themajorityofthephenotypevaluesarecorrect,and2)theyformthemajor
distributionintheresidualdata,whichisalsodistinctlytfromthedistribution(s)of
theresidualdataoftheabnormalities.
20
(a)GMMincoarseprocess
(b)Candidatesincoarseprocess
(c)Beforeregionalnt
(d)ApplyEMonlocalregion
(e)OutputofEM
(f)Afterregionalt
Figure3.2:AnexampleofDynamicFilter.Thesolidpointsareabnormalitiesandthehollow
pointsarenormalvalues.
21
(a)Candidatesinprocess
(b)Finaloutputs
Figure3.2(cont'd)
22
3.1.2.1Step1.Coarseprocesstoidentifyabnormalcandidates.
Givenasetofphenotypedata
II
,weadoptEq.2.2togeneratethetheoreticalvaluesof
steadystatequantumyieldsforeachplant,denotedas
f

0
II
g
,byusingthewholetime-serial
astemporalregion
T
,akathecoarselevel.ForthedatasetusedintheExperimentSection,
thesmallestvalueoftimeintervalis10minutes,andthescaleof
T
inthewholedataset
is3days.Consequently,wegeneratetheresidualdataofallplants
f
rsd
g
usingEq.3.1,
andmodelthemusingaGaussianmixturemodel(GMM)(seedetailsinSection3.1.3and
exampleinFig.3.2(a)).Finally,wegeneratetheabnormalitycandidateset
f

abn
g
with
3.1.InFig.3.2(b),solidpointsareabnormalitycandidatesinthecoarseprocess.
Clearly,becauseofthePIcurvemodel,notalltheabnormalitycandidatesare
correctlyiden
3.1.2.2Step2.KNNprocesstoabnormalityiden
Abnormalitycandidatesmayhavecertainintrinsicpatternsofdistributionhighlyrelatedto
certainrangesoffeaturespace.Forexample,accidentaldysfunctionofdata-capturingdevices
maycauseabnormalitiesconcentratedaroundsomeregions,whichformstatisticalpatterns
onthedistributionplotofthefeaturespace.Fromastatisticalviewpoint,abnormalities
shouldbeawayfromnormalvaluesinthefeaturespace,andvalueswithsimilarfeatures
tendtohavethesamelabels.Thisleadstoatprocesstoexploitthepatterns
ofabnormalitiescandidatesonselectedfeaturespace,andtomakeuseofthesepatternsto
abnormalityidenasdescribedinAlgorithm1.
Sp,weselecttheoptimalfeaturesfrom
II
,
rsd
,
i
,
t
,etc.,inwhichab-
normalitiesandnormalvaluesaremaximallyseparated.Tosolvethisfeaturereduction
problem,LinearDiscriminantAnalysis(LDA)isadoptedtogettheprincipalcomponents
23
oftheoptimalfeaturespace(Algorithm2,seedetailsinSection3.1.3).Second,weap-
plyK-nearest-neighborapproachontheselectedfeaturespace,suchthateachabnormality
candidatewillberelabeledasitsmajoritylabelof
k
nearestneighbors(Algorithm1)([4]).
Algorithm1
KNNprocesstoresults
Input:

:originalfeaturespace

C
:thesetoflabels(abnormalityornormal)

k
:numberofnearestneighborsused
1:

proj
 
Feature
;C
)2
2:
for
 
i
in
proj
do
3:
C
i
 
majoritylabelof
k
nearestneighbors
4:
endfor
5:
Output:
C
Algorithm2
FeatureSelection
Input:

:originalfeaturespace

C
:thesetoflabels(abnormalityornormal)
1:

 
1
j
C
j
P
 
2:
for
i
from1to2
do
3:

i
 
Featuresof
i
th
label
4:
n
i
 j

i
j
;

i
 
1
n
i
P
 
i
5:
SW
i
 
1
n
i
P
(
 
i


i
)(
 
i


i
)
T
6:
endfor
7:
SW
 
P
2
i
=1
SW
i
8:
SB
 
P
2
i
=1
n
i
j
C
j
(

i


)(

i


)
T
9:

proj
 
eig
(
SB=SW
)


10:
Output:

proj
:
proj
isprojectedspace
Step3.processtoidentifyabnormalitiesinlocalregions.
Becausethetheoreticalvalues
f

0
II
g
arelearnedwiththedPIcurvemodelatthe
coarselevel,notalltheassignmentsoftheabnormalcandidatesarecorrect.Consequently,
weseparatetheabnormalcandidates
f

abn
g
totemporalcheckingregions(seen
3.2)andabnormalityidenationineachregion.
24
Algorithm3
EMoptimizationoneachlocalregion
r
Input:

phenotypevaluesinalocalregion

i
:lightintensity


:interval
1:
Let
nor
and
abn
benormalvaluesandabnormalitiesin
2:
repeat
3:
E-step:
4:
[
rsd
nor
;i
1
=
2
nor
]
 
PI
Curv
nor
;i
)
5:
[

nor
;˙
nor
]
 
GMM
(
rsd
nor
)
6:
[
rsd
min
;rsd
max
]
 
terval(

nor
;˙
nor
;
)
7:
M-step:
8:
rsd
abn
 

abn
;i
1
=
2
nor
)
9:

nor
;

abn
]
 
UpdateCandidate(
rsd
nor
;rsd
abn
;rsd
min
;rsd
max
)
10:
until

nor
and
abn
arestable
11:
Output:

nor
3.2.
TemporalCheckingRegion.
Acheckingregion
r
consistsofatmost
m
normalvaluestheselectedabnormalcandidates,dependingondataavailability,
denotedas
f

nor
g
,andatmost
n
abnormalcandidatessuchthatthelastabnormalcandidate
isconstrainedtobeatmost
l
-timepointsawayfromtheone,denotedas
f

abn
g
.
In3.2,
m
,
n
and
l
areuserparametersthatdeterminethesizeofa
temporalcheckingregion.Acheckregionhasatmost
m
+
l

n
normalvaluesandatmost
n
abnormities.Notethatabnormalcandidatescanbecontinuousordiscontinuous,andtwo
checkingregionsmaysharecommonnormalvalues.
Intheprocess,anExpectation-Maximization(EM)processisemployedtorepeat-
edlyoptimizetheresultsineachtemporalregion
r
.Pseudo-codeoftheEMprocessisshown
inAlgorithm3.IntheEstep,usingthelocalnormalvalues
f

nor
g
incheckingregion
r
as
inputs,weregeneratethetheoreticalvalues
f

0
g
withEq.2.2.Thentheresiduals
f
rsd
g
for
boththeabnormalcandidates
f

abn
g
andthenormalvalues
f

nor
g
areregeneratedusing
Eq.3.1(Algorithm3line6-8).IntheMstep,wetheabnormalcandidateset
f

abn
g
25
withthestatisticaldistributionofthenewresidualdata
f
rsd
g
accordingto3.1.
Sp,avaluefallstheintervalthresholdofthemajordistributionofthe
normalresidualvalueswillbemovedto
f

abn
g
;andifanabnormalcandidateiswithinthe
intervalthresholdofthemajordistributionofthenormalresidualvalues,itwill
belabeledasnormalandbemovedto
f

nor
g
(Algorithm3line10-11).
TheEMprocesswillstopwhenthelabelassignmentisstable.Fig.3.2(c-f)showsthe
iterativeprocessinacheckingregion.Sincecheckingregionsmaysharecommonvalues,the
resultsfromtregionsmaybeForexample,aphenotypevalueisiden
asanabnormalityinoneregionbutisconsideredanormalvalueinanotherregion.To
solveandconsequentlyimproveperformance,weemployaninformationsharing
processintheendoftheEMprocesstobroadcastallthelocalresultstoallthechecking
regions.Ifexists,votingresultswillbeusedtoabnormalcandidatesinthe
selectedfeaturespace(step2),andtheEMprocesswillrerunonthenewcheckingregions.
Theprocesswillrepeattilltheresultsconverge.Fig.3.2(g,h)demonstratethatallthe
abnormalitiesareiden
3.1.3RelatedWorks
WeintroducetheGaussianMixtureModel(GMM)andtheLineardiscriminantanalysis
(LDA)usedinSection3.1.2.1asfollows.
3.1.3.1GaussianMixtureModel.
AGaussianMixtureModelisaparametricprobabilitydensityfunctionrepresentedasa
weightedsumofGaussiancomponentdensities([53]).GMMsarecommonlyusedasapara-
metricmodeloftheprobabilitydistributionofcontinuousfeatures([53]).Theprobability
26
densityfunctionisgivenbytheequation:
p
(
x
j

)=
M
X
i
=1
!
i
g
(
x
j

i
;

i
)(3.2)
where
x
isaD-dimensionalcontinuous-valuedvector,
!
i
,
i
=1
;:::;M
,arethemixture
weights,and
g
(
x
j

i
;

i
)
;i
=1
;:::;M
arethecomponentGaussiandensities.Eachcompo-
nentdensityisaD-variateGaussianfunctionoftheform:
g
(
x
j

i
;

i
)=
1
(2
ˇ
)
D=
2
j

i
j
1
=
2
exp

1
2
(
x


i
)
T


1
i
(
x


i
)
g
(3.3)
with

i
bethemeanvectorand
i
bethecovariancematrix(
i
=1
;:::;M
).Themixture
weightssatisfytheconstraintthat
P
M
i
=1
!
i
=1.GMMparametersareestimatedfromtrain-
ingdatausingtheMaximumLikelihoodParameterEstimationorMaximumAPosteriori
Estimation([53]).Inthispaper,residualsareone-dimensionalscalardata,weuse

i
and
˙
i
torepresentthemeanandvarianceofresiduals.
3.1.3.2LinearDiscriminantAnalysisforFeatureSelection.
Lineardiscriminantanalysis(LDA)isamethodsusedinstatistics,patternrecognitionand
machinelearningtoalinearcombinationoffeatures,whichcharacterizesorseparatestwo
ormoreclassesofobjectsorevents,suchthattheinter-classvarianceismaximizedandthe
intra-classvarianceisminimized([67]).Theresultingcombinationmaybeusedasalinear
ormorecommonly,fordimensionalityreductionbeforelaterInthis
paper,weseekcombinationoffeatures,withwhichnormalvalues(oneclass)arecentered
aroundonearea,whileabnormalities(anotherclass)arecenteredaroundadistinctively
separatedarea.
27
Supposethereare
C
classes,andeachclasshas
n
i
points,mean

i
andintra-classvariance

i
.Thentheinter-classvariancemaybebythesamplecovarianceoftheclassmeans:
SB
=
C
X
i
=1
n
i
j
C
j
(

i


)(

i


)
T
(3.4)
andtheintra-classvarianceofwholedatasetis
SW
=
P
C
i
=1
SW
i
([41]).Theclassseparation
inadirection
~
!
inthiscasewillbegivenby:
S
=
~
!
T
SB
~
!
T
~
!
T
SW
~
!
T
(3.5)
Theobjectivefunctionistomaximize
S
anditcanbeshownthatwhen
~
!
istheeigenvector
of
SW

1
SB
,
S
willhavemaximizedvaluecorrespondingtoeigenvalue([51]).
3.2Experiment
WecomparedDynamicFilteronbothrealandsyntheticdatasetswithtwowidely-used
datacleaningalgorithms:1)astatisticalapproachthatabnormalitiesbasedon
standardvariance([39]),and2)Hampelthatidenabnormalitiesbasedondigress
frommedianoftrends([48,49]).Notethatallthethreemethodswereappliedonthesame
phenotyperesidualdataforafaircomparison.
Forperformanceevaluation,weusedboththeprecision-recallcurveandtheMatthews
correlationcot(MCC)([5]).TheMCCthatcanappropriatelyrepresentaconfusion
matrixiscomputedwith:
MCC
=
TP

TN

FP

FN
p
(
TP
+
FP
)(
TP
+
FN
)(
TN
+
FP
)(
TN
+
FN
)
(3.6)
28
(c)Precision-Recallcurves
(d)MCCw.r.t.intervalthreshold
Figure3.3:Performanceevaluationofprecision-recallandMatthewsCorrelationCot
onrealdataset.DFrepresentsDynamicFilter.
3.2.1Realphenotypedataset
WetestedtheperformanceofDynamicFilterusingtheplantphotosyntheticphenotype
dataconsistingof106
Arabidopsisthaliana
plantsedT-DNAinsertionmutantsand
wild-types)sampledat64timepointsunderdynamiclightconditions([3,1]).Thephoto-
29
syntheticphenotypevaluesvarydramaticallyacrossplants,potentialdi
indevelopment,stressresponses,orregulationofprocessessuchasstomatalconductance,
photodamage,andstorageofphotosynthate([32]).Expertswentthroughthedataand
manuallymarkedthegroundtruthofabnormalities,andfoundtheerrorrateis6.5%.
TheexperimentalresultsshowninFig3.3(a)indicatedthatDynamicFilteristly
betterthantheothertwoapproachesintheprecision-recallcurve.Sp,Dynamic
FilteryieldsAUCashighas0.964,higherthantheAUCofsimplestatisticsandHampel
Filter(0.147and0.543respectively).Fig3.3(b)showsourmodelisalsotlybetter
accordingtoMCC.Furthermore,itshowsthatDynamicFilterisinsensitivetotheselection
oftheintervalthreshold,whichisdistinctlytfromtheotheralgorithms
thatrelyonwell-pickedparameters.
NotethattheAUCofDynamicFilterwithoutKNNis0.862(Fig3.3(a)),implyingthe
KNNt(step2)isakeycomponentofDynamicFilter.Sp,Fig3.4shows
howtheKNNtimprovedtheperformanceofdatacleansing.Onthe
II
vs.
residualplotshowninFig3.4(a)(detailedvisualizationonFig3.4(b)),someisolatednormal
valuesareasabnormalities,andcertainabnormalitiesasnormal
values.Clearly,thesevaluesdonotconformwiththemostnearbyvalues.Byapplyingthe
KNNt,theseareelycorrected(Fig3.4(c,d)).
WesystematicallytestedtheperformanceofthetcomponentsofDynamicFilter.
Fig.3.5showstheperformanceimprovementbycomparingDynamicFilterwithamodel
withoutKNNt(v5),iterationofEM(v4),consensusonallregions(v3),reassign-
mentofnormalvaluesandabnormalitiesinEM(v2),orevenwithoutthewhole
process(v1).Itimpliesthattheedprocess,especiallytheKNNandEMt,is
thekeyofperformanceimprovement.
30
(a)Abnormalitiesincoarseprocess
(b)Detailedviewof(a)
(c)AbnormalitiesafterKNN
(d)Detailedviewof(c)
Figure3.4:PerformanceimprovementbyapplyingtheKNNtprocess.
Fig3.6and3.7showcasestudiesontherealdata.InFig3.6,theexperimentwasrunon
awild-typereferenceplant,ArabidopsisCol-0.Inthecoarseprocess,theresidualanalysis
wasappliedtoidentifytheabnormalcandidates(Fig3.6(a)andsolidpointsinFig3.6(b)).
Clearly,6solidpointsonthebottomwereincorrectlylabeledasabnormalities,whichwere
graduallycorrectedintheprocess(Fig3.6(c,d)).Fig3.7(a)showsatruebiological
discoveryontherealdata.OurscreenrevealedaccessionELYexhibitingphotosynthetic
characteristicsmarkedlytfromthereference(Col-0).Itwould,however,belabeled
asabnormalandsubsequentlydeletedbytheexistingoutlierdetectionbaseddatacleaning
31
Figure3.5:Performancecomparison.Eachversioncorrespondstoatversionof
DynamicFilter(DF)without:KNNt(v5),iterationofEM(v4),consensuson
regions(v3),reassignmentofnormal/abnormallabelsinEM(v2),orthewhole
process(v1).
methods,resultinginover-cleanproblem.DynamicFilteridenELYcorrectlyandsug-
geststhattheinitsquantumyieldiscausedbythemonotonedecreaseof
i
1
=
2
regardlessthechangeofsunlight(seeFig3.7(b)).Thenonnegligibledeviationbetweenthe
observedvaluesandthetheoreticalcurvelearnedfromthecoarsephaseofDynamicFilter
(seeFig3.7(c))impliesthetheoreticalmodelissimplecomparedwiththerealworldsitu-
ation.InsteadofdirectlyusethePIcurvetoinferabnormalities,weoptimizethe
resultsinthephaseofDynamicFilter,resultinginalmostperfectmatchbetweenthe
observedvaluesandthetheoreticalcurve(seeFig3.7(d)).
Furthermore,wevariedthesizeofthetemporalcheckingregionandcomparedtheper-
formanceinFig3.8.TheresultsinFig3.8(a)revealthatDynamicFilterachievesthebest
performancewhen
m
isbetween10and15.Thisnumberallowsenoughtrainingdataforthe
tprocess,meanwhileavoidingNPQvariationoverlongtimeinterval.Fig3.8(b)
32
(a)GMMincoarseprocess
(b)Candidatesincoarseprocess
(c)Candidatesinprocess
(d)Finaloutputs
Figure3.6:AcasestudyontherealdatashowsthatDynamicFiltercorrectlyidenall
theabnormalities.
showsperformanceofDynamicFilterisrelativelystableagainstmaxnumberofabnormali-
ties
n
,implyingtherobustnessofDynamicFilterishigh.
3.2.2Syntheticdataset
Sincethetruebiologicaldiscoveriesintherealdataareunknown,wefurthertestedDynamic
Filteronserialsofsyntheticdatasets.Thesyntheticdatasetsweregeneratedbyvaryingfour
parameterssystematically:lightsand
i
1
=
2
beingsmoothlyorabruptlychanged,abnormal-
itiesbeingcontinuouslyordiscontinuouslydistributed,anderrorratiobeingloworhigh.
Furthermore,weaddedvariationsrepresentingabnormalitiesandbiologicaldiscoveries(dif-
33
(a)PhenotypeofELYandCol-0
(b)
i
1
=
2
learnedbyourmethod
(c)ELYobserved&modelleddatainthecoarse
phase
(d)ELYobserved&modelledinthephase
Figure3.7:AcasestudyontherealdatashowsthatDynamicFilteridentruebiological
discoveriesunderthediurnallightcondition.Lineswiththesamemarkerrepresentbiological
replicates.
ferent
i
1
=
2
values)inthesyntheticdatasets.Intotal,63kindsofsyntheticdatasetsin9
groupsweregenerated,andforeachkindofsyntheticdata,werepeatedlygenerated100
datasets.
Fig3.9showstherobustnessofDynamicFilterontsyntheticdatasetsgenerated
under9tsettings.TheperformanceisevaluatedusingMatthewsCorrelationCo-
tonbothabnormalitiesandonbiologicaloutliers.Eachrepresentssynthetic
datageneratedundertsettings(seedetailsinsupplementarysection2).Eachpoint
34
(a)Performancevs.maxnumberofnormalvalues
(b)Performancevs.maxnumberofabnormalities
Figure3.8:Performancetestontemporalcheckingregionsize.(a)Fixingmaxnumberof
abnormalities
n
,andvarying
m
;(b)Fixing
m
,andvaryingmaxnumberofabnormalities
n
.
Figure3.9:TheMatthewsCorrelationCotofbiologicaldiscoveriesandabnormalities
onsyntheticdata.
35
inFig3.9representsaMCCscoreofbiologicaldiscoveryidenatx-axisandaMCC
scoreofabnormalityidenaty-axis.Thehighestpossiblevalueis(1
:
0
;
1
:
0).The
experimentalresultsshowthatDynamicFilter(redcircle)isbetterthantheothertwometh-
odsinalmostallthesyntheticdatasets.ThisisbecauseDynamicFiltercanidentifyand
removeabnormalitieswhilereservingbiologicaldiscoveries(seesupplementaryTable1and
Table2forperformancecomparisononMCCandtruepositiveraterespectively).
36
Chapter4
Inter-functionalanalysisby
PhenoCurve
InthisSection,weproposeamodelbasedBayescurvealgorithm,whichisableto
studyboththevaluesandthechangingratesofthedynamicphenomicsdata.Inparticular,
thisalgorithmisdesignedforstudyingthehiddenparameterofphenomics,butactuallyit
isnotexclusivetophemomicsdatabutalsomaybeextendedtobeusedonothermodels.
Therestofthechapterisarrangedasfollows.Section4.1motivatestheproblemandmain
intuitionbehindtheproposedalgorithm,thedetaileddescriptionoftheproposedalgorithm.
Section4.2presentsthedetailedimplementationissues,theexperimentalsetup,resultsand
analysis.
4.1Method
Inordertoexplorethedynamicphenotype-environmentrelationshipswithoutthetechnical
limitationsincurveandBayesianNIGmethods,wepresentacomprehensivedata
analysisapproach
PhenoCurve
basedonBayesiantheoremandpolynomialgeneralization.
PhenoCurvehasfourcomponents.First,itsplitsthewholephenotypeandenvironment
dataintohighlyoverlappedtemporalregionsusingaslidingwindowapproach,allowingfor
modelingthegradualchangeof
i
1
=
2
.Second,itemploysanon-linearcurvemethod
37
tocompute
i
1
=
2
foreachtemporalregion,andtheresultsintotwogroups,i.e.
reliable(
D
r
)andunreliable(
D
u
),basedon
R
2
.Third,usingdatafrom
D
r
,itestimates
i
1
=
2
foreachunreliableregionin
D
u
withpolynomialgeneralization.Finally,itoptimizes
the
i
1
=
2
valuesforallunreliableregionswithBayesiantheoremusinglocaldata,resulting
inincreasedreliabilityincurveWewillintroduceallthefourstepsinthefollowing
content.Anillustrativeexampleof
II
andlightdataisshowninFig.4.1AB.
4.1.1DataSeparationwithSlidingWindow
Duetobiologicalconstraints,thesamplingrateofthephenotypeisusuallymuchlowerthan
thatoftheenvironmentalfactors[14].Subsequently,wesplitthewholedatasetintohighly
overlappedtemporalregionssolelybasedonthephenotypedata.Inthesliding
windowapproach,eachtemporalregionhas
d
pairsofphenotypeandenvironmentvalues,
andthevaluesineachpairareprobedatexactlythesametime(orarecloseenoughto
eachother).Eachtemporalregionshare
d

1valueswiththepreviousandthenextregion.
Thewindowwidth
d
twoconditions:1)
i
1
=
2
remainsrelativelyconstantwithineach
temporalregion;and2)thereareenoughdataineachtemporalregionforinferring
i
1
=
2
.
4.1.2LocalCurveFitting
Thenextstepistoinfer
i
1
=
2
foreverytemporalregion
R
n
centeredaroundtimepoint
t
n
.Its
halflightparameter
i
n
1
=
2
canbeestimatedusingleastsquarecurve[20].Thegeneral
ideaistoidentifyparametersinafunctiontominimizethesumofallthesquareof
errorsbetweenthepredictedandtheobservedvalues.Inourcase,assumingthereare
d
38
Figure4.1:AnillustrativeexampleofPhenoCurvethatoptimizesbothlocaland
smoothnessof
i
1
=
2
,whichshowsA)dynamiclightvariationovertime,B)corresponding
II
values,C)anoptimizedonlocalregion,andD)
i
1
=
2
valuesonalltemporalregions.
39
phenotype-environmentpairs
f

1
II
;i
1
)
;

2
II
;i
2
)
;:::;

d
II
;i
d
)
g
in
R
n
,weidentify
i
n
1
=
2
by:
i
n
1
=
2
=argmin
i
1
=
2
d
X
k
=1
0
B
@

k
II


II
)
1+
i
k
i
1
=
2
1
C
A
2
(4.1)
TheprocedureinEq.4.1alsoyieldsa
R
2
score,indicatingthelevelofreliability
oftheBasedon
R
2
,wedivideallthetemporalregionsintotwogroups,i.e.reliable
data
D
r
andunreliabledata
D
u
.Inourexperiment,
R
2
thresholdat0
:
9isusedonboththe
realandthesyntheticdatasets.
4.1.3PolynomialGeneralization
Usingthe
i
1
=
2
valuesforthereliabletemporalregionsin
D
r
asinputs,webuildaregularized
polynomiallinearregressionmodel,aimingtogeneralizea
p
-orderpolynomialsmoothcurve
of
i
1
=
2
[63].Inaregularizedpolynomiallinearregressionmodel,orderofonerepresents
alinearmodel,orderoftworepresentsaquadraticform,andorderofthreeandabove
worksforarbitraryshapes.Inourexperiments,thehighestorderofpolynomialtermwas
settothreebyusingcross-validation.Notethatusinghigh-orderpolynomialwouldriskin
ovtting[7].
Let
X
=(1
;t;i;

II
;t
2
;i
2
;

2
II
;t
3
;i
3
;

3
II
)
T
bethevectorofpolynomialterms,and
W
=
(
w
0
;w
1
;:::;w
9
)
T
bethecotvectorof
X
,thegeneralizedpolynomiallinearregression
modelis
i
1
=
2
=
W
T
X
.
40
Theaimoftheregularizedistominimize
E
(
W
)where
E
(
W
)=
j
D
r
j
X
n
=1
(
W
T
X
n

i
n
1
=
2
)
2
+

j
W
j
2
2
=
j
W
T
X

i
1
=
2
j
2
2
+

j
W
j
2
2
(4.2)
where
j
D
r
j
isthenumberofreliabletemporalregionsidenedinSection4.1.2,
X
isthe
polynomialfeaturematrix,
i
1
=
2
isthevectorof
i
1
=
2
valueslearnedfromthereliabletemporal
regions,and
jj
2
2
denotes
L
2
normofavector.
InordertosolveEq.4.2,weset
@E
(
W
)
@
W
=0,whichyields
W

=(

I
+
X
T
X
)

1
X
T
i
1
=
2
(4.3)
where

isaregularizationtermprovidedbyuser.
Applying
i
1
=
2
=
W
T

X
todataineverytemporalregion,includingtheunreliableregions,
wethereforeobtainageneralizedpredictedhalf-saturationparameterforeveryregion
R
n
,
denotedas
^
i
n
1
=
2
=
W
T

X
n
.Thegeneralized
i
1
=
2
valuesensuresmoothhiddenstatevariables
underdynamicenvironments.
4.1.4BayesianMLEoptimization
Givenanunreliabletemporalregion
R
n
in
D
u
,weoptimizeitslocalhalflightparameter
^
i

n
1
=
2
usingBayesiantheorem,suchthat
^
i

n
1
=
2
bestwithboththelocaldatain
R
n
and
^
i
1
=
2
learnedfromthegeneralizedmodelinthepreviousstep.Mathematically,welookfor
optimal
^
i

n
1
=
2
that
^
i

n
1
=
2
=argmax
i
1
=
2
P
(
i
1
=
2
j
R
n
;
^
i
1
=
2
)(4.4)
41
Since
R
n
and
^
i
1
=
2
areindependent,weadopttheBayesiantheoremtosolveEq.4.4:
P
(
i
1
=
2
j
R
n
;
^
i
1
=
2
)=
P
(
i
1
=
2
;
^
i
1
=
2
)

P
(
R
n
j
i
1
=
2
;
^
i
1
=
2
)
P
(
R
n
;
^
i
1
=
2
)
=
P
(
i
1
=
2
;
^
i
1
=
2
)

P
(
R
n
j
i
1
=
2
;
^
i
1
=
2
)
P
(
R
n
)
P
(
^
i
1
=
2
)
=
P
(
i
1
=
2
j
^
i
1
=
2
)

P
(
R
n
j
i
1
=
2
)
P
(
R
n
)
=
P
(
i
1
=
2
j
^
i
1
=
2
)

Q
d
k
=1
P
(
d
k
j
i
1
=
2
)
P
(
R
n
)
/
P
(
i
1
=
2
j
^
i
1
=
2
)

d
Y
k
=1
P
(
d
k
j
i
1
=
2
)
(4.5)
where
d
k
isthe
k
thpairofphenotype-environmentpairin
R
n
and
R
n
2
D
u
.
InEq.4.5,
P
(
i
1
=
2
j
^
i
1
=
2
)representstheprobabilityof
i
1
=
2
given
^
i
1
=
2
.Weassuming
i
1
=
2
followsGaussiandistributionwith
i
1
=
2
˘N
(
^
i
1
=
2
;˙
2
1
),where
^
i
1
=
2
isthemeanand
˙
2
1
isthe
variance.InEq.4.5,
P
(
d
k
j
i
1
=
2
)representstheofthephenotype-environmentdata
given
i
1
=
2
.Infact,when
i
1
=
2
isprovided,
II
canbecomputedusingEq.2.2.Withthe
sameGaussiandistributionassumption,wehave
P
(
d
k
j
i
1
=
2
)=
N

k
II
;˙
2
2
).Hereboth
˙
2
1
and
˙
2
2
areuserparameters.
ByadoptingtheGaussiandistributionassumptionon
P
(
i
1
=
2
j
^
i
1
=
2
)and
P
(
d
k
j
i
1
=
2
),and
bytakingalogtransformonEq.4.5,wehave:
ln
Pr
(
i
1
=
2
j
R
n
;
^
i
1
=
2
)
/
ln
N
(
t
j
^
i
1
=
2
;˙
2
1
)+
d
X
k
=1
ln
N

k
II
j
f
PI
(
t
k
;i
1
=
2
)
;˙
2
2
)
/
(
i
1
=
2

^
i
1
=
2
)
2
2
˙
2
1

d
X
k
=1

k
II

max

II
)
(1+
i
k
=i
1
=
2
)
)
2
2
˙
2
2
(4.6)
42
SubstitutingtherightpartofEq.4.4withEq.4.6,theequationthatcanbesolved
byadoptingthemaximumlikelihoodestimation(MLE)is:
^
i

n
1
=
2
=argmin
i
1
=
2
0
B
B
@
(
i
1
=
2

^
i
1
=
2
)
2
2
˙
2
1
+
d
X
k
=1

k
II

max

II
)
(1+
i
k
=i
1
=
2
)
)
2
2
˙
2
2
1
C
C
A
(4.7)
Notethatinordertoincreasethereliabilityofcurveinunreliableregion
R
n
,
duringtheBayesianMLEoptimizationprocess,wesearchforthephenotype-environment
pairsoutside
R
n
thathavethebestmatchtothecurveof
R
n
,andaddtheminto
R
n
.Fig.4.1CDshowsanexampleofthewhichoptimizesbothlocalandthe
smoothnessof
i
1
=
2
.
TodemonstratethePhenoCurveprocedure,Fig.4.2showsthetstagesofcurve
Fig.4.2(a)presentsthegroundtruthdataforandaswellastheobserved
inputdatawhichiscomposedofnoiserateat0.1andbiasat0.1.Theresultfrom
slidingwindowapproach(WIN)isshowninFig.4.2(b).Basedonslidingwindowresult,the
dataregionisseparatedintoreliableregionsandunreliableregionsasshowninFig.4.2(c).
Lastly,PhenoCurveusethereliableregionstogeneralizedhalf-saturationparameters(as
showningreencurvewithmarkinFig.4.2(d))andmakethepredictionontheunreliable
regions(asshowningreencurvewithoutmarkinFig.4.2(d)).
4.2ExperimentalResults
WeevaluatedtheperformanceofPhenoCurveonboththerealandsimulatedphenotype
dataintermsofperformanceandreliability.WealsocomparedPhenoCurve
witheexistingmethodsi.e.1)thedirectcomputationwithPIfunction(PI),2)theone-
43
(a)GroundTruthandInputdata.
(b)Slidingwindowapproach(WIN).
(c)Separationoftrainingandtestingregions.
(d)PhenoCurvegneralizationandBayesoptimiza-
tion
Figure4.2:DemonstrationofPhenoCurverunningexampleonsyntheticdata.
windowcurve(ONE),3)thesliding-windowbasedcurve(WIN),4)thekernel
smoothingmethodusinglocallinearregression(LLR),and5)theBayesianlinearmodel
withnormalinversegammaprior(NIG).Allthesemethodshavebeenintroducedinthe
RelatedWorkSection.
4.2.1ExperimentalData
WetestedtheperformanceofPhenoCurveusingtherealplantphotosyntheticphenotype
dataconsistingof375Arabidopsisthalianaplants(330coT-DNAinsertionmutants
44
and45wildtypeplants)[3,2].Duringtheexperiment,alltheplantswereevenlysampled
at32timepoints.Alltheenvironmentalfactorsexceptlightareconstant.Followinga
sinusoidalcurve,lightintensitychangesgraduallyfrom35
m

2
s

1
to500
m

2
s

1
thengoesbackto35
m

2
s

1
duringtheexperiment.Thephotosyntheticphenotype
valuesvarydramaticallyacrossplants,potentialindevelopment,stress
responsesorregulationofprocessessuchasstomatalconductance,photodamageandstorage
ofphotosynthate[32,14].
Thesyntheticdataweregeneratedinthreesteps.First,werandomlyavectorof
i
1
=
2
thatchangesgraduallyovertime.Second,wereconstructedavectorofphenotype
II
usingthevectorof
i
1
=
2
,thesamevectoroflightastherealdata,andthePIfunction[38].
Third,werandomlyaddednoiseandbias(levelsvaryfrom5%to15%)tothephenotype
data.Theprocesswasrepeated4,000timestogeneratethefullsyntheticdata(seeTable
S1).
4.2.2EvaluationCriteria
Wefourperformanceevaluationcriteria,andappliedallofthemonboththeunreliable
regions
D
u
andthereliableregions
D
r
,aswellasthewholeregionsofthephenotypedata
D
.
First,cotofdetermination,denotedas
R
2
,isoftenusedasthemaincriteriafor
measuringwhetheracurveisadequate[28,10].Inourcase,
R
2
isas:
R
2
=1

P
n
i
=1

II
(
t
i
)

^

II
(
t
i
))
2
P
n
i
=1

II
(
t
i
)


II
)
2
(4.8)
where
^

II
(
t
i
)isthe
II
valueattime
t
i
,and

II
istheaveraged
II
valuesina
45
temporalregion.InEq.4.8,
R
2
measuresthefractionofthetotalvariationinthephenotype
datathatcanbeexplainedbythecurve.Highervaluesindicatethatthecurvethedata
better.If
R
2
=1
:
0,allpointslieexactlyonthecurvewithnoscatter.
Second,wecomputethesmoothnessofthe
i
1
=
2
vector.Foracontinuouscurve,smooth-
nesscanbemeasuredusinghighorderofderivatives.Fordiscretevalues(whichisourcase),
wemeasurealltheanglesformedbyadjacenttemporalregions.Mathematically,wehave:
smoothness
(
i
1
=
2
)=
1
j
D
j
j
D
j
X
n
=1
[

n

T

](4.9)
where
D
isasetoftemporalregions,

n
=
j
arctan(
i
n
+1
1
=
2

i
n
1
=
2
)

arctan(
i
n
1
=
2

i
n

1
1
=
2
)
j
representstheanglecenteredaroundtemporalregion
R
n
(
R
n
2
D
),
T

isauser
givenanglethreshold,and[
X
]returns1ifthecondition
X
isotherwiseitreturns
0.Inourexperiment,
T

=30
o
.
Finally,forsyntheticdata,wecomputedboth
II
and
i
1
=
2
,whicharethesumof
alltheabsolutebetweeneveryphenotypevalueanditscorrespondingvalueon
thecurveandthesumofalltheabsolutebetweenevery
i
1
=
2
valueandits
correspondingparameterofthePIcurverespectively.
4.2.3ExperimentalResultsonRealData
WeranPhenoCurveontherealphotosynthesisphenotypedatausingthreetwindow
sizes,i.e.6,12and18.WecomparedtheperformanceofPhenoCurvewitheexisting
methodsusingthesamecommonparameters.
TheexperimentalresultsindicatethatPhenoCurvehastlymorereliablecurve
results.Fig.4.3showsthatontheunreliableregions,thecotofdetermination
46
(a)
R
2
onunreliableregions.
(b)
R
2
onwholedata.
(c)Smoothnessonunreliableregions.
(d)Smoothnessonwholedata.
Figure4.3:Cotofdetermination
R
2
andsmoothnessof
i
1
=
2
ontheunreliableregions
andthewholerealphenotypedata.
(
R
2
)increases23%from0.72to0.89whenwindowsizeis12,comparedwiththesliding-
windowcurvemethod.Meanwhile,thesmoothnessofthecurveincreasesfrom
0.55insliding-windowmethodto0.63inPhenoCurveapproach.Amongalltestsonrealdata,
PhenoCurveyieldshighest
R
2
inbothunreliabledataandwholedataparts.PhenoCurve
alsooutputssmoothestcurveinallcasesexcepttheonewithwindowsize=6.Overall,
PhenoCurveperformsbestinexperimentsonrealdata.
47
(a)
R
2
onunreliableregions.
(b)
R
2
onwholedata.
(c)Smoothnessonunreliableregions.
(d)Smoothnessonwholedata.
Figure4.4:Cotofdetermination
R
2
andsmoothnessof
i
1
=
2
ontheunreliableregions
andthewholesyntheticphenotypedata.
4.2.4ExperimentalResultsonSyntheticData
Similarly,wecomparedtheperformanceofPhenoCurveandtheveexistingmethodsona
syntheticphenotypedatausingthreetwindowsizes.
TheexperimentalresultsindicatethatPhenoCurvehasmostreliableandsmoothestcurve
results,andithaslowesterrorratefor
II
and
HL
.Fig.4.4showsthatontheunre-
liableregions,thecotofdetermination(
R
2
)increasestlyfrombarely0.09in
sliding-windowmethodto0.67inPhenoCurvewhenwindowsizeis12.Besides,smoothness
ofPhenoCurveisalwaysthebestamongallmethods.Fig.4.5showsthecomparisonof
48
(a)
II
onunreliableregions.
(b)
II
onwholedata.
(c)
i
1
=
2
onunreliableregions.
(d)
i
1
=
2
onwholedata.
Figure4.5:
II
and
i
1
=
2
onthesyntheticphenotypedatawith0.10noiseandbiasrate.
errorrate
II
and
i
1
=
2
.Onunreliabledataregions,PhenoCurveperformsbetterthan
allothermethodsintermof
i
1
=
2
.Asfor
II
,PhenoCurveandonewindowapproach
hasbestperformanceinunreliableregions,whilePhenoCurveperformsbestinwholedata.
ThisisduetothefactthatPhenoCurvenotonlyoptimizeunreliabledata,butalsooptimize
thereliableregions.Overall,PhenoCurveperformsbestonsyntheticdata.
49
4.2.5BiologicalV
Givenallthe
i
1
=
2
valuesofalltheArabidopsismutantlines,wecomputedthemaximal
relative
i
1
=
2
as
max
j
i
1
=
2
(
t
m
)

i
1
=
2
(
t
n
)
j
foralltimepoints
t
m
and
t
n
satisfying
i
(
t
m
)=
i
(
t
n
)
and
m
6
=
n
.Thepurposeistochecktherecoveryabilityofaplantbyprobingitbeforeand
afterlightstress.ThedistributionofdatainFig.4.6showsthatwecaneasilyclusterall
theArabidopsismutantlinesintothreegroups
A
,
B
,
C
asindicatedintheMutants
ingroupAandBhavehighphotosynthesis,whereasgroupCmutantshaverelativelylow
photosynthesis.MutantsingroupAandCaremoreresistanttodynamiclightstressthan
groupBmutants.
Inordertoverifythebiologicaldiscovery,abiologicalexperimentbymeasuring
qI
(pho-
toinhibition)ofthesameplantsusingDEPI[14]maypotentiallybedesigned.Insummary,
learning
i
1
=
2
withPhenoCurvecangreatlysimplifytheprocesstoidentifythecomplicated
phenotype-environmentrelationshipsandtovisualizetheresults,enablingbiologiststodis-
coverthemechanismthatregulateplantsinresponsetodynamicenvironment.
50
Figure4.6:Phenotypeclusteringbasedonmaximalrelative
i
1
=
2
andmaximal
II
.
51
Chapter5
SummaryandConclusions
Inthisthesis,wedesignedtwotechniquesforcontroldataqualityandexploredataof
phenotypingvalues.Theproposedapplications,namely,DynamicFilterandPhenoCurve,
achievetheultimatetaskaroundtwoquestionsincluding

Howcanwegethighqualitydatafromproblematicrawdata?

Howcanwedocurvetingfordynamicallychangingphenotypingvalueswithhigh
reliability?
5.1Contributions
Thisthesismainlyanswersthetwoquestionsraisedabovebyproposingspecapplications
asfollows,givingtheoreticalstudiesandprovidingrealcaseandsyntheticcasetests.
5.1.1Datacleaningforhighdataquality
Withanaimtowardsidentifyingtargetsforimprovingenergyyield,advancedtechnologies
inhigh-throughputplantphotosyntheticphenotypinghavebeendeveloped([29,14]).These
systemscanbeusedtoquantifyphotosyntheticbehavioringeneticallydiversepopulations
andtodrawrelationshipsamonggenotype,phenotypeandbiologicalfunction,leadingto
betterunderstandingoftheunderlyingmechanismsthatcontrolthephotosyntheticproper-
52
tiesundervariousenvironmentalconditions([52,19]).Asaconsequenceofthelong-time
high-throughputplantphenotyping,thescaleofplantphenomicsdatagrowsexponentially.
However,thequalityofphenotypedatamaybeskewedbysourcesofnoisethatare
toremoveinthedatacollectionstep.
Thepurposeofplantphenotypingistodiscoverphenotypevaluesthataretly
tfromareference.Butphenotypevaluesleadingtobiologicaldiscoveriesmaybe
obscuredbyabnormalvaluescausedbyerrorsduringdetection.Toensurehighdataquality,
edatacleaningshouldbeconsideredaprimarytask.However,sinceadvanceddata
cleaningalgorithmsareprimarilybasedonindiscriminateoutlierdetection,theymayremove
bothabnormalitiesandbiologicaldiscoveriesnotseparableinthedatadistribution.
Wehavedevelopedanewmodelcalled
DynamicFilter
toely
identifybothabnormalitiesandbiologicaldiscoveriesbyadoptingawidely-usedphotosyn-
theticmodel.Spally,DynamicFilterisaresidualanalysisapproachbydynamically
tracingstatisticaldistributionsofallsamplesratherthanindividuals,andincorporatingEM
forperformanceoptimizationincheckingregions.
Wenotethat,certainevents,suchastransientchangesingrowthenvironment,could
introducesignalssimilartogrowthlightingmalfunction,whichcouldbewronglylabeled
asabnormalitiesbyDynamicFilter.Therefore,insteadofautomaticallydeletingallthe
predictedabnormalities,wesendallofthemtodomainexpertsfortion.Meanwhile,
allrawdataarekeptforanyrollbackoperation.
Experimentalresultsshowthatourmodelistlybetterthantheexistingdata
cleaningtoolsonbothrealphenomicsdataandsyntheticdata.DynamicFiltermayhave
awideimpactbecauseoftherapidincreaseoflarge-scalephenotypingtechnologies.It
shouldbenotedthatalthoughweusedaphotosynthesis-spcurve,themodelitselfis
53
independentofactualbiologicalconstraints.Inprinciple,ourapproachcanbeusedtoclean
dataforanynumberofphenotypesaslongassuitabletheoreticalcurvescanbederivedfor
theirbehavior.Implementationfornewusecaseswouldinvolvesubstitutingtheappropriate
theoreticalcurveintotheprogram,calculatingtheresidualsoftothedatasets,and
optimizingtheprocedure(asinFig3.1).
5.1.2Inter-functionalanalysis
Photosynthesisphenotypesareusuallymeasuredunderdynamicconditions,inarelatively
longtimeperiod,andonplantswithvastlytgeneticbackgrounds.Tomeetthe
growingneedstostudythedynamicrelationshipsbetweenphenotypesandenvironmental
factorsusinglimitedbiologicalknowledge,wedevelopedanewtoolcalledPhenoCurve.
PhenoCurvesplitsthewholephenotypeandenvironmentdataintohighlyoverlapped
temporalregions,employsnon-linearcurvemethodstocalculate
i
1
=
2
forthereliable
regions,andthenoptimizethe
i
1
=
2
fortheunreliableregionsusingpolynomialgeneralization
andBayesianMLE.TheresultsonbothsyntheticandrealdatashowthatPhenoCurveos
tlybetterthantheothereexistingmethods.
Table5.1:TestingtherobustnessofPhenoCurveagainstwindowapproachwithmul-
tiplenoiseandbiasrates.
Unreliabledata
Wholedata
noiseandbias
R
2
"
smoothness
"

II
#

HL
#
R
2
"
smoothness
"

II
#

HL
#
0.05
67%7%83%35%
24%20%33%15%
0.10
87%7%54%50%
36%5%32%25%
0.15
118%3%46%42%
46%5%27%22%
InordertotesttherobustnessofPhenoCurve,weranPhenoCurveonsyntheticdatawith
tlevelsofnoiseandbias.TheresultsshowninTable.5.1indicatesthatPhenoCurveis
moreethanwindowapproachwiththeincreaseofnoiseandbiasrate.Compared
towindowapproach,PhenoCurveyieldshight
R
2
,smoothercurve,lowererrorin
II
54
andhalflightparameters.
5.2Conclusions
Thisdissertationanswersthetwoquestionsraisedinthebeginningofthischapterbypresent-
ingmachinelearningtechniquesindataqualitycontrolanddataexplorationofdynamically
changingphenotypingvalues.Toaddresstheseproblems,DynamicFilterandPhenoCurve
areproposedtocleanrawdataandexplorecleaneddata.
Theconcludedresearchmakestcontributionsto(i)therealisticsolutionfor
maintainhighqualityandcleanproblematicdatafromrawinput,(ii)thechallengesof
exploringandobtainingreliableandcomparablehiddenparametersfromphenotypingdata.
55
Chapter6
FutureWork
6.1FutureWork
Thestudiespresentedinthethesisleadtoseveralimportantresearchquestions,whichmay
beextendedinthefuture.
6.1.1DynamicFilterfordataqualitycontrol
Futureworkmayincorporatesemi-supervisedlearningtechniquesintotheapplication.Cur-
rentversionofDynamicFilterprocessresultwithonepass,anddoesnotinteractwith
humanvtoautomaticallyadjusttheparametersetting.
Asforapplicationpotential,aslongasasuitabletheoreticalcurveisgiven,thetheoretical
photosyntheticcurvecanbeupdatedwiththenewone,andresidualsofthecurvecanbe
learnedatthecoarselevel,andthenbeoptimizedinthelevel(seeFig3.1).Therefore,
itispossibletoautomatethecleansingprocessonanylongtime-seriesdataforavarietyof
applications.
6.1.2Inter-functionalanalysis
Inthefuture,wewillextendPhenoCurvetomodelmultiplephenotypesandmultipleenvi-
ronmentalfactors.Oneofthetechnicalchallengesisthatwiththeincreaseofdatatypes,
56
thenumberofparameterswillrapidlyincreased,whicheitherrequirestlymore
samplingdata,orrequiresabetteralgorithmtomodelthedata.
57
APPENDIX
58
Pseudo-codeofDynamicFilter
Algorithm4
DynamicFilter:DynamicFilterAlgorithm
Input:

f

II
g
:arrayofphotosynthesisintensity,Light,99%

light
:arrayoflightintensity

99%:99%istheuser-spinterval
1:
[
RSD;I
i=
2
t
]
 
PI
CurveFitting(
f

II
g
;Light
)2.1
2:
[

main
;˙
main
]
 
GMM
(
RSD
)
3:
[
r
min
;r
max
]
 
getInterval(

main
;˙
main
;
99%)5
4:
C
 
UpdateCandidate(
RSD;r
min
;r
max
)6
5:
repeat
6:

 
(
f

II
g
;
f
RSD
g
;Light
)
7:
C
 

;C;k
0
)1
8:
f

r
g 
UpdateWindow(
C;num
)7
9:
for

r
in
f

r
g
do
10:

r
 
EM

r
;Light;
99%)3
11:
endfor
12:
C
 
Consensus(W)8
13:
until
C
isstable
14:
Output:
C
Algorithm5
Sub-procedures:getInterval
Input:


:meanofGaussiandistribution

˙
:standarddeviationofGaussiandistribution

99%:userchosencinterval,correspondsto3
˙
1:
r
min
 

3

˙
2:
r
max
 

+3

˙
3:
Output:
[
r
min
;r
max
]
59
Algorithm6
Sub-procedures:UpdateCandidate
Input:

R
:arrayofresidual

r
min
:minimumresidualthreshold

r
max
:maximumresidualthreshold
1:
for
r
in
R
do
2:
if
r
min

r

r
max
then
3:
C
r
 
0
4:
else
5:
C
r
 
1
6:
endif
7:
endfor
8:
Output:
C
Algorithm7
Sub-procedures:UpdateWindow
Input:

C
:candidatearray

num
:maximumnumberofabnormalityineachregion
1:
for
abnormality
c
in
C
do
2:
repeat
3:
w:abnormality
(
end
+1)
 
c
4:
next
c
5:
until
length
(
w:abnormality
)
>
=
num
6:
w:normal
 
[
num
pointsahead,
num
pointsafter]
7:
W
i
 
w
8:
clear
w
9:
endfor
10:
Output:
W
Algorithm8
Sub-procedures:Consensus
Input:

W
:arrayofwindows
1:
(
Count
n
;Count
ab
)
 
([]
;
[])
2:
for
w
in
W
do
3:
for
Idx
n
in
w:normal
do
4:
Count
n
[
Idx
n
]
 
Count
n
[
Idx
n
]+1
5:
endfor
6:
for
Idx
ab
in
w:normal
do
7:
Count
ab
[
Idx
ab
]
 
Count
ab
[
Idx
ab
]+1
8:
endfor
9:
endfor
10:
C
[
i
]
 
(
Count
n
[
i
]
<Count
ab
[
i
])foreach
i
11:
Output:
C
60
ProofofEquation2
Let
P
=
P
max

[
I
]
i
1
=
2
+[
I
]
,[
I
]=
i
(
t
)and
P
=
II
(
t
)

i
(
t
),wehave
II
(
t
)

i
(
t
)=
P
max

i
(
t
)
i
1
=
2
+
i
(
t
)
,so
that:

II
(
t
)=
P
max
i
1
=
2
+
i
(
t
)
Giventheof
P
max
(themaximumpotentialphotosyntheticrateperindividual)
and
i
1
=
2
(thelightintensityatwhichthephotosyntheticrateproceedsathalf
P
max
),we
have:
P
max
2
=

II
)
2

i
1
=
2
Bycombiningbothequations,wehave:

II
(
t
)=

II
)
1+
i
(
t
)
i
1
=
2
Done.
PerformancecomparisonofMCCandtruepositiverate
onsyntheticdatasets
ThissectionpresentstheperformancecomparisonofDynamicFilter,SimpleStatisticsand
HampelFilteronentsyntheticdatasetsgeneratedunderdtsettings.The
resultsareshowninFigure10ofthepaper,andthenumericalvaluesareinTable1and
Table2.
61
TheperformancecomparedareMatthewsCorrelationCotandTruePositiveRate
onabnormalitiesandonbiologicaloutliers.Heretheoutliersarevaluesgeneratedwith
half-lightfactorstlytfromtheaveragedone.
Thereare9groupsoftsettings,witheachgroupcontaining7tcases
varyingthedeviationdegreeofabnormalities.Detailsofthe9groupsofsettingsare:
1.
Lightintensitygeneratedfrom3frequencyFFT,halflightfactorchangessmoothly
alongtime,abnormalitiespointsaggregate,abnormalitiesrateis7.8%.
2.
Lightintensitygeneratedfrom3frequencyFFT,halflightfactorchangessmoothly
alongtime,abnormalitiespointsoccurseveryotherpoint,abnormalitiesrateis7.8%.
3.
Lightintensitygeneratedfrom3frequencyFFT,halflightfactortesalongtime,
abnormalitiespointsaggregate,abnormalitiesrateis7.8%.
4.
Lightintensitygeneratedfrom1frequencyFFT,halflightfactorchangessmoothly
alongtime,abnormalitiespointsaggregate,abnormalitiesrateis7.8%.
5.
Lightintensitygeneratedfrom3frequencyFFT,halflightfactorchangessmoothly
alongtime,abnormalitiespointsaggregate,abnormalitiesrateis14.1%.
6.
Lightintensitygeneratedfrom1frequencyFFT,halflightfactorchangessmoothly
alongtime,abnormalitiespointsaggregate,abnormalitiesrateis14.1%.
7.
Lightintensitygeneratedfrom3frequencyFFT,halflightfactortesalongtime,
abnormalitiespointsaggregate,abnormalitiesrateis14.1%.
62
8.
Lightintensitygeneratedfrom3frequencyFFT,halflightfactortesalongtime,
abnormalitiespointsaggregate,abnormalitiesrateis14.1%.
9.
Lightintensitygeneratedfrom3frequencyFFT,halflightfactortesalongtime,
abnormalitiespointsoccureveryotherpoint,abnormalitiesrateis7.8%.
Table1:Performancecomparisonon63typesofsyntheticdatasets.Eachpairofthescore
isaMCCscoreanditsstandarddeviation.Boldfontindicatesthebestperformanceineach
dataset.
Dataset
SimpleStatistics
HampelFilter
DynamicFilter
outlierStdabnormalityStd
outlierStdabnormalityStd
outlierStdabnormalityStd
1
0.860.030.700.05
0.910.060.640.12
0.99
0.00
0.93
0.09
2
0.890.000.700.02
0.950.020.710.12
0.99
0.01
0.95
0.04
3
0.890.010.720.04
0.950.030.800.05
1.00
0.01
0.99
0.02
4
0.920.010.710.03
0.930.040.720.09
1.00
0.00
0.98
0.01
5
0.920.010.700.02
0.960.030.700.08
1.00
0.00
0.99
0.01
6
0.900.010.730.05
0.980.020.810.08
0.99
0.01
0.97
0.02
7
0.880.010.720.03
0.940.030.730.07
0.99
0.01
0.98
0.02
8
0.870.030.690.03
1.00
0.00
0.99
0.00
0.990.010.950.04
9
0.870.030.710.03
1.00
0.00
0.99
0.01
0.990.010.940.05
10
0.890.010.740.02
0.990.010.970.02
1.00
0.00
0.98
0.01
11
0.900.010.730.04
1.00
0.00
0.99
0.01
1.000.000.980.01
12
0.900.020.730.06
0.99
0.01
0.99
0.01
1.000.010.970.02
13
0.880.020.750.03
1.00
0.00
0.99
0.01
1.000.000.980.01
14
0.880.020.710.05
1.00
0.00
1.00
0.01
1.000.000.990.02
15
0.850.020.630.03
0.780.090.270.08
0.96
0.02
0.76
0.09
16
0.870.010.670.05
0.760.100.440.11
0.96
0.04
0.90
0.07
17
0.890.050.640.04
0.780.060.380.13
0.98
0.02
0.93
0.02
18
0.900.010.750.05
0.860.070.620.11
0.99
0.01
0.98
0.02
19
0.910.020.750.05
0.880.070.540.18
0.98
0.02
0.97
0.04
20
0.920.030.700.03
0.820.070.500.11
0.97
0.02
0.93
0.04
21
0.890.030.720.02
0.850.030.470.05
0.99
0.02
0.97
0.03
22
0.830.030.690.02
0.900.020.730.06
0.96
0.04
0.89
0.08
23
0.840.040.730.03
0.910.020.770.05
0.99
0.01
0.98
0.02
24
0.860.010.760.01
0.940.020.840.07
1.00
0.00
1.00
0.00
25
0.900.010.770.02
0.960.010.920.02
1.00
0.00
0.99
0.01
26
0.900.010.820.02
0.970.010.940.02
0.99
0.01
0.95
0.04
27
0.890.010.820.02
0.970.020.930.03
1.00
0.00
1.00
0.01
28
0.900.020.830.02
0.970.020.940.03
0.97
0.04
0.94
0.08
29
0.710.010.560.03
0.930.030.730.05
0.97
0.02
0.94
0.05
30
0.720.010.580.03
0.910.030.730.06
0.99
0.01
0.98
0.01
31
0.760.030.550.06
0.950.020.800.06
0.98
0.01
0.98
0.02
32
0.750.040.580.05
0.960.010.870.07
1.00
0.00
0.99
0.00
33
0.760.030.560.07
0.980.010.890.04
0.99
0.01
0.99
0.01
34
0.740.030.580.06
0.960.010.880.04
0.99
0.02
0.97
0.03
35
0.760.060.530.05
0.950.020.850.02
0.98
0.01
0.97
0.02
36
0.670.010.580.03
0.930.010.740.02
0.98
0.02
0.93
0.09
37
0.700.020.620.02
0.930.010.790.04
0.99
0.02
0.99
0.02
38
0.700.030.670.02
0.950.030.840.03
0.99
0.02
0.98
0.02
39
0.720.030.720.02
0.960.010.850.04
0.99
0.02
0.98
0.02
40
0.750.030.730.01
0.970.010.940.02
1.00
0.00
1.00
0.01
41
0.770.030.720.02
0.980.010.950.03
0.99
0.02
0.99
0.02
42
0.780.020.730.03
0.980.010.920.06
1.00
0.00
0.97
0.05
63
Table.1(cont'd)
Dataset
SimpleStatistics
HampelFilter
DynamicFilter
outlierStdabnormalityStd
outlierStdabnormalityStd
outlierStdabnormalityStd
43
0.730.030.440.07
0.640.080.320.04
0.91
0.02
0.89
0.01
44
0.740.090.490.06
0.750.050.380.03
0.94
0.03
0.91
0.04
45
0.760.040.450.04
0.740.030.430.06
0.97
0.01
0.95
0.03
46
0.730.040.500.05
0.750.060.460.07
0.98
0.02
0.93
0.03
47
0.690.050.570.05
0.840.040.590.07
0.99
0.01
0.96
0.03
48
0.710.050.540.05
0.810.070.550.07
0.96
0.02
0.93
0.02
49
0.720.070.530.07
0.730.070.470.06
0.98
0.01
0.95
0.03
50
0.650.070.500.07
0.670.100.360.06
0.91
0.06
0.71
0.09
51
0.610.030.510.02
0.650.040.400.05
0.93
0.03
0.86
0.05
52
0.760.030.480.03
0.710.020.460.06
0.94
0.03
0.93
0.03
53
0.690.050.520.06
0.660.060.450.08
0.98
0.01
0.96
0.02
54
0.730.050.620.07
0.780.080.590.10
0.99
0.02
0.97
0.04
55
0.750.010.540.02
0.820.050.440.09
0.98
0.01
0.98
0.01
56
0.750.050.510.06
0.720.070.380.03
0.97
0.02
0.94
0.03
57
0.850.050.620.05
0.900.040.730.04
0.99
0.01
0.77
0.10
58
0.880.030.630.03
0.930.020.790.04
0.98
0.01
0.94
0.04
59
0.880.020.680.02
0.940.020.820.04
0.97
0.02
0.86
0.04
60
0.900.020.700.02
0.950.020.830.04
0.99
0.01
0.95
0.02
61
0.900.030.750.05
0.970.010.900.04
0.98
0.02
0.95
0.04
62
0.900.020.740.02
0.950.020.840.04
0.99
0.01
0.94
0.04
63
0.910.020.700.03
0.960.020.860.03
0.99
0.01
0.96
0.02
64
Table2:Performancecomparisonon63typesofsyntheticdatasets.Eachpairofthescore
isaTPRscoreanditsstandarddeviation.
SimpleStatistics
HampelFilter
DynamicFilter
bio
outlierStdabnormalityStd
bio
outlierStdabnormalityStd
bio
outlierStdabnormalityStd
0.850.030.900.03
0.840.060.920.05
0.970.051.000.01
0.850.030.900.05
0.870.060.880.06
0.980.020.990.01
0.870.030.880.04
0.920.020.930.04
0.990.011.000.00
0.880.020.830.05
0.860.060.870.03
0.990.010.990.01
0.860.060.840.03
0.860.050.830.03
1.000.001.000.00
0.890.020.880.05
0.930.030.890.04
0.990.010.990.02
0.880.010.860.05
0.880.050.870.04
0.990.011.000.01
0.830.030.910.05
1.000.001.000.00
0.970.021.000.01
0.850.010.880.02
0.990.011.000.00
0.950.061.000.01
0.890.010.890.01
0.990.010.990.01
0.990.011.000.00
0.860.020.890.03
0.990.001.000.00
1.000.000.980.02
0.860.030.900.07
0.990.011.000.01
0.980.010.990.01
0.880.020.900.03
0.990.011.000.00
0.990.010.990.01
0.850.030.900.04
1.000.001.000.00
0.990.011.000.01
0.790.050.880.06
0.580.100.630.07
0.860.110.870.10
0.820.030.880.04
0.660.070.730.09
0.960.040.950.08
0.770.050.880.04
0.640.100.700.10
0.950.021.000.00
0.870.040.900.03
0.810.060.830.09
0.990.011.000.01
0.870.050.870.07
0.710.110.820.08
0.980.021.000.00
0.850.040.820.05
0.710.090.770.08
0.970.020.990.02
0.850.020.900.04
0.680.050.830.05
0.980.021.000.00
0.810.030.950.04
0.880.030.940.04
0.950.031.000.01
0.830.030.980.03
0.900.020.960.03
0.990.011.000.00
0.860.010.960.01
0.930.030.970.02
1.000.001.000.00
0.890.020.940.04
0.960.011.000.00
1.000.001.000.00
0.900.010.990.02
0.970.011.000.00
0.960.041.000.00
0.890.011.000.00
0.970.021.000.00
1.000.001.000.00
0.900.021.000.00
0.970.021.000.00
0.970.041.000.00
0.710.010.800.02
0.850.040.910.02
0.970.020.950.08
0.720.010.810.04
0.860.020.900.05
0.980.011.000.00
0.730.020.790.06
0.920.020.900.04
0.980.011.000.01
0.730.020.820.07
0.940.020.950.03
1.000.001.000.00
0.710.010.810.08
0.940.020.940.02
0.990.010.990.01
0.730.020.800.05
0.930.020.950.02
0.980.030.990.01
0.720.020.740.04
0.930.010.920.02
0.980.010.990.01
0.670.010.840.04
0.900.010.940.02
0.970.030.940.12
0.700.020.860.03
0.910.020.930.02
0.990.021.000.00
0.700.030.930.03
0.940.030.940.02
0.990.021.000.00
0.720.030.980.01
0.950.010.980.03
0.990.021.000.00
0.750.030.970.04
0.970.011.000.00
1.000.001.000.00
0.780.020.910.06
0.980.011.000.00
0.990.021.000.00
0.780.020.950.04
0.980.010.990.02
1.000.000.960.08
0.610.030.700.03
0.560.080.600.09
0.910.020.970.02
0.650.040.710.03
0.600.040.660.13
0.940.020.950.08
0.610.020.750.05
0.630.030.700.04
0.960.010.980.04
0.670.030.690.07
0.660.030.670.06
0.940.021.000.01
0.640.040.830.08
0.760.060.750.02
0.960.031.000.00
0.670.050.770.09
0.750.070.720.05
0.940.020.990.02
0.650.030.770.05
0.650.060.700.02
0.970.020.980.02
65
Table.2(cont'd)
SimpleStatistics
HampelFilter
DynamicFilter
bio
outlierStdabnormalityStd
bio
outlierStdabnormalityStd
bio
outlierStdabnormalityStd
0.590.030.770.02
0.610.070.590.04
0.840.030.800.13
0.610.030.750.02
0.600.030.680.03
0.900.040.950.08
0.620.030.750.02
0.660.040.670.07
0.940.030.990.01
0.620.020.790.05
0.600.060.720.06
0.980.010.980.02
0.730.050.790.12
0.740.050.780.09
0.980.030.990.02
0.670.050.780.07
0.680.040.660.09
0.980.011.000.00
0.690.050.720.08
0.600.040.670.04
0.950.020.990.03
0.790.030.850.04
0.860.020.900.02
0.880.070.890.06
0.770.030.900.06
0.900.020.930.03
0.960.020.990.02
0.840.030.820.01
0.910.030.940.04
0.940.030.900.09
0.840.010.870.03
0.920.020.930.03
0.980.010.980.02
0.870.040.900.02
0.950.020.970.02
0.970.021.000.00
0.880.020.890.02
0.920.030.950.03
0.970.020.990.01
0.850.020.890.02
0.930.010.950.02
0.970.020.990.01
66
BIBLIOGRAPHY
67
BIBLIOGRAPHY
[1]
IAjjawi,YLu,LJSavage,SMBell,andRLLast.Large-scalereversegeneticsin
arabidopsis:Casestudiesfromthechloroplast2010project.
PlantPhysiol
,152(2):529{
540,2010.
[2]
IAjjawi,YLu,LJSavage,SMBell,andRLLast.Large-scalereversegeneticsinara-
bidopsis:casestudiesfromthechloroplast2010project.
Plantphysiology
,152(2):529{
540,2010.
[3]
JMAlonso,ANStepanova,TJLeisse,etal.Genome-wideinsertionalmutagenesisof
arabidopsisthaliana.
Science
,301(5633):653{657,2003.
[4]
NSAltman.Anintroductiontokernelandnearest-neighbornonparametricregression.
AmStat
,46(3):175{185,1992.
[5]
PBaldi,SBrunak,YChauvin,CAndersen,andHNielsen.Assessingtheaccuracy
ofpredictionalgorithmsforn:anoverview.
Bioinformatics
,16(5):412{424,
2000.
[6]
DMBatesandDGWatts.
Nonlinearregression:iterativeestimationandlinearapprox-
imations
.WileyOnlineLibrary,1988.
[7]
CMBishop.
Patternrecognitionandmachinelearning
.springer,2006.
[8]
WilliamMBolstad.
IntroductiontoBayesianstatistics
.JohnWiley&Sons,2013.
[9]
JBonner.Theupperlimitofcropyieldthisclassicalproblemmaybeanalyzedasone
ofthephotosyntheticcyofplantsinarrays.
Science
,137(3523):11{15,1962.
[10]
CACameronandFAGWindmeijer.Anr-squaredmeasureofgoodnessofforsome
commonnonlinearregressionmodels.
JournalofEconometrics
,77(2):329{342,1997.
[11]
TCChouandPTaLaLay.Generalizedequationsfortheanalysisofinhibitionsof
michaelis-mentenandhigher-orderkineticsystemswithtwoormoremutuallyexclusive
andnonexclusiveinhibitors.
EuropeanJournalofBiochemistry
,115(1):207{216,1981.
68
[12]
FChu,YWang,DSParker,andCZaniolo.Datacleaningusingbeliefpropagation.
IQ2S
,pages99{104,2005.
[13]
XChu,IFIlyas,andPPapotti.Holisticdatacleaning:Puttingviolationsintocontext.
ICDE
,pages458{469,2013.
[14]
JACruz,LJSavage,RZegarac,WKovac,CHall,JChen,andDMKramer.Dynamic
environmentalphotosyntheticimaging(depi)revealsemergentphenotypesrelatedto
theenvironmentalresponsesofphotosynthesis.
NatBiotech
,inrevision,2014.
[15]
FdeA.Lobo,MPdeBarros,HJDalmagro,etal.Fittingnetphotosyntheticlight-
responsecurveswithmicrosoftexcel
^
A<
^
A
a
acriticallookatthemodels.
Photosynthetica
,
51(3):445{456,2013.
[16]
JEDowdandDSRiggs.Acomparisonofestimatesofmichaelis-mentenkineticcon-
stantsfromvariouslineartransformations.
JournalofbiologicalChemistry
,240(2):863{
869,1965.
[17]
AEbaid,AElmagarmid,IFIlyas,MOuzzani,JAQuiane-Ruiz,NTang,andSYin.
Nadeef:ageneralizeddatacleaningsystem.
VLDBEndowment
,6(12):1218{1221,2013.
[18]
PEilersandJPeeters.Amodelfortherelationshipbetweenlightintensityandthe
rateofphotosynthesisinphytoplankton.
Ecologicalmodelling
,42(3):199{215,1988.
[19]
FFioraniandUSchurr.Futurescenariosforplantphenotyping.
AnnuRevPlantBiol
,
64:267{291,2013.
[20]
DAFreedman.
Statisticalmodels:theoryandpractice
.CambridgeUniversityPress,
2009.
[21]
QGao,EOstendorf,JACruz,RJin,DMKramer,andJChen.Inter-functionalanalysis
ofhigh-throughputphenotypedatabynonparametricclusteringanditsapplicationin
photosynthesis.
Bioinformatics
,btv515,2015.
[22]
AGelman.Priordistributionsforvarianceparametersinhierarchicalmodels(comment
onarticlebybrowneanddraper).
Bayesiananalysis
,1(3):515{534,2006.
[23]
Govindjee,BeattyJT,GestH,andAllenJF.
DiscoveriesinPhotosynthesis
.Springer,
2005.
69
[24]
JMGreen,HAppel,EMRehrig,JHarnsomburana,JFChang,PBalint-Kurti,and
CRShyu.Phenophyte:amethodtoquantify2dphenotypesfrom
imagery.
PlantMethods
,8(1):1{12,2012.
[25]
DKGroˇkinsky,JSvensgaard,SChristensen,andTRoitsch.Plantphenomicsandthe
needforphysiologicalphenotypingacrossscalestonarrowthegenotype-to-phenotype
knowledgegap.
Journalofexperimentalbotany
,66(18):5429{5440,2015.
[26]
MRGupta,EKGarcia,andEChin.Adaptivelocallinearregressionwithapplication
toprintercolormanagement.
ImageProcessing,IEEETransactionson
,17(6):936{945,
2008.
[27]
KGHerbert,NHGehani,WHPiel,JTLWang,andCHWu.Bio-ajax:anextensible
frameworkforbiologicaldatacleaning.
ACMSIGMODRecord
,33(2):51{57,2004.
[28]
PWHollandandREWelsch.Robustregressionusingiterativelyreweightedleast-
squares.
CommunicationsinStatistics-TheoryandMethods
,6(9):813{827,1977.
[29]
DHoule,DRGovindaraju,andSOmholt.Phenomics:thenextchallenge.
NatRev
Genet
,11(12):855{866,2010.
[30]
ADJassbyandTPlatt.Mathematicalformulationoftherelationshipbetweenphoto-
synthesisandlightforphytoplankton.
AmericanSocietyofLimnologyandOceanogra-
phy
,pages540{547,1976.
[31]
JLYKoh,MLLee,WHsu,andKTLam.Correlation-baseddetectionofattribute
outliers.In
AdvancesinDatabases:Concepts,SystemsandApplications
,pages164{
175.2007.
[32]
DMKramerandJREvans.Theimportanceofenergybalanceinimprovingphotosyn-
theticproductivity.
Plantphysiol
,155(1):70{78,2011.
[33]
MKutsukake,XYMeng,NKatayama,NNikoh,HShibao,andTFukatsu.Aninsect-
inducednovelplantphenotypeforsustainingsociallifeinaclosedsystem.
Nature
communications
,3:1187,2012.
[34]
HLambers,FSChapinIII,andTLPons.Mechanismsofphotosyntheticoxygenevo-
lutionandfundamentalhypothesesofphotosynthesis.In
PlantPhysiologicalEcology
,
pages26{47.Springer,2008.
[35]
KLevenberg.Amethodforthesolutionofcertainnon{linearproblemsinleastsquares.
(2):164{168,1944.
70
[36]
XPLietal.Apigment-bindingproteinessentialforregulationofphotosyntheticlight
harvesting.
Nature
,403(6768):391{395,2000.
[37]
SPLongandJEallgren.Measurementofco2assimilationbyplantsintheand
thelaboratory.InD.O.Hall,J.M.O.Scurlock,H.R.ar-Nordenkampf,etal.,editors,
PhotosynthesisandProductioninaChangingEnvironment
,pages129{167.Chapman
andHall,1993.
[38]
HLMacIntyre,TMKana,TAnning,andRJGeider.Photoacclimationofphoto-
synthesisirradianceresponsecurvesandphotosyntheticpigmentsinmicroalgaeand
cyanobacteria.
JPhycol
,38(1):17{38,2002.
[39]
JIMaleticandAMarcus.Datacleansing:Beyondintegrityanalysis.In
IQ
,pages
200{209,2000.
[40]
CMaJNeville,andSPrabhakar.Eracer:adatabaseapproachforstatistical
inferenceanddatacleaning.
ACMSIGMOD
,pages75{86,2010.
[41]
GMcLachlan.
Discriminantanalysisandstatisticalpatternrecognition
,volume544.
JohnWiley&Sons,2004.
[42]
LMentenandMIMichaelis.Diekinetikderinvertinwirkung.
BiochemZ
,49:333{369,
1913.
[43]
HMotulskyandAChristopoulos.
Fittingmodelstobiologicaldatausinglinearand
nonlinearregression:apracticalguidetocurve
.OxfordUniversityPress,2004.
[44]
HMullerandJCFreytag.
Problems,methods,andchallengesincomprehensivedata
cleansing
.ProfessorendesInst.FurInformatik,2005.
[45]
P.Muller,X.Li,andK.K.Niyogi.Non-photochemicalquenching.aresponsetoexcess
lightenergy.
PlantPhisiol
,125(4):1558{66,2001.
[46]
ABNicotra,OKAtkin,SPBonser,AMDavidson,EJFinnegan,UMathesius,PPoot,
MDPurugganan,CLRichards,FValladares,etal.Plantphenotypicplasticityina
changingclimate.
Trendsinplantscience
,15(12):684{692,2010.
[47]
E.

OgrenandJ.R.Evans.Photosyntheticlight-responsecurves.
Planta
,189(2):182{190,
1993.
71
[48]
RKPearson.Outliersinprocessmodelingandiden
IEEETContrSystT
,
10(1):55{63,2002.
[49]
RKPearson.
Miningimperfectdata:Dealingwithcontaminationandincompleterecords
.
Siam,2005.
[50]
TDPrice,AQvom,andDEIrwin.Theroleofphenotypicplasticityindriving
geneticevolution.
ProceedingsoftheRoyalSocietyofLondonB:BiologicalSciences
,
270(1523):1433{1440,2003.
[51]
CRRao.Theutilizationofmultiplemeasurementsinproblemsofbiological
tion.
JRStatSoc
,10(4):159{203,1948.
[52]
URascher,SBlossfeld,FFiorani,etal.Non-invasiveapproachesforphenotypingof
enhancedperformancetraitsinbean.
FunctPlantBiol
,38(12):968{983,2011.
[53]
DReynolds.Gaussianmixturemodels.
EncyclopediaofBiometrics
,pages659{663,
2009.
[54]
GSeberandCWild.
NonlinearRegression
.Wiley-Interscience,2003.
[55]
JSer^odioandJLavaud.Amodelfordescribingthelightresponseofthenonphoto-
chemicalquenchingofchlorophylluorescence.
Photosynthesisresearch
,108(1):61{76,
2011.
[56]
MShanahan.Perceptionasabduction:Turningsensordataintomeaningfulrepresen-
tation.
CognitiveSci
,29(1):103{134,2005.
[57]
HSohn,DWAllen,KWorden,andCRFarrar.Structuraldamageionusing
extremevaluestatistics.
Jdynsyst-tasme
,127(1):125{132,2005.
[58]
SSubramaniam,TPalpanas,DPapadopoulos,VKalogeraki,andDGunopulos.Online
outlierdetectioninsensordatausingnon-parametricmodels.
VLDB
,pages187{198,
2006.
[59]
RSubramanian,EPSpalding,andNJFerrier.Ahighthroughputrobotsystemfor
machinevisionbasedplantphenotypestudies.
MachineVisionandApplications
,
24(3):619{636,2013.
72
[60]
KTakizawa,JACruz,AKanazawa,andDMKramer.Thethylakoidprotonmotive
forceinvivo.quantitative,non-invasiveprobes,energetics,andregulatoryconsequences
oflight-inducedpmf.
Biochim.Biophys
,1767(10):1233{44,2007.
[61]
OLTessmer,YJiao,JACruz,DMKramer,andJChen.Functionalapproachto
high-throughputplantgrowthanalysis.
BMCSystBiol
,7(Suppl6):S17,2013.
[62]
OLTessmer,YJiao,JACruz,DMKramer,andJChen.Functionalapproachto
high-throughputplantgrowthanalysis.
BMCSystemBiology
,7(Suppl6):S17,2013.
[63]
CThrampoulidis,SOymak,andBHassibi.Regularizedlinearregression:Aprecise
analysisoftheestimationerror.In
ProceedingsofThe28thConferenceonLearning
Theory
,pages1683{1709,2015.
[64]
JVlasblom,KZuberi,HRodriguez,RArnold,AGagarinova,VDeineko,AKumar,
ELeung,KRizzolo,BSamanfar,etal.Novelfunctiondiscoverywithgenemania:a
newintegratedresourceforgenefunctionpredictioninescherichiacoli.
Bioinformatics
,
pagebtu671,2014.
[65]
SVonCaemmererandGDFarquhar.Somerelationshipsbetweenthebiochemistryof
photosynthesisandthegasexchangeofleaves.
Planta
,153(4):376{387,1981.
[66]
AWalter,FLiebisch,andAHund.Plantphenotyping:frombeanweighingtoimage
analysis.
Plantmethods
,11(1):14,2015.
[67]
AndrewR.Webb.
Fisher
^
A<
^
criterion
^
lineardiscriminantanalysis
.Wiley,2002.
[68]
MLWong,CDong,VAndreev,MArcos-Burgos,andJLicinio.Predictionofsuscep-
tibilitytomajordepressionbyamodelofinteractionsofmultiplefunctionalgenetic
variantsandenvironmentalfactors.
Molecularpsychiatry
,17(6):624{633,2012.
[69]
L.Xu,JA.Cruz,L.Savage,DM.Kramer,andJ.Chen.Plantphotosynthesisphenomics
dataqualitycontrol.
Bioinformatics
,31(11):1796{1804,2015.
[70]
JYangandJLeskovec.Patternsoftemporalvariationinonlinemedia.In
Proceedings
ofthefourthACMinternationalconferenceonWebsearchanddatamining
,pages
177{186,2011.
[71]
XYin,XLiu,JChen,andDMKramer.Multi-leafalignmentfromplant
images.In
ApplicationsofComputerVision(WACV),2014IEEEWinterConference
on
,pages437{444,2014.
73
[72]
XYin,XLiu,JChen,andDMKramer.Multi-leaftrackingfromplant
videos.In
ImageProcessing(ICIP),2014IEEEInternationalConferenceon
,pages
408{412,2014.
[73]
YZeinalov.Mechanismsofphotosyntheticoxygenevolutionandfundamentalhypothe-
sesofphotosynthesis.
[74]
XZhu,StephenP.Long,andDonaldR.Ort.Improvingphotosyntheticciencyfor
greateryield.
AnnuRevPlantBiol
,61:235{261,2010.
74