ASIMULATIONSTUDYFOREVALUATINGTHEAREAUNDERTHEROCCURVEAND THEERRORRATEINBINARYCLASSIFICATIONS By QinhuaHuang ATHESIS Submittedto MichiganStateUniversity inpartialoftherequirements forthedegreeof BiostatisticsŒMasterofScience 2015 ABSTRACT ASIMULATIONSTUDYFOREVALUATINGTHEAREAUNDERTHEROCCURVE ANDTHEERRORRATEINBINARYCLASSIFICATIONS By QinhuaHuang TheareaundertheROCcurve(AUC)andtheerrorratearetwoimportantcriteriadesignedto measuretheperformanceofThemaximumAUCandtheminimumerrorrateindicates thebestHowever,onecannotgettheminimumerrorrateandthemaximumAUCsi- multaneouslyunderthesame.Itisthusofinteresttoinvestigatetherelationshipbetween theAUCandtheerrorrate.StudyingtherelationshipbetweentheAUCanderrorrate,Cortesand Mehryar(2004)haveprovidedanexpressionoftheexpectedvalueoftheAUCforagivenerror rate.Inthisthesis,IstudythevalidityoftheexpressiongivenbyCortesandMehryar(2004), afterthat,IinvestigatetheerrorratedistributionunderaedrangeofAUC. MyresultsshowthatCortesandMehryar'sexpressionisnotvalidundersomesitua- tions,andtheexpectedaveragevalueofAUCisalwayssmallerthantheestimateofAUCfrom Mote-Carlosamples.Whentheproportionofpositivesamplesisnotcloseto0.5,theexpected averagevalueofAUCcalculatedbyCortesandMehryar'sexpressiondeviateslargelyfromthe Mote-CarlosamplesofAUC.Thisindicatesthattheexpressionoftheexpectedaveragevalueof AUCforgivenerrorratemaynotbeaccurateandshouldbecautionused.Ialsoprovideuseful informationforthequantilesoftheerrorrateforgivenedrangeofAUC,withtheproportionof positivesamplesvaryingin[0.1,0.5]. Copyrightby QINHUAHUANG 2015 ACKNOWLEDGMENTS Iwishtoexpressmysincerethankstoallmygraduateandundergraduateprofessors.Iamextreme- lythankfulandindebtedtothemforsharingexpertise,andsincereandvaluableguidanceand encouragementextendedtome.Ialsothankmyparentsfortheunceasingencouragement,support andattention.Iamalsogratefultomyfriendswhosupportedmethroughthisventure.Ialsoplace onrecord,mysenseofgratitudetooneandall,whodirectlyorindirectly,havelenttheirhandin thisventure. iv TABLEOFCONTENTS LISTOFTABLES ....................................... viLISTOFFIGURES ...................................... viiChapter1Introduction .................................. 11.1Background......................................1 1.2Motivation.......................................2 1.3performance................................3 1.4ofROCcurve...............................4 1.5ofAUC...................................6 Chapter2LiteratureReview .............................. 72.1Methodsof...............................7 2.2DevelopmentofROCCurveAnalysis........................8 2.3StudyofInvestigatingtheRelationshipbetweenAUCandtheRate9 Chapter3DifferentApproachestoAddresstheRelationshipbetweentheAUC andtheRate ........................ 113.1TheExpectedValueoftheAUCunderederrorrate................11 3.2SimulatingtheAUCDistributionunderFixedErrorRate..............13 3.2.1GeneratingBinaryDistribution........................13 3.2.2UsingLogisticRegressionasa..................14 3.2.3EstimateofAUCandExpectedAUCCalculation..............15 3.2.4ComparingtheEstimateofAUCversusExpectedAverageAUC......16 3.3StudytheErrorRateDistributionUndertheFixedAUC...............25 3.3.1CDFandPDFPlotsforErrorRateUndertheFixedRangeAUC......25 Chapter4Conclusion .................................. 364.1Summary.......................................36 4.2Limitation.......................................37 4.3Discussion.......................................38 BIBLIOGRAPHY....................................... 39vLISTOFTABLES Table1.1:Performance...........................4 Table3.1:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=10%)18 Table3.2:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=20%)19 Table3.3:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=30%)21 Table3.4:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=40%)22 Table3.5:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=50%)24 Table3.6:TheDifferencebetweenestimateofAUCandExpectedAverageAUC..24 Table3.7:DescriptiveStatisticsofErrorRateunderFixedAUC(r=10%).......27 Table3.8:DescriptiveStatisticsofErrorRateunderFixedAUC(r=20%).......29 Table3.9:DescriptiveStatisticsofErrorRateunderFixedAUC(r=30%).......31 Table3.10:DescriptiveStatisticsofErrorRateunderFixedAUC(r=40%).......33 Table3.11:DescriptiveStatisticsofErrorRateunderFixedAUC(r=50%).......35 viLISTOFFIGURES Figure1.1:ROCcurve..................................5 Figure3.1:AUCExpectation(m=100,n=900).....................12 Figure3.2:CDFofestimateofAUC(r=10%)......................17 Figure3.3:PDFofestimateofAUC(r=10%)......................17 Figure3.4:CDFofestimateofAUC(r=20%)......................18 Figure3.5:PDFofestimateofAUC(r=20%)......................19 Figure3.6:CDFofestimateofAUC(r=30%)......................20 Figure3.7:PDFofestimateofAUC(r=30%)......................20 Figure3.8:CDFofestimateofAUC(r=40%)......................21 Figure3.9:PDFofestimateofAUC(r=40%)......................22 Figure3.10:CDFofestimateofAUC(r=50%)......................23 Figure3.11:PDFofestimateofAUC(r=50%)......................23 Figure3.12:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=10%).26 Figure3.13:ErrorRateDistributionUnderedAUC(r=10%).............26 Figure3.14:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=20%).28 Figure3.15:ErrorRateDistributionUnderedAUC(r=20%).............28 Figure3.16:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=30%).30 Figure3.17:ErrorRateDistributionUnderedAUC(r=30%).............30 Figure3.18:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=40%).32 Figure3.19:ErrorRateDistributionUnderedAUC(r=40%).............32 viiFigure3.20:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=50%).34 Figure3.21:ErrorRateDistributionUnderedAUC(r=50%).............34 Figure3.22:ErrorRateDistributionUnderedAUC(r=50%).............34 viii Chapter1 Introduction 1.1Background isacommontaskinmanyofapplicationssuchashealthcare,geneticanalysis, andcomputerscience.Forexample,Planetetal.(2001)proposedamolecularkey-basedmethod toclassifyputativeNTPasegenesprecisely[2].AnotherexampleisgivenbyPangetal.(2002), whostudiedhowtoclassifydocumentsbysentimentusingmachinelearningincludingNavie Bayes,maximumentropyandsupportvectormachines[3].Finally,Gorno-Tempini (2011)studiedhowtoclassifyprimaryprogressiveaphasiaanditsthreemainvariants[1]. Aftertheitisimportanttostudytheaccuracyoftheions.Thereare twocommoncriteriausedtomeasuretheperformanceoftheerrorrateandthearea underreceiveroperatingcharacteristics(ROC)curve(AUC).Recently,someresearcherspointed outthattheAUCmaybemorepertinentmeasurementforclasthanthe errorrate[4]. TheROCcurveisaplotthatteststheperformanceofabinarysystemasitsdiscrimi- nationthresholdisvaried,thusitcanselectbasedontheirperformance.Ithasbeenused foralongtime,andhavebeenextendedtovisualizeandanalyzethediagnosticsystems'behav- ior[43].Besides,anincreasingnumberofmedicaldecisionshavebeenmadebasedontheROC graph,andagrowingusageoftheROCcurveshavebeenseeninmachinelearningcommunity becauseoftherealizationthattheerrorrateisnotaccurateenoughtomeasurethe 1 performance[22].Apartfrombeingamainlyperformancegraphingmethod,theROCgraphalso haspropertiesmakingitveryusefulforestimatingerrorcostsofskewedclassdistribution.And thesepropertieshavebecomemoreandmoreimportantbecausetheresearchaboutcost-sensitive learninghasgainedalotofattentionlately[16]. 1.2Motivation TheareaunderROCcurve(AUC)andtheerrorratearetwoimportantcriteria designedtomeasuretheperformanceofForinstance,Simonetal.(2003)usedmisclas- ratetomeasuretheperformanceofaclasspredicationforDNAmicroarraydata[42]. Golubetal.(1999)usedthecumulativeerrorratetoassesstheaccuracyofangeneexpression basedforcancer[39].Wangetal.(2007)alsousedtheoverallerrorratetoassess theirforrapidassignmentofrRNASequenceintohighertaxonomy[40].Anotherex- ampleisgivenbyKrizhevskyetal.(2012),whomeasuredtheirthatdesignedtoclassify high-resolutionimagesintheImageNetLSVRC-2010contestbyerrorrate[9].Furthermore,S- tatnikovetal.(2008)usedtheAUCtocomparetherandomforestsandsupportvectormachines forcancerbasedonmicroarray[34].AndLeeetal.(2008)usedAUCtoevaluatethe performanceofanewmethodofwhichbasedonathwayactivitiesinferredforeach patients[35].Finally,Maetal.(2005)proposedanewmethodusedasigmoidapproximationto theAUCasaobjectivefunctiontoselectandclassifybiomarker[8]. Themostcommonmethodstomeasuretheperformanceofexerciseistheerror rateandAUC.However,onecannotgettheminimumerrorrateandthemaximumAUCsimulta- neouslyunderthesame.Itisthusofinteresttoinvestigatetherelationshipbetweenthe AUCandtherate.CortesandMehryar(2004)haveprovidedanexpressionofthe 2 expectedaveragevalueandthevarianceofAUCgivenaederrorrate.However,theauthors warnedthattheseequationsrequiretheclasorrankingswith k errorstobeequiprobable. Byequiprobable,theymeansituationsinwhicheachtestsamplehastheequalprobabilityofbeing [4].Inthisthesis,IstudytheexpressionprovidedbyCortesandMehryar(2004),and pointouttheexpressionisinappropriateinsomesituations.AndIconductasimulation experimenttoinvestigatetherelationshipbetweentheestimatesofAUCandtheestimateoferror rate.Iconductthisexperimentbysimulatingabinarydistribution,andusinglogisticregression withthresholdasa.Iassumethatthethresholdforthefollowsanuniform distributionfrom0to1.ThenIcalculatetheestimateofAUCandtheestimateoferrorratefor eachToinvestigatehowtheestimateoferrorrateisdistributedundertheed rangesofvalueoftheestimateofAUC,Idrawthecumulativedistributionfunction(CDF)plots andprobabilitydistributionfunction(PDF)plotsoftheestimateoferrorrate. 1.3performance Inthisthesis,Istudythebinarysituations.Inabinaryexercise,every sampleisassignedtopositiveornegativeclass.Aisusedtopredictwhichclassshould thesamplebeassignedto.Differentproducedifferentoutcomestopredictthesample's class,someofthemproducediscreteclasslabelsandothersproducecontinuousoutputstodifferent thresholds.Thethresholdscandifferfrom0to1forabinaryIftheoutputsofa issmallerthanthethreshold,thenthesampleisasanegative;iftheoutputofa islargerorequaltothethreshold,thenthesampleisasapositive. Givenamodelandatestsample,theremaybefourdifferentoutcomes(Table 1.1).Ifthetestsampleispositiveandisassignedcorrectly,itisatruepositive;ifitispositivebut 3 Table1.1:Performance ConditionPositiveConditionNegative TestPositiveTruepositiveFalsepositive TestNegativeFalsenegativeTruenegative isassignedtonegative,itisafalsenegative.Ifthetestsampleisnegativeandbeassignedcorrectly, itisatruenegative;ifthetestsampleisnegativebutisassignedtopositive,itisafalsepositive. SomecationfunctionsareusedtomeasuretheperformanceofbinaryThe sensitivity(truepositiverate)isestimatedas TruePositiveRate = TruePositive TotalPositives Thefalsepositiverateisestimatedas FalsePositiveRate = FalseNegative TotalNegatives Theisestimatedas Specifity = TrueNegative FalsePositives + TrueNegatives = 1 FalsePositiveRate 1.4ofROCcurve TheROCcurveisaplotthatdemonstrateshowabinaryperforms.Itisatwo-dimension graphwiththetruepositiverateontheYaxisandthefalsepositiverateontheXaxis.TheROC curvecanillustratetherelationshipbetweenthefalsepositiveandthetruepositive.Figure(1.1)isa simpleexampleoftheROCcurve.Here,thediagonalrepresentstherandominother 4 words,theisconductedasafaircointoss,anditisdrawnasareference.Point(0,0)is locatedatlowerleftanddemonstratesthesituationinwhichneverissuingapositive thecommitsnofalsepositiveerrorsbutalsonotruepositives.Oppositely,thepoint(1, 1)whichislocatedattheupperrightcornerdemonstratesnoissuingnegativeThe point(0,1)showstheperfectclaswithzerofalsepositiverateandonetruepositiverate. Intuitively,onepointperformsabetterifitislocatedtothenorthwestofanother becauseithasahighertruepositiverateandalowfalsepositiverate.Usually,aclwhich appearsneartheXaxisandontheleft-handsideofaROCcurvewouldbetakenasficonservativefl becausetheymakepositiveonlywithstrongevidencesotheymakefewpositive errors;however,thetruepositiveratedoesn'tperformwelltoo.Andawhichappearson theupperright-handsideofanROCcurvealwaysbetakenasfiliberalflbecausetheymakepositive withweakevidencetoincreasethetruepositiverate,butthehightruepositiverate alwaysaffectsthehighfalsepositiverate. Figure1.1:ROCcurve 5 1.5ofAUC AsImentionedintheprevioussection,anROCcurveisaplotofthetruepositiverateasafunction ofthefalsepositiverate.ReducingROCperformancefromtwodimensionstoonesinglescalar valuemaybeeasiertocomparetheperformancesofAUC,whichisasthearea undertheROCcurve,isthemostcommonmethodtomeasuretheROCperformance.Sinceitisa portionofareaoftheunitsquare,thevalueofAUCwillalwaysbetween0and1.However,since therandomproducesthediagonallinebetween(0,0)and(1,1)hasanareaof0.5, norealisticshouldhaveanAUCunder0.5. ThevalueofAUCcouldbecalculatedbytheexpressiongivenbyMannandWhitney(1947) andWilcoxon(1945),whichiscalledWilcoxon-Mann-Whitneystatistic.Thestatisticisgivenby: W = å m i = 1 å n j = 1 I ( x i > y j ) mn (1.1) Itisbasedonpairwisecomparisonsbetweenasample x i , i = 1 ;:::; m ofrandomvariable X and asample y j , j = 1 ;:::; n ,ofrandomvariable Y .Weidentify x 1 ; x 2 ;:::; x m astheoutputsfor m positivesamples,and y 1 ; y 2 ;:::; y n astheoutputsfor n negativesamples.Theproofof thisexpressionisbasedontheobservationthattheAUCvalueisexactlytheprobability P ( X > Y ) . SotheAUCcanbeusedasameasureofpairwisecomparisonsbetweenofthetwo classes.Withaperfectranking,allpositivesamplesarerankedhigherthanthenegativeonesand AUC=1. 6 Chapter2 LiteratureReview 2.1Methodsof Asacommontask,hasbeenstudiedinmanycases.Researchersstudiedvarious methodsoffordifferentsituations.Friedman(1989)studiedhowtouselineardis- criminantanalysisandFisher'slineardiscriminantmethodtoclassifymultipleclassesofsamples [23].Mikaetal.(1999)statedthatlineardiscriminantanalysisisaappropriatemethodtoclassify continuousobservations.Oppositely,thediscriminantcorrespondenceanalysisismoreappropriate toclassifydiscretevariable[24].Murthy(1998)studiedhowtoconductdecisiontreesmethodin machinelearningarea[25].Decisiontreesaremethodsthatclassifysamplesbysortingthembased onfeaturevalues.Eachnodeinadecisiontreestandsforafeatureinasampletobeand eachbranchstandsforavaluethatthenodecanassume[28]. AnotherwellknownisBeyesiannetworks.NaiveBayesiannetworksisoneofthe simplestBeysiannetworks.It'siscombinedbyadirectedacyclicgraphswithoneunobserved nodeandseveralobservednodesandanassumptionthattheseveralobservednodesareinde- pendent(Good,1950).Anotherstatisticalmethodsforisinstance-basedlearning. Mitchell(1997)indicatedthatinstance-basedlearningalgorithmsdelaythegeneralizationpro- cessuntilisperformed,andthustheyarelazy-learningalgorithms.[27].Although lazy-learningalgorithmssavedtimeforthetrainingphase,itrequiresmoretimeon process[28]. 7 2.2DevelopmentofROCCurveAnalysis TheoccurrenceofROCcurvewasduringWorldWarII,anditwasdevelopedbyradarengi- neerstodetecttheenemyobject.ThenROCcurvewasusedintheofpsychologytoaccount theperceptualdetectionofstimuli.SincethentheROCanalysishasbecameusefulinmany suchasmedicine,radiology,biometricsanddataminingresearch. Metz(1978)discussedthebasicprinciplesofROCanalysis.TheyshowedthattheROCanal- ysiscouldcombinethetruepositivefractionandthefalsepositivefraction,andmakeiteasierto comparehypotheticaltestsbasedonbasicperformance[14].Toestimatethevalue oftheAUC,Hanleyetal.(1982)statedthattheareaunderROCcurverepresentsaprobability thatarandomlychosenpositivesampleisratedhigherthanarandomlychosennegativesample. AndthisprobabilityisthesamequalityofestimatedbythenonparametricWilcoxonstatistic[21]. Moses(1993)proposedaconstructiontodoROCanalysisbyfourstepsto[29].Bradley(1997) furtherinvestigatedtheuseofROCanalysisasameasureofperformanceinthearea ofmachinelearningalgorithms.TheystatedthatAUChasmanyadvantagescomparedtoover- allaccuracyrate)asameasureperformance[18].Metzet.al(1998)provided anewgeneralizedmethodforROCcurveThenewalgorithmnamedROCKITconducts allanalysesavailablefrompreviousROCsoftwareandprovides95%intervalforeach estimates[30]. 8 2.3StudyofInvestigatingtheRelationshipbetweenAUCand theRate Inmanyexercise,researcherschosecationratetomeasuretheperfor- manceofthe.Forexample,Kimetal.(2003)studiedtheerrorrateesti- mationbybootstrap[36].AnotherexampleisgivenbyOchetal.(2003),whoprovidedanew algorithmforunsmoothederrorcountandstudieddifferenttrainingcriteriaofstatisticalmachine translationmodelsforoptimizetheminimumerrorrate[10].Meanwhile,someresearcherspro- posedthattheareaundertheROCcurveisanalternativemeasuretoevaluatethe models.HerschtalandRaskutti(2004)introducedabinaryercalledRankOptthatcanopti- miseAUCusinggradientdescent[7].Agarwal(2005)studiedthegeneralizationboundsforAUC. Intheirpaper,theytheexpectedaccuracyofrankingfunctionandderivedistribution-free probabilisticboundsonthedeviationoftheempiricalAUCofarankingfunction.Furthermore, theyalsoderivedbothalargedeviationboundsandauniformconvergencebound[31].Thusitis ofinteresttostudytheerrorrateandtheAUC. CortesandMehryar(2004)conductedastatisticalanalysistoinvestigatehowAUCisrelated toerrorrate.TheyderivedtheexpressiontocomputetheexpectedvalueofAUCoverall cationswithaederrorrate.Givenaederror k ,theypointedouttherearethree situations:i)samplesarecorrectly,ii)positivesamplesaretonegativeand iii)negativesamplesaretopositive.TheyfurthercomputedtheAUCforeachsitua- tion,andprovidedanexpressiontocalculateaveragevalueoftheAUCgiven k errorsand x false positiveexamples. < AUC > x = 1 x n + k x m 2 (2.1) 9 Besides,theyhaveprovidedanexpressiontocalculatethevarianceofAUCgiven x falsepositive samples.Oneyearlater,theygaveanexpressiontocalculatetheintervalforAUCgiven errornumber k andthenumberofpositivesamplesandnegativesamples.Theiranalysisgaveus agoodstartingpointtostudytherelationshipbetweentheerrorrateandtheAUC.However, theseexpressionsgivenbyCortesandMehryarareonlycorrectundertheassumptionthatall orrankingswith k errorsareequiprobable,whichmeanseachsamplehasthesame probabilitytobeThisconditionisrarelymetinrealisticsettings. 10 Chapter3 DifferentApproachestoAddressthe RelationshipbetweentheAUCandthe Rate 3.1TheExpectedValueoftheAUCundererrorrate CortesandMehryar(2004)haveprovidedanexpressionoftheexpectedvalueofAUCoverall withaednumberoferrorsandcomparedthattotheerrorrate. Assumethatthenumberoferror k ised,andabinarytaskwith m positive samplesand n negativesamplesisgiven.Undertheassumptionthatallorrankings with k errorsareequiprobable,theexpectedvalueoftheAUCisgivenbyEq.(3.1)[4]. < A > m ; n ; k = 1 k m + n ( n m ) 2 ( m + n + 1 ) 4 mn ( k m + n å k 1 x = 0 m + n x å k x = 0 m + n + 1 x ) ; (3.1) Whichisequivalentto < A > = å k x = 0 N x N 0 x 0 ( 1 x n + k x m 2 ) å k x = 0 N x N 0 x 0 (3.2) Where x isthenumberoffalsepositivesamples, x 0 isthenumberoffalsenegativesamples, N isthe numberofnegativesamples,and N 0 isthenumberofpositivesamples.Theproofofthisexpression 11 isbasedonweightingtheexpression(2.1)withthetotalnumberofpossiblefora given x .Thus,thereare N x possiblewaysofchoosing x falsepositiveexamplesand N 0 x 0 possible waysofchoosing x 0 negativeexamples.Here,theauthorsassumedthefollowingcondition:0 x k ,and x 0 = k x . However,theauthorsdidnotconsiderasituationwherethenumberof k is largerthanthenumberofnegativesamples y ,andinthatsituation,therangeoffalsepositive x shouldbe0 x n .Similarly,thenumberoffalsenegative k x shouldislessthan m ,which means x m .Thus,thevaluerangeof x shouldbe [ 0 ; min ( m ; n ; k )] .Ifwestillusetheexpression (3.2)tocalculatetheexpectationofAUCgiven x when k > n or k > m ,theexpectationofAUC canbelessthan0.5orevennegative. Forexample,ifIhave100positivesamples,900negativesamples,and200sam- ples,theexpectationofAUCis 0 : 1429524.Icanalsoindicatethisissuebyplottingthevalueof theAUCexpectationcalculatedbyexpression(3.1).Theplotofexpression(3.1)with100positive examplesand900negativeexamplesisshownin(3.1). Figure3.1:AUCExpectation(m=100,n=900) Here,theredlineisshownasareferencelineas AUC = 0.Fromthewecanseethat whenthenumberoferror k ismuchlargerthenthenumberofpositivesamples m ,theexpectation 12 calculatedbyexpression(3.1)isnegative.AsImentionedinprevioussections,AUCisaproba- bilityofpositivesamplerankedhigherthannegativesamplescorrectly,whichindicatesthatAUC shouldhaveapositivevalue.Inordertousetheexpression(3.1)correctly,theassumptionthatthe numberoferrorsislessthanthenumberofnegativesamplesandthenumberofpositivesamples shouldbeadded. 3.2SimulatingtheAUCDistributionunderFixedErrorRate Thepervioussectionshowedthattheexpression(3.1)isnotvalidwhentheerrornumber k islarger thanthenumberofpositivesamples m orthenumberofnegativesamples n .Whentheerrornumber k issmallerthan min ( m ; n ) ,thederivationofexpression(3.1)iscorrect.Sincetheassumptionthat eachsamplehassameprobabilitytobeishardtobeachievedinrealisticscenarios,I furtherconductanextensivesimulationexperimenttoevaluatethevalidityofthisexpressionwhen k min ( m ; n ) formoderatetolargedeviationsoftheequiprobableassumption. 3.2.1GeneratingBinaryDistribution Inordertoinvestigatethevalidityofexpression(3.1),Isimulatebinarydistributeddatafromlogis- ticregressionmodel.Sinceexpression(3.1)isconditionedon m , n and k ,Iinvestigatevesituations correspondingtotheratioofpositivesamples r = m = ( m + n ) equalsto0 : 1 ; 0 : 2 ; 0 : 3 ; 0 : 4 ; 0 : 5. First,Igenerated1000 a 1 ; a 2 ; a 3 ; a 4 ; a 5 ˘ N ( 0 ; 0 : 5 ) ,andIset b 1 = 3, b 2 = 5 : 5, b 3 = 5, b 4 = 2 : 5, b 5 = 1. ThenIgenerated1000 u ˘ U [ 0 ; 1 ] independentby a i . I q = 1 = 1 + exp ( ( b 0 + b 1 a 1 + b 2 a 2 + b 3 a 3 + b 4 a 4 + b 5 a 5 )) Ilabelexamplesto0and1byfollowingrules: 13 if u q then y = 0 if u < q then y = 1 Togetabinarydistributeddatawith10%positivesamples,Iset beta 0 = 8,andonlyuse thebinarydistributeddatasetswhichhave100positivesamples( y = 1)toconductthesimulation experiment.Togetbinarydistributeddatawith20%positivesamples,Iset beta 0 = 4 : 5,and onlyusethebinarydistributeddatasetswith200positivesamples( y = 1)toconductthesimulation experiment.Similarly,Iset beta 0 = 1 : 5 ; 1 ; 3 : 5correspondingtogetthedatasetswith30%,40% and50%positivesamples.ThereasonIadjust beta 0 istoadjusttheprobability ( y = 1 j a ) .By adjusting P ( y = 1 j a ) ,Icangetthedatasetwitharound10%to50%positivesamples. 3.2.2UsingLogisticRegressionasa Thelogisticregressionisadirectprobabilitymodel.Itisaspecialcaseofgeneralizedlinear models.Forlogisticregression,theconditionaldistributionofbinarydatagivencovariatesfollows aBernoullidistributionwithsuccessprobabilityboundedbetween0and1.Thusonecanuse binarylogisticmodeltopredictbinaryoutcomesbasedonpredictorvariables. Givenabinaryrandomvariable y andavectorofpredictors(couldbecontinuousordiscrete) a ,logisticregressioncanbeusedtopredictthesuccessprobability P ( y j a ) . P ( y = 1 j a )= 1 1 + exp ( b 0 + å b i = 1 b i a i ) (3.3) Where b isthenumberofpredictors. P ( y j a ) leadstoasimplelinearexpressionfor Generally,weassignthelabel Y = 1ifthefollowingconditionholds: P ( y = 0 j a ) P ( y = 1 j a ) < 1 ; 14 Whichisequivalentto exp ( b 0 + b å i = 1 b i a i ) < 1 Aftertakingnaturallogofbothsides,wecanassign y = 1if a b 0 + b å i = 1 b i a i < 0 ; andassign y = 0otherwise. Intheprevioussection,Isimulatethebinarydistributeddata ( y ; a ) .ThenIchooselogistic regressionasabinarytoclassifythedata.Ichoose a 1 ; a 3 ; a 5 aspredictors,anduse expression(3.3)togetthesuccessprobabilitycancomparetheprobabilitytothresholds.Iassume thatthethresholdfollowsauniformdistributionfrom0to1,andIrandomlyselectonethreshold from U ( 0 ; 1 ) foreachIfthepredictedprobabilityislessorequaltothreshold,then ‹ y = 0,else‹ y = 1. 3.2.3EstimateofAUCandExpectedAUCCalculation AsImentionedbefore,theAUCistheareaunderROCcurve,aplotoftruepositiverateasa functionoffalsepositiverate.Tocalculatetheareaundercurve,IcanintegralundertheROC curve.Thusitisnecessarytocalculatethetruepositiverateandthefalsenegativerateateach pointofROCcurve Frommysimulation,Icangetthetruepositiverateandfalsenegativeratebyfollowingfor- mula. TPS = numberof ( ‹ Y = Y = 1 ) numberof ( Y = 1 ) FPS = numberof ( ‹ Y = 0 andY = 1 ) numberof ( Y = 0 ) 15 Ichoose1000pointsontheROCcurveandcalculatedthetruepositiverateandfalsepositive rateforeachpoint.ThenIcalculatedtheintegralundertheROCcurvetogettheAUC.TheEq.(3.1) givenbyCoresandMehryaristheexpectedaverageAUCvalueunderederrornumber k ,which meansaederrorrate.However,itisdiftoinvestigatetheAUCdistributionunderevery k ˆ [ 0 ; min ( m ; n )] .SoIinvestigatetheAUCdistributionunderseveralsmallrangesof k = 10.And IcomparetheestimateofAUCwiththetheexpectedaverageAUCundereachrange.Tocalculate theexpectedaverageAUC,thenumberoferror k needstobeknownforeachiteration.Ithe errornumber k asthecountnumberof ‹ ( Y i ) 6 =( Y i ) .ThenIplug m , n , k intoexpression(3.1)toget theexpectedaverageAUC. 3.2.4ComparingtheEstimateofAUCversusExpectedAverageAUC Ihave1000samplesintotal,andthereare m positiveexamplesand n negativesamples.Ithe ratioofpositivesamples r as r = m = ( m + n ) .IcomparetheestimateofAUCandexpectedaverage AUCunder r = 0 : 1 ; 0 : 2 ; 0 : 3 ; 0 : 4 ; 0 : 5.First,Idrawtheempiricalcumulativeprobabilityfunction (CDF)plotsofAUCunderrangeof k toinvestigatetheAUCdistribution.Afterthat,I drawtheprobabilitydensityfunction(PDF)plotsoftheestimateofAUCunderrangeof k ,andaddthelowerboundandupperboundofexpectedaverageAUCcalculatedbyexpression (3.1)asreferencelinestocomparetheestimateofAUCandtheexpectedaverageAUC.Finally,I givethedescriptivestatisticstoshowthedifferencebetweentheestimateofAUCandtheexpected averageAUCunderrangesoferrornumber k . First,Ilookedatthesituationwhere r = 10%,whichmeansthereare100positivesamples and900negativesamples.Inthissituation,IonlyinvestigatetheestimateofAUCdistribution when k 100.Here,weplottheCDFandPDFforestimateofAUCwiththeestimateoferrorrate from0.07to0.08,0.08to0.09and0.09to0.1.Ineachrangeoferrorrate,Ihave2,924,70,609 16 and312,565estimatesofAUCvaluescorrespondingly. Figure3.2:CDFofestimateofAUC(r=10%) Figure(3.2)showesthattheestimateofAUCwithlowererrorratehasahighervalue.The minimumandmaximumoftheestimateofAUCwitherrorratefrom0.07to0.08islargestamong threeestimateofAUC.ThenIplotthePDFoftheestimateofAUCwitherrorratefrom0.07to 0.08,0.08to0.09and0.09to0.1,andaddtheupperboundandlowerboundofexpectedaverage AUCcalculatedbyexpression(3.1).ThePDFplotisshownin(3.3).ThenIcalculatethe differencebetweenaverageestimateofAUCandthebounds(upperandlower)correspondingto eachrangeoferrorrate(Table3.1)asreference. Figure3.3:PDFofestimateofAUC(r=10%) 17 Table3.1:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=10%) Errorrate(0.07,0.08](0.08,0.09](0.09,0.1] AverageestimateofAUC-lowerbound0.276120.317610.36258 AverageestimateofAUC-upperbound0.223880.26420.3079 Figure(3.3)indicatesthattheexpectedaverageAUCvalueisalwayslowerthantheestimate ofAUC.Andfromtable(3.1)Ithatwhentheerrorrategetslarger,thedifferencebetween estimateofAUCandexpectedaverageAUCgetlargertoo. When r = 20%,thereare200positivesamplesand800negativesamples.Inthiscase,Ionly investigatetheestimateofAUCdistributionwhen k 200.IplottheCDFandPDFoftheestimate ofAUCwiththeestimateoferrorratefrom0.15to0.16,0.17to0.18and0.19to0.2.Ineachrange oferrorrate,Ihave39,892,100,487and110,114estimatesofAUCvaluescorrespondingly.The PDFisshownin(3.4). Figure3.4:CDFofestimateofAUC(r=20%) From(3.4)IcanseethattheestimateofAUChassimilardistributionundertherange oferrorratefrom0.17to0.18and0.19to0.20.ThenIplotthePDFplotoftheestimatesof AUCtoinvestigatethedifferencebetweentheestimatesofAUCandtheexpectedaverageAUC 18 Figure3.5:PDFofestimateofAUC(r=20%) From(3.5)IthattheexpectedaverageAUCisalwayslowerthantheestimatesof AUC.AndtheestimatesofAUChasthesimilaronemodedistributionundererrorratewithin rangeof ( 0 : 17 ; 0 : 18 ] and ( 0 : 19 ; 0 : 20 ] .TheestimateofAUCwitherrorratefrom0.15to0.16is higherthantheothertwo.ThedifferencebetweenaverageestimateofAUCandthebounds(upper andlower)correspondingtoeachrangeoferrorrateisshownintable(3.2).Thetableshowsthat whenerrorrategetslarger,thedifferencebetweenestimateofAUCandtheexpectedaverageAUC getslargertoo. Table3.2:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=20%) Errorrate(0.15,0.16](0.17,0.18](0.19,0.20] AverageestimateofAUC-lowerbound0.234840.280040.30864 AverageestimateofAUC-upperbound0.207520.251250.27902 When r = 30%,thereare300positivesamplesand700negativesamples.Inthissituation, expression(3.1)onlyvalidwhen k 300.WedrawtheCDFandPDFplotoftheestimatesof AUCwiththeestimateoferrorratefrom0.19to0.20,0.24to0.25and0.29to0.3.Ineachrange oferrorrate,Ihave7,060,48,399and55,015estimatesofAUCvalue.TheplotoftheCDFofthe estimatesofAUCundereacherrorraterangeisshownin3.6. 19 Figure3.6:CDFofestimateofAUC(r=30%) From3.6InoticethattheestimateofAUCundererrorratefrom0.19to0.20hasthe highestvalue.ThedistributionoftheestimateofAUCundererrorratefrom0.24to0.25an0.29 to0.30arealmostsame.TofurtherinvestigatethedifferencebetweentheestimateofAUCandthe expectedaverageAUCwithsamerangeoferrorrate,IplotthePDFplotoftheestimateofAUC withboundsofexpectedaverageAUCasreference(3.7)). Figure3.7:PDFofestimateofAUC(r=30%) From(3.7)IthatthevalueoftheexpectedaverageAUCismuchsmallerthanthe estimatesofAUCundereachrangeoferrorrate.Andthetable(3.3)ofdifferencebetweenaverage estimateofAUCandthebounds(upperandlower)correspondingtoeachrangeoferrorrateshows 20 thatwhentheerrorrategetslarger,thedifferencegetslargersimultaneously. Table3.3:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=30%) Errorrate(0.19,0.20](0.24,0.25](0.29,0.3] AverageestimateofAUC-lowerbound0.15070.219160.32513 AverageestimateofAUC-upperbound0.133750.200080.30225 Forthesituationwhere r = 40%,thereare400positivesamplesand600negativesamples.In thissituation,IonlyinvestigatethedistributionofestimateofAUCwhen k 400.IplottheCDF andPDFoftheestimateofAUCwiththeestimateoferrorratefrom0.24to0.25,0.30to0.31, 0.34to0.35and0.39to0.4.Ineachrangeoferrorrate,Ihave41,575,26,702,20,451and31,572 estimatesofAUCvalue.TheplotoftheCDFofestimatesofAUCundereacherrorraterangeis shownin3.8. Figure3.8:CDFofestimateofAUC(r=40%) FromthisplotInoticethatwhentheerrorrateisfrom0.24to0.25,theestimateofAUCvalue ishigherthanothers.Whenerrorrateisfrom0.39to0.4,theestimateofAUCvalueis smallest.AndthedistributionoftheestimateofAUCwitherrorratefrom0.3to0.31,0.34to0.35 and0.39to0.4isverysimilarwitheachother.ThePDFplotoftheestimateofAUCisshownin (3.9). 21 Figure3.9:PDFofestimateofAUC(r=40%) From(3.9)IcanclearlyseethattheestimateofAUCwitherrorratefrom0.24to0.25has thelargestvalue,andthedifferencebetweenitandtheexpectedaverageAUCissmallest.Forerror ratefrom0.3to0.31,0.34to0.35and0.39to0.4,theestimateofAUChassimilardistribution. However,theexpectedaverageAUCvaluediffersalotforthesethreerangesoferrorrate.The tablecontainsthedifferencebetweenaverageestimateofAUCandthebounds(upperandlower) correspondingtoeachrangeoferrorrateisshownintable(3.4). Table3.4:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=40%) Errorrate(0.24,0.25](0.3,0.31](0.34,0.35](0.39,0.4] AverageestimateofAUC-lowerbound0.111850.178630.234640.31688 AverageestimateofAUC-upperbound0.099870.165570.220120.29808 Thistablealsoshowsthatwhenerrorrategetslarger,thedifferencebetweentheestimateof AUCandtheexpectedaverageAUCgetslarger. When r = 50%,thereare500positivesamplesand500negativesamples.SoIinvestigatethe situationthat k 500.IplottheCDFandPDFforestimateofAUCwiththeestimateoferrorrate from0.24to0.25,0.30to0.31,0.34to0.35,0.39to0.4,0.44to0.45and0.49to0.50.Ineach rangeoferrorrate,Ihave20,467,28,456,19,620,15,038,13,536and29,677estimatesofAUC 22 value.TheCDFplotofestimateofAUCisshownin(3.10). Figure3.10:CDFofestimateofAUC(r=50%) Figure(3.10)showsthatthevalueofestimateofAUCwitherrorratefrom0.24to0.25is higherthanotherestimateofAUCwithhighererrorrate.TheestimateofAUCwith errorratefrom0.3to0.31,0.34to0.35,0.39to0.4,0.44to0.45and0.49to0.5hassimilar distribution.ThePDFplotofthoseestimatesofAUCisshownin(3.11). Figure3.11:PDFofestimateofAUC(r=50%) ThePDFplot(3.11)alsoshowsthatestimateAUCwitherrorratefrom0.24to0.25has thehighestvalue,andtherangeofitismuchnarrowthanothers.FortheestimateofAUCwith otherrangesoferrorrate,thedistributionissimilar.However,theexpectedaverageAUCwith 23 thoserangesoferrorratediffersalot.AndIcalculatethedifferencebetweenaverageestimate ofAUCandthebounds(upperandlower)correspondingtoeachrangeoferrorrate(Table3.5) asreference.ThetableshowsthatthedifferencebetweentheestimateofAUCandtheexpected averageAUCgetslargerwhenerrorrategetslarger.However,inthissituation,thedifferenceis smallerthanpervioussituationwithsmallerproportionofpositivesamples. Table3.5:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=50%) Errorrate(0.24,0.25](0.3,0.31](0.34,0.35](0.39,0.4](0.44,0.45](0.49,0.5]0.08590.13040.1710.22120.2710.3194AvgEst.ofAUC-lowerboundAvgEst.ofAUC-upperbound0.07590.12040.1610.21120.2610.3094InordertocomparethedifferencebetweenestimateofAUCandexpectedaverageAUCineach situation,IcalculatedthemeanofdifferencebetweenaverageestimateofAUCbounds(upperand lower).Theresultsareshowedintable3.6. Table3.6:TheDifferencebetweenestimateofAUCandExpectedAverageAUC r=m/(m+n)0.10.20.30.40.5 meanofdifference(withlowerbound)0.3187710.2745050.2316650.21050.199911 meanofdifference(withupperbound)0.2653250.2459320.2120280.1959080.189911 Table3.6indicatesthatwhenrisgetslarger,thedifferencebetweentheexpectedaverage AUCandestimateofAUCisgetssmaller.Thatmeanswhenthepositivesamplesandnegative samplesareevendistributed,thedeviationofexpression(3.1)issmallestthoughtheequiprobable assumptionisnotsatButwhenthepositivesamplesandnegativesamplesarenotevenly distributed,wecannotuseequation3.1asareference. 243.3StudytheErrorRateDistributionUndertheFixedAUC Anotherobjectiveofthisthesisistostudyhowtheerrorratedistributedunderaedrangeof AUC.Toobservethedistributionclearly,IdrawtheCDFplotsoferrorratefortheedrange AUCof(0.49,0.51),(0.59,0.61),(0.69,0.71),(0.79,0.81)and(0.89,0.91).AndIstudysituations thattheratioofpositivesamples r = m = ( m + n ) varyfrom0.1to0.5by0.1toinvestigatewhether thedistributionconditionalontheratioofpositiveexamples. Inthissection,IusethesameMonteCarlodatasetswiththeprevioussection.Inordertoget thelargerangeofestimateAUC,Ichoosedifferentpredictorsforlogisticregressionfor eachrangeofestimateofAUC.Thethresholdofisrandomlychosenfromauniform distribution U ( 0 ; 1 ) . 3.3.1CDFandPDFPlotsforErrorRateUndertheFixedRangeAUC First,Istudythesituationthat r = 10%.TheCDFplotandPDFplotareshownin(3.12) and(3.13).ForestimateofAUCfrom0.49to0.51,Ichose a 5 aspredictor,andIhave3,444 estimatesofAUCinthisrange.ForestimateofAUCfrom0.59to0.61,Ichose a 1 aspredictor, andIhave15,620estimatesofAUCinthisrange.ForestimateofAUCfrom0.69to0.71,Ichose a 1 ; a 4 ; a 5 aspredictor,andIhave35,709estimatesofAUCinthisrange.ForestimateofAUCfrom 0.79to0.81,Ichose a 1 ; a 3 ; a 5 aspredictor,andIhave7,982estimatesofAUCinthisrange.And forestimateofAUCfrom0.89to0.91,Ichose a 2 ; a 4 ; a 5 aspredictor,andIhave31,450estimates ofAUCinthisrange. Figure3.12showsthatwhentheestimateofAUCfrom0.49to0.51,themajorityvaluesof estimateoferrorrateisaround0.1.WhenAUCgetslarger,therangeoferrorratebecomeslarge too,however,themaximumerrorrateisalwaysaround0.9.ThelargertheAUC,thesmallerthe 25 Figure3.12:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=10%) Figure3.13:ErrorRateDistributionUnderedAUC(r=10%) 26 minimumerrorrate.From3.13Ithatthemodeoferrorrateisaround0.1,andthewhen AUCgetslarge,themodeofthedensitygetscloserto0.Ialsocalculatethemean,median,and quantileoftheerrorrateunderdifferentAUC(table3.7). Table3.7:DescriptiveStatisticsofErrorRateunderFixedAUC(r=10%) AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91 Minimum0.10.0950.0920.0880.067 1stQuantile0.10.10.10.10.089 Median0.10.10.10.1010.096 Mean0.18170.17870.16910.1560.1262 3rdQuantile0.10.17830.1160.1270.107 Maximum0.9010.9010.9010.90.9 ThedescriptivestatisticshowsthatthemeanoftheerrorrategetssmallerwhentheAUCgets larger.ThemaximumerrorrateforeachAUCaround0.9,andthethemedianoftheerrorrateis alwaysaround0.1. When r = 20%,thereare200positivesamplesand800negativesamples.ForestimateofAUC from0.49to0.51,Ichose a 5 aspredictor,andIhave1,968estimatesofAUCinthisrange.For estimateofAUCfrom0.59to0.61,Ichose a 1 aspredictor,andIhave7,798estimatesofAUC inthisrange.ForestimateofAUCfrom0.69to0.71,Ichose a 1 ; a 4 ; a 5 aspredictor,andIhave 31,384estimatesofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose a 3 ; a 4 ; a 5 as predictor,andIhave11,826estimatesofAUCinthisrange.AndforestimateofAUCfrom0.89 to0.91,Ichose a 1 ; a 2 ; a 5 aspredictor,andIhave60,867estimatesofAUCinthisrange. IgettheCDFplotandPDFplotin3.14and3.15. Fromtheempiricalcumulativefunctionplot(3.14)IthatwhenestimateofAUCfrom 0.49to0.51,themajorityoferrorrateis0.2.WhenAUCgetslarger,theminimumerrorrate getssmaller,butthemaximumerrorrateisalwaysaround0.8.Theprobabilitydensityplot(3.15) showsthattherearetwomodesofprobabilityinthisplot.Thehigheroneisaround0.2,andthe 27 Figure3.14:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=20%) Figure3.15:ErrorRateDistributionUnderedAUC(r=20%) 28 loweroneisaround0.8.WhentheAUCbecomeslarger,thehighermodeismorecloserto0.1. Thedescriptivestatisticsareshownintable(3.8). Table3.8:DescriptiveStatisticsofErrorRateunderFixedAUC(r=20%) AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91 Minimum0.1990.1920.1780.1540.111 1stQuantile0.20.20.20.1850.144 Median0.20.20.20.1970.161 Mean0.32260.31570.29170.25240.1925 3rdQuantile0.20.2760.280.2390.192 Maximum0.80.8010.80.80.8 ThistableindicatesthatthemaximumerrorrateundereachAUCisalmostthesame,which around0.8.ThemeanerrorrategetssmallerwhenAUCgetslarger.Forthefourrangeof estimateofAUC,thequantileoferrorrateisaround0.2.ButforestimateofAUCfrom0.89 to0.91,thequantileoferrorrateismunchmoresmaller,whichisaround0.14.Theminimum andmeanoferrorrategetssmallerwhentherangeofAUCestimategetslarger. When r = 30%,Ihave300positivesamplesand700negativesamples.ForestimateofAUC from0.49to0.51,Ichose a 5 aspredictor,andIhave1,488estimatesofAUCinthisrange.For estimateofAUCfrom0.59to0.61,Ichose a 1 aspredictor,andIhave4,156estimatesofAUC inthisrange.ForestimateofAUCfrom0.69to0.71,Ichose a 1 ; a 4 ; a 5 aspredictor,andIhave 32,683estimatesofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose a 3 ; a 4 ; a 5 aspredictor,andIhave35,089estimatesofAUCinthisrange.AndforestimateofAUCfrom 0.89to0.91,Ichose a 1 ; a 2 ; a 4 aspredictor,andIhave20,490estimatesofAUCinthisrange.Iget theCDFplotandPDFplotin3.16and3.17. Basedonempiricalcumulativefunctionplot(3.14)InoticethatwhentherangeofAUCesti- mateis0.49to0.51,therearetwojumpofCDFplot,oneisaround0.3andtheotherisaround0.7. WhentheAUCestimategetslarger,theminimumerrorrategetssmaller,butthemaximumAUC 29 Figure3.16:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=30%) Figure3.17:ErrorRateDistributionUnderedAUC(r=30%) 30 isalwaysaround0.7.FromtheprobabilitydensityplotIcanseethatwhenAUCestimateissmall (rangeof(0.49,0.51),(0.59,0.61),(0.69,0.71)),therearetwomodesoferrorrate,oneisaround0.7 andtheotherisaround0.3.WhenAUCestimategetslarger,therearemultiplemodes.Bothof themhaveonemodearound0.7.ForAUCestimaterangefrom0.79to0.81,ithasonemode around0.3,onemodearound0.2andanotheraround0.7.AndforestimateofAUCfrom0.89to 0.91,ithasonemodearound0.25andanotheraround0.19.Thedescriptivestatisticsareshownin table(3.9). Table3.9:DescriptiveStatisticsofErrorRateunderFixedAUC(r=30%) AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91 Minimum0.2980.2860.2530.2020.143 1stQuantile0.30.30.2930.2470.179 Median0.30.30.30.2760.204 Mean0.42640.40780.3750.31570.2339 3rdQuantile0.70.5340.4140.3090.256 Maximum0.70.7020.7010.70.7 ThistableshowsthatthemaximumerrorrateforeachvalueoftheAUCisaround0.7.The minimumandmeanerrorrategetssmallerwhentheAUCgetslarger.Themedianandquantile oferrorrateforthreerangesofAUCestimatearearound0.3.FortheAUCestimatefrom0.89 to0.91,thequantileandmedianerrorrateismuchmoresmallerthan0.3. For r = 40%,Ihave400positivesamplesand600negativesamples.ForestimateofAUCfrom 0.49to0.51,Ichose a 5 aspredictor,andIhave1,488estimatesofAUCinthisrange.Forestimate ofAUCfrom0.59to0.61,Ichose a 1 aspredictor,andIhave5,870estimatesofAUCinthisrange. ForestimateofAUCfrom0.69to0.71,Ichose a 1 ; a 4 ; a 5 aspredictor,andIhave34,875estimates ofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose a 1 ; a 3 ; a 5 aspredictor,andI have16,566estimatesofAUCinthisrange.AndforestimateofAUCfrom0.89to0.91,Ichose a 2 ; a 3 ; a 5 aspredictor,andIhave9,290estimatesofAUCinthisrange. 31 TheCDFplotandPDFplotareshownin3.18and3.19. Figure3.18:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=40%) Figure3.19:ErrorRateDistributionUnderedAUC(r=40%) FromtheCDFplotIcanseethatwhentheestimateofAUCisfrom0.49to0.51,therearetwo jumpintheCDFplot,onearound0.4andtheotheraround0.6.WhenAUCestimategetslarger, theminimumerrorrategettingsmaller.ForAUCestimatefrom0.89to0.91,theminimumerror rateislessthan0.2.However,thelargesterrorratestillaround0.6.Fromtheprobabilitydensity plotInoticethatforestimateofAUCfrom0.49to0.51and0.59to0.61,theerrorratehastwo modes,onearound0.6andanotheraround0.4.ForestimateofAUCfrom0.69to0.71,theerror ratehasthreemodes,onearound0.4,onearound0.35andtheotheraround0.6.ForAUCestimate 32 from0.79to0.81,theerrorratehasthreemodes,onearound0.3,onearound0.4andtheother around0.6.However,forestimateofAUCfrom0.89to0.91,theerrorrateonlyhasonemodeand itisaround0.2.Thedescriptivestatisticsaresshownintable(3.10). Table3.10:DescriptiveStatisticsofErrorRateunderFixedAUC(r=40%) AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91 Minimum0.3950.3590.3050.2380.152 1stQuantile0.40.40.3590.2820.187 Median0.40.40.3960.3250.214 Mean0.48150.46430.42220.35050.2431 3rdQuantile0.60.5790.4630.3910.276 Maximum0.60.6020.6010.60.6 ThistableshowsthatthemaximumerrorrateforallrangesoftheestimateofAUCisaround 0.6.ThemeanandthirdquantileerrorrategetssmalleralongwiththeAUCgetslarger.The quantileandminimumerrorrateisaround0.4forestimateofAUCfrom0.49to0.51and0.59to 0.61.AndwhenAUCgetslarger,thequantileandminimumerrorrategetssmaller. For r = 50%,thereare500positivesamplesand500negativesamples.ForestimateofAUC from0.49to0.51,Ichose a 5 aspredictor,andhave1,089estimatesofAUCinthisrange.For estimateofAUCfrom0.59to0.61,Ichose a 4 aspredictor,andhave14,419estimatesofAUC inthisrange.ForestimateofAUCfrom0.69to0.71,Ichose a 1 ; a 4 ; a 5 aspredictor,andhave 37,195estimatesofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose a 1 ; a 3 ; a 5 as predictor,andIhave19,311estimatesofAUCinthisrange.AndforestimateofAUCfrom0.89 to0.91,Ichose a 1 ; a 3 ; a 4 aspredictor,andhave5,969estimatesofAUCinthisrange. SincetheprobabilitydensityfunctionforestimateofAUCfrom0.49to0.51andforother rangesofestimateofAUCdiffersalot,IdrawthePDFforthesetwosituationsseparately.The CDFplotsandPDFplotareshowenin3.20,3.21,and3.22. FromtheCDFplotInoticethatwhentheestimateofAUCisfrom0.49to0.51,themajority 33 Figure3.20:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=50%) Figure3.21:ErrorRateDistributionUnderedAUC(r=50%) Figure3.22:ErrorRateDistributionUnderedAUC(r=50%) 34 oferrorrateisequalto0.5.AndwhenAUCgetslarger,theerrorratevariesfrom0.16to0.5. WhentheestimateofAUCisfrom0.89to0.91,theminimumerrorrateislessthan0.2.Fromthe 3.21Ithatwhen AUC = 0 : 5,theprobabilitythaterrorrateequalsto0.5isextremely large.Basedon3.22IthatwhentheestimateofAUCisfrom0.59to0.61,theerror ratevaluehastwomodes,oneisaround0.5andtheotherisaround0.45.AndforestimateofAUC isfrom0.69to0.71,theerrorratehastwomodesaround0.5and0.3.ForestimateAUCfrom 0.79to0.81andfrom0.89to0.91,theerrorrateonlyhasonemodewhichisaround0.25and0.2 correspondingly.Thedescriptivestatisticsareshownintable(3.11). Table3.11:DescriptiveStatisticsofErrorRateunderFixedAUC(r=50%) AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91 Minimum0.4770.40.3250.2520.167 1stQuantile0.50.4720.3830.2950.203 Median0.50.4990.4450.3440.234 Mean0.49990.48360.4370.36220.2631 3rdQuantile0.50.50.4940.4250.304 Maximum0.5080.5050.5050.50.5 FromthistableIthatthemaximumerrorrateisaround0.5.Themeanerrorrategets smallerwhenAUCgetslarger.OnlyforestimateofAUCfrom0.49to0.51,thequantileand theminimumerrorrateisaround0.5.ForotherlargerAUC,thequantileandminimumerror rateismuchsmallerthan0.5. 35 Chapter4 Conclusion 4.1Summary TheobjectiveofthisthesiswastoinvestigatetherelationshipbetweenAUCand rate.CortesandMehryar(2004)haveprovidedexpressionfortheexpectedvalueofAUCgiven errornumber k thatisonlyvalidwhenallorrankingswith k errorsareequiprobable. Theassumptionofthisequationistoostrongtometthereallifescenarios.AndIfoundthatCortes andMehryar'sexpressionisnotvalidinthesituationwhereerrornumber k islargerthan min ( m ; n ) . Fortheirexpressiontobevalid,theconstraint k > min ( m ; n ) needstobeimposed. Isimulatedabinarydistributionusinglogisticregressionandusedlogisticregressionmodel asatostudytherelationshipbetweenrateandAUC.First,Icompared theestimateofAUCvaluetotheexpectedaveragevalueofAUCcalculatedbyequation3.1only forsituationthat k min ( m ; n ) .TheresultsshowedtheexpectedaveragevalueofAUCisalways lowerthantheestimateofAUC.Whenthepositivesamplesandnegativesampleswereevenlydis- tributed,thedifferencebetweenestimateofAUCandexpectedaveragevalueofAUCaresmallest. Thus,onecanuseequation3.1asareferencetolearntherelationshipbetweentheAUCandthe errorratewhenIhavesameproportionofpositiveexamplesandnegativeexamples.Butwhenthe proportionsofpositivesamplesareextremecloseto0or1,thisexpressionisveryquestionableto beused. Furthermore,IstudiedtheerrorratedistributionunderaedrangeofAUCvaluewhenr 36 variesfrom0.1to0.5.Theresultsshowedthatwhen r = 0 : 1 ; 0 : 2,themodeoferrorrateisalways around r or1 r .WhenAUCbecomeslarger,thedistributionoferrorratebecomestotheright skewed,andthemeanerrorratebecomessmaller. 4.2Limitation Inthisthesis,IdidasimulationstudytoinvestigatetherelationshipbetweenAUCandmisclas- rateforbinarydistribution.,Itestedthevalidityoftheexpressiongivenby CortesandMehryar(2004)andstudiedthedistributionoferrorrateundertheestimateofAUC. Intheanalysispart,IcalculatedtheestimateofAUCgivenedrangeofestimateoferror ratetovalidatetheexpressiongivenbyCortesandMehryar(2004).Igottheestimateoferrorrate byalogisticregression.Becausethethresholdfortheisunknown,Ijustsimply assumedthatthethresholdfollowsauniformdistributionfrom0to1.Thisassumptionmaybe uncorrect.Inthesecondanalysispart,Istudiedthedistributionofestimateoferrorrateunder aedrangeofestimateofAUC.BecauseAUCisacontinuousvariableandIcannotgetthe errorratedistributionundereverysinglevalueofAUC,Ijustlookedinto5intervalsof AUC.Ididnotusethesamethegetthese5intervalsofAUCvaluebecauseitneedsan extremelylargenumberofMoteCarlosamples.Instead,Iused5differenttogetthese5 intervalsofAUCvalue.ThisisnotappropriateandthedistributionoferrorrateunderAUCwould bemoreaccurateifIhadmoreMoteCarlosamples. 37 4.3Discussion Inthisthesis,IevaluatedthevalidityoftheexpressionprovidedbyCortesandMehryar(2004) formoderatetolargedeviationsoftheequiprobable.Myresultsshowedthatwhenthepositive samplesandnegativesamplesarenotevenlydistributed,theexpressionisquestionable.Based onmywork,peoplecanhaveabriefideaofhowerrorrateisdistributedunderaedrangeof AUC.Toinvestigatetherelationshipmoreprecise,IcanapplytheBayesianinferencemethods. ThedifofBayesianinferenceisthatboththeerrorrateandtheAUCarerandomvariables, anditishardtothedistributionofAUCanderrorrate. 38 BIBLIOGRAPHY 39 BIBLIOGRAPHY [1]Gorno-Tempini,M.L.,etal.ofprimaryprogressiveaphasiaanditsvariants." Neurology76.11(2011):1006-1014. [2]Planet,PaulJ.,etal."PhylogenyofgenesforsecretionNTPases:ofthe widespreadtadAsubfamilyanddevelopmentofadiagnostickeyforgenePro- ceedingsoftheNationalAcademyofSciences98.5(2001):2503-2508. [3]Pang,Bo,LillianLee,andShivakumarVaithyanathan."Thumbsup:sentiment tionusingmachinelearningtechniques."ProceedingsoftheACL-02conferenceonEmpirical methodsinnaturallanguageprocessing-Volume10.AssociationforComputationalLinguis- tics,2002. [4]Cortes,Corinna,andMehryarMohri."AUCoptimizationvs.errorrateminimization."Ad- vancesinneuralinformationprocessingsystems16.16(2004):313-320. [5]Mohri,C.intervalsfortheareaundertheROCcurve."Advancesinneuralinfor- mationprocessingsystems17(2005):305. [6]Hand,DavidJ.,andRobertJ.Till."AsimplegeneralisationoftheareaundertheROCcurve formultipleclassproblems."Machinelearning45.2(2001):171-186. [7]Herschtal,Alan,andBhavaniRaskutti."OptimisingareaundertheROCcurveusinggradient descent."ProceedingsoftheinternationalconferenceonMachinelearning.ACM, 2004. [8]Ma,Shuangge,andJianHuang."RegularizedROCmethodfordiseaseonand biomarkerselectionwithmicroarraydata."Bioinformatics21.24(2005):4356-4362. [9]Krizhevsky,Alex,IlyaSutskever,andGeoffreyE.Hinton."Imagenetwithdeep convolutionalneuralnetworks."Advancesinneuralinformationprocessingsystems.2012. [10]Och,FranzJosef."Minimumerrorratetraininginstatisticalmachinetranslation."Proceed- ingsofthe41stAnnualMeetingonAssociationforComputationalLinguistics-Volume1. AssociationforComputationalLinguistics,2003. 40 [11]Ben-David,Shai,etal."Minimizingtheerrorrateusingasurrogateconvex loss."arXivpreprintarXiv:1206.6442(2012). [12]Murthy,SreeramaK."Automaticconstructionofdecisiontreesfromdata:Amulti- disciplinarysurvey."Dataminingandknowledgediscovery2.4(1998):345-389. [13]Joachims,Thorsten."Asupportvectormethodformultivariateperformancemeasures."Pro- ceedingsofthe22ndinternationalconferenceonMachinelearning.ACM,2005. [14]Metz,CharlesE."BasicprinciplesofROCanalysis."Seminarsinnuclearmedicine.Vol.8. No.4.WBSaunders,1978. [15]Ferri,Cesar,JoseHernandez-Orallo,andR.Modroiu."Anexperimentalcomparisonofper- formancemeasuresforPatternRecognitionLetters30.1(2009):27-38. [16]Fawcett,Tom."AnintroductiontoROCanalysis."Patternrecognitionletters27.8(2006): 861-874. [17]Ling,CharlesX.,JinHuang,andHarryZhang."AUC:astatisticallyconsistentandmore discriminatingmeasurethanaccuracy."IJCAI.Vol.3.2003. [18]Bradley,AndrewP."TheuseoftheareaundertheROCcurveintheevaluationofmachine learningalgorithms."Patternrecognition30.7(1997):1145-1159. [19]Mann,H.B.,Whitney,D.R.(1947)Onatestwhetheroneoftworandomvariablesisstochas- ticallylargerthantheother.Ann.Math.Statist.,18,pp.50-60. [20]Wilcoxon,F.(1945)Individualcomparisonsbyrankingmethods.Biometrics,1,pp.80-83. [21]Hanley,JamesA.,andBarbaraJ.McNeil."Themeaninganduseoftheareaunderareceiver operatingcharacteristic(ROC)curve."Radiology143.1(1982):29-36. [22]Huang,Jin,andCharlesX.Ling."UsingAUCandaccuracyinevaluatinglearningalgorithm- s."KnowledgeandDataEngineering,IEEETransactionson17.3(2005):299-310. [23]BreimanL.,FriedmanJ.H.,OlshenR.A.,StoneC.J.(1984)andRegression Trees,WadsforthInternationalGroup. [24]Mika,S.,Ratsch,G.,Weston,J.,Scholkopf,B.andMuller,K.-R.(1999),Fisherdiscrimi- nantanalysiswithkernels.InY.-H.Hu,J.Larsen,E.Wilson,andS.Douglas,editors,Neural NetworksforSignalProcessingIX,pages41-48.IEEE. 41 [25]Murthy,SreeramaK."Automaticconstructionofdecisiontreesfromdata:Amulti- disciplinarysurvey."Dataminingandknowledgediscovery2.4(1998):345-389. [26]GoodI.J.(1950),ProbabilityandtheWeighingofEvidence,London,CharlesGrin. [27]Mitchell,T.(1997).MachineLearning.McGrawHill. [28]Kotsiantis,SotirisB.,I.Zaharakis,andP.Pintelas."Supervisedmachinelearning:Areview oftechniques."(2007):3-24. [29]Moses,LincolnE.,DavidShapiro,andBenjaminLittenberg."Combiningindependents- tudiesofadiagnostictestintoasummaryroccurve:Data-analyticapproachesandsome additionalconsiderations."Statisticsinmedicine12.14(1993):1293-1316. [30]Metz,CharlesE.,BenjaminA.Herman,andCherylA.Roe."Statisticalcomparisonoftwo ROC-curveestimatesobtainedfrompartially-paireddatasets."MedicalDecisionMaking18.1 (1998):110-121. [31]Agarwal,Shivani,etal."GeneralizationboundsfortheareaundertheROCcurve."Journal ofMachineLearningResearch.2005. [32]Dreiseitl,Stephan,andLucilaOhno-Machado."Logisticregressionandneuralnet- workclmodels:amethodologyreview."Journalofbiomedicalinformatics35.5 (2002):352-359. [33]Davis,Jesse,andMarkGoadrich."TherelationshipbetweenPrecision-RecallandROC curves."Proceedingsofthe23rdinternationalconferenceonMachinelearning.ACM,2006. [34]Statnikov,Alexander,LilyWang,andConstantinF.Aliferis."Acomprehensivecomparison ofrandomforestsandsupportvectormachinesformicroarray-basedcancer BMCbioinformatics9.1(2008):319. [35]Lee,Eunjung,etal."InferringpathwayactivitytowardprecisediseasePLoS computationalbiology4.11(2008):e1000217. [36]Kim,Ji-Hyun."Estimatingerrorrate:Repeatedcross-validation,repeatedhold- outandbootstrap."ComputationalStatisticsandDataAnalysis53.11(2009):3735-3745. [37]Lagreid,Astrid,etal."Predictinggeneontologybiologicalprocessfromtemporalgeneex- pressionpatterns."Genomeresearch13.5(2003):965-979. 42 [38]Deselaers,Thomas,DanielKeysers,andHermannNey.errorrateforquanti- tativeevaluationofcontent-basedimageretrievalsystems."PatternRecognition,2004.ICPR 2004.Proceedingsofthe17thInternationalConferenceon.Vol.2.IEEE,2004. [39]Golub,ToddR.,etal."Molecularofcancer:classdiscoveryandclassprediction bygeneexpressionmonitoring."science286.5439(1999):531-537. [40]Wang,Qiong,etal."NaiveBayesianforrapidassignmentofrRNAsequencesinto thenewbacterialtaxonomy."Appliedandenvironmentalmicrobiology73.16(2007):5261- 5267. [41]Kuncheva,LudmilaI.,andChristopherJ.Whitaker."Measuresofdiversityinen- semblesandtheirrelationshipwiththeensembleaccuracy."Machinelearning51.2(2003): 181-207. [42]Simon,Richard,etal."PitfallsintheuseofDNAmicroarraydatafordiagnosticandprog- nosticJournaloftheNationalCancerInstitute95.1(2003):14-18. [43]Swets,JohnA."Measuringtheaccuracyofdiagnosticsystems."Science240.4857(1988): 1285-1293. 43