ASIMULATIONSTUDYFOREVALUATINGTHEAREAUNDERTHEROCCURVEAND
THEERRORRATEINBINARYCLASSIFICATIONS
By
QinhuaHuang
ATHESIS
Submittedto
MichiganStateUniversity
inpartialoftherequirements
forthedegreeof
BiostatisticsŒMasterofScience
2015
ABSTRACT
ASIMULATIONSTUDYFOREVALUATINGTHEAREAUNDERTHEROCCURVE
ANDTHEERRORRATEINBINARYCLASSIFICATIONS
By
QinhuaHuang
TheareaundertheROCcurve(AUC)andtheerrorratearetwoimportantcriteriadesignedto
measuretheperformanceofThemaximumAUCandtheminimumerrorrateindicates
thebestHowever,onecannotgettheminimumerrorrateandthemaximumAUCsi-
multaneouslyunderthesame.Itisthusofinteresttoinvestigatetherelationshipbetween
theAUCandtheerrorrate.StudyingtherelationshipbetweentheAUCanderrorrate,Cortesand
Mehryar(2004)haveprovidedanexpressionoftheexpectedvalueoftheAUCforagivenerror
rate.Inthisthesis,IstudythevalidityoftheexpressiongivenbyCortesandMehryar(2004),
afterthat,IinvestigatetheerrorratedistributionunderaedrangeofAUC.
MyresultsshowthatCortesandMehryar'sexpressionisnotvalidundersomesitua-
tions,andtheexpectedaveragevalueofAUCisalwayssmallerthantheestimateofAUCfrom
Mote-Carlosamples.Whentheproportionofpositivesamplesisnotcloseto0.5,theexpected
averagevalueofAUCcalculatedbyCortesandMehryar'sexpressiondeviateslargelyfromthe
Mote-CarlosamplesofAUC.Thisindicatesthattheexpressionoftheexpectedaveragevalueof
AUCforgivenerrorratemaynotbeaccurateandshouldbecautionused.Ialsoprovideuseful
informationforthequantilesoftheerrorrateforgivenedrangeofAUC,withtheproportionof
positivesamplesvaryingin[0.1,0.5].
Copyrightby
QINHUAHUANG
2015
ACKNOWLEDGMENTS
Iwishtoexpressmysincerethankstoallmygraduateandundergraduateprofessors.Iamextreme-
lythankfulandindebtedtothemforsharingexpertise,andsincereandvaluableguidanceand
encouragementextendedtome.Ialsothankmyparentsfortheunceasingencouragement,support
andattention.Iamalsogratefultomyfriendswhosupportedmethroughthisventure.Ialsoplace
onrecord,mysenseofgratitudetooneandall,whodirectlyorindirectly,havelenttheirhandin
thisventure.
iv
TABLEOFCONTENTS
LISTOFTABLES
.......................................
viLISTOFFIGURES
......................................
viiChapter1Introduction
..................................
11.1Background......................................1
1.2Motivation.......................................2

1.3performance................................3

1.4ofROCcurve...............................4
1.5ofAUC...................................6
Chapter2LiteratureReview
..............................
72.1Methodsof...............................7
2.2DevelopmentofROCCurveAnalysis........................8
2.3StudyofInvestigatingtheRelationshipbetweenAUCandtheRate9
Chapter3DifferentApproachestoAddresstheRelationshipbetweentheAUC
andtheRate
........................
113.1TheExpectedValueoftheAUCunderederrorrate................11
3.2SimulatingtheAUCDistributionunderFixedErrorRate..............13
3.2.1GeneratingBinaryDistribution........................13

3.2.2UsingLogisticRegressionasa..................14

3.2.3EstimateofAUCandExpectedAUCCalculation..............15

3.2.4ComparingtheEstimateofAUCversusExpectedAverageAUC......16
3.3StudytheErrorRateDistributionUndertheFixedAUC...............25
3.3.1CDFandPDFPlotsforErrorRateUndertheFixedRangeAUC......25
Chapter4Conclusion
..................................
364.1Summary.......................................36

4.2Limitation.......................................37

4.3Discussion.......................................38
BIBLIOGRAPHY.......................................
39vLISTOFTABLES
Table1.1:Performance...........................4
Table3.1:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=10%)18
Table3.2:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=20%)19
Table3.3:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=30%)21

Table3.4:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=40%)22
Table3.5:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=50%)24
Table3.6:TheDifferencebetweenestimateofAUCandExpectedAverageAUC..24

Table3.7:DescriptiveStatisticsofErrorRateunderFixedAUC(r=10%).......27
Table3.8:DescriptiveStatisticsofErrorRateunderFixedAUC(r=20%).......29
Table3.9:DescriptiveStatisticsofErrorRateunderFixedAUC(r=30%).......31
Table3.10:DescriptiveStatisticsofErrorRateunderFixedAUC(r=40%).......33
Table3.11:DescriptiveStatisticsofErrorRateunderFixedAUC(r=50%).......35
viLISTOFFIGURES
Figure1.1:ROCcurve..................................5
Figure3.1:AUCExpectation(m=100,n=900).....................12
Figure3.2:CDFofestimateofAUC(r=10%)......................17
Figure3.3:PDFofestimateofAUC(r=10%)......................17
Figure3.4:CDFofestimateofAUC(r=20%)......................18
Figure3.5:PDFofestimateofAUC(r=20%)......................19
Figure3.6:CDFofestimateofAUC(r=30%)......................20

Figure3.7:PDFofestimateofAUC(r=30%)......................20
Figure3.8:CDFofestimateofAUC(r=40%)......................21
Figure3.9:PDFofestimateofAUC(r=40%)......................22
Figure3.10:CDFofestimateofAUC(r=50%)......................23
Figure3.11:PDFofestimateofAUC(r=50%)......................23
Figure3.12:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=10%).26

Figure3.13:ErrorRateDistributionUnderedAUC(r=10%).............26
Figure3.14:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=20%).28
Figure3.15:ErrorRateDistributionUnderedAUC(r=20%).............28

Figure3.16:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=30%).30
Figure3.17:ErrorRateDistributionUnderedAUC(r=30%).............30
Figure3.18:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=40%).32
Figure3.19:ErrorRateDistributionUnderedAUC(r=40%).............32
viiFigure3.20:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=50%).34
Figure3.21:ErrorRateDistributionUnderedAUC(r=50%).............34
Figure3.22:ErrorRateDistributionUnderedAUC(r=50%).............34
viii
Chapter1
Introduction
1.1Background
isacommontaskinmanyofapplicationssuchashealthcare,geneticanalysis,
andcomputerscience.Forexample,Planetetal.(2001)proposedamolecularkey-basedmethod
toclassifyputativeNTPasegenesprecisely[2].AnotherexampleisgivenbyPangetal.(2002),
whostudiedhowtoclassifydocumentsbysentimentusingmachinelearningincludingNavie
Bayes,maximumentropyandsupportvectormachines[3].Finally,Gorno-Tempini
(2011)studiedhowtoclassifyprimaryprogressiveaphasiaanditsthreemainvariants[1].
Aftertheitisimportanttostudytheaccuracyoftheions.Thereare
twocommoncriteriausedtomeasuretheperformanceoftheerrorrateandthearea
underreceiveroperatingcharacteristics(ROC)curve(AUC).Recently,someresearcherspointed
outthattheAUCmaybemorepertinentmeasurementforclasthanthe
errorrate[4].
TheROCcurveisaplotthatteststheperformanceofabinarysystemasitsdiscrimi-
nationthresholdisvaried,thusitcanselectbasedontheirperformance.Ithasbeenused
foralongtime,andhavebeenextendedtovisualizeandanalyzethediagnosticsystems'behav-
ior[43].Besides,anincreasingnumberofmedicaldecisionshavebeenmadebasedontheROC
graph,andagrowingusageoftheROCcurveshavebeenseeninmachinelearningcommunity
becauseoftherealizationthattheerrorrateisnotaccurateenoughtomeasurethe
1
performance[22].Apartfrombeingamainlyperformancegraphingmethod,theROCgraphalso
haspropertiesmakingitveryusefulforestimatingerrorcostsofskewedclassdistribution.And
thesepropertieshavebecomemoreandmoreimportantbecausetheresearchaboutcost-sensitive
learninghasgainedalotofattentionlately[16].
1.2Motivation
TheareaunderROCcurve(AUC)andtheerrorratearetwoimportantcriteria
designedtomeasuretheperformanceofForinstance,Simonetal.(2003)usedmisclas-
ratetomeasuretheperformanceofaclasspredicationforDNAmicroarraydata[42].
Golubetal.(1999)usedthecumulativeerrorratetoassesstheaccuracyofangeneexpression
basedforcancer[39].Wangetal.(2007)alsousedtheoverallerrorratetoassess
theirforrapidassignmentofrRNASequenceintohighertaxonomy[40].Anotherex-
ampleisgivenbyKrizhevskyetal.(2012),whomeasuredtheirthatdesignedtoclassify
high-resolutionimagesintheImageNetLSVRC-2010contestbyerrorrate[9].Furthermore,S-
tatnikovetal.(2008)usedtheAUCtocomparetherandomforestsandsupportvectormachines
forcancerbasedonmicroarray[34].AndLeeetal.(2008)usedAUCtoevaluatethe
performanceofanewmethodofwhichbasedonathwayactivitiesinferredforeach
patients[35].Finally,Maetal.(2005)proposedanewmethodusedasigmoidapproximationto
theAUCasaobjectivefunctiontoselectandclassifybiomarker[8].
Themostcommonmethodstomeasuretheperformanceofexerciseistheerror
rateandAUC.However,onecannotgettheminimumerrorrateandthemaximumAUCsimulta-
neouslyunderthesame.Itisthusofinteresttoinvestigatetherelationshipbetweenthe
AUCandtherate.CortesandMehryar(2004)haveprovidedanexpressionofthe
2
expectedaveragevalueandthevarianceofAUCgivenaederrorrate.However,theauthors
warnedthattheseequationsrequiretheclasorrankingswith
k
errorstobeequiprobable.
Byequiprobable,theymeansituationsinwhicheachtestsamplehastheequalprobabilityofbeing
[4].Inthisthesis,IstudytheexpressionprovidedbyCortesandMehryar(2004),and
pointouttheexpressionisinappropriateinsomesituations.AndIconductasimulation
experimenttoinvestigatetherelationshipbetweentheestimatesofAUCandtheestimateoferror
rate.Iconductthisexperimentbysimulatingabinarydistribution,andusinglogisticregression
withthresholdasa.Iassumethatthethresholdforthefollowsanuniform
distributionfrom0to1.ThenIcalculatetheestimateofAUCandtheestimateoferrorratefor
eachToinvestigatehowtheestimateoferrorrateisdistributedundertheed
rangesofvalueoftheestimateofAUC,Idrawthecumulativedistributionfunction(CDF)plots
andprobabilitydistributionfunction(PDF)plotsoftheestimateoferrorrate.
1.3performance
Inthisthesis,Istudythebinarysituations.Inabinaryexercise,every
sampleisassignedtopositiveornegativeclass.Aisusedtopredictwhichclassshould
thesamplebeassignedto.Differentproducedifferentoutcomestopredictthesample's
class,someofthemproducediscreteclasslabelsandothersproducecontinuousoutputstodifferent
thresholds.Thethresholdscandifferfrom0to1forabinaryIftheoutputsofa
issmallerthanthethreshold,thenthesampleisasanegative;iftheoutputofa
islargerorequaltothethreshold,thenthesampleisasapositive.
Givenamodelandatestsample,theremaybefourdifferentoutcomes(Table
1.1).Ifthetestsampleispositiveandisassignedcorrectly,itisatruepositive;ifitispositivebut
3
Table1.1:Performance
ConditionPositiveConditionNegative
TestPositiveTruepositiveFalsepositive
TestNegativeFalsenegativeTruenegative
isassignedtonegative,itisafalsenegative.Ifthetestsampleisnegativeandbeassignedcorrectly,
itisatruenegative;ifthetestsampleisnegativebutisassignedtopositive,itisafalsepositive.
SomecationfunctionsareusedtomeasuretheperformanceofbinaryThe
sensitivity(truepositiverate)isestimatedas
TruePositiveRate
=
TruePositive
TotalPositives
Thefalsepositiverateisestimatedas
FalsePositiveRate
=
FalseNegative
TotalNegatives
Theisestimatedas
Specifity
=
TrueNegative
FalsePositives
+
TrueNegatives
=
1

FalsePositiveRate
1.4ofROCcurve
TheROCcurveisaplotthatdemonstrateshowabinaryperforms.Itisatwo-dimension
graphwiththetruepositiverateontheYaxisandthefalsepositiverateontheXaxis.TheROC
curvecanillustratetherelationshipbetweenthefalsepositiveandthetruepositive.Figure(1.1)isa
simpleexampleoftheROCcurve.Here,thediagonalrepresentstherandominother
4
words,theisconductedasafaircointoss,anditisdrawnasareference.Point(0,0)is
locatedatlowerleftanddemonstratesthesituationinwhichneverissuingapositive
thecommitsnofalsepositiveerrorsbutalsonotruepositives.Oppositely,thepoint(1,
1)whichislocatedattheupperrightcornerdemonstratesnoissuingnegativeThe
point(0,1)showstheperfectclaswithzerofalsepositiverateandonetruepositiverate.
Intuitively,onepointperformsabetterifitislocatedtothenorthwestofanother
becauseithasahighertruepositiverateandalowfalsepositiverate.Usually,aclwhich
appearsneartheXaxisandontheleft-handsideofaROCcurvewouldbetakenasﬁconservativeﬂ
becausetheymakepositiveonlywithstrongevidencesotheymakefewpositive
errors;however,thetruepositiveratedoesn'tperformwelltoo.Andawhichappearson
theupperright-handsideofanROCcurvealwaysbetakenasﬁliberalﬂbecausetheymakepositive
withweakevidencetoincreasethetruepositiverate,butthehightruepositiverate
alwaysaffectsthehighfalsepositiverate.
Figure1.1:ROCcurve
5
1.5ofAUC
AsImentionedintheprevioussection,anROCcurveisaplotofthetruepositiverateasafunction
ofthefalsepositiverate.ReducingROCperformancefromtwodimensionstoonesinglescalar
valuemaybeeasiertocomparetheperformancesofAUC,whichisasthearea
undertheROCcurve,isthemostcommonmethodtomeasuretheROCperformance.Sinceitisa
portionofareaoftheunitsquare,thevalueofAUCwillalwaysbetween0and1.However,since
therandomproducesthediagonallinebetween(0,0)and(1,1)hasanareaof0.5,
norealisticshouldhaveanAUCunder0.5.
ThevalueofAUCcouldbecalculatedbytheexpressiongivenbyMannandWhitney(1947)
andWilcoxon(1945),whichiscalledWilcoxon-Mann-Whitneystatistic.Thestatisticisgivenby:
W
=
å
m
i
=
1
å
n
j
=
1
I
(
x
i
>
y
j
)
mn
(1.1)
Itisbasedonpairwisecomparisonsbetweenasample
x
i
,
i
=
1
;:::;
m
ofrandomvariable
X
and
asample
y
j
,
j
=
1
;:::;
n
,ofrandomvariable
Y
.Weidentify
x
1
;
x
2
;:::;
x
m
astheoutputsfor
m
positivesamples,and
y
1
;
y
2
;:::;
y
n
astheoutputsfor
n
negativesamples.Theproofof
thisexpressionisbasedontheobservationthattheAUCvalueisexactlytheprobability
P
(
X
>
Y
)
.
SotheAUCcanbeusedasameasureofpairwisecomparisonsbetweenofthetwo
classes.Withaperfectranking,allpositivesamplesarerankedhigherthanthenegativeonesand
AUC=1.
6
Chapter2
LiteratureReview
2.1Methodsof
Asacommontask,hasbeenstudiedinmanycases.Researchersstudiedvarious
methodsoffordifferentsituations.Friedman(1989)studiedhowtouselineardis-
criminantanalysisandFisher'slineardiscriminantmethodtoclassifymultipleclassesofsamples
[23].Mikaetal.(1999)statedthatlineardiscriminantanalysisisaappropriatemethodtoclassify
continuousobservations.Oppositely,thediscriminantcorrespondenceanalysisismoreappropriate
toclassifydiscretevariable[24].Murthy(1998)studiedhowtoconductdecisiontreesmethodin
machinelearningarea[25].Decisiontreesaremethodsthatclassifysamplesbysortingthembased
onfeaturevalues.Eachnodeinadecisiontreestandsforafeatureinasampletobeand
eachbranchstandsforavaluethatthenodecanassume[28].
AnotherwellknownisBeyesiannetworks.NaiveBayesiannetworksisoneofthe
simplestBeysiannetworks.It'siscombinedbyadirectedacyclicgraphswithoneunobserved
nodeandseveralobservednodesandanassumptionthattheseveralobservednodesareinde-
pendent(Good,1950).Anotherstatisticalmethodsforisinstance-basedlearning.
Mitchell(1997)indicatedthatinstance-basedlearningalgorithmsdelaythegeneralizationpro-
cessuntilisperformed,andthustheyarelazy-learningalgorithms.[27].Although
lazy-learningalgorithmssavedtimeforthetrainingphase,itrequiresmoretimeon
process[28].
7
2.2DevelopmentofROCCurveAnalysis
TheoccurrenceofROCcurvewasduringWorldWarII,anditwasdevelopedbyradarengi-
neerstodetecttheenemyobject.ThenROCcurvewasusedintheofpsychologytoaccount
theperceptualdetectionofstimuli.SincethentheROCanalysishasbecameusefulinmany
suchasmedicine,radiology,biometricsanddataminingresearch.
Metz(1978)discussedthebasicprinciplesofROCanalysis.TheyshowedthattheROCanal-
ysiscouldcombinethetruepositivefractionandthefalsepositivefraction,andmakeiteasierto
comparehypotheticaltestsbasedonbasicperformance[14].Toestimatethevalue
oftheAUC,Hanleyetal.(1982)statedthattheareaunderROCcurverepresentsaprobability
thatarandomlychosenpositivesampleisratedhigherthanarandomlychosennegativesample.
AndthisprobabilityisthesamequalityofestimatedbythenonparametricWilcoxonstatistic[21].
Moses(1993)proposedaconstructiontodoROCanalysisbyfourstepsto[29].Bradley(1997)
furtherinvestigatedtheuseofROCanalysisasameasureofperformanceinthearea
ofmachinelearningalgorithms.TheystatedthatAUChasmanyadvantagescomparedtoover-
allaccuracyrate)asameasureperformance[18].Metzet.al(1998)provided
anewgeneralizedmethodforROCcurveThenewalgorithmnamedROCKITconducts
allanalysesavailablefrompreviousROCsoftwareandprovides95%intervalforeach
estimates[30].
8
2.3StudyofInvestigatingtheRelationshipbetweenAUCand
theRate
Inmanyexercise,researcherschosecationratetomeasuretheperfor-
manceofthe.Forexample,Kimetal.(2003)studiedtheerrorrateesti-
mationbybootstrap[36].AnotherexampleisgivenbyOchetal.(2003),whoprovidedanew
algorithmforunsmoothederrorcountandstudieddifferenttrainingcriteriaofstatisticalmachine
translationmodelsforoptimizetheminimumerrorrate[10].Meanwhile,someresearcherspro-
posedthattheareaundertheROCcurveisanalternativemeasuretoevaluatethe
models.HerschtalandRaskutti(2004)introducedabinaryercalledRankOptthatcanopti-
miseAUCusinggradientdescent[7].Agarwal(2005)studiedthegeneralizationboundsforAUC.
Intheirpaper,theytheexpectedaccuracyofrankingfunctionandderivedistribution-free
probabilisticboundsonthedeviationoftheempiricalAUCofarankingfunction.Furthermore,
theyalsoderivedbothalargedeviationboundsandauniformconvergencebound[31].Thusitis
ofinteresttostudytheerrorrateandtheAUC.
CortesandMehryar(2004)conductedastatisticalanalysistoinvestigatehowAUCisrelated
toerrorrate.TheyderivedtheexpressiontocomputetheexpectedvalueofAUCoverall
cationswithaederrorrate.Givenaederror
k
,theypointedouttherearethree
situations:i)samplesarecorrectly,ii)positivesamplesaretonegativeand
iii)negativesamplesaretopositive.TheyfurthercomputedtheAUCforeachsitua-
tion,andprovidedanexpressiontocalculateaveragevalueoftheAUCgiven
k
errorsand
x
false
positiveexamples.
<
AUC
>
x
=
1

x
n
+
k

x
m
2
(2.1)
9
Besides,theyhaveprovidedanexpressiontocalculatethevarianceofAUCgiven
x
falsepositive
samples.Oneyearlater,theygaveanexpressiontocalculatetheintervalforAUCgiven
errornumber
k
andthenumberofpositivesamplesandnegativesamples.Theiranalysisgaveus
agoodstartingpointtostudytherelationshipbetweentheerrorrateandtheAUC.However,
theseexpressionsgivenbyCortesandMehryarareonlycorrectundertheassumptionthatall
orrankingswith
k
errorsareequiprobable,whichmeanseachsamplehasthesame
probabilitytobeThisconditionisrarelymetinrealisticsettings.
10
Chapter3
DifferentApproachestoAddressthe
RelationshipbetweentheAUCandthe
Rate
3.1TheExpectedValueoftheAUCundererrorrate
CortesandMehryar(2004)haveprovidedanexpressionoftheexpectedvalueofAUCoverall
withaednumberoferrorsandcomparedthattotheerrorrate.
Assumethatthenumberoferror
k
ised,andabinarytaskwith
m
positive
samplesand
n
negativesamplesisgiven.Undertheassumptionthatallorrankings
with
k
errorsareequiprobable,theexpectedvalueoftheAUCisgivenbyEq.(3.1)[4].
<
A
>
m
;
n
;
k
=
1

k
m
+
n

(
n

m
)
2
(
m
+
n
+
1
)
4
mn
(
k
m
+
n

å
k

1
x
=
0

m
+
n
x

å
k
x
=
0

m
+
n
+
1
x

)
;
(3.1)
Whichisequivalentto
<
A
>
=
å
k
x
=
0

N
x

N
0
x
0

(
1

x
n
+
k

x
m
2
)
å
k
x
=
0

N
x

N
0
x
0

(3.2)
Where
x
isthenumberoffalsepositivesamples,
x
0
isthenumberoffalsenegativesamples,
N
isthe
numberofnegativesamples,and
N
0
isthenumberofpositivesamples.Theproofofthisexpression
11
isbasedonweightingtheexpression(2.1)withthetotalnumberofpossiblefora
given
x
.Thus,thereare

N
x

possiblewaysofchoosing
x
falsepositiveexamplesand

N
0
x
0

possible
waysofchoosing
x
0
negativeexamples.Here,theauthorsassumedthefollowingcondition:0

x

k
,and
x
0
=
k

x
.
However,theauthorsdidnotconsiderasituationwherethenumberof
k
is
largerthanthenumberofnegativesamples
y
,andinthatsituation,therangeoffalsepositive
x
shouldbe0

x

n
.Similarly,thenumberoffalsenegative
k

x
shouldislessthan
m
,which
means
x

m
.Thus,thevaluerangeof
x
shouldbe
[
0
;
min
(
m
;
n
;
k
)]
.Ifwestillusetheexpression
(3.2)tocalculatetheexpectationofAUCgiven
x
when
k
>
n
or
k
>
m
,theexpectationofAUC
canbelessthan0.5orevennegative.
Forexample,ifIhave100positivesamples,900negativesamples,and200sam-
ples,theexpectationofAUCis

0
:
1429524.Icanalsoindicatethisissuebyplottingthevalueof
theAUCexpectationcalculatedbyexpression(3.1).Theplotofexpression(3.1)with100positive
examplesand900negativeexamplesisshownin(3.1).
Figure3.1:AUCExpectation(m=100,n=900)
Here,theredlineisshownasareferencelineas
AUC
=
0.Fromthewecanseethat
whenthenumberoferror
k
ismuchlargerthenthenumberofpositivesamples
m
,theexpectation
12
calculatedbyexpression(3.1)isnegative.AsImentionedinprevioussections,AUCisaproba-
bilityofpositivesamplerankedhigherthannegativesamplescorrectly,whichindicatesthatAUC
shouldhaveapositivevalue.Inordertousetheexpression(3.1)correctly,theassumptionthatthe
numberoferrorsislessthanthenumberofnegativesamplesandthenumberofpositivesamples
shouldbeadded.
3.2SimulatingtheAUCDistributionunderFixedErrorRate
Thepervioussectionshowedthattheexpression(3.1)isnotvalidwhentheerrornumber
k
islarger
thanthenumberofpositivesamples
m
orthenumberofnegativesamples
n
.Whentheerrornumber
k
issmallerthan
min
(
m
;
n
)
,thederivationofexpression(3.1)iscorrect.Sincetheassumptionthat
eachsamplehassameprobabilitytobeishardtobeachievedinrealisticscenarios,I
furtherconductanextensivesimulationexperimenttoevaluatethevalidityofthisexpressionwhen
k

min
(
m
;
n
)
formoderatetolargedeviationsoftheequiprobableassumption.
3.2.1GeneratingBinaryDistribution
Inordertoinvestigatethevalidityofexpression(3.1),Isimulatebinarydistributeddatafromlogis-
ticregressionmodel.Sinceexpression(3.1)isconditionedon
m
,
n
and
k
,Iinvestigatevesituations
correspondingtotheratioofpositivesamples
r
=
m
=
(
m
+
n
)
equalsto0
:
1
;
0
:
2
;
0
:
3
;
0
:
4
;
0
:
5.
First,Igenerated1000
a
1
;
a
2
;
a
3
;
a
4
;
a
5
˘
N
(
0
;
0
:
5
)
,andIset
b
1
=
3,
b
2
=

5
:
5,
b
3
=

5,
b
4
=
2
:
5,
b
5
=

1.
ThenIgenerated1000
u
˘
U
[
0
;
1
]
independentby
a
i
.
I
q
=
1
=
1
+
exp
(

(
b
0
+
b
1

a
1
+
b
2

a
2
+
b
3

a
3
+
b
4

a
4
+
b
5

a
5
))
Ilabelexamplesto0and1byfollowingrules:
13
if
u

q
then
y
=
0
if
u
<
q
then
y
=
1
Togetabinarydistributeddatawith10%positivesamples,Iset
beta
0
=

8,andonlyuse
thebinarydistributeddatasetswhichhave100positivesamples(
y
=
1)toconductthesimulation
experiment.Togetbinarydistributeddatawith20%positivesamples,Iset
beta
0
=

4
:
5,and
onlyusethebinarydistributeddatasetswith200positivesamples(
y
=
1)toconductthesimulation
experiment.Similarly,Iset
beta
0
=

1
:
5
;
1
;
3
:
5correspondingtogetthedatasetswith30%,40%
and50%positivesamples.ThereasonIadjust
beta
0
istoadjusttheprobability
(
y
=
1
j
a
)
.By
adjusting
P
(
y
=
1
j
a
)
,Icangetthedatasetwitharound10%to50%positivesamples.
3.2.2UsingLogisticRegressionasa
Thelogisticregressionisadirectprobabilitymodel.Itisaspecialcaseofgeneralizedlinear
models.Forlogisticregression,theconditionaldistributionofbinarydatagivencovariatesfollows
aBernoullidistributionwithsuccessprobabilityboundedbetween0and1.Thusonecanuse
binarylogisticmodeltopredictbinaryoutcomesbasedonpredictorvariables.
Givenabinaryrandomvariable
y
andavectorofpredictors(couldbecontinuousordiscrete)
a
,logisticregressioncanbeusedtopredictthesuccessprobability
P
(
y
j
a
)
.
P
(
y
=
1
j
a
)=
1
1
+
exp
(
b
0
+
å
b
i
=
1
b
i
a
i
)
(3.3)
Where
b
isthenumberofpredictors.
P
(
y
j
a
)
leadstoasimplelinearexpressionfor
Generally,weassignthelabel
Y
=
1ifthefollowingconditionholds:
P
(
y
=
0
j
a
)
P
(
y
=
1
j
a
)
<
1
;
14
Whichisequivalentto
exp
(
b
0
+
b
å
i
=
1
b
i
a
i
)
<
1
Aftertakingnaturallogofbothsides,wecanassign
y
=
1if
a

b
0
+
b
å
i
=
1
b
i
a
i
<
0
;
andassign
y
=
0otherwise.
Intheprevioussection,Isimulatethebinarydistributeddata
(
y
;
a
)
.ThenIchooselogistic
regressionasabinarytoclassifythedata.Ichoose
a
1
;
a
3
;
a
5
aspredictors,anduse
expression(3.3)togetthesuccessprobabilitycancomparetheprobabilitytothresholds.Iassume
thatthethresholdfollowsauniformdistributionfrom0to1,andIrandomlyselectonethreshold
from
U
(
0
;
1
)
foreachIfthepredictedprobabilityislessorequaltothreshold,then
‹
y
=
0,else‹
y
=
1.
3.2.3EstimateofAUCandExpectedAUCCalculation
AsImentionedbefore,theAUCistheareaunderROCcurve,aplotoftruepositiverateasa
functionoffalsepositiverate.Tocalculatetheareaundercurve,IcanintegralundertheROC
curve.Thusitisnecessarytocalculatethetruepositiverateandthefalsenegativerateateach
pointofROCcurve
Frommysimulation,Icangetthetruepositiverateandfalsenegativeratebyfollowingfor-
mula.
TPS
=
numberof
(
‹
Y
=
Y
=
1
)
numberof
(
Y
=
1
)
FPS
=
numberof
(
‹
Y
=
0
andY
=
1
)
numberof
(
Y
=
0
)
15
Ichoose1000pointsontheROCcurveandcalculatedthetruepositiverateandfalsepositive
rateforeachpoint.ThenIcalculatedtheintegralundertheROCcurvetogettheAUC.TheEq.(3.1)
givenbyCoresandMehryaristheexpectedaverageAUCvalueunderederrornumber
k
,which
meansaederrorrate.However,itisdiftoinvestigatetheAUCdistributionunderevery
k
ˆ
[
0
;
min
(
m
;
n
)]
.SoIinvestigatetheAUCdistributionunderseveralsmallrangesof
k
=
10.And
IcomparetheestimateofAUCwiththetheexpectedaverageAUCundereachrange.Tocalculate
theexpectedaverageAUC,thenumberoferror
k
needstobeknownforeachiteration.Ithe
errornumber
k
asthecountnumberof
‹
(
Y
i
)
6
=(
Y
i
)
.ThenIplug
m
,
n
,
k
intoexpression(3.1)toget
theexpectedaverageAUC.
3.2.4ComparingtheEstimateofAUCversusExpectedAverageAUC
Ihave1000samplesintotal,andthereare
m
positiveexamplesand
n
negativesamples.Ithe
ratioofpositivesamples
r
as
r
=
m
=
(
m
+
n
)
.IcomparetheestimateofAUCandexpectedaverage
AUCunder
r
=
0
:
1
;
0
:
2
;
0
:
3
;
0
:
4
;
0
:
5.First,Idrawtheempiricalcumulativeprobabilityfunction
(CDF)plotsofAUCunderrangeof
k
toinvestigatetheAUCdistribution.Afterthat,I
drawtheprobabilitydensityfunction(PDF)plotsoftheestimateofAUCunderrangeof
k
,andaddthelowerboundandupperboundofexpectedaverageAUCcalculatedbyexpression
(3.1)asreferencelinestocomparetheestimateofAUCandtheexpectedaverageAUC.Finally,I
givethedescriptivestatisticstoshowthedifferencebetweentheestimateofAUCandtheexpected
averageAUCunderrangesoferrornumber
k
.
First,Ilookedatthesituationwhere
r
=
10%,whichmeansthereare100positivesamples
and900negativesamples.Inthissituation,IonlyinvestigatetheestimateofAUCdistribution
when
k

100.Here,weplottheCDFandPDFforestimateofAUCwiththeestimateoferrorrate
from0.07to0.08,0.08to0.09and0.09to0.1.Ineachrangeoferrorrate,Ihave2,924,70,609
16
and312,565estimatesofAUCvaluescorrespondingly.
Figure3.2:CDFofestimateofAUC(r=10%)
Figure(3.2)showesthattheestimateofAUCwithlowererrorratehasahighervalue.The
minimumandmaximumoftheestimateofAUCwitherrorratefrom0.07to0.08islargestamong
threeestimateofAUC.ThenIplotthePDFoftheestimateofAUCwitherrorratefrom0.07to
0.08,0.08to0.09and0.09to0.1,andaddtheupperboundandlowerboundofexpectedaverage
AUCcalculatedbyexpression(3.1).ThePDFplotisshownin(3.3).ThenIcalculatethe
differencebetweenaverageestimateofAUCandthebounds(upperandlower)correspondingto
eachrangeoferrorrate(Table3.1)asreference.
Figure3.3:PDFofestimateofAUC(r=10%)
17
Table3.1:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=10%)
Errorrate(0.07,0.08](0.08,0.09](0.09,0.1]
AverageestimateofAUC-lowerbound0.276120.317610.36258
AverageestimateofAUC-upperbound0.223880.26420.3079
Figure(3.3)indicatesthattheexpectedaverageAUCvalueisalwayslowerthantheestimate
ofAUC.Andfromtable(3.1)Ithatwhentheerrorrategetslarger,thedifferencebetween
estimateofAUCandexpectedaverageAUCgetlargertoo.
When
r
=
20%,thereare200positivesamplesand800negativesamples.Inthiscase,Ionly
investigatetheestimateofAUCdistributionwhen
k

200.IplottheCDFandPDFoftheestimate
ofAUCwiththeestimateoferrorratefrom0.15to0.16,0.17to0.18and0.19to0.2.Ineachrange
oferrorrate,Ihave39,892,100,487and110,114estimatesofAUCvaluescorrespondingly.The
PDFisshownin(3.4).
Figure3.4:CDFofestimateofAUC(r=20%)
From(3.4)IcanseethattheestimateofAUChassimilardistributionundertherange
oferrorratefrom0.17to0.18and0.19to0.20.ThenIplotthePDFplotoftheestimatesof
AUCtoinvestigatethedifferencebetweentheestimatesofAUCandtheexpectedaverageAUC

18
Figure3.5:PDFofestimateofAUC(r=20%)
From(3.5)IthattheexpectedaverageAUCisalwayslowerthantheestimatesof
AUC.AndtheestimatesofAUChasthesimilaronemodedistributionundererrorratewithin
rangeof
(
0
:
17
;
0
:
18
]
and
(
0
:
19
;
0
:
20
]
.TheestimateofAUCwitherrorratefrom0.15to0.16is
higherthantheothertwo.ThedifferencebetweenaverageestimateofAUCandthebounds(upper
andlower)correspondingtoeachrangeoferrorrateisshownintable(3.2).Thetableshowsthat
whenerrorrategetslarger,thedifferencebetweenestimateofAUCandtheexpectedaverageAUC
getslargertoo.
Table3.2:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=20%)
Errorrate(0.15,0.16](0.17,0.18](0.19,0.20]
AverageestimateofAUC-lowerbound0.234840.280040.30864
AverageestimateofAUC-upperbound0.207520.251250.27902
When
r
=
30%,thereare300positivesamplesand700negativesamples.Inthissituation,
expression(3.1)onlyvalidwhen
k

300.WedrawtheCDFandPDFplotoftheestimatesof
AUCwiththeestimateoferrorratefrom0.19to0.20,0.24to0.25and0.29to0.3.Ineachrange
oferrorrate,Ihave7,060,48,399and55,015estimatesofAUCvalue.TheplotoftheCDFofthe
estimatesofAUCundereacherrorraterangeisshownin3.6.
19
Figure3.6:CDFofestimateofAUC(r=30%)
From3.6InoticethattheestimateofAUCundererrorratefrom0.19to0.20hasthe
highestvalue.ThedistributionoftheestimateofAUCundererrorratefrom0.24to0.25an0.29
to0.30arealmostsame.TofurtherinvestigatethedifferencebetweentheestimateofAUCandthe
expectedaverageAUCwithsamerangeoferrorrate,IplotthePDFplotoftheestimateofAUC
withboundsofexpectedaverageAUCasreference(3.7)).
Figure3.7:PDFofestimateofAUC(r=30%)
From(3.7)IthatthevalueoftheexpectedaverageAUCismuchsmallerthanthe
estimatesofAUCundereachrangeoferrorrate.Andthetable(3.3)ofdifferencebetweenaverage
estimateofAUCandthebounds(upperandlower)correspondingtoeachrangeoferrorrateshows
20
thatwhentheerrorrategetslarger,thedifferencegetslargersimultaneously.
Table3.3:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=30%)
Errorrate(0.19,0.20](0.24,0.25](0.29,0.3]
AverageestimateofAUC-lowerbound0.15070.219160.32513
AverageestimateofAUC-upperbound0.133750.200080.30225
Forthesituationwhere
r
=
40%,thereare400positivesamplesand600negativesamples.In
thissituation,IonlyinvestigatethedistributionofestimateofAUCwhen
k

400.IplottheCDF
andPDFoftheestimateofAUCwiththeestimateoferrorratefrom0.24to0.25,0.30to0.31,
0.34to0.35and0.39to0.4.Ineachrangeoferrorrate,Ihave41,575,26,702,20,451and31,572
estimatesofAUCvalue.TheplotoftheCDFofestimatesofAUCundereacherrorraterangeis
shownin3.8.
Figure3.8:CDFofestimateofAUC(r=40%)
FromthisplotInoticethatwhentheerrorrateisfrom0.24to0.25,theestimateofAUCvalue
ishigherthanothers.Whenerrorrateisfrom0.39to0.4,theestimateofAUCvalueis
smallest.AndthedistributionoftheestimateofAUCwitherrorratefrom0.3to0.31,0.34to0.35
and0.39to0.4isverysimilarwitheachother.ThePDFplotoftheestimateofAUCisshownin
(3.9).
21
Figure3.9:PDFofestimateofAUC(r=40%)
From(3.9)IcanclearlyseethattheestimateofAUCwitherrorratefrom0.24to0.25has
thelargestvalue,andthedifferencebetweenitandtheexpectedaverageAUCissmallest.Forerror
ratefrom0.3to0.31,0.34to0.35and0.39to0.4,theestimateofAUChassimilardistribution.
However,theexpectedaverageAUCvaluediffersalotforthesethreerangesoferrorrate.The
tablecontainsthedifferencebetweenaverageestimateofAUCandthebounds(upperandlower)
correspondingtoeachrangeoferrorrateisshownintable(3.4).
Table3.4:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=40%)
Errorrate(0.24,0.25](0.3,0.31](0.34,0.35](0.39,0.4]
AverageestimateofAUC-lowerbound0.111850.178630.234640.31688
AverageestimateofAUC-upperbound0.099870.165570.220120.29808
Thistablealsoshowsthatwhenerrorrategetslarger,thedifferencebetweentheestimateof
AUCandtheexpectedaverageAUCgetslarger.
When
r
=
50%,thereare500positivesamplesand500negativesamples.SoIinvestigatethe
situationthat
k

500.IplottheCDFandPDFforestimateofAUCwiththeestimateoferrorrate
from0.24to0.25,0.30to0.31,0.34to0.35,0.39to0.4,0.44to0.45and0.49to0.50.Ineach
rangeoferrorrate,Ihave20,467,28,456,19,620,15,038,13,536and29,677estimatesofAUC
22
value.TheCDFplotofestimateofAUCisshownin(3.10).
Figure3.10:CDFofestimateofAUC(r=50%)
Figure(3.10)showsthatthevalueofestimateofAUCwitherrorratefrom0.24to0.25is
higherthanotherestimateofAUCwithhighererrorrate.TheestimateofAUCwith
errorratefrom0.3to0.31,0.34to0.35,0.39to0.4,0.44to0.45and0.49to0.5hassimilar
distribution.ThePDFplotofthoseestimatesofAUCisshownin(3.11).
Figure3.11:PDFofestimateofAUC(r=50%)
ThePDFplot(3.11)alsoshowsthatestimateAUCwitherrorratefrom0.24to0.25has
thehighestvalue,andtherangeofitismuchnarrowthanothers.FortheestimateofAUCwith
otherrangesoferrorrate,thedistributionissimilar.However,theexpectedaverageAUCwith
23
thoserangesoferrorratediffersalot.AndIcalculatethedifferencebetweenaverageestimate
ofAUCandthebounds(upperandlower)correspondingtoeachrangeoferrorrate(Table3.5)

asreference.ThetableshowsthatthedifferencebetweentheestimateofAUCandtheexpected

averageAUCgetslargerwhenerrorrategetslarger.However,inthissituation,thedifferenceis

smallerthanpervioussituationwithsmallerproportionofpositivesamples.
Table3.5:DifferencebetweenestimateofAUCandexpectedaverageAUC(r=50%)
Errorrate(0.24,0.25](0.3,0.31](0.34,0.35](0.39,0.4](0.44,0.45](0.49,0.5]0.08590.13040.1710.22120.2710.3194AvgEst.ofAUC-lowerboundAvgEst.ofAUC-upperbound0.07590.12040.1610.21120.2610.3094InordertocomparethedifferencebetweenestimateofAUCandexpectedaverageAUCineach
situation,IcalculatedthemeanofdifferencebetweenaverageestimateofAUCbounds(upperand

lower).Theresultsareshowedintable3.6.
Table3.6:TheDifferencebetweenestimateofAUCandExpectedAverageAUC
r=m/(m+n)0.10.20.30.40.5
meanofdifference(withlowerbound)0.3187710.2745050.2316650.21050.199911
meanofdifference(withupperbound)0.2653250.2459320.2120280.1959080.189911
Table3.6indicatesthatwhenrisgetslarger,thedifferencebetweentheexpectedaverage
AUCandestimateofAUCisgetssmaller.Thatmeanswhenthepositivesamplesandnegative

samplesareevendistributed,thedeviationofexpression(3.1)issmallestthoughtheequiprobable

assumptionisnotsatButwhenthepositivesamplesandnegativesamplesarenotevenly

distributed,wecannotuseequation3.1asareference.
243.3StudytheErrorRateDistributionUndertheFixedAUC
Anotherobjectiveofthisthesisistostudyhowtheerrorratedistributedunderaedrangeof
AUC.Toobservethedistributionclearly,IdrawtheCDFplotsoferrorratefortheedrange
AUCof(0.49,0.51),(0.59,0.61),(0.69,0.71),(0.79,0.81)and(0.89,0.91).AndIstudysituations
thattheratioofpositivesamples
r
=
m
=
(
m
+
n
)
varyfrom0.1to0.5by0.1toinvestigatewhether
thedistributionconditionalontheratioofpositiveexamples.
Inthissection,IusethesameMonteCarlodatasetswiththeprevioussection.Inordertoget
thelargerangeofestimateAUC,Ichoosedifferentpredictorsforlogisticregressionfor
eachrangeofestimateofAUC.Thethresholdofisrandomlychosenfromauniform
distribution
U
(
0
;
1
)
.
3.3.1CDFandPDFPlotsforErrorRateUndertheFixedRangeAUC
First,Istudythesituationthat
r
=
10%.TheCDFplotandPDFplotareshownin(3.12)
and(3.13).ForestimateofAUCfrom0.49to0.51,Ichose
a
5
aspredictor,andIhave3,444
estimatesofAUCinthisrange.ForestimateofAUCfrom0.59to0.61,Ichose
a
1
aspredictor,
andIhave15,620estimatesofAUCinthisrange.ForestimateofAUCfrom0.69to0.71,Ichose
a
1
;
a
4
;
a
5
aspredictor,andIhave35,709estimatesofAUCinthisrange.ForestimateofAUCfrom
0.79to0.81,Ichose
a
1
;
a
3
;
a
5
aspredictor,andIhave7,982estimatesofAUCinthisrange.And
forestimateofAUCfrom0.89to0.91,Ichose
a
2
;
a
4
;
a
5
aspredictor,andIhave31,450estimates
ofAUCinthisrange.
Figure3.12showsthatwhentheestimateofAUCfrom0.49to0.51,themajorityvaluesof
estimateoferrorrateisaround0.1.WhenAUCgetslarger,therangeoferrorratebecomeslarge
too,however,themaximumerrorrateisalwaysaround0.9.ThelargertheAUC,thesmallerthe
25
Figure3.12:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=10%)
Figure3.13:ErrorRateDistributionUnderedAUC(r=10%)
26
minimumerrorrate.From3.13Ithatthemodeoferrorrateisaround0.1,andthewhen
AUCgetslarge,themodeofthedensitygetscloserto0.Ialsocalculatethemean,median,and
quantileoftheerrorrateunderdifferentAUC(table3.7).
Table3.7:DescriptiveStatisticsofErrorRateunderFixedAUC(r=10%)
AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91
Minimum0.10.0950.0920.0880.067
1stQuantile0.10.10.10.10.089
Median0.10.10.10.1010.096
Mean0.18170.17870.16910.1560.1262
3rdQuantile0.10.17830.1160.1270.107
Maximum0.9010.9010.9010.90.9
ThedescriptivestatisticshowsthatthemeanoftheerrorrategetssmallerwhentheAUCgets
larger.ThemaximumerrorrateforeachAUCaround0.9,andthethemedianoftheerrorrateis
alwaysaround0.1.
When
r
=
20%,thereare200positivesamplesand800negativesamples.ForestimateofAUC
from0.49to0.51,Ichose
a
5
aspredictor,andIhave1,968estimatesofAUCinthisrange.For
estimateofAUCfrom0.59to0.61,Ichose
a
1
aspredictor,andIhave7,798estimatesofAUC
inthisrange.ForestimateofAUCfrom0.69to0.71,Ichose
a
1
;
a
4
;
a
5
aspredictor,andIhave
31,384estimatesofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose
a
3
;
a
4
;
a
5
as
predictor,andIhave11,826estimatesofAUCinthisrange.AndforestimateofAUCfrom0.89
to0.91,Ichose
a
1
;
a
2
;
a
5
aspredictor,andIhave60,867estimatesofAUCinthisrange.
IgettheCDFplotandPDFplotin3.14and3.15.
Fromtheempiricalcumulativefunctionplot(3.14)IthatwhenestimateofAUCfrom
0.49to0.51,themajorityoferrorrateis0.2.WhenAUCgetslarger,theminimumerrorrate
getssmaller,butthemaximumerrorrateisalwaysaround0.8.Theprobabilitydensityplot(3.15)
showsthattherearetwomodesofprobabilityinthisplot.Thehigheroneisaround0.2,andthe
27
Figure3.14:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=20%)
Figure3.15:ErrorRateDistributionUnderedAUC(r=20%)
28
loweroneisaround0.8.WhentheAUCbecomeslarger,thehighermodeismorecloserto0.1.
Thedescriptivestatisticsareshownintable(3.8).
Table3.8:DescriptiveStatisticsofErrorRateunderFixedAUC(r=20%)
AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91
Minimum0.1990.1920.1780.1540.111
1stQuantile0.20.20.20.1850.144
Median0.20.20.20.1970.161
Mean0.32260.31570.29170.25240.1925
3rdQuantile0.20.2760.280.2390.192
Maximum0.80.8010.80.80.8
ThistableindicatesthatthemaximumerrorrateundereachAUCisalmostthesame,which
around0.8.ThemeanerrorrategetssmallerwhenAUCgetslarger.Forthefourrangeof
estimateofAUC,thequantileoferrorrateisaround0.2.ButforestimateofAUCfrom0.89
to0.91,thequantileoferrorrateismunchmoresmaller,whichisaround0.14.Theminimum
andmeanoferrorrategetssmallerwhentherangeofAUCestimategetslarger.
When
r
=
30%,Ihave300positivesamplesand700negativesamples.ForestimateofAUC
from0.49to0.51,Ichose
a
5
aspredictor,andIhave1,488estimatesofAUCinthisrange.For
estimateofAUCfrom0.59to0.61,Ichose
a
1
aspredictor,andIhave4,156estimatesofAUC
inthisrange.ForestimateofAUCfrom0.69to0.71,Ichose
a
1
;
a
4
;
a
5
aspredictor,andIhave
32,683estimatesofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose
a
3
;
a
4
;
a
5
aspredictor,andIhave35,089estimatesofAUCinthisrange.AndforestimateofAUCfrom
0.89to0.91,Ichose
a
1
;
a
2
;
a
4
aspredictor,andIhave20,490estimatesofAUCinthisrange.Iget
theCDFplotandPDFplotin3.16and3.17.
Basedonempiricalcumulativefunctionplot(3.14)InoticethatwhentherangeofAUCesti-
mateis0.49to0.51,therearetwojumpofCDFplot,oneisaround0.3andtheotherisaround0.7.
WhentheAUCestimategetslarger,theminimumerrorrategetssmaller,butthemaximumAUC
29
Figure3.16:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=30%)
Figure3.17:ErrorRateDistributionUnderedAUC(r=30%)
30
isalwaysaround0.7.FromtheprobabilitydensityplotIcanseethatwhenAUCestimateissmall
(rangeof(0.49,0.51),(0.59,0.61),(0.69,0.71)),therearetwomodesoferrorrate,oneisaround0.7
andtheotherisaround0.3.WhenAUCestimategetslarger,therearemultiplemodes.Bothof
themhaveonemodearound0.7.ForAUCestimaterangefrom0.79to0.81,ithasonemode
around0.3,onemodearound0.2andanotheraround0.7.AndforestimateofAUCfrom0.89to
0.91,ithasonemodearound0.25andanotheraround0.19.Thedescriptivestatisticsareshownin
table(3.9).
Table3.9:DescriptiveStatisticsofErrorRateunderFixedAUC(r=30%)
AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91
Minimum0.2980.2860.2530.2020.143
1stQuantile0.30.30.2930.2470.179
Median0.30.30.30.2760.204
Mean0.42640.40780.3750.31570.2339
3rdQuantile0.70.5340.4140.3090.256
Maximum0.70.7020.7010.70.7
ThistableshowsthatthemaximumerrorrateforeachvalueoftheAUCisaround0.7.The
minimumandmeanerrorrategetssmallerwhentheAUCgetslarger.Themedianandquantile
oferrorrateforthreerangesofAUCestimatearearound0.3.FortheAUCestimatefrom0.89
to0.91,thequantileandmedianerrorrateismuchmoresmallerthan0.3.
For
r
=
40%,Ihave400positivesamplesand600negativesamples.ForestimateofAUCfrom
0.49to0.51,Ichose
a
5
aspredictor,andIhave1,488estimatesofAUCinthisrange.Forestimate
ofAUCfrom0.59to0.61,Ichose
a
1
aspredictor,andIhave5,870estimatesofAUCinthisrange.
ForestimateofAUCfrom0.69to0.71,Ichose
a
1
;
a
4
;
a
5
aspredictor,andIhave34,875estimates
ofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose
a
1
;
a
3
;
a
5
aspredictor,andI
have16,566estimatesofAUCinthisrange.AndforestimateofAUCfrom0.89to0.91,Ichose
a
2
;
a
3
;
a
5
aspredictor,andIhave9,290estimatesofAUCinthisrange.
31
TheCDFplotandPDFplotareshownin3.18and3.19.
Figure3.18:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=40%)
Figure3.19:ErrorRateDistributionUnderedAUC(r=40%)
FromtheCDFplotIcanseethatwhentheestimateofAUCisfrom0.49to0.51,therearetwo
jumpintheCDFplot,onearound0.4andtheotheraround0.6.WhenAUCestimategetslarger,
theminimumerrorrategettingsmaller.ForAUCestimatefrom0.89to0.91,theminimumerror
rateislessthan0.2.However,thelargesterrorratestillaround0.6.Fromtheprobabilitydensity
plotInoticethatforestimateofAUCfrom0.49to0.51and0.59to0.61,theerrorratehastwo
modes,onearound0.6andanotheraround0.4.ForestimateofAUCfrom0.69to0.71,theerror
ratehasthreemodes,onearound0.4,onearound0.35andtheotheraround0.6.ForAUCestimate
32
from0.79to0.81,theerrorratehasthreemodes,onearound0.3,onearound0.4andtheother
around0.6.However,forestimateofAUCfrom0.89to0.91,theerrorrateonlyhasonemodeand
itisaround0.2.Thedescriptivestatisticsaresshownintable(3.10).
Table3.10:DescriptiveStatisticsofErrorRateunderFixedAUC(r=40%)
AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91
Minimum0.3950.3590.3050.2380.152
1stQuantile0.40.40.3590.2820.187
Median0.40.40.3960.3250.214
Mean0.48150.46430.42220.35050.2431
3rdQuantile0.60.5790.4630.3910.276
Maximum0.60.6020.6010.60.6
ThistableshowsthatthemaximumerrorrateforallrangesoftheestimateofAUCisaround
0.6.ThemeanandthirdquantileerrorrategetssmalleralongwiththeAUCgetslarger.The
quantileandminimumerrorrateisaround0.4forestimateofAUCfrom0.49to0.51and0.59to
0.61.AndwhenAUCgetslarger,thequantileandminimumerrorrategetssmaller.
For
r
=
50%,thereare500positivesamplesand500negativesamples.ForestimateofAUC
from0.49to0.51,Ichose
a
5
aspredictor,andhave1,089estimatesofAUCinthisrange.For
estimateofAUCfrom0.59to0.61,Ichose
a
4
aspredictor,andhave14,419estimatesofAUC
inthisrange.ForestimateofAUCfrom0.69to0.71,Ichose
a
1
;
a
4
;
a
5
aspredictor,andhave
37,195estimatesofAUCinthisrange.ForestimateofAUCfrom0.79to0.81,Ichose
a
1
;
a
3
;
a
5
as
predictor,andIhave19,311estimatesofAUCinthisrange.AndforestimateofAUCfrom0.89
to0.91,Ichose
a
1
;
a
3
;
a
4
aspredictor,andhave5,969estimatesofAUCinthisrange.
SincetheprobabilitydensityfunctionforestimateofAUCfrom0.49to0.51andforother
rangesofestimateofAUCdiffersalot,IdrawthePDFforthesetwosituationsseparately.The
CDFplotsandPDFplotareshowenin3.20,3.21,and3.22.
FromtheCDFplotInoticethatwhentheestimateofAUCisfrom0.49to0.51,themajority
33
Figure3.20:ErrorRateEmpiricalCumulativeDistributionUnderedAUC(r=50%)
Figure3.21:ErrorRateDistributionUnderedAUC(r=50%)
Figure3.22:ErrorRateDistributionUnderedAUC(r=50%)
34
oferrorrateisequalto0.5.AndwhenAUCgetslarger,theerrorratevariesfrom0.16to0.5.
WhentheestimateofAUCisfrom0.89to0.91,theminimumerrorrateislessthan0.2.Fromthe
3.21Ithatwhen
AUC
=
0
:
5,theprobabilitythaterrorrateequalsto0.5isextremely
large.Basedon3.22IthatwhentheestimateofAUCisfrom0.59to0.61,theerror
ratevaluehastwomodes,oneisaround0.5andtheotherisaround0.45.AndforestimateofAUC
isfrom0.69to0.71,theerrorratehastwomodesaround0.5and0.3.ForestimateAUCfrom
0.79to0.81andfrom0.89to0.91,theerrorrateonlyhasonemodewhichisaround0.25and0.2
correspondingly.Thedescriptivestatisticsareshownintable(3.11).
Table3.11:DescriptiveStatisticsofErrorRateunderFixedAUC(r=50%)
AUC0.49-0.510.59-0.610.69-0.710.79-0.810.89-0.91
Minimum0.4770.40.3250.2520.167
1stQuantile0.50.4720.3830.2950.203
Median0.50.4990.4450.3440.234
Mean0.49990.48360.4370.36220.2631
3rdQuantile0.50.50.4940.4250.304
Maximum0.5080.5050.5050.50.5
FromthistableIthatthemaximumerrorrateisaround0.5.Themeanerrorrategets
smallerwhenAUCgetslarger.OnlyforestimateofAUCfrom0.49to0.51,thequantileand
theminimumerrorrateisaround0.5.ForotherlargerAUC,thequantileandminimumerror
rateismuchsmallerthan0.5.
35
Chapter4
Conclusion
4.1Summary
TheobjectiveofthisthesiswastoinvestigatetherelationshipbetweenAUCand
rate.CortesandMehryar(2004)haveprovidedexpressionfortheexpectedvalueofAUCgiven
errornumber
k
thatisonlyvalidwhenallorrankingswith
k
errorsareequiprobable.
Theassumptionofthisequationistoostrongtometthereallifescenarios.AndIfoundthatCortes
andMehryar'sexpressionisnotvalidinthesituationwhereerrornumber
k
islargerthan
min
(
m
;
n
)
.
Fortheirexpressiontobevalid,theconstraint
k
>
min
(
m
;
n
)
needstobeimposed.
Isimulatedabinarydistributionusinglogisticregressionandusedlogisticregressionmodel
asatostudytherelationshipbetweenrateandAUC.First,Icompared
theestimateofAUCvaluetotheexpectedaveragevalueofAUCcalculatedbyequation3.1only
forsituationthat
k

min
(
m
;
n
)
.TheresultsshowedtheexpectedaveragevalueofAUCisalways
lowerthantheestimateofAUC.Whenthepositivesamplesandnegativesampleswereevenlydis-
tributed,thedifferencebetweenestimateofAUCandexpectedaveragevalueofAUCaresmallest.
Thus,onecanuseequation3.1asareferencetolearntherelationshipbetweentheAUCandthe
errorratewhenIhavesameproportionofpositiveexamplesandnegativeexamples.Butwhenthe
proportionsofpositivesamplesareextremecloseto0or1,thisexpressionisveryquestionableto
beused.
Furthermore,IstudiedtheerrorratedistributionunderaedrangeofAUCvaluewhenr
36
variesfrom0.1to0.5.Theresultsshowedthatwhen
r
=
0
:
1
;
0
:
2,themodeoferrorrateisalways
around
r
or1

r
.WhenAUCbecomeslarger,thedistributionoferrorratebecomestotheright
skewed,andthemeanerrorratebecomessmaller.
4.2Limitation
Inthisthesis,IdidasimulationstudytoinvestigatetherelationshipbetweenAUCandmisclas-
rateforbinarydistribution.,Itestedthevalidityoftheexpressiongivenby
CortesandMehryar(2004)andstudiedthedistributionoferrorrateundertheestimateofAUC.
Intheanalysispart,IcalculatedtheestimateofAUCgivenedrangeofestimateoferror
ratetovalidatetheexpressiongivenbyCortesandMehryar(2004).Igottheestimateoferrorrate
byalogisticregression.Becausethethresholdfortheisunknown,Ijustsimply
assumedthatthethresholdfollowsauniformdistributionfrom0to1.Thisassumptionmaybe
uncorrect.Inthesecondanalysispart,Istudiedthedistributionofestimateoferrorrateunder
aedrangeofestimateofAUC.BecauseAUCisacontinuousvariableandIcannotgetthe
errorratedistributionundereverysinglevalueofAUC,Ijustlookedinto5intervalsof
AUC.Ididnotusethesamethegetthese5intervalsofAUCvaluebecauseitneedsan
extremelylargenumberofMoteCarlosamples.Instead,Iused5differenttogetthese5
intervalsofAUCvalue.ThisisnotappropriateandthedistributionoferrorrateunderAUCwould
bemoreaccurateifIhadmoreMoteCarlosamples.
37
4.3Discussion
Inthisthesis,IevaluatedthevalidityoftheexpressionprovidedbyCortesandMehryar(2004)
formoderatetolargedeviationsoftheequiprobable.Myresultsshowedthatwhenthepositive
samplesandnegativesamplesarenotevenlydistributed,theexpressionisquestionable.Based
onmywork,peoplecanhaveabriefideaofhowerrorrateisdistributedunderaedrangeof
AUC.Toinvestigatetherelationshipmoreprecise,IcanapplytheBayesianinferencemethods.
ThedifofBayesianinferenceisthatboththeerrorrateandtheAUCarerandomvariables,
anditishardtothedistributionofAUCanderrorrate.
38
BIBLIOGRAPHY
39
BIBLIOGRAPHY
[1]Gorno-Tempini,M.L.,etal.ofprimaryprogressiveaphasiaanditsvariants."
Neurology76.11(2011):1006-1014.
[2]Planet,PaulJ.,etal."PhylogenyofgenesforsecretionNTPases:ofthe
widespreadtadAsubfamilyanddevelopmentofadiagnostickeyforgenePro-
ceedingsoftheNationalAcademyofSciences98.5(2001):2503-2508.
[3]Pang,Bo,LillianLee,andShivakumarVaithyanathan."Thumbsup:sentiment
tionusingmachinelearningtechniques."ProceedingsoftheACL-02conferenceonEmpirical
methodsinnaturallanguageprocessing-Volume10.AssociationforComputationalLinguis-
tics,2002.
[4]Cortes,Corinna,andMehryarMohri."AUCoptimizationvs.errorrateminimization."Ad-
vancesinneuralinformationprocessingsystems16.16(2004):313-320.
[5]Mohri,C.intervalsfortheareaundertheROCcurve."Advancesinneuralinfor-
mationprocessingsystems17(2005):305.
[6]Hand,DavidJ.,andRobertJ.Till."AsimplegeneralisationoftheareaundertheROCcurve
formultipleclassproblems."Machinelearning45.2(2001):171-186.
[7]Herschtal,Alan,andBhavaniRaskutti."OptimisingareaundertheROCcurveusinggradient
descent."ProceedingsoftheinternationalconferenceonMachinelearning.ACM,
2004.
[8]Ma,Shuangge,andJianHuang."RegularizedROCmethodfordiseaseonand
biomarkerselectionwithmicroarraydata."Bioinformatics21.24(2005):4356-4362.
[9]Krizhevsky,Alex,IlyaSutskever,andGeoffreyE.Hinton."Imagenetwithdeep
convolutionalneuralnetworks."Advancesinneuralinformationprocessingsystems.2012.
[10]Och,FranzJosef."Minimumerrorratetraininginstatisticalmachinetranslation."Proceed-
ingsofthe41stAnnualMeetingonAssociationforComputationalLinguistics-Volume1.
AssociationforComputationalLinguistics,2003.
40
[11]Ben-David,Shai,etal."Minimizingtheerrorrateusingasurrogateconvex
loss."arXivpreprintarXiv:1206.6442(2012).
[12]Murthy,SreeramaK."Automaticconstructionofdecisiontreesfromdata:Amulti-
disciplinarysurvey."Dataminingandknowledgediscovery2.4(1998):345-389.
[13]Joachims,Thorsten."Asupportvectormethodformultivariateperformancemeasures."Pro-
ceedingsofthe22ndinternationalconferenceonMachinelearning.ACM,2005.
[14]Metz,CharlesE."BasicprinciplesofROCanalysis."Seminarsinnuclearmedicine.Vol.8.
No.4.WBSaunders,1978.
[15]Ferri,Cesar,JoseHernandez-Orallo,andR.Modroiu."Anexperimentalcomparisonofper-
formancemeasuresforPatternRecognitionLetters30.1(2009):27-38.
[16]Fawcett,Tom."AnintroductiontoROCanalysis."Patternrecognitionletters27.8(2006):
861-874.
[17]Ling,CharlesX.,JinHuang,andHarryZhang."AUC:astatisticallyconsistentandmore
discriminatingmeasurethanaccuracy."IJCAI.Vol.3.2003.
[18]Bradley,AndrewP."TheuseoftheareaundertheROCcurveintheevaluationofmachine
learningalgorithms."Patternrecognition30.7(1997):1145-1159.
[19]Mann,H.B.,Whitney,D.R.(1947)Onatestwhetheroneoftworandomvariablesisstochas-
ticallylargerthantheother.Ann.Math.Statist.,18,pp.50-60.
[20]Wilcoxon,F.(1945)Individualcomparisonsbyrankingmethods.Biometrics,1,pp.80-83.
[21]Hanley,JamesA.,andBarbaraJ.McNeil."Themeaninganduseoftheareaunderareceiver
operatingcharacteristic(ROC)curve."Radiology143.1(1982):29-36.
[22]Huang,Jin,andCharlesX.Ling."UsingAUCandaccuracyinevaluatinglearningalgorithm-
s."KnowledgeandDataEngineering,IEEETransactionson17.3(2005):299-310.
[23]BreimanL.,FriedmanJ.H.,OlshenR.A.,StoneC.J.(1984)andRegression
Trees,WadsforthInternationalGroup.
[24]Mika,S.,Ratsch,G.,Weston,J.,Scholkopf,B.andMuller,K.-R.(1999),Fisherdiscrimi-
nantanalysiswithkernels.InY.-H.Hu,J.Larsen,E.Wilson,andS.Douglas,editors,Neural
NetworksforSignalProcessingIX,pages41-48.IEEE.
41
[25]Murthy,SreeramaK."Automaticconstructionofdecisiontreesfromdata:Amulti-
disciplinarysurvey."Dataminingandknowledgediscovery2.4(1998):345-389.
[26]GoodI.J.(1950),ProbabilityandtheWeighingofEvidence,London,CharlesGrin.
[27]Mitchell,T.(1997).MachineLearning.McGrawHill.
[28]Kotsiantis,SotirisB.,I.Zaharakis,andP.Pintelas."Supervisedmachinelearning:Areview
oftechniques."(2007):3-24.
[29]Moses,LincolnE.,DavidShapiro,andBenjaminLittenberg."Combiningindependents-
tudiesofadiagnostictestintoasummaryroccurve:Data-analyticapproachesandsome
additionalconsiderations."Statisticsinmedicine12.14(1993):1293-1316.
[30]Metz,CharlesE.,BenjaminA.Herman,andCherylA.Roe."Statisticalcomparisonoftwo
ROC-curveestimatesobtainedfrompartially-paireddatasets."MedicalDecisionMaking18.1
(1998):110-121.
[31]Agarwal,Shivani,etal."GeneralizationboundsfortheareaundertheROCcurve."Journal
ofMachineLearningResearch.2005.
[32]Dreiseitl,Stephan,andLucilaOhno-Machado."Logisticregressionandneuralnet-
workclmodels:amethodologyreview."Journalofbiomedicalinformatics35.5
(2002):352-359.
[33]Davis,Jesse,andMarkGoadrich."TherelationshipbetweenPrecision-RecallandROC
curves."Proceedingsofthe23rdinternationalconferenceonMachinelearning.ACM,2006.
[34]Statnikov,Alexander,LilyWang,andConstantinF.Aliferis."Acomprehensivecomparison
ofrandomforestsandsupportvectormachinesformicroarray-basedcancer
BMCbioinformatics9.1(2008):319.
[35]Lee,Eunjung,etal."InferringpathwayactivitytowardprecisediseasePLoS
computationalbiology4.11(2008):e1000217.
[36]Kim,Ji-Hyun."Estimatingerrorrate:Repeatedcross-validation,repeatedhold-
outandbootstrap."ComputationalStatisticsandDataAnalysis53.11(2009):3735-3745.
[37]Lagreid,Astrid,etal."Predictinggeneontologybiologicalprocessfromtemporalgeneex-
pressionpatterns."Genomeresearch13.5(2003):965-979.
42
[38]Deselaers,Thomas,DanielKeysers,andHermannNey.errorrateforquanti-
tativeevaluationofcontent-basedimageretrievalsystems."PatternRecognition,2004.ICPR
2004.Proceedingsofthe17thInternationalConferenceon.Vol.2.IEEE,2004.
[39]Golub,ToddR.,etal."Molecularofcancer:classdiscoveryandclassprediction
bygeneexpressionmonitoring."science286.5439(1999):531-537.
[40]Wang,Qiong,etal."NaiveBayesianforrapidassignmentofrRNAsequencesinto
thenewbacterialtaxonomy."Appliedandenvironmentalmicrobiology73.16(2007):5261-
5267.
[41]Kuncheva,LudmilaI.,andChristopherJ.Whitaker."Measuresofdiversityinen-
semblesandtheirrelationshipwiththeensembleaccuracy."Machinelearning51.2(2003):
181-207.
[42]Simon,Richard,etal."PitfallsintheuseofDNAmicroarraydatafordiagnosticandprog-
nosticJournaloftheNationalCancerInstitute95.1(2003):14-18.
[43]Swets,JohnA."Measuringtheaccuracyofdiagnosticsystems."Science240.4857(1988):
1285-1293.
43