MACHINELEARNINGMETHODFORAUTHORSHIPATTRIBUTION By XianfengHu ADISSERTATION Submittedto MichiganStateUniversity inpartialentoftherequirements forthedegreeof AppliedMathematics|DoctorofPhilosophy 2015 ABSTRACT MACHINELEARNINGMETHODFORAUTHORSHIPATTRIBUTION By XianfengHu Broadlyspeaking,theauthorshipidenorauthorshipattributionproblemisto determinetheauthorshipofagivensamplesuchastext,paintingandsoon.Ourmain workistodevelopaneandmathe-soundapproachfortheanalysisofauthorshipof doubtedbooks. Inspiredbyvariousauthorshipattributionproblemsinthehistoryofliteratureandthe applicationofmachinelearninginthestudyofliterarystylometry,wedeveloparigorousnew methodforthemathematicalanalysisofauthorshipbytestingforaso-calledchrono-divide inwritingstyles.Ourmethodincorporatessomeofthelatestadvancesinthestudyofau- thorshipattribution,particularlytechniquesfromsupportvectormachines.Byintroducing thenotionofrelativefrequencyofwordandphrasesasfeaturerankingmetricsourmethod provestobehighlyeandrobust. ApplyingourmethodtotheclassicalChinesenovel DreamoftheRedChamber hasled toconvincingifnotirrefutableevidencethatthe80chaptersandthelast40chapters ofthebookwerewrittenbytwotauthors. AlsoapplyingourmethodtotheEnglishnovel Micro ,weareabletocontheexistence ofthechrono-divideandidentifyitslocationsothatwecantiatethecontributionof MichaelCrichtonandRichardPreston,theauthorsofthenovel. WehavealsotestedourmethodtotheotherthreeGreatClassicalNovelsinChinese.As expectednochrono-divideshavebeenfoundinthesenovels.Thisprovidesfurtherevidence oftherobustnessofourmethod. Wealsoproposedanewapproachtoauthorshipidentosolvetheopenclass problemwherethecandidategroupisnonexistentorverylarge,whichisreliablyscaled fromanewmethodwehavedevelopedforthecloseclassprobleminwhichthecandidates authorpoolissmall.Thisisattainedbyusingsupportvectormachinesandbyanalyzingthe relativefrequenciesofcommonwordsinthefunctionwordsdictionaryandmostfrequently usedwords.Thismethodscalesverynicelytotheopenclassproblemthroughanovel authorrandomizationtechnique ,whereanauthorinquestioniscomparedrepeatedlyto randomlyselectedauthors.Theauthorrandomizationtechniqueprovestobehighlyrobust ande.Usingourapproacheswehavefoundanswerstothreewellknownauthorship controversies:(1)DidRobertGalbraithwrite Cuckoo'sCalling ?(2)DidHarperLeewrite ToKillaMockingbird ordidherfriendTrumanCapotewriteit?(3)DidBillAyerswrite Obama'sautobiography DreamsFromMyFather ? ACKNOWLEDGMENTS FirstandforemostIwanttothankmyadvisorYangWang,thesmartestpersonIknow.It hasbeenanhonortobehisPh.D.student.Iwanttothankhimforhelpingmetoshapeand guidethedirectionoftheresearchinparticular.Hissupportandinsightfuldiscussions aboutresearchmademypursuitofPh.D.possible.Iappreciateallhiscontributionoftime, ideasandfundingtomakemyPh.D.productiveandstimulatingoverthepasteyears.The enthusiasmhehasforhisresearchwascontagiousandmotivationalforme,evenduringthe toughtimesinthePh.D.pursuit.Beyondhisscienguidance,hisadviceonnavigating anacademiccareerhasbeeninvaluable. IwillforeverbethankfultomyadvisorMinWuinSouthChinaUniversityofTechnology. Shewasandremainsmyexcellentexampleasasuccessfulwomanmathematician,mentor andteacher.SheisthereasonwhyIdecidedgotopursueacareerinresearch.Iamgrateful forhersupportandencouragementtopursuePh.D. IthankMarkIwenforsharinghisGFFTcode.Itishewholeadedmetousec++and showedtricksofdebug.Ireallyappreciatehispatientandtimeduringdiscussion.Thank himverymuchforhelpingmetopreparejobapplicationmaterialsandsuggestionsduring application.IalsowanttothankQiangWuforhelpingtostartmyprojectandmatlab. Ithankhimforhisinspiringdiscussionandwonderfulsuggestionsinlife.Ialsowantto thankAdityaViswanathanforhiswonderfulsuggestionsoftoolboxduringdiscussion.I wouldliketothankZhengfangZhou,AndrewChristliebandYingdaChengforservingon mydefensecommittee. Iwouldliketothankmyfellowmathstudentsfortheirfriendshipoverthepasteyears andforsharingthewonderfullifewithme,ZixuanWangandYuqiHonginparticular.The iv nightsspentplayingtennis,sharingdinner,orlaughingoverdesertmadetheexperienceof graduateschoolmorerewarding.IwouldalsothankmyfriendsLipingChenandLiangmin Zhouforhelpingmeduringstudyandteach.MyparentsJunlinandGenshui,mybrother MubiaoandsistersNian,XueandJie,havebeenincrediblysupportivethroughoutmy studentlife.ThankthemforgivingmeawarmfamilyandhappychildhoodandIdedicate thisdissertationtothem. v TABLEOFCONTENTS LISTOFTABLES .................................... viii LISTOFFIGURES ................................... x Chapter1Introduction ............................... 1 1.1AuthorshipAttribution..............................1 1.2RelatedWorks...................................4 1.3OrganizationoftheThesis............................6 Chapter2IntroductiontoMachineLearningTechniques .......... 7 2.1Introduction....................................7 2.2LinearSupportVectorMachine.........................8 2.2.1Separablecase...............................8 2.2.2Non-separablecase............................11 2.3NonlinearSupportVectormachine.......................12 2.4Multiclass.............................14 2.5CrossValidation..................................16 Chapter3StylometryAnaysis:DetectingChrono-DivideinWritingStyles18 3.1Chrono-divide...................................18 3.2Methodology...................................20 3.2.1Initialstylometricfeatureextraction..................20 3.2.2Datapreparation.............................21 3.2.3Featuresubsetselection.........................21 3.2.4Dataanalysis...............................23 3.2.5Thealgorithm...............................24 3.3CaseStudy:Analysisof DreamoftheRedChamber ..............25 3.3.1Background................................25 3.3.2SeparabilityofthechaptersbyCaoandGao..............29 3.3.3Non-separabilityofthe80chapters.................34 3.3.4Analysisofchapters81-120:stylechangeovertime..........35 3.3.5Comparisonwith ContinuedDreamoftheRedChamber ........37 3.4CaseStudy:AnalysisoftheotherthreeGreatClassicalNovels........38 3.5CaseStudy:Chrono-divideof Micro ......................39 3.6Conclusion.....................................42 Chapter4OpenClassAuthorshipIden ............... 43 4.1Introduction....................................43 4.2CloseClassProblems...............................45 4.2.1Methodology...............................45 vi 4.2.2Casestudies................................47 4.3OpenClassProblems...............................51 4.3.1Databaseanddatapreparation.....................51 4.3.2Methodology...............................52 4.3.3Casestudies................................56 4.4Conclusion.....................................59 BIBLIOGRAPHY .................................... 60 vii LISTOFTABLES Table3.1:Thefeaturesandvalidationerrorsoftheobtainedfromtwo randomlyselectedmodelingsubsets...................31 Table3.2:Relativefrequenciesofthetopranked8featuresineachofthefour GreatClassicalNovels..........................39 Table4.1: Cuckoo'sCalling :resultbytheobtainedusing onebookforeachofthetwoauthorsastrainingsamples.......48 Table4.2: Cuckoo'sCalling :resultbytheobtainedusing onebookforeachofthefourauthorsastrainingsamples,groupI..48 Table4.3: Cuckoo'sCalling :resultbytheobtainedusing onebookforeachofthefourauthorsastrainingsamples,groupII..48 Table4.4: Cuckoo'sCalling :resultbytheobtainedusing twobooksforeachofthefourauthorsastrainingsamples......49 Table4.5: ToKillAMockingbird :resultbytheobtained usingonebookforeachoftheeauthorsastrainingsamples....50 Table4.6: DreamsFromMyFather :resultbytheob- tainedusingonebookforeachofthefourauthorsastrainingsamples.50 Table4.7:Matchingrateformulti-classtrainedbyrandomlycho- senbooksbyeachauthor.........................53 Table4.8:Matchingrateformulti-classtrainedbyrandomlycho- sensamplesbyeachauthor.......................53 Table4.9:Matchingrateformulti-classbyrandomlychosenbooks byeachauthor..............................55 Table4.10:Matchingrateformulti-classwithrandomlychosen samplesbyeachauthor..........................55 Table4.11: Cuckoo'sCalling: mastersuspectedauthorsandtheaveragematch- ingrate..................................57 Table4.12: ToKillaMockingbird: theaveragematchingrate...........58 viii Table4.13: DreamsFromMyFather: theaveragematchingrate..........59 ix LISTOFFIGURES Figure2.1:Hyperplanes...............................9 Figure2.2:Optimalhyperplane...........................9 Figure2.3:Linearlynon-separabledataset.....................11 Figure2.4:LinearSVMinnewspace.......................13 Figure2.5:3-class(a)one-vs-one, H ij separatesclassiandclass j;(b)one-vs-rest, H i separatesclassifromotherclasses.......15 Figure2.6:Thek-foldcrossvalidation.......................17 Figure3.1:Experiment1:(a)Meancrossvalidationerrorrate;(b)Valuesof SVMonchapters60-90.....................34 Figure3.2:Experiment2:(a)Meancrossvalidationerrorrate;(b)Valuesof SVMonchapters31-50.Notethereisnochrono-divide...35 Figure3.3:Experiment3:(a)Meancrossvalidationerrorrate;(b)Valuesof SVMonchapters96-105,whichcorrespondtothesamples 31-50inall80samples.Notetwosamplescomefromonechapterin thisexperiment.............................37 Figure3.4:Distancesbetweenthe80chaptersoftheCheng-Gaoversion, thelast40chaptersoftheCheng-Gaoversion,and30chaptersof ContinuedDreamoftheRedChamber ..................38 Figure3.5:resultsonthetestingsamplesoftheotherthreeclassical novels:(a) RomanceoftheThreeKingdoms ;(b) WaterMargin ;(c) JourneytotheWest. ..........................40 Figure3.6: Micro :averagevaluesofthe7...............41 Figure3.7: Micro :resultbytheobtainedusingthe 22partsandthelast22partsofthebookastrainingsamples....42 x Chapter1 Introduction 1.1AuthorshipAttribution DidFrancisBaconorChristopherMarlowewritesomeoftheplaysthatwereattributed toWilliamShakespeare?DidCaoXueqinwritethelastfortychaptersof Dreamsofthe RedChamber ,oneofthegreatestmasterpiecesinthehistoryofChineseliterature?Who wrotethe Federalist papers?DidBillAyerswriteObama'sautobiography DreamsFromMy Father ?Therehavebeennoshortagesofauthorshipcontroversiesthroughoutthehistory, andthesearejustafewofthem.Thesequestionshaveoftengenerateddebates amonghistoriansandliteraryscholars.Historicallyconnoisseurship,disciplinaryknowledge andbackgroundhistoryhavebeencentralinsuchdebates.Inrecentyears,however,the \unemotionalapproach"usingquantitativeanalysisbasedonmathematics,statisticsand computationhasgainedprominenceinthestudyofauthorshipattribution. Theauthorshipattributionistodeterminetheauthorshipofworkslikebooks,visual artsandsoonwhoseauthorsareunknown.Therearetwoclassesofproblemsinauthorship attribution[21]ingeneral.Themostcommononeisthe closeclass ,whereitisknownthat thesampleoftextisbyoneoftheauthorsfromagivenset,usuallyaverysmallsetwithtwo tothreeauthors.Forexample, Federalist papersmentionedearlierisatypicalclosedclass problem,wheretheauthorsaretoAlexanderHamilton,JamesMadisonandJohn Jay.Farmorechallengingisthe openclass problem,wherethesampletextmaycomefrom 1 afarlargersetofauthorsorevennosuspectauthor.Anexampleofanopenclassproblem istodeterminetheauthorshipofbookswrittenanonymouslyorunderpseudonyms,e.g. ThePrimaryColor or ImperialHubris:WhytheWestisLosingtheWaronTerror .The studyofauthorshipattributionhasseenrapidgrowthinrecentyears.Still,thereremains manychallengesevenfortheclosedclassproblem.Accuracyisoneofthemostprominent concerns.Fortheopenclassproblemthechallengeisfargreater,andsofartherehavebeen fewattemptsandnoneoftheknowntechniquescanbereliablyscaledtohandletheopen classproblem. Westudybothopenclassandclosedclassproblemsinauthorshipattribution.The partofthisthesisfocusonanimportantauthorshipquestioninclosedclassproblem, whichwecallchrono-divide,namelytodeterminewhetherabodyoftextiswrittenbya singleauthor.Suchastudyisvaluableinanumberofways.Forexample,itiswidely speculatedamongtheShakespeareanscholarsthatsomeoftheShakespeareplayswerein factcollaborativelywrittenbyShakespeareandothers,e.g.MiddletonandFletcher.Itisof scholarlyinteresttonotonlyitbutalsotooutexactlywhichpartsoftheplays werewrittenbyplaywrightsotherthanShakespeare.Therewerealsohistoricalcontroversies involvingsuspectedfraud,asinthecaseof DreamoftheRedChamber ,whereaparticular bookorsequelattributedtocertainwellknownauthormightinfactbeperpetratedby someoneelse.Arelatedpracticewasforawellknownandauthortowriteonly thefewchaptersofabookandthenpassitontoaghostwriter.Thereareclear bbothethicallyandscholarlytodetectfraudsandghostwritings.Moderndayssee anexplosionofcoauthoredbooksandarticles.Itwouldbeinterestingtodetectstylistic inconsistenciesamongpartsofsuchbooks.Asweshallsee,mathematicscanplayacentral roleinthestudyofauthorship,leadingtotherapidgrowthofthe stylometry . 2 Wewillintroduceourmethodonthedetectionofchrono-divideinwritingstyleswith agivenbodyoftextinthispartandusetwocasestudiestoshowwaysitcanbedone, andtoillustratetheenessandrobustnessofourmethodsbycomparingthestudyof casesjustbyoneauthor.ThecasestudyistheclassicalChinesenovel Dreamofthe RedChamber .Thecontroversysurroundingthebookwaswellknownandintenselydebated intheChineseliterarycircleforover250years.Thesecondcasestudyisthebook Micro writtenbyMichaelCrichtonandRichardPreston.TheoneauthorcaseswillbeChina's otherthreeclassicalnovels, ThreeKingdoms , JourneytotheWest and TheWaterMargin . Theresearchforthecasestudyhasbeendoneinourwork[39],andmuchofwhatwe presenthereisreproducedfrom[39].Allmaterialsinthesecondcasestudyarenew. Thesecondpartofthisthesiswillbethestudyofopenprobleminauthorshipattribution, whichistodetecttheauthorofabodyoftextwhoseunknowntrueauthorisamongalarge setofcandidateauthors.Inmanycasesthereisn'tevenasuspectsothesetofcandidate authorsistheentiredatabase.Tosolveanopenclassidenproblem,wepropose anewmethodforsolvingthecloseclassproblemwherethenumberofcandidateauthorof atextissmall(e.g. 8).Wethenreliablyscaleittosolvetheopenclassproblemthrough theauthorrandomizationtechnique.Aslongasthetrueauthorisinourdatabasewecan reliablyidentifytheauthor.Furthermore,ifthetrueauthorisnotinthedatabasewecan alsoreliablyruleoutanyauthorinthedatabase. Wewilldescribeourtechniquesandthealgorithmofourapproach,anddemonstratethe enessthroughthreecasestudies:(1) Cuckoo'sCalling byRobertGalbraith,which turnedouttobethepseudonymforJ.K.Rowling;(2) ToKillaMockingbird byHarperLee in1960,whichsomespeculatedtobewrittenbyherlongtimefriendTrumanCapote;(3) DreamsFromMyFather byPresidentObama,whoseauthorshipisindoubteversinceBill 3 Ayers\confessed"thathewasthetrueauthor. 1.2RelatedWorks Theproblemofstylequanandauthorshipattributionintheliteraturegoesatleast asfarbackas1854bytheEnglishmathematicianAugustusDeMorgan[11],whoinaletter toaclergymanonthesubjectofGospelauthorship,suggestedthatthelengthsofwords mightbeusedtotiateauthors.In1897theterm stylometry wascoinedbythe historianofphilosophy,WincentyLutaslowski,asacatch-allforacollectionofstatistical techniquesappliedtoquestionsofauthorshipandevolutionofstyleintheliteraryarts(see e.g.[29]).Today,literarystylometryisawelldevelopedandhighlyinterdisciplinaryresearch areathatdrawsextensivelyfromanumberofdisciplinessuchasmathematicsandstatistics, literatureandlinguistics,computerscience,informationtheoryandothers.Itisacentral areaofresearchinstatisticallearning(seee.g.[18]). Themainassumptionofauthorshipattributionisthateverypersonhashisorherown styleofspeaking,writingandpainting.Thesestylescanbequanusingsomesocalled \authorialts"[21]orfeatures.Overthehistoryofauthorshipattributionstudy manytkindsoffeatureshavebeenproposedforauthorshipidenThegoal hasalwaysbeentocertainfeatureorfeaturesthatwilltiateoneauthorfrom another.Yule[41]suggestedthelengthofsentenceasafeatureofwritingstyleandapplied ittotwocasesofdisputedauthorships.Apopularclassictechniqueforstylometricanaly- sisofauthorshipinvolvescomparingfrequenciesoftheso-called functionwords ,aclassof wordsthatingeneralhavelittlecontentmeaning,butinsteadservetoexpressgrammatical relationshipswithotherwordswithinasentence.Thetpercentagesofnouns,verbs, 4 adjectives,adverbsandotherpartsofspeechcanalsobeusefulcharactersofwritingstyle, see[1,5,6].Ellegard[14]determinedtheauthorshipoftheJuniusLettersusingfunction wordsfrequenciesandsynonympairs(e.g.ratioof\big"vs\large").MostellerandWal- lace[28]focusedonthedistributionof30functionwordsfromthevariousFederalistpapers andanalyzedtheauthorshipofthem.Popularearlierfeaturesalsoincludeaveragelength ofwords,cumulativesums(Qsums)[3,4,15,27],n-grams.In[22]theauthorspresented amethodforcomputer-assistedauthorshipattributionbasedoncharacter-leveln-gramau- thorsee[21]and[33]foracomprehensivereview.Whilesomeofthesefeatures aree,manyhaveshowntohavelessdiscriminatingpower.Weshallnotdiscussthe advantagesoftfeatureshere.Insteadwereferallinterestedreaderstotheexcel- lentsurveyarticlesbyJuola[21]andStamatatos[33]foracomprehensivesummaryofthe prosandconsoftfeaturesinstylomteryanalysis.Ithasbeennoted,however,that frequenciesofcontentindependentfunctionwordsareamongthemostefeatures[7]. Theofliterarystylometryhasseenimpressiveadvancesinrecentyears,withmore andmorenewandsophisticatedmathematicaltechniquesaswellassoftwaresbeingdevel- oped.Machinelearninghasproventobeapowerfultoolintioninlastdecadeand researchersstartedtouseitinauthorshipattribution.Koppel[26]usedtheexponential Gradient(EG)algorithmofKivinenandWarmuthtoalinearseparatorbetweenmale- authoredandfemale-authoreddocuments.Burrows[7][8]appliedprincipalcomponents analysis(PCA)onwordfrequenciestoauthorshipattribution.Kjell[24,23,25]analyzed authorshipusingneuralnetworkBayesianandnearestneighborclassi- [12]appliedsupportvectormachinetoidentifyauthorsofGermannewspaperarticles. Jockers[20]evencomparedtheperformanceofemethods(Delta,k-nearestneighbors, thesupportvectormachine,nearestshrunkencentroids,regularizeddiscriminantanalysis) 5 intheclassicauthorshipattributionprobleminvolvingtheFederalistPapers. 1.3OrganizationoftheThesis Therestofthesisisorganizedasfollows:wewillpresentanoverviewofmachinelearning. WewillintroduceclassicSupportVectorMachineinparticularanditsgeneralizationto multi-classinchapter2.Chapter3willintroduceourstudyonthestylometry analysis,whichfocusondetectingchrono-divideinwritingstyles.Wewilldescribeour algorithmindetailandanalyzeexperimentalresults.Inchapter4,wewillshowourmethod forauthorshipattributionproblemswherethetrueauthorisknowninasmallgroupand extendthismethodtoopenclassproblemswherethecandidateauthorsofatextisinlarge setorevennosuchset.Wewillreporttheaccuracyofourmethodforcloseproblemsand showtherobustnessofourmethodforopenclassproblemsthroughextensiveexperiment andthetestingresultsofthreecasesinlastpartofthischapter. 6 Chapter2 IntroductiontoMachineLearning Techniques 2.1Introduction Themainobjectiveofmachinelearning,aofcomputerscience,istolearnthe patternsofthegivendatawhichisrepresentedbyso-calledfeaturesordescriptors.The underlyingprincipleofmachinelearningisthatthedatafromreallifeisnotgenerated randomly,therealwaysexistpatternsinit,althoughwedon'tknowexactlywhattheyare. Machinelearningcanexplorethepatternsofthegiventrainingdataandmakepredictionfor thenewdataset.Machinelearningiswidelyusedinallkindsofarealikefacerecognition, pagerankingandspeechrecognition.Therearethreetypesofproblemsinmachinelearning ingeneral: Supervisedmachinelearningwheretheoutputofthetrainingsampleisgiven,andthe goalistopredicttheoutputsofnewsamples; Unsupervisedmachinelearningwherenolabelisgivenforthetrainingsample,and thegoalistothestructureofthetrainingdata; Reinforcementwherethegoalisgivenandthetrainingdataisdynamic. 7 Generallyspeaking,machinelearningistheprocessofoptimization.Untilnow,many optimizationalgorithmshavebeenproposedforttypeofprobleminmachinelearn- ing,suchasNaiveBayes,NearestNeighbors,SupportVectorMachine(SVM)andsoon.In thisthesis,IwillfocusonSVM,whichisproposedbyVapnik[35]in1970s.Itisthemost popularmachinelearningalgorithmssincethen.Itisasupervisedlearningmethodwhichis usedinbothfeatureselectionandpatternrecognitions.SVMhasbeenhighlydevelopedin boththeoryandalgorithmandfamousforitscapabilityofdealinglargedataset. 2.2LinearSupportVectorMachine 2.2.1Separablecase Inthebinarysetting,giventhetrainingdataset f ( x 1 ;y 1 ) ; ( x 2 ;y 2 ) ;:::; ( x l ;y l ) g ;x i 2 R n ;y i 2f 1 ; 1 g ;i =1 ;:::;l wedenote I = f x i j y i =1 g II = f x i j y i = 1 g Wesaythat I and II arelinearseparableifthereisaunitvector w andconstant b suchthat x > i w + b> 1 8 x i 2 I x > i w + b< 1 8 x i 2 II; (2.1) 8 i.ethedatacanbeseparatedbythehyperplane x > w + b =0 Figure2.1:Hyperplanes Figure2.2:Optimalhyperplane Aswecanseein2.1,therewillbemanyhyperplaneswhichcanseparate thetwogroups.Whichoneshouldwechoose?Tobestseparatethetwogroups,weneed tomaximizethemargin ˆ (equation2.2)asin2.2,whichisthedistancebetweenthe 9 hyperplane b 11 and b 12 .Thehyperplanewhichseparatesthetwogroupsofthetrainingdata setandhasthemaximalmarginiscalledthe OptimalHyperplane anditisunique(see[36]). ˆ = 2 k w k 2 (2.2) Hence,theoptimalhyperplanecanbeconstructedasthefollowingPrimaloptimization problem: min w 2 H 1 2 k w k 2 s.t. y i ( x > i w + b ) 1 8 i 2f 1 ;:::;l g (2.3) Tosolvethisquadraticoptimizationproblem,weneedtothesaddlepointofthe Lagrangefunction: L ( w;b; )= 1 2 w > w l X i =1 i [ y i ( x > i w + b ) 1] ; (2.4) where i 0aretheLagrangemultipliers. AccordingtotheFermattheorem,theminimumizerofthisfunctionhastosatisfythe followingcondition: @L ( w ) @w = w l P i i y i x i =0 @L ( w ) @b = l P i i y i =0 : (2.5) Fromequation2.5,weget: w = l P i =1 i y i x i l P i =1 i y i =0 (2.6) Plugequation2.6intoequation2.4,wehavechangedtheprimalproblemtoits dual 10 problem: max W ( )= P i i 1 2 P i;j i j y i y j x > i x j s.t. l P i =1 i y i =0 i 0 8 i 2f 1 ;:::;l g (2.7) Thus,bysolvingequation2.7,wecanobtain i .Meanwhileweget w byequation2.6and b = 1 2 (min x i 2 I x > i w +max x i 2 II x > i w ).Theinstanceswith i > 0arecalled SupportVectors ,which areclosesttothehyperplanes b 11 and b 12 .Finally,weobtainthedecisionfunction: f ( x )= w > x + b andthesignof f ( x )isthepredictedlabelofinstance x . 2.2.2Non-separablecase Figure2.3:Linearlynon-separabledataset Inthepreviouspart,wehaveshowedhowtotheoptimalhyperplaneforlinearsep- arabledata,butinreallife,mostofdataareeithernotlinearlyseparable(e.g2.3),or onlyseparablewithverysmallmarginsbecauseofthepresenceofoutliers.Wewillgeneral- 11 izetheconceptofoptimalhyperplaneforthenon-separablecasebyrelaxingtheconstraints 2.1inthispart.Byintroducingnonnegative slack variables ˘ 1 ;:::;˘ l intoconstraints2.1, wecanconstructthehyperplanewiththesmallesterrors.Thus,theoptimizationproblem becomes: min w 2 H;˘ 2 R l 1 2 k w k 2 + C l P i =1 ˘ i s.t. y i ( x > i w + b ) 1 ˘ i ˘ i 0 8 i 2f 1 ;:::;l g (2.8) where C> 0istheparameterthatspthepenaltyofWhenweset C tobelarge,theaboveproblemisthesameasthelinearcase.Similartothelinear separablecase,weneedtothesaddlepointofthefollowingLagrangianinordertosolve theaboveoptimizationproblem: max W ( )= P i i 1 2 P i;j i j y i y j x > i x j s.t. l P i =1 i y i =0 0 i C 8 i 2f 1 ;:::;l g (2.9) Thisformulaissimilartothelinearcaseexceptfortheupperboundof i anditcanbe solvedinthesameway. 2.3NonlinearSupportVectormachine Untilnow,wehaveonlydiscussedhowtouseSVMstodealwithlinearlyseparabledatasets. However,forthedatain2.4,whichisnotlinearseparable,wecanextendthelinear SVMtomoregeneralformsusing\kernel"functionsandseparatethedatawithlinearSVM inanewspace.ThebasicideaofnonlinearSVMistomaptheinputvectors f x i g intothe 12 Figure2.4:LinearSVMinnewspace highdimensionalspacethroughnonlinearmapping ˚ .Inthenewspace,wecanconstruct theoptimalhyperplaneasinprevioussection,butnonlinearintheinputspace. Insteadofcomputinginnerproduct h x i ;x j i ,wenowcalculate K ( x i ;x j )= h ˚ ( x i ) ;˚ ( x j ) i , whichiscalledakernelfunction.Thustheoptimizationproblemis: max W ( )= P i i 1 2 P i;j i j y i y j K ( x i ;x j ) s.t. l P i =1 i y i =0 0 i C 8 i 2f 1 ;:::;l g (2.10) Similarly,thedecisionfunctionwillbe: f ( x )= w > ˚ ( x )+ b = l X i =1 i y i K ( x;x i )+ b (2.11) andthesignof f ( x )isthethelabelofx. Remark1 Akernelfunction K needstosatisfyMercer'stheorem,whichrequiresthatthe matrix K =( K ( x i ;x j )) l i;j =1 besymmetricandpositiveThefollowingthree typesaretypical: 1.LinearKernel: K ( x;y )= h x;y i ; 13 2.PolynomialKernel: K ( x;y )=( h x;y i ) d ; 3.RBFKernel: K ( x;y )=exp( k x y k 2 2 ˙ 2 ) . tkernelsareusedtodealwithdttypesofclascationproblems.Sofor agiventrainingdatasetwithoutanyknowledgeofitsgeometricstructure,itisbetterto trytmodelsbychoosingerentkernelsandtheirparameterscarefully.Weusually dividethetrainingdataintotwoparts,oneofwhichisusedtotrainthemodelandtheother one(calledvalidationdataset)totestthemodel.Wechoosethestablemodelwhichyields thehighestvalidationaccuracy.Forthedataweuseinourstudyofauthorshipiden linearSVMprovestobethesimplestaswellasthebestkernelcomparedtotheothertwo typesofkernels. 2.4Multiclass Intheprevioussection,wehavediscussedtheclassicalSVMwhichsolvesbinary problems.Butinmostscenarios,therearemultiplecategoriesforthetrainingset,i.e. f ( x 1 ;y 1 ) ; ( x 2 ;y 2 ) ;:::; ( x l ;y l ) g ;x i 2 R n ;y i 2f 1 ; 2 ;:::;N g ;i =1 ;:::;l howtoextendthebinarySVMtothemulti-classproblem? Thebasicandmoststraightforwardideaistodecomposetheproblemintoaseriesof binaryproblems,andthencombinetheresultstoobtainthemulticlassclassi- Therearemanytechniquesthathavebeendevelopedsofar,suchasone-vs-rest, one-vs-one,decisiondirectedacyclicgraph(DAG)in[34],errorcorrectingcodein[13],single machinein[38]andsoon.Crammer[10]andVapnik[36]evenconsideralltheclassesto- 14 (a)(b) Figure2.5:3-class(a)one-vs-one, H ij separatesclassiandclassj;(b)one- vs-rest, H i separatesclassifromotherclasses. getherinoneoptimizationformulation,butitiscomputationallyexpensiveandnotwidely used.Inthemeantime,theone-vs-restandone-vs-onearesosimpleandethatthey havealwaysbeenthemostpopulartechniquesforthemulticlassproblems. Herewewilldiscussone-vs-restandone-vs-oneindetail. One-vs-Rest istobuild N tbinary f f i g N i =1 ,eachofwhich distinguishesoneclassfromtheremaining N 1classestogetherviewedasasingleclass. Toconstructthe i -thbinaryweletthesamplesfromthe i -thclassbethepositive samples,andothersamplesfromtheother N 1classesbethenegativesamples.Fortest sample x ,weattributeittotheclass j 0 where j 0 =argmax f i ( x ).Theadvantageofthis methodisthatwejustneedtosolve N optimizationproblemstosolve N class problem.Buttherearethreemaindisadvantagesusingthismethod: thereisnoboundonthegeneralizationerror; thescaleofthedecisionvaluesmaybequitetint thedataisusuallyextremelyimbalanced,wherethenegativesampledominateseach 15 One-vs-One istobuildonecforeachpairofclasses,whichyields N ( N 1) = 2 f f ij g N i;j =1 (note f ij = f ji )intotal.Each f ij istodistinguish classes i and j ,anditcanbeviewedasthescoreofa\match"betweenclass i andclass j .A sample x isattributedtoclass i 0 throughsomevotingscheme,forexamplewemaychoose i 0 =argmax P j f ij ( x ),or i 0 betheclassthathaswonthemostnumberof\matches", i 0 =argmax P j sign( f ij ( x )).Althoughthismethodhastheadvantageofmorebalanced trainingsamplesizesforeachitalsohassomedisadvantages: Thereisnoboundsonthegeneralizationerror; computationallyitisexpensivebecauseitrequires N ( N 1) = 2 itcaneasilycauseov Bothstrategieshavetheiradvantagesanddisadvantages,soitishardtoconcludewhich oneisalwaysbetter.Theperformanceofone-vs-restandone-vs-onehasbeenobserved tobecomparableingeneral[19],sowechoosetheone-vs-reststrategyconsideringthe computationalcost.Becausetheobjectiveistheoverallaccuracy,wechooseonesingle parametersettomaximizeoverallaccuracyratherthanselectingatparameterset foreachThelatterispronetoov,inwhichcasethetrainingaccuracyis highbuttestingaccuracyisusuallylow. 2.5CrossValidation Crossvalidationisastrategytoassesstheaccuracyofamodelwhenthetestsetisnot available.Theideaofoneroundofcrossvalidationistoseparatethetrainingdataintotwo parts,oneofwhichisusedtotrainthemodel,theremainingpart(calledvalidationdata) 16 Figure2.6:Thek-foldcrossvalidation isusedtopredicttheperformanceofthemodel.In k -foldcrossvalidation(see2.6),the trainingsetispartitionedinto k parts,inwhich k 1partsareusedtotrainthemodel,and theremainingpartisusedtotesthowaccuratethemodelis.Eachpartworksastestset onceandonlyonce,sotherearekmodelsconstructedin k -foldcrossvalidation. Theadvantageof k -foldcrossvalidationisthateachsampletesthasoneandonlyone chancetotestthemodel.Whenthetrainingdatasetissmallorthesetofparameteris verylarge,onemodelcaneasilyleadtoovCrossvalidationcantlyavoid thisproblemandhelptoselectabetterandstablemodel.However,crossvalidationusually requiresadditionalcomputationalcost.Inthisthesis,weusea5-foldcrossvalidationinthe trainingprocessandapplythepluralityvotingschemetolabelthetestsamples. 17 Chapter3 StylometryAnaysis:Detecting Chrono-DivideinWritingStyles 3.1Chrono-divide Themainideabehindstatisticallyorcomputationally-supportedauthorshipattributionis thatbymeasuringsometextualfeatureswecandistinguishbetweentextswrittenbyt authors.Nearlyathousandtmeasuresincludingsentencelength,wordlength,word frequencies,characterfrequencies,andvocabularyrichnessfunctionshadbeenproposedthus far[32]overtheyears.Someofthesemeasures,suchasfrequenciesoffunctionwords,have provenewhileothers,suchaslengthofwords,haveprovenlesse[21].The ofliterarystylometryhasseenimpressiveadvancesovertheyears,andhasbecomean increasinglyimportantresearchinthedigitalagewiththeexplosionoftextsonline. Thispartfocusesonaparticularclassofauthorshipcontroversies,inwhichthereisa suspectedchangeofauthorshipatsomepointofabook.Inotherwords,onesuspectsthatthe X chaptersofabookwerewrittenbyoneauthorwhiletheremaining Y chapterswere writtenbyanother.Clearly,theauthorshipcontroversyfor DreamoftheRedChamber falls intothiscategory.Sincenotwoauthorshaveexactlythesamewritingstyle,nomatterhow similartheymightbe,abookwritteninsuchafashionwouldhaveastylisticdiscontinuity goingfromChapter X toChapter X +1.Ifwecanquantifythestylesofthetwoauthorsbya 18 stylometricfunction S ( n )(aclaser)where n denoteschapters,orchronologicallyordered samples,ofthebookinquestion,thisstylisticdiscontinuitywillappearasadividingpoint inthestylometricfunction S ( n )goingfrom n = X to n = X +1.Becausethesamplesare orderedbytime,weshallcallthisdivideinthestylometricfunction S ( n )a chrono-dividein style ,orsimplyachrono-divide. Theunderlyingprincipleofourstudyisthatifabookisinfactwrittenbytwoauthors AandB,thenthereshouldexistagroupoffeaturesthatcharacterizetheoftheir respectivestyles.Thesefeatureswillleadtoastylometricfunctionthatseparatethebook intotwotclasses.Intherestofthispartweshallusethemoreconventionalterm forsuchastylometricfunction.Thefoundationalprincipleforliterarystylometry isbuiltaroundsuchSupposethatachrono-divideinstyleexists.Then anewillshowabreakpointsomewhereinthemiddleofthebook,before andafterwhichthegivespositivevaluesandnegativevalues,respectively.Thusin analyzingabooksuspectedtobewrittenbytwoauthorswithachrono-divide,onecanlook forathatgivesrisetosuchabreakpoint.Theexistenceofsuchawill providestrongsupportforthetwo-authorhypothesis.Conversely,ifsuchacannot befoundthenwecancontlyrejectthetwo-authorwithachrono-dividehypothesis. Weusefunctioncharactersandwordstobuildandselectagroupofstylometricfeatures havingthehighestdiscriminativepower,andfromwhichweconstructourWe shalldetailourmethodinthefollowingsubsections. 19 3.2Methodology 3.2.1Initialstylometricfeatureextraction Supposethebookinquestionissuspectedtobewrittenbytwoauthors.Forsimplicitywe shallcallthepartwrittenbyauthorA PartA andthepartwrittenbyauthorB PartB .In manycases,suchaswith DreamoftheRedChamber ,bothPartAandpartBareknown. Insomecases,theyarenotpreciselyknownlike Micro .However,forbookssuspectedto haveachrono-dividefromauthorshipchange,thereisusuallyagoodestimateforwherethe divideis.TypicallythefewchapterscanbetlyattributedtoAandthelast fewchapterstoB. Webeginbychoosingafeaturesetconsistingofthekindsoffeaturesthatmightbeused consistentlybyasingleauthoroveravarietyofwritings.Typically,thesefeaturesinclude thefrequenciesofwords(orcharactersforbooksinChinese),phrases,meanandvariation ofsentencelength,andfrequenciesofdirectspeechesandexclamations,andothers.Inour analysis,togetabetterunderstandingofanauthor'swritingstyle,wethemost frequentlyusedcharactersandwordsinthebook,e.g.wewouldthe500mostfrequently usedcharactersinthewholebook,fromwhichwepickoutonly,say, n functioncharacters. Wechoose m words(combinationsofcharacters)amongthe300mostfrequentlyusedwords inthesameway.Animportantpointisthatbyselectingonlyfunctioncharactersandwords weobtainaselectionofcharactersandwordsthatare contentindependent .Thisleadsto aninitialsetoffeaturesconsistingofthefrequenciesofthe n charactersandthe m words, plusthemeanandvarianceofsentencelengthaswellasthefrequenciesofdirectspeeches andexclamations.Thesefeatureswillbecomputedovergivensampletextsofthebook (e.g.chapters).Wenormalizeeachsampletextinthefollowingway:setthemedianofthe 20 meanandvariationofsentencelengthandthefrequenciesofdirectspeeches,exclamations, n charactersand m wordsineachworkofAandBtobe1.Foreachsample,wenowget n + m +4features. 3.2.2Datapreparation Havingconstructedtheappropriatefeaturevectors,webuildadistinguishingmodelthrough amachinelearningalgorithm.Todosorequirescarefuldatapreparation.Sinceweusually haveinhandonlylimitedsampleswhilethenumberoffeatureswillbeverylarge,buildinga modeldirectlyontheentirebookwilleasilyleadtoovToovercometheov problem,weusethestandardtechniqueofseparatingthewholedataintosamplesconsisting oftrainingdataandtestdata.Ourmodelwillbeestablishedbasedonlyonthetrainingdata whileitsperformanceistestedovertheindependenttestdata.IfweknowPartAandPart Balreadythenasubsetofeachcanbedesignatedastrainingdata.Forbookssuspectedto haveachrono-divideinstyle,thetrainingdatawillconsistofthefewchaptersandthe lastfewchapters.Therestofthebookwillbeusedastestdata. Inordertoobtainmoretrainingsetsandtestingsetsweshallchunkthebookinquestion intosmallerpiecesofsampletextsofrelativelyuniformsizeandstyle.Inallthebookswe havestudied,wehavekeptthesampletextstobeatleast1000characterslong.Inthecase of DreamoftheRedChamber eachsampletextisachapter. 3.2.3Featuresubsetselection Whenwebuildauthorshipanalysismodelusingthetrainingdataonly,wedonotuseallthe features( n + m +4features).Insteadwestartoutwithallofthem,buteventuallyselecta 21 subsetoffeaturesthatachievesthehighestdiscriminativepowers.Featuresubsetselection hasbeenwellunderstoodforhighdimensionaldataanalysisinthemachinelearningcontext. First,thenumberofdiscriminativefeaturesmaybesmallbecausethenumberoffeatures anauthorusesinaconsistentlytwayfromothersisusuallynotverybig.Moreover, thecanperformverypoorlyiftoomanyirrelevantfeaturesareincludedintothe model.InthispaperwewilluseSupportVectorMachinesRecursiveFeatureElimination (SVM-RFE)introducedin[17]torealizefeatureselection. SVM-RFEisafeaturerankingmethod.GivenasetofsampleswecanuselinearSVMto buildalinearItrankstheimportanceofthefeaturesaccordingtotheirweights.As mentionedabove,becauseoflargefeaturesizeandsmallsamplesize,themightnot berobust.Inaddition,thehighcorrelationbetweenthefeaturesmayresultinsmallweights forrelevantfeatures.ThustherankingbySVMdirectlymaybeinaccurate.Inorder totheranking,theleastimportantfeatureisremovedandthelinearSVM isretrained.Thisnewprovidesarankingfortheremainingfeatures.The processisthenrepeateduntiltherankingofallfeaturesareThisistheSVM-RFE methodintroducedin[17].TheideaunderlyingSVM-RFEisthatineachrepeat,although theoverallrankingmaybepoor,theleastimportantfeatureisveryunlikelyarelevantone. Byiterativelyeliminatingtheleastimportantfeaturesthenewwillbecomemore andmorereliableandhencewillprovidebetterandbetterranking.Intheapplicationof geneexpressiondataanalysisSVM-RFEhasbeenproventobesubstantiallysuperiortothe SVMdirectrankingwithoutRFE. HoweveringeneralSVM-RFEisnotstableundertheperturbationofsamples.Asmall changeinsamplesmayresultinverytfeatureranking.Therearetwopossible reasons.Oneisthatthehighlycorrelatedvariablesaretoosensitiveandmayberankedin 22 tordersbytAnotheristhat,duetotherandomness,somesubset ofsamplesmightbesingularinthesensethattheyarelessrepresentativeforthewholedata structure.AsaresulttheSVMareovandthefeaturerankingbySVM- RFEisthereforeunreliable.Thesituationislessharmfulforperformance whilethesecondisvital.Toovercomethisphenomenonandguaranteethestabilityofthe ranking,weuseapseudo-aggregationtechnique.Werandomlychooseasubsetoftraining samplestorunSVM-RFEtoselectthetopimportantfeatures.Thisprocessisrepeated tensorhundredstimesandonlythosefeaturesthatappearimportantveryfrequentlyare deemedastrulyimportantones.Thisremovestherandomnessandresultsinamuchmore reliableranking. Withthisrankingoffeatures,wecanconcludewhichstatisticsareusefulforquantifying thewritingstyle.Weusecrossvalidationtoselectthenumberoffeaturesincludedinthe model.Thisgroupoffeaturesisastableandmostdiscriminativesubset offeatures.Aisbuilttoclassifythetestdata. 3.2.4Dataanalysis Thewehavebuiltisusedtoanalyzetheauthorshipquestion.Weexaminethe discriminativepoweroftheonthetrainingdata.Ifitcannotevenreliablyclassify thetrainingdatawecanconvincinglyrejectthetwo-authorhypothesis.Evenifitcanthe tellingstorywillbewhetheritcanclassify,ordetectachrono-divide,fromthetestdata. Ifitfailsthenagainweshouldrejectthetwo-authorhypothesis.Ontheotherhand,ifthe thetrainingdata,anditcanalsoclassifythetestdataaccuratelyordetect aclearchrono-divide,wecanthenconvincinglyconcludethatthebookdoescontaintwo twritingstylesandcanthereforebetlyattributedtotwotauthors. 23 Moreover,thefeaturesubsetandtheclaserdescribetheofthetwoauthors' writingstyles. 3.2.5Thealgorithm Inthefollowingwesummarizetheprocessofouralgorithm: 1.Initializethedata(thebook),whichcontainspartsAandBsuspectedtobewritten bytwotauthors. 2.SplitpartAandpartBintomanysectionsandextractthefeaturesforeachsectionas describedinsection3.2.1.Thisformsthewholedataset D ,containing D A and D B . 3.Chooseaportion(e.g.20%-30%)of D A and D B respectivelytoformthetestdataset andleavetheremainingasthetrainingdataset.Thetestdatawillnotbeuseduntil themodelisbuilt. 4.Randomlychooseasubsetfromthetrainingdataasmodelingdataandtherest(again 20%-30%)asthevalidationdata.RunSVM-RFEonthemodelingdataandusethe validationdatatodeterminealltheparametersused.Thisprovidesarankingofall the n + m +4featuresextractedinstep2. 5.For d rangefrom1to n + m +4,buildausingonlythetop d featuresand evaluatetheirperformanceonthevalidationdata.Thebestmodelistheonewith minimalvalidationerrorandminimalnumberoftopfeatures.Thefeaturesubsetof thisbestmodelisrecorded. 6.Repeat T timesstep4andstep5toobtain T bestmodelsand T subsetsofcorre- spondingimportantfeatures.Werecommend T tobelargerthan50.Rankallthe 24 featuresinthesesubsetsaccordingtotheirappearancefrequency.Denote N asthe totalnumberoffeaturesincluded. 7.For d =1 ;:::;N ,usingcrossvalidationtoselectthenumberoffeaturesthatshould beincludedintheDenoteitby d .Notewerequireboththecross validationerrorandthenumberoffeaturestobeassmall. 8.Retrainthemodelusingthewholetrainingsetbasedonthistop d importantfeatures. 9.Usingthetoclassifythetestdata.Drawtheconclusionaccordingtothe performance. Sinceourrankingprocessinvolvesaggregationoflargenumberofmodelsthataretrained usingSVM-RFEbasedontsubsetsofthesamedatasource,werefertoourapproach aspseudo-aggregationSVM-RFEmethod. 3.3CaseStudy:Analysisof DreamoftheRedCham- ber 3.3.1Background DreamoftheRedChamber byCaoXueqinisoneofChina'sFourGreatClassicalNovels.For morethanoneandahalfcenturiesithasbeenwidelyacknowledgedasthegreatestliterary masterpieceeverwritteninthehistoryofChineseliterature.Thenovelisremarkableforits vividlydetaileddescriptionsoflifeofChinainthe18thcenturyduringtheQingDynastyand thepsychologicalofitslargecastofcharacters.Thereisavastliteraturein Redology , atermdevotedexclusivelytothestudyof DreamoftheRedChamber ,thattouchesupon 25 virtuallyallaspectsofthebookonecanimagine,fromtheanalysisofevenminorcharacters inthebooktoin-depthliterarystudyofthebook.MuchofthescopeofRedologyisoutside thefocusofthispaper. Theoriginalmanuscriptof DreamoftheRedChamber begantocirculateintheyear 1759.Theproblemsconcerningthetextandauthorshipofthenovelareextremelycomplex andhaveremainedverycontroversialeventoday,andtheyremainanimportantpartof Redologystudies.Cao,whodiedin1763-4,didnotlivetopublishhisnovel.Onlyhand- copiedmanuscripts{some80chapters{hadbeencirculating.Itwasnotuntil1791that theprintedversionwaspublished,whichwasputtogetherbyChengWeiyuanandGao Eandwasknownasthe Cheng-Gaoversion .TheCheng-Gaoversionhas120chapters,40 morethanthevarioushand-copiedversionsthatwerecirculatingatthattime.Chengand Gaoclaimedthatthis\completeversion"wasbasedonpreviouslyunknownworkingpapers ofCao,whichtheyobtainedthroughtchannels.Itwastheselast40chaptersthat werethesubjectofintensedebateandscrutiny.Mostmodernscholarsbelievethatthese40 chapterswerenotwrittenbyCao.ManyviewthoselateadditionsastheworkofGaoE. Somecritics,suchastherenownedscholarHuShi,calledthemforgeriesperpetratedbyGao, whileothersbelievethatGaowasdupedintotakingsomeoneelse'sforgeryasanoriginal work.Thereis,however,aminorityofcriticswhoviewthelast40chaptersasgenuine. Theanalysisoftheauthenticityofthelast40chaptershaslargelybeenbasedonthe examinationofplotsandprosestylebyRedologyscholarsandconnoisseurs.Forexample, manyscholarsconsidertheplottingandproseofthelast40chapterstobeinferiortothe 80chapters.Othershavearguedthatthefatesofmanycharactersintheendwereincon- sistentwithwhatearlierchaptershavebeenforeshadowing.Anaturalquestioniswhether amathematicalstylometryanalysisofthebookcanshedsomelightonthisauthenticity 26 debate. AlthoughthereisavastRedologyliteraturegoingbackover100years,thenumberof studiesofthebookbasedonmathematicalandstatisticaltechniquesissurprisinglysmall, particularlyinviewofthefactthatsuchtechniqueshavebeenusedwidelyintheWestfor settlingauthorshipquestions.AmongthenotableCao[30]meticulouslybrokedown anumberoffunctioncharactersandwordsintoclassesaccordingtotheirfunctions.By analyzingtheirfrequenciesinthe40chapters,themiddle40chaptersandthelast40 chapters,Caoconcludedthatthe80chaptersandthelast40chapterswerewrittenby tauthors.Zhang&Liu[37]examinedtheoccurrenceof240charactersinthebook thatareoutsidetheGB2312encodingsystem,andfoundthat210ofthemhaveappeared exclusivelyinthe80chapterswhileonly20ofthemhaveappearedexclusivelyinthe last40chapters.Thisledtothesameconclusionbytheotherauthors.Yu[31]studiedthe authorshipbycombiningbothhistoricalknowledgeandstatisticaltools.InthestudyYue testedtwohypotheses,thatthelast40chapterswerenotwrittenbythesameauthororthey werewrittenbythesameauthor.Hisstudyfocusedonthefrequenciesof5particularfunction characters,theproportionoftextstopoemsineachchapter,andafewotherstylometric peculiaritiessuchasthenumberofsentencesendedwiththecharacter\Ma".Usingvarious statisticaltechniquesthecomparisonsledthepapertodrawtheconclusionthatitisunlikely thatthe80chaptersandthelast40chapterswerewrittenbythesameauthor.Atthe sametime,usinghistoricknowledgeaboutthebookandtheoriginalauthorCaoXueqin,the paperalsospeculatedthatitwasnotlikelythatthelast40chapterswerecreatedentirely byasingletauthorsuchasGaoE.Intheoppositedirection,thestudiesofChan [9]andLi&Li[16]concludedthattheentirebookwaslikelywrittenbyasingleauthor. Thestudy[16]focusedontheusageoffunctionalcharacterswhile[9]examinedtheusage 27 ofsomeeightythousandcharacters.Bothstudiestabulatedthefrequenciesoftheselected characters,whichledtoafrequencyvectorforeachofthe40chapters,themiddle40 chaptersandthelast40chapters.Thecorrelationsofthesefrequencyvectorswerecomputed. In[16]thecorrelationswerefoundtobelargeenoughfortheauthorstoconcludethatthe entire120chaptersofthebookwerewrittenbythesameauthor.In[9]afourthfrequency vectorusingpartsofthebook TheGallantOnes wasaddedforcomparison.Theauthor foundtlyhighercorrelationsamongthethreefrequencyvectorsfromchapters of DreamoftheRedChamber thanthecorrelationsbetweenthefourthfrequencyvector andthethree.Thisfactformedthebasisoftheconclusionbytheauthorthatall120 chapterswerewrittenbyasingleauthor.AtconclusionwasreachedbyLi[40].By analyzingthefrequenciesof47functionalcharactersandapplyingseveralstatisticaltests theauthorconjecturedthatthelast40chapterswereputtogetherbyGaoEusingunedited andmanuscriptsbyCaoXueqinandhisfamilymembers. Althoughsomeoftheseaforementionedstudiesareimpressiveintheirscopes,missing conspicuouslyfromtheRedologyliteraturearestudiesbasedonthelatestadvancesinliterary stylometry,particularlysomeofthenewandpowerfulmethodsfrommachinelearningtheory. Whilecomparingthefrequenciesoffunctioncharactersandwordsisclearlyaviablewayto analyzetheauthorshipquestion,careneedstobetakentoaccountforrandom ofthesefrequencies,especiallywhensomeofthefunctioncharactersandwordsusedfor comparisonhavelimitedoccurrencesoverallinthebookandsometimesnotatallinsome chapters.Noneoftheaforementionedstudiesemployedcrossvalidationtoaddressrandom Wehavesubstantialreservationsaboutdrawingconclusionsfromcorrelations aloneasinthestudiesofChan[9]andLi&Li[16],becausethetiatingpowerof anysinglevariablesuchascorrelationisratherlimited.Itwouldbeinterestingtoseea 28 morecomprehensivestudyofcorrelationsonalargecorpusoftextsinChinesetodetermine itsectivenessasametricforauthorshipattribution,somethingtheauthorsfailedtodo inbothstudies.Theuseofthebook TheGallantOnes in[9]forbenchmarkcomparison iscurioustousinparticular,especiallyconsideringthattheauthordidnotlimittojust functioncharacters.Thetwobooksareoftwotgenresandaretintheir respectivebackgroundsettings.Consideringthese and thefactthat TheGallant Ones isknown not tobewrittenbyCaoXueqin,itwouldbeashockifthecorrelation betweenthelast40chaptersof DreamoftheRedChamber andthe80chaptersis not higherthanthecorrelationbetweenthelast40chaptersand TheGallantOnes .Itispossible thatthecorrelationcomputedin[9]tellsmoreaboutthegenrethantheauthorshipofthe books.Again,withoutextensiveevidencethatusingthesametechniquethecorrelation betweentwobodiesoftextswrittenbytauthorsisgenerallylowevenwhentheplots arecloselyrelated,theargumentmadein[9]isunconvincingatbest. Havingestablishedarigorousprotocolforchrono-divides,wearenowinposition toapplythisprotocoltoinvestigatetheauthorshipcontroversyoftheCheng-Gaoversionof DreamoftheRedChamber .Inparticularweinvestigatetheexistenceofachrono-divideat Chapter80. 3.3.2SeparabilityofthechaptersbyCaoandGao Thebookisdividedintosamples.Tobalancethenumberofsamples,wegenerateone sampleforeachofthe80chapterswhileusingtheconventionalpracticeofduplicating eachofthelast40chaptersintotwochapterstoobtain80samples.Fromthosesampleswe extractthefeaturesbycalculatingthestatisticsproposedinsubsection3.2.1.Thesefeatures arethennormalizedforfaircomparison.Intotalwehave196variables.Theyarethe144 29 charactersand48words,thenormalizedmeanandvariationofsentencelength,andthe frequenciesofdirectspeechesandexclamations. Toinvestigatetheauthorshipcontroversyweperformthreeseparatetests.Firstwebuild aforthewholebookandlookfortheexistenceofachrono-divideatChapter80. Foraddedrobustnessandreliabilitywealsoperformthesametestsonlyonthe80 chaptersandthelast40chapters. IntheexperimentweapplyourmethodtothewholeChen-Gaoversionof Dream oftheRedChamber .Samplesfromthe60chaptersaredesignatedastrainingsamples foroneclasswhilesamplesfromthelast30chaptersaredesignatedastrainingsamples foranotherclass.Theremainingsamples,fromChapter61to90,areheldoutastesting samples.Thetrainingsamplesarefurtherrandomlysplitintomodelingdataof80samples andvalidationdataof40samples.TheSVM-RFEisrepeated100timesand d ischosen using50crossvalidationruns.Wehavethefollowingobservations. InstabilityofSVM-RFE. Therandomnessofthemodelingsethasresultedinvery substantialinthenumberoffeaturesselectedaswellasfeaturerankings.The resultedmayalsoperformquitetly.Table3.1liststhefeaturesselected usingtworentmodelingdatasets.Oneselects11featuresandtheotherselectsonly4, withonlyonefeatureincommon.Thealsoperformtly.Theexperiments clearlyestablishtheinstabilityofSVM-REF. Givensuchinstabilityonecannotreliablydrawanyconclusionsfromanysinglerun.For example,ifamodelingdatasetseparatesthetrainingdatawellitmightbeduetoover- ConverselyifitseparatespoorlyitmightbeduetoThisproblemis overcomewithourPseudoAggregateSVM-RFEmethod. 30 Modelingset FeaturesSelected ValidationError 1 qu,de,jiu,hui,zhi,dao,shi,ne,bie,zuo 5/40 2 hui,fang,mei,haoxie 1/40 Table3.1:Thefeaturesandvalidationerrorsoftheobtainedfromtworandomly selectedmodelingsubsets. StabilityofPseudoAggregateSVM-RFE. OurpseudoaggregateSVM-RFEapproach repeatsSVM-RFE100timesusingrandomizeddatasets.Thedatasetfromeachrepeatis usedtoselectasetoffeatures,fromwhichaisbeingbuilt.Forsimplicityweshall refertothedataset,featuresandtheresultingclasrtogetherfromarepeatasa model . Tocounterrandomctuationsweconsiderimportantfeaturestobethosethatappear frequentlyamongthe100s.Thisreducedtheinstabilitycausedbyrandomness.In fact,ourbeliefisasfollows:ifthetwoclassesarewellseparated,thereshouldexistasetof featuresthathelptobuildagoodMostmodelingsubsetsshouldbeabletoselect thesefeaturesoutandonlyalimitednumberofmodelingsetsmightbesingularandmiss them.Conversely,ifthetwoclassescannotbewellseparated,noconsistentlydiscriminative featuresexist.tmodelingsetmayleadtototallytfeaturesubset.Asaresult, nofeatureappearswithhighfrequencyinall100models.Thisphilosophy,however,isonly partiallytrue.Whenthetwoclassescannotbeseparated,themodelingprocesssometimes canovthedatabyselectingalotofvariableswhichresultsinhighabsolutefrequencies forsomelessimportantorirrelevantfeatures.Suchaphenomenonisusuallyaccompanied bylargenumberofvariablesandlowvalidationaccuracy.Toimprovetheprocesswepropose amoreappropriatemetric,whichwecall relativefrequency .Inrelativefrequencyweweight thefrequencybytwocriteria.Inthecriteriaavariableappearinginshortmodelsis 31 weightedmorethanthevariablesappearinginlongmodels.Thisleadstoaweightof h ( n j ) foravariableinthe j thmodel,with n j beingthenumberofvariablesinthe j thmodel. Inthesecondcriteriaavariableinamodelwithhighpredictiveaccuracyisweightedmore thanavariablewithpoorpredictiveaccuracy.Thisprovidesanotherweight g ( A j )fora variableinthe j thmodel,where A j denotestheaccuracyofthe j thmodelcomputedfrom thevalidationprocess.Mathematicallytherelativefrequencyforavariable x i inatestrun of M repeatsisas rf ( x i )= 1 M M X j =1 g ( A j ) h ( n j ) 1 ( x i appearsinmodel j ) : (3.1) Inourstudywealwaysset M =100.Also,weset g ( A j )=exp( A j 1 [2 A j 1] + )where[ t ] + = max f 0 ;t g and h ( n j )=[1 cn j ] + forsomeconstant c .For g ( A j )theideaisthatifthe weightshoulddecayfastiftheaccuracyiscloseto50%orlessbecauseitindicatesthatthe issimplynote.For h ( n j )weputinapenaltyforthenumberofvariables usedinamodel.Inourexperimentswehavechosen c =1 = 30,whichseemstoworkwell. Ourexperimentsshowthatfeaturesyieldedfromrelativefrequencyrankingsarevery stableandconsistent.Wehaveperformedrunsof100repeatsusingtrandomseeds inMATLAB,andtheresultsarealwayssimilar.Anadditionalbofusingrelative frequencyinsteadofabsolutefrequencyisthattheexistenceofaneis typicallyaccompaniedbyhighrelativefrequenciesforthetopfeatures,whilelowrelative frequenciesforthetopfeaturesusuallyimplypoorseparability.Hencewecanuserelative frequencyasasimpleguideontheseparabilityofthesamples.Wewillshowsomeexamples inthenextsection. ResultsandConclusion. Inexperiment1wehaveperformedarunof100repeatson 32 theentireCheng-Gaoversionof DreamoftheRedChamber .Altogether70featureshave appearedinatleastonemodel.However,ofthoseonlyasmallnumberofthemhave appearedwithhighenoughfrequencytobeviewedasbeingimportant.Weapplycross validationtoselectthenumberoffeatures,andthemeancrossvalidationerrorrateagainst tnumberoffeaturesisplottedinFigure3.1(a).Thetellsusthat10to50 featuresareenoughtotellthestylebetweenthetwoparts.Usinglesscharacters andwordsist,whileusingmoredegradestheperformancealsobybringingintoo muchnoise.Thesmallcrossvalidationerrorrateisencouraging,anditisalreadyhinting astrongpossibilitythatthetwotrainingsamplesetshavetstylisticto supportthetwo-authorhypothesis. Tosettlethetwo-authorhypothesismoreelyweapplyouronthetest data,whichuntilnowhasneverbeenusedduringthefeatureselectionandmodeling process.Inparticularweinvestigatetheexistenceofachrono-divideinthevaluesobtained throughclassiFigure3.1(b),whichplotsthesevalues,clearlyshowsachrono-divideat Chapter80:ForChapter81-90theyieldsallnegativevalueswhileforChapters 61-80theryieldsallpositivevalueswiththeexceptionofChapter67.Allowing somestatisticalabberationstooccur,ourresultsprovideanextremelyconvincingifnot irrefutableevidencethatthereexistclearstylometricbetweenthewritingsofthe 80chaptersandthelast40chapters.Thisstronglysupportsthetwo-author hypothesisfor DreamoftheRedChamber .Wealsonotethatourinvestigationdidnotneed toassumethattheknowledgethatthestylisticchangeshouldbeatChapter80.Thefact thatthechrono-dividewehavedetectedisindeedatChapter80lendsevenstrongersupport tothetwo-authorhypothesis. 33 (a)(b) Figure3.1:Experiment1:(a)Meancrossvalidationerrorrate;(b)ValuesofSVM onchapters60-90. Interestingly,thefactthatChapter67appearedasan\outlier"inourserves asfurtherevidencetothevalidityofouranalysis.Itwasonlyafterthetestswerealizedthat theauthorshipofChapter67itselfisoneofthecontroversiesinRedology.Unlikethemain controversyabouttheauthorshipofthe80chaptersandthelast40chapters,experts arelessintheirpositionshere.Again,ourresultsstronglysuggeststhatChapter67 isstylisticallytfromtherestofthe80chapters,anditmaynotbewrittenby Cao.Ourisconsistentwiththeconclusionof[2]. 3.3.3Non-separabilityofthe80chapters Tofurthervalidateourmethodweapplythesameteststothe80chaptersof Dream oftheRedChamber toseewhetherwecangetachrono-divide(Experiment2).Weusethe 30andlast30chaptersasthetrainingdataandleavechapters31-50asthetestdata. 34 Figure3.2showsthemeancrossvalidationerrorandthevaluesofSVMonthe testdatachapters31-50.Theexperimentshowsmuchmorefeatureshavebeenselectedin the100repeats,implyingtheyofaconsistentsubsetofdiscriminativefeatures. Thelargeerrorsonthetrainingdataalsoindicatetheyforseparation.Whenthe isappliedtothetestdata,thereisclearlynochrono-divide.Thissuggeststhatour methodyieldsaconclusionthatiscompletelyconsistentwithwhatisknown. (a)(b) Figure3.2:Experiment2:(a)Meancrossvalidationerrorrate;(b)ValuesofSVM onchapters31-50.Notethereisnochrono-divide. 3.3.4Analysisofchapters81-120:stylechangeovertime Wenextapplyourmethodtothelast40chapters(Experiment3).Ourexperiment hasalreadythattheyareunlikelytobewrittenbyCao.However,therearestill debatesonwhetherthesewerewrittenentirelybyoneauthor(mostlikelyGaohimself),or bymorethanoneauthor.Ourmathematicalanalysismaysomeinsighthere. 35 Wesplitthe40chaptersintotwosubsetsasbefore.ThetrainingdataincludeChapters 81-95asoneclassandChapters106-120asanother.Thetestdataarethemiddle10chapters. Becauseoftherelativelysmallnumberofsampleswehavesubdividedeachchapterinto2 sectionstoincreasethesamplesize.Asaresultwenowhave60samplesinthetraining dataand20intestdata,with2samplescorrespondingtoonechapter.Themeancross validationerroroftheanditsclassivaluesonthetestsamplesare showninFigures3.3(a)and(b)respectively. Inthisexperimentweobservethattheperformanceintermsofboththeand featurerankingisnoticeablyworsethanthatinExperiment1butsubstantiallybetterthan thatinExperiment2.Furthermore,unliketheresultsfromthetwoexperiments,the valuesfromtheshowaninterestingtrend.ComparedwithFigure3.2(b)where thevaluesappearedtolackanyorder,thevalueshereexhibitacleargradualdownward shift.Ontheotherhand,comparedtoFigure3.1(b)thevaluesplottedinFigure3.3(b)do notshowaclearsharpchrono-divide,eventhoughthevalueschangegraduallyfrombeing positivetobeingnegative.Whatittellsusisthatthewritingstyleofthelast80chaptershad undergoneagraduatechange,butthischangeisunlikelytobeduetochangeofauthorship. Ourresultsherecouldbesubjecttoseveralinterpretations.Oneplausibleinterpretation isthatGaomightindeedobtainedsomeincompletesetofmanuscriptsbyCao,andtriedto completethenovelbasedonwhathehadobtained.Thestylechangeisaresultofthelack ofgenuineworkbyCaoasthestorydeveloped.Amoreplausibleinterpretationisthatthe last40chapterswerewrittenbysomeonesuchasGaotryingtoimitateCao'sstyle,andover timetheauthorbecamesloppierandreturnedmoreandmoretohisownstyle. 36 (a)(b) Figure3.3:Experiment3:(a)Meancrossvalidationerrorrate;(b)ValuesofSVM onchapters96-105,whichcorrespondtothesamples31-50inall80samples.Notetwo samplescomefromonechapterinthisexperiment. 3.3.5Comparisonwith ContinuedDreamoftheRedChamber Itisworthmentioningthatthereareseveralotherattemptstocomplete DreamoftheRed Chamber fromits80chapters,amongthemis ContinuedDreamoftheRedChamber byQiZichen.UsingthesamefeaturesforbuildingthecinExperiment1,wecan computetheEuclideandistancesbetweenallchaptersandtheirdistancesofchaptersfrom ContinuedDreamoftheRedChamber ,seeFigure3.4.Surprisingly,althoughthesefeatures areobtainedinfavourofthebetweenCaoandCheng-Gao,theyleadtoeven largerdistancebetweenthe80chaptersandthosechaptersof ContinuedDreamofthe RedChamber .Itobviouslyimpliesthatthestyleofthe40chaptersbyCheng-Gaoismore similartothe80chaptersbyCaocomparedto ContinueddreamoftheRedChamber .Maybe that'swhytheCheng-Gaoversionismorepopularthanotherversions. 37 Figure3.4:Distancesbetweenthe80chaptersoftheCheng-Gaoversion,thelast40 chaptersoftheCheng-Gaoversion,and30chaptersof ContinuedDreamoftheRedChamber . 3.4CaseStudy:AnalysisoftheotherthreeGreat ClassicalNovels TofurtherbolsterthecredibilityofourapproachwetestourmethodontheotherthreeGreat ClassicalNovelsinChineseliterature, RomanceoftheThreeKingdoms , WaterMargin ,and JourneytotheWest .Unlike DreamoftheRedChamber ,thereisnoauthorshipcontroversy forthesethreenovels.Thusifourmethodisindeedrobustweshouldexpectnegativeanswers forthetwo-authorhypothesesforallofthembynochrono-divides. Aswith DreamoftheRedChamber ,wespliteachofthethreenovelsintotraining samplesandtestsamples.Both RomanceoftheThreeKingdoms and WaterMargin have 120chapters.Inbothcaseswedesignatethe30chaptersandthelast30chaptersas thetwoclassesoftrainingdata,andthemiddle60chaptersastestdata.For Journeytothe 38 West thetwoclassesoftrainingdataaretheandlast25chaptersrespectively,withthe middle50chaptersastestdata. Weusethesameproceduretotestforchrono-dividesonthethreenovels.Compared to DreamoftheRedChamber ,theselectedfeaturesshowmuchlowerrelativefrequencies, indicatingyintiatingbetweenthewritingstyles.Table3.2showtherelative frequencies(with c =1 = 30)ofthetop8featuresforeachofthefourGreatClassicalNovels. Alsoofnoteisthatinthecaseof WaterMargin ,51featuresareusedtobuilda fromthe60trainingdata,whichisclearlyanotherstrongindicationoftheculty. Novel Relativefrequenciesoftop8features DreamoftheRedChamber 0.570.460.430.360.310.300.290.19 RomanceoftheThreeKingdoms 0.310.270.260.250.230.220.170.15 WaterMargin 0.180.170.160.160.140.110.110.10 JourneytotheWest 0.030.030.020.020.020.020.020.02 Table3.2:Relativefrequenciesofthetopranked8featuresineachofthefourGreatClassical Novels. Figure3.5plotsthevaluesfromtheforallthreenovels.Inallcasesthevalues insuchawaythatitisquiteclearthatnochrono-dividesexist,asexpected. Thisanalysisshowsthatourapproachcanreliablyrejectthetwo-authorhypothesiswhen itisfalse,lendingfurthersupporttotheenessandrobustnessofourmethod. 3.5CaseStudy:Chrono-divideof Micro Micro ,atechno-thrillerpublishedposthumouslyin2011,isMichaelCrichton'snovel. Itwasfoundonhiscomputeruponhisdeathin2008asanmanuscript.Harper Collinscommissionedscience-writerRichardPrestontocompletethenovelfromCrichton's notesandresearch.Although DreamoftheRedChamber isinChinese,theprincipleofour 39 (a)(b)(c) Figure3.5:resultsonthetestingsamplesoftheotherthreeclassicalnovels: (a) RomanceoftheThreeKingdoms ;(b) WaterMargin ;(c) JourneytotheWest. methodshouldapplytobooksinotherlanguages. Micro thusprovidesuswithagoodtest example.Inthiscasestudy,wewilluseourapproachtocanddetectthechrono- divideof Micro .Wewillalsoperformatnewtestusingbuiltdirectlyfrom otherbookswrittenbyCrichtonandPrestonforcomparison.Thenewtestservesbothasa validationofourmethodandasacomparison.Notethatthesecondoptionisnotavailable for DreamoftheRedChamber . InthedirecttestweusetheotherbookswrittenbyCrichtonandPrestonto generatethetrainingdata.Atotalof17bookswrittenbyCrichtonand2booksbyPreston wereusedfortraining.Theinitialfeaturesconsistofthefrequencyof241mostfrequently usedwordsinthesebooks.Tobuildceachbookwasdividedintomultiplepieces witheachpiececontainingapproximately2000words.Thefrequenciesofthethe241selected wordsofeachpieceformasamplingpoint.Overall782datapointsforCrichtonand104 datapointsfromPrestonweregenerated.Toovercometheimbalanceofthesamplingpoints forCrichtonandRichard,weonlyused728samplesforCrichtonandtheyaresplitinto7 subsets.EachsubsetiscombinedwiththesamplesforPrestontoformatrainingdataset, fromwhichwebuildalinearclaser.Sototally7wereconstructed.Applying thetodetectthechrono-dividein Micro ,wechunkthebookinto56parts,each 40 containingabout2000words.Eachpartprovidesatestingsamplepoint.Weappliedthe7 tothistestingdata.Theaverageofthe7sareplottedinFigure3.6.The resultshowsabreakpointataroundthe15th-16thsamplepoints. Figure3.6: Micro :averagevaluesofthe7 Wecannowcomparetheabovemethodtotheearliermethodfor DreamofRedChamber . Weassumethat,comparedwiththeoverallstyleofanauthoracrossmultiplebooks,the styleoftheauthorinasinglebookwouldbemoreconsistent.Asaresultwedivided Micro into112partsofapproximately1000wordseach.Themostfrequentlyused265content independentwordsfromthebookwereusedastheinitialfeatures.Weusethe22 samplepointsandthelast22samplepointsastrainingandvalidationdataandthemiddle 68samplepointsastestdata.TheresultsareshowninFigure3.7.Aclear breakpointcanbeseenaroundthe29th-30thsamples. Thetwoexperimentsctheexistenceofachrono-dividein Micro ,andprovide furtherevidenceofthevalidityofouroriginalapproachfordiscoveringandlocatingchrono- divides.Asabyproduct,ourresultsshowthatthechangeofauthorshipfor Micro had occurredbetween1/4and1/3ofthebook.ThisisconsistentwithwhatRichardPreston hadindicatedinseveralinterviewsaboutthebook. 41 Figure3.7: Micro :ionresultbythecobtainedusingthe22partsand thelast22partsofthebookastrainingsamples. 3.6Conclusion Inspiredbyauthorshipcontroversyof DreamoftheRedChamber andtheapplicationof SVMinthestudyofliterarystylometry,wehavedevelopedamathematicallyrigorousnew methodfortheanalysisofauthorshipbytestingforachrono-divideinwritingstyles.We haveshownthatthemethodishighlyeandrobust. ApplyingourmethodtotheCheng-Gaoversionof DreamoftheRedChamber hasledto convincingifnotirrefutableevidencethatthe80chaptersandthelast40chaptersof thebookwerewrittenbytwotauthors.Furthermore,ouranalysishasunexpectedly providedstrongsupporttothehypothesisthatChapter67wasnottheworkofCaoXueqin either. Applyingourmethodto Micro ,weareabletotheexistenceofchrono-divideand identifyitslocation.Itprovidesstrongevidenceforustoattributethe1/4ofthework toMichaelCrichtonandtheleft3/4toRichardPreston. Therobustnessofourapproachisalsoevidencedbyitsabilitytorejectthemultiple authorhypothesiswhenthereisnochrono-divide,aswehavedonefortheotherthree classicalChinesenovels. 42 Chapter4 OpenClassAuthorshipIden 4.1Introduction Whowrotethe FightingforOurLives ,whowrotethe RecipesforDisater andwhowrote TheAnimatedSkeleton ?Therearenumerousbooksthatwerepublishedanonymouslyor underpseudonyms.Althoughauthorshipattributionhasalonghistory,mostofthestudies aretothecloseauthorshipidencaseswheretheauthorsarelimitedtoa smallsetofpotentialcandidateauthors.Forexample,the Federalist paperswereknownto bewrittenbythreeauthors(Hamilton,MadisonandJay),anditisonlyamatterofdeciding whichpaperwaswrittenbywhichoneofthethreeauthors.Incontrast,theaforementioned booksdidn'thaveanyreliablesetof\suspects",andtheidenoftheirauthorship becomesanopencase.Theclosecaseidenproblemisgenerallyfarlesschallenging. Withasmallsetofpotentialauthorsitisoftenviabletosimplylookfortheclosestmatch insomefeatures,andthisisinfacthowmostoftheseclosecaseswerestudied.However, thisapproachcannotbeappliedifthenumberofcandidateauthorsisverylarge(insome casesitwillbetheentiredatabase).Thusforopencaseidenproblemswewillneed completelynewapproaches,approachesthatwillallowustoreliablypickoutthetrueauthor amongalargenumberofauthors.Sofar,therehasnotbeenareliablemethodtohandle openclassauthorshipiden Authorshipattributionisoneofthemostprominentproblemsinthebroadresearcharea 43 ofliterarystylometry,whichhasseenratherexplosivegrowthinrecentyears.Thebasicidea behindtheauthorshipattributionisthatpeoplecanalwayssomemeasurementsthat candistinguishthestylesofditauthors.Althoughthestyleofanauthorint booksmayvaryduetotgenres,itshouldn'tdtoomuch.Theinstyle betweenauthorsshouldbemuchlargercomparedtothatwithinanauthorifweuseproper measurements.Thus,withthesemeasurements,totheauthorofabookwhosesetof candidateauthorsisknown,wecanusesomebooksfromacandidateauthorastraining samplesandbuildbetweentauthors.Thepurposeofpartofthis chapteristosolvecloseclassproblems,wherethesetofcandidateauthorissmall.In anotherword,ifonebookissuspectedtobewritteneitherbyauthor A 1 ;A 2 ,or A 3 ,wecan alwayssomefeaturesthatcandiscriminatethestyleoftheseauthorsandattributeitto oneofthecandidateauthors.Thisformsthebasisforclosecaseauthorshipident Oneofthefundamentalrequirementforconductingopencaseidenistheac- cesstoalargedatabaseofauthorsandtheirwork.Fortunately,therearemanydatabases thatareavailableeitherinthepublicdomain(e.g.ProjectGutenberg)orthroughlibrary subscription(e.g.theHathiTrustcollection).Butevenwithalargedatabase,the authorofabookssuchas FightingforOurLives whenwehavenoclueaboutitsauthorship ischallenging.Theauthorshipinanopenclassproblemismuchhardertostudy,andvery fewworkhasbeendonesofar.Harderstillistobeabletodoittlywithinalarge database.Inthischapter,weproposeanewmethodtoopenclassproblemsinauthorshipat- tributionusingtheideaofrandomizationtogetherwiththemethodforcloseclassproblems, bywhichwecandetectthetrueauthorofanytextinalargedatabaseofauthorsanddoit tly.Wecanattributethetexttoitstrueauthorifheorsheisinthedatabase,and otherwisewecanconcludethatthetrueauthorisnotinthedatabase.Wewilldescribeand 44 demonstratethereliabilityofouralgorithminthenextsections.Ourmethodisbasedon anewmethodforclosedcaseidenandanauthorrandomizationtechnique.Inthe partofthechapterweconductthreecasestudiesbyanalyzingthreebooks: Cuckoo's Calling , ToKillaMockingbird and DreamsfromMyFather .Allthreebooksareinvolved incontroversies,someofwhicharestillongoing.Weshowthatourmethodcanalmost unequivocallysettlethecontroversies. 4.2CloseClassProblems 4.2.1Methodology Aswementionedearlier,theobjectiveofclosecaseidenistoconstruct fortheseauthorsandthebestmatchamongthegroupusingcertainmetrics.Herewe buildSVMusingfrequenciesofwordsasfeatures.Insteadofusingonlyfunction words,weusethemostcommonlyappearedwordsinthesamplesweareanalyzing.This turnsouttobemoreeinourtesting,anditisalsopointedoutin[7]. Let A 1 ;A 2 ;:::;A L betheauthorsinagroupwithinwhichwewouldliketothe closestmatchforcertaintextinquestion.Webreakdownallthetrainingandtestingtext into samples .Eachsampleisasegmentoftextconsistingof K words.Inourstudywe haveset K =2000.Thisprovestoworkverywellandsmaller K suchas K =1000still workswell.Thisisimportantinsituationswheretheavailabilityoftextbycertainauthor islimited.Atypicalbookisbrokendownintobetween25to100samples.Thenumberof samplesthat canbeattributedtoagivenauthor thusdependsonhowthe authoris.Onaverageanauthorwillhaveacorpusofbetweentwohundredstoathousand samples. 45 For L> 2,therearetwopopularwayswecanperformthismulti-class task.Onewaywouldbetobuildforeachpair,andinessenceconductaseries ofhead-to-headcontest.Theclosestmatchwouldbetheauthorwiththemostwins.One potentialproblemisthatas L increases,thenumberofpairwisewouldincrease intheorderof L 2 ,makingitcomputationallylesst.Hereweemploythe one-vs-rest method,inwhichwebuild L f 1 ;f 2 ;:::;f L .The f j isbuiltusingauthor A j asoneclassandallotherauthorsgroupedintoasingleclass.Thisisshowntoperform aswellasthepairwiseapproachandinourtesting,itactuallyoutperformsslightlyinour setting. Assumethatthetextinquestionisbrokendowninto N samples X 1 ;X 2 ;:::;X N .We nowapplythe f j toeachsample X n ,where f j isthesignfunctionofdecision function2.11.Thetestisbeingmatchedwithauthor j 0 (maximumwin)if j 0 =argmax j X n f j ( X n ) : Inourimplementationwealsogainsomeimprovementbyrunningseveralfoldsof andtaketheiraveragetominimizethechanceforoutliers,withmatchingbeingdecidedby avotingschemesuchasmaximumwinorBORDA.Weshallcall q j =N the matchingrate for author A j ,where q j isthenumberofsamplesamong f X n g thathavematchedwithauthor A j .Theauthorwiththehighestmatchingrateisdeclaredtobetheclosestmatchwithin thegroup. 46 4.2.2Casestudies Casestudy1{ Cuckoo'sCalling . Acontroversyreceivingagreatdealofmedia attentioninthesummerof2013wastherumorallegingthattheauthorRobertGalbraithof thecrime Cuckoo'sCalling wasactuallythepseudonymforJ.K.Rowling.Inan to\out"Rowling,asastoryintheNationalGeographicdetailed,the SundayTimes sent thebookalongwithbooksbyRowlingandthreeothercrimeauthorstoPatrickJuola andPeterMillican,twoprominentexpertsinauthorshipattribution,forcomparison.Both researcherswereabletoidentifyRowlingastheclosestmatchamongthe4authors.Butto highlightthegeneralyoftrueauthorshipidenon,thestoryalsomentionedthat Juola\wasn'ttotallytintheresult.Afterall,hehadnowayofknowingwhether therealauthorwassomebodywhowasn'tinthecomparisonsetofbookswhohappenedto writelikeRowlingdoes." Wehavetested Cuckoo'sCalling withinagroupof L candidates,where L canbe2,3 ormoreandoneortwobooksfromeachauthorareusedinconstructingtheTo makeourresultcomparable,thebookswechoosearesimilartotheonesthatJuolaand Millicanhadusedintheirtrainingprocess.Toshowtheaccuracyofourwehave alsotestedotherbooksbyalltheauthors.Thematchingratesarelistedtogetherwith thatof Cuckoo'sCalling .Wehaveconductedseveralgroupsofexperimentsandobtained thefollowingresults(seetables(4.14.24.34.4)).Asonecansee,althoughtherearesome whenttrainingsamplesareused,thetestedbooksfromeachauthor matchedtheirownworkwithveryhighmatchingrates.Furthermore,allthetableshere indicatethat Cuckoo'sCalling istheclosestmatchtotheworkofJ.K.Rowing.Wecan easilydrawthesameconclusionasJuolaandMillicandid,namelyifRobertGalbraithis 47 amongtheauthorsinthetraininglist,thenitisthepseudonymforJ.K.Rowling. Testing n Training TheCasualVacancy-J.K.Rowling ThePrivatePatient-P.D.James Cuckoo'sCalling 72/72 0/72 Deathly-JKR 92/100 8/100 Death-PDJ 1/45 44/45 Table4.1: Cuckoo'sCalling :resultbytheobtainedusingonebook foreachofthetwoauthorsastrainingsamples. Testing n Training TheCasualVacancy J.K.Rowling ThePrivatePatient P.D.James NoMoreDying R.Rendell DeadBeat V.McDermid Cuckoo'sCalling 65/72 0/72 1/72 6/72 Deathly-JKR 64/100 1/100 8/100 27/100 Death-PDJ 0/45 44/45 0/45 0/45 SomeLie-RR 2/28 0/28 25/28 1/28 KickBack-VM 0/37 0/37 0/37 37/37 Table4.2: Cuckoo'sCalling :resultbytheobtainedusingonebook foreachofthefourauthorsastrainingsamples,groupI. Testing n Training Half-Blood J.K.Rowling ColdBlood CapoteTruman UnbearableLightness Alexander.McCall MurderofRoger AgathaChristie Cuckoo'sCalling 57/72 11/72 3/72 1/72 Deathly-JKR 92/100 8/100 0/100 0/100 Prayer-CT 0/22 22/22 0/22 0/22 Lady-AM 1/32 0/32 31/32 0/32 Appoint-AC 5/26 0/26 4/26 17/26 Table4.3: Cuckoo'sCalling :resultbytheobtainedusingonebook foreachofthefourauthorsastrainingsamples,groupII. Casestudy2{ ToKillaMockingbird ToKillaMockingbird wasasouthern dramabyHarperLeepublishedin1960.AyearlateritwonthePulitzerPrizeandhad quicklybecomeoneoftheclassicsinAmericanliterature.Eversinceitspublication, theauthorshipofthebookhasbeenasubjectofcontroversybecausethiswastheoneand onlyonepublishedworkbyHarperLeeuntil2013.Somefoundithardtobelievethat anunknownauthorwouldwriteasingleandgreatnovelandthenstopwriting.Rumors persistedthetrueauthorswasTrumanCapote,achildhoodfriendofLeeandtheauthorof 48 Testing n Training J.K.Rowling P.D.James RuthRendell ValMacDermid Cuckoo'sCalling 72/72 0/72 0/72 0/72 Chamb-JKR 42/42 0/42 0/42 0/42 Light-PDJ 1/60 59/60 0/60 0/60 Crack-VM 0/37 0/37 1/37 36/37 Table4.4: Cuckoo'sCalling :resultbytheobtainedusingtwobooks foreachofthefourauthorsastrainingsamples. InColdBlood ,despitethedenialsofbothLeeandCapote.PearlBelle,afamousliterary criticandeditorinCambridge,Massachusetts,exposedthatCapoteimpliedtoherthathe pennedorheavilyeditedthebook.In2001JimGilbert,anAlabamawriter,addedtothe rumorbyclaimingthat ToKillaMockingbird wastheworkofCapoteafterhecomparedit with InColdBlood intermsofliterarystructure.ButHarperLeehasherowndefendersas well.Dr.WayneFlyt,aretiredprofessorofhistoryfromAuburnUniversity,claimedthat HarperLeeisthetrueauthorofthisbookafteranalyzingthevoicesofthecharactersand foundthestylestobequitetfromthestyleofCapote. Tosolvethisauthorshipmystery,wehavecompared ToKillAMockingbird withthe workbyTrumanCapoteusingtheproposedmethodandauthorrandomizationforclose classproblems.Wehavetrainedtheusingonebookbyeachoftheecandidate authors(seetable4.5).Wehavethentested ToKillAMockingbird togetherwiththeother booksbytheselectedcandidateauthors,thematchingratesareshownintable4.5.The resultsshowthatthetestingworksfromcandidateauthorsmatchedcorrectlyinhighrates, whilethematchingrateof ToKillAMockingbird isdistributedalmostrandomlyinthethree oftheeauthors.Thus,weconcludethatitisunlikelythat ToKillAMockingbird was writtenbyCapoteTruman. Casestudy3{ DreamsFromMyFather . DreamsfromMyFather isamemoir byBarackObamapublishedin1995.In2008WilliamAyersstirredupacontroversyby 49 Testing n Training ADarkerDomain ValMcDermid TheExecutionersSong MailerNorman InColdBlood CapoteTruman TheMurderofFrorence DouglasPreston MidnightGarden JohnBerendt ToKillaMockingbird 13/49 3/49 15/49 1/49 17/49 Bleeding-VM 63/66 2/66 0/66 0/66 1/66 Castle-MN 9/75 53/75 3/75 10/75 0/75 Voices-CT 1/27 0/27 24/27 0/37 2/27 Prayer-CT 1/22 0/22 21/22 0/22 0/22 Table4.5: ToKillAMockingbird :resultbytheobtainedusingone bookforeachoftheeauthorsastrainingsamples. \confessing"thathewasthetrueauthorofObama'smemoir.SomeofObama'scriticswere eagertopointoutthatObamadidn'tpossessthewritingskilltowritethisbest-sellingbook becausehecouldn'tevenwritea30secondspeech(his2012acceptancespeechwaswritten byBillClinton).Hisdefenders,ontheotherhand,werebelieversthatObamahimself wasthetrueauthor.Afterreadingthebook,JackCashillarguedthat DreamsfromMy Father wasthematicallyandsemanticallytooclosetoAyers'searliermemoir, FugitiveDays . PatrickJuolacouldn'tmaketheconclusionandsaid:\Theaccuracysimplyisn'tthere" usinghistools. ToinvestigatetheAyers'\confession",weconstructedclasserswithinasmallgroup, whichincludedthebook FugitiveDays , AudacityofHope byObama,andothertwobooks fromotherauthors,andtestedthematchingfor DreamsFromMyFather andotherbooks knowntobebythechosenauthors.Aswecanseeintable4.6,wematchedallthebooksto thetrueauthorswithhighaccuracy,whilethebestmatchfor DreamsFromMyFather was withObama's AudacityofHope .ThisresultleadsustoconcludethatAyerslikelyliedin his\confession". Testing n Training AudacityofHope BarackObama FugitiveDays BillAyers ADistantMirror BarbaraTuchman TheStoryofMyLife ClarenceDarrow Dreams-Obama 54/74 10/74 3/74 7/74 Confession-Ayers 1/42 40/42 0/42 1/42 Folly-Tuchman 2/80 0/80 78/80 0/80 Table4.6: DreamsFromMyFather :resultbytheobtainedusingone bookforeachofthefourauthorsastrainingsamples. 50 4.3OpenClassProblems Openclassproblemsaretocheckwhethertheauthorisinthetrainingauthorpoolornot, ifyes,whoitis.Largecandidategroupandcientalgorithmtodealwithbigdataarethe twomainchallengesinsolvingopenproblems.Wewillmoveonestepfurtherinauthorship attributionandreliablydetectwhethertheauthorisinthetrainingauthorpoolornot.Our approachisbasedonthemethodinsolvingcloseclassproblemandtheideaofrandomization. Theideaisthatiftheauthorisinasmallgroup,wewillalwaysgethighmatchingrate forthetrueauthorwhenweconstructinthegroup.Whenweseparatecandidate authorsintosmallgroupsrandomly,thetrueauthorwouldalwaysobtainhighmatchingrate. Byrecordingtheoccurrenceoftheauthorswhogetthehighestmatchingrate,wewould obtainthehighfrequencyofthetrueauthor.IfweauthorAandconstructby randomlychoosingotherauthorinthetrainingmanytimes,theaveragematchingratefor authorAshouldbemuchhigherifAisthetrueauthor.Iftheaveragematchingratesfor allthecandidateauthorsarenothighenough,thenwecanconcludethatthetrueauthoris notinthecandidateset. Theopenclassauthorshipidentiproblemisdividedintoseveralcomponents. Herewebreakdownthesecomponentsanddiscusstheirdetails. 4.3.1Databaseanddatapreparation Naturally,toperformopenclassauthorshipiden,onewouldneedasizabledigi- taldatabaseofauthors.Fortunatelythisisnotaproblemtoday.Thepubliclyavailable ProjectGutenberghasover45,000titlesasofMay2014.MoreimpressiveistheHathiTrust collection,whichhasover10milliontitlesandisavailablefromseveraluniversitylibraries. 51 TheHathitrustcontainsbothpublicdomainandcopyrightedtitles.Itisanidealdatabase aroundwhichtobuildalargescaleauthorshipstylometryanalysisproject.Forauthorin thetraining,weneedenoughsamplestotrainthethus,Wedon'tuseallthe authorsinthementioneddatabase.Wejustchoosesomeauthorswhohaveatleast3books andmorethan100samples(oflength2000)fromthesebooks.Finally,wegotadatabase whichcontains200authorsandeachauthorhasatleast3bookswhichcontainatleast100 samples. Sinceourmethodisbasedonsupervisedmachinelearning,trainingsamplesinthe databasewillbeusedtotrainforIntheorywecanuseallavailablesamples inthetrainingprocess.Inpractice,wenoticedthatresultsdonotimprovetlywith morethan100samplesperauthor.Infactweobtaingoodresultsevenwithonly30-60 trainingsamplesperauthor.Byrandomlyonebookastestandconstructingtheclas- betweenthetrueauthorandothers,weobtaintheaveragematchingrateforthetrue authorbyrepeatingthisprocess50times.Theoretically,thewritingstyleoftheauthorscan bebetterifthetrainingsamplesarechosenfromtbooksbyeachauthor ratherthanfromoneortwowholebooksbytheauthors.Thefollowingtwotables4.7and 4.8showthatwhenthesamplesizeincreasefrom30to90,theaveragetruematchingrate isgettinghigher.Theyalsovthatwhenthetrainingsamplesarefromtbooks ratherthanoneortwobooksbytheauthor,thetruematchingrateisbetter.Thus,allour casestudiesinthisparthaveused90trainingsamplesfromtbooksbyeachauthor. 4.3.2Methodology ImprovedReliabilityandRobustnessthroughAuthorRandomization. Good resultscanoftenbeattainedifthereiscertaintythattheauthorofthetextinquestionis 52 L=2 L=3 L=4 L=5 30 0.93 0.87 0.86 0.84 60 0.94 0.93 0.89 0.88 90 0.94 0.93 0.90 0.90 Table4.7:Matchingrateformulti-classtrainedbyrandomlychosenbooksby eachauthor. L=2 L=3 L=4 L=5 30 0.95 0.92 0.91 0.88 60 0.96 0.95 0.93 0.92 90 0.97 0.96 0.94 0.93 Table4.8:Matchingrateformulti-classtrainedbyrandomlychosensamples byeachauthor. indeedinthegroup(suchasinthecaseof Federalist papers).Theproblemisthattheclosest matchcanalwaysbefoundevenwhentherealauthorofthetextinquestionisinfactnot amemberofthegroup.Soingeneraloneoftencanonlyclaimtohavefoundtheclosest matchwithinthegroupwithouttrulyidentifyingtheauthorship. OurSVMapproachisn'timmunetothisconcern.Infact,itcanhappenwheninasmall groupamatchingrateofover90%isobtainedbysomeoneotherthanthetrueauthorof sometextinquestion.Thisusuallyoccurswhenthetrueauthorisnotinthegroup.Sohow canoneelyminimizesuchfalsematching? Akeyingredientforourmethodis authorrandomization ,whichprovestobeahighly ewaytoidentifyafalsematching.Hereforeach potentialcandidate authorweperform aseriesofsmallgrouponsagainstrandomlychosenauthorsfromadatabase. Moreprecisely,let A 1 betheauthorwewishtotestformatchingforthetextsamples X 1 ;X 2 ;:::;X N .Wenowrandomlyselectauthors A 2 ;:::;A L foradatabaseandperform then.Thisisdoneoverseveraltrials.Intheendweobtainanaveragematching 53 rateforauthor A 1 .Theoutlineofouralgorithmforopenclassproblemsisasfollowing: 1Initializethedata,whichcontainsatleast3booksbyeachauthorinthedatabase. 2Spliteachbookintopartsof K words(usuallyweset K =2000),andextractthe featuresforeachpart.Werandomlychooseonebookbyeachauthorinthedatabase astest. 3Randomlydividetheauthorsintogroupsof L 1 authors,andconstructfor eachgroup.Recordtheauthorswhoreceivesthehighestmatchingrates. 4Repeat T 1 timesstep3andrecordtheauthorshavingthehighestmatchingrates. 5Removetheauthorswhoseaveragematchingrateisbelowthethreshold S 1 .Those authorswhoarestillleftaredeemedaspotentialtrueauthors,or\suspectauthors". 6Fixeachsuspectauthorfromstep5andrandomlychoose L 1authorsfromthe databaseandconstructaRecordthematchingratesforthesuspectauthors. 7Repeat T 2 timesstep6toobtaintheaveragematchingrateforeachsuspectauthor. 8Ifthemaximumaveragematchingratefromstep7isabovesomethreshold S 2 ,we identifythatauthorasthetrueauthor.Otherwisethetrueauthorisnotinthe dataset. Wehavedoneextensivetestofthisapproach,andthetable4.9showsthematchingrates forbothauthorsandnon-authors.Thelastrowdenotesthemaximalaveragematchingrate foranon-authorinourtest.Asonecansee,theaveragematchingratesofthetrueauthors areabove90%evenfor5-classwhiletheaveragematchingrateofthenon- authorisalwaysonaverage.What'smore,eventhemaximalaveragematchingratefora 54 non-authorisnotevenclosetotheaveragematchingrateforthetrueauthor.Thisisperhaps themostimportantattributethatallowsustoidentifyauthorshipusingourapproach. L=2 L=3 L=4 L=5 Author 0.94 0.93 0.90 0.90 Non-author 0.48 0.32 0.24 0.20 Maxnon-author 0.70 0.62 0.59 0.51 Table4.9:Matchingrateformulti-classbyrandomlychosenbooksbyeach author. L=2 L=3 L=4 L=5 Author 0.97 0.96 0.94 0.93 Non-author 0.51 0.34 0.25 0.20 Maxnon-author 0.66 0.59 0.52 0.49 Table4.10:Matchingrateformulti-classwithrandomlychosensamplesby eachauthor. Theimportanceoftheaboveresultsisthatitprovidesanextremelyrobustwaytonot onlytelluswhichauthoristhebestmatchforagivenbodyoftext,butalsohowwell thematchingis.Thegapbetweenmatchingratefortrueauthorsandnon-authorsisso tthatitcanbescaledupintoalegitimateopenclassauthorshipiden method,whichwediscussnext. OpenClassAuthorshipation. Toscaleourmethodforcloseclassproblems upforopenclassauthorshipidenwedivideourdatabaseofauthorsintosmall groups G 1 ;G 2 ;:::;G M of L authors(wechoose L tobe4or5.Itworksfor L aslargeas 10).Withineachgroup G m webuildwhichcanhaveseveralfoldsasmentioned earlier.Formorerobustnesswemayalsoestablishseveralsuchgroupingsfromrandom assignments.Notethatthisisperhapsthemostcomputationallydemandingpartwitha largedatabase,butitcanbedoneentirelyasaonetimetask. 55 Givenabodyoftextinquestionanditssamples X 1 ;:::;X N ,wefeedthesamplesinto theforeachgroup.Eachgroupwillhaveanauthorthathasthehighestmatching rate,buteventhehighestmatchingratecanbelow.Weonlyselectthosewhosematching ratesexceedathreshold(e.g.50%)ascandidatesforthenextroundoftesting.Intheory,the trueauthorshouldalwaysgetthehighestmatchingrateinallgroups,whileotherauthors won'talwaysobtainhighmatchingrate.Assumethatafterthisinitialroundweareleft withagroupofpotentialauthors B 1 ;B 2 ;:::;B q forthetextinquestion.If q isstillrather large,asitispossiblegiventhesizeofthedatabasewehopetobuild,wewouldthentreat B 1 ;B 2 ;:::;B q asaseparatedatabaseofauthorsanditeratethisprocessuntilasmallnumber ofcandidateauthors C 1 ;:::;C p remain. Nowheretheauthorrandomizationmethodshowsitspower.Foreachremainingcandi- dateauthor C j ,werandomlyadd L 1authorsfromtheentiredatabasetoformagroup G of L authors.Within G amatchingratefor C j isobtained.Werepeatthisprocessfor C j manytimestoobtainanaveragematchingratefor C j .Weidentifytheauthoramong C j withthehighestaveragematchingrateastheauthorofthetextinquestion,provided thattheaveragematchingrateishigherthancertainthreshold.Ifeventhehighestaverage matchingrateislow,wemayconcludethatthetextiswrittenbyanauthoroutsideour database. Inthefollowingcasestudiesweset L 1 =4or5, T 1 = T 2 =50, S 1 =0 : 5or0 : 6.Wealso set S 2 =0 : 8. 4.3.3Casestudies Wehavebuiltatestdatabaseconsistingof200authorsandeachauthorhasmorethan100 samples.Thisdatabase,althoughsmall,alreadyallowsustotestourmethodandstudy 56 someoftherealworldcontroversies.WeareworkingwiththeMichiganStateLibraryto tlyaugmentthesizeofthedatabase.ItisconceivablethatusingtheHathitrust wecanbuildadatabaseconsistingofseveralhundredsofthousandsofauthors. Casestudy: Cuckoo'sCalling . Asatestcaseweranouropenclassmethodon Cuckoo'sCalling usingthedatabasewehadbuilt.Theresultisquitestrikinglyeventhough ourdatabaseissmall.Inthestagewherewedividetheauthorsintogroupsoflength5, sotherearenow40groups.Werecordtheauthorineachgroupwiththehighestmatching rate.Butofthe40leadersineachgrouponly7hasamatchingrateover60%.Ifwe lowerthethresholdto50%thenthereare16candidatesremaining.However,ifwedothis 50timesthenonly3candidateshavematchingrateof50%orhigherinatleast30trials (notsurprisinglyRowlingpassedthethresholdinalltrials.).Wealsotriedamorerobust approachinthestage,inwhichwegrouptheauthorsandkeepthoseaspotentialauthors (orsuspectauthors).Thenextstageisauthorrandomization,whereweeachsuspect authorandrandomlychoose L 1authorsfromthenon-suspectauthorsandbuildthe Weobtainedtheaveragematchingrateforeachbyrepeating T 2 times inthisrandomizationprocess( T 2 =50).Table4.11showsthatthematchingrateforeach suspectauthorishigherthantheaverage.Stillthematchingratein T 2 =50repeatsfor RowlingwasmuchhigherandwecaneasilyandalmostelyidentifyRowlingasthe trueauthor. masterauthors frequency L=2 L=3 L=4 L=5 80 50/50 0.96 0.93 0.91 0.91 163 34/50 0.64 0.58 0.50 0.41 46 33/50 0.65 0.57 0.44 0.33 83 28/50 0.63 0.56 0.42 0.31 174 28/50 0.62 0.53 0.40 0.32 Table4.11: Cuckoo'sCalling: mastersuspectedauthorsandtheaveragematchingrate. 57 Casestudy2{ ToKillaMockingbird . Asanothercasestudy,wehaveapplied ourmethodofauthorrandomizationtostudytheauthorshipof ToKillAMockingbird .We testthebookformatchingagainstCapoteand L 1otherrandomlyselectedauthorsfrom ourdatabase.Additionallywehavealsotested AnsweredPrayers byCapoteinparallel.The trialisrepeated50times.Aswecanseeintable4.12,theaveragematchingrateof ToKill aMockingbird withCapoteislowerthan50%.Incontrast,thematchingratefor Answered Prayers isalwaysabove80%.TheresultshowsvirtuallyirrefutablythatCapotedidnot write ToKillaMockingbird . testbooks n matchingrate L=2 L=3 L=4 L=5 ToKillaMockingbird 0.49 0.40 0.39 0.36 Voices-CT 0.89 0.84 0.85 0.81 Prayer-CT 0.98 0.96 0.95 0.96 Table4.12: ToKillaMockingbird: theaveragematchingrate. Casestudy3{ DreamsFromMyFather . Tostrengthenourconclusion,wehave againappliedourmethodbymatchingup DreamsFromMyFather againstWilliamAyers and L 1randomlyselectedauthors.Thisisrepeated50times.Testingresultsintable 4.13aretheaveragematchingrateforthe50trials.TheaveragematchingrateforAyers islowerthan25%.Incontrastinthesametestbutwith PublicEnemy:ConfessionsofAn AmericanDissident byAyersinplaceof DreamsfrommyFather ,weareabletoobtainover 98%averagematchingrate!ConstructingtheusingotherbooksbyAyersyield verysimilarlyhighmatchingrates,whicharealwaysabove95%.Changingthetrainingbook forAyersgivesalmostthesameresults,wherethematchingratefor DreamsfromMyfather againstAyershasalwaysstayedbelow50%.ThisisvirtuallyirrefutableevidencethatAyers lied inhis\confession"aboutwritingObama'sautobiography. Aswecanseeintheresultsofthethreeexperiments,ourmethodforopenclassproblems 58 testbooks n matchingrate L=2 L=3 L=4 L=5 Dreams-Obama 0.24 0.25 0.23 0.23 Confession-Ayers 1.0 0.98 0.99 0.99 Table4.13: DreamsFromMyFather: theaveragematchingrate. canalwaysdetectthetrueauthorsuccessfully,althoughthematchingratesofthetrue authorsvaryinrentexperiment.Whenthetrueauthorisnotinthecandidatesetlike ToKillaMockingbird ,theresultsfromourmethodcanleadusclaimthatthetrueauthor isnotinthecandidateset. 4.4Conclusion Westudiedthecloseclassauthorshipattributionproblemsusingmachinelearningtechniques andstudiedtheauthorshipproblemsof Cuckoo'sCalling , ToKillaMockingbird and Dreams FromMyFather .Alltheexperimentsshowedhighmatchingrateforthetrueauthors. Weproposedanewalgorithmtosolveopenclassauthorshipattributionproblemsby extendingthemethodforcloseclassproblemsandrandomizingauthorsinabigauthor pool.Bycomparingthematchingrateoftrueauthorsandnon-authors,weshowedthat thisalgorithmisveryreliableandstable.Tofurtherverifyourresult,westudiedthethree particularcasesasintheclosecase. Applyingourmethodsto Cuckoo'sCalling ,wecanthatitiswrittenbyJ.K.Rowling andpublishedwithhispseudonymRobertGalbraithwithstrongevidence.Studyingtheau- thorof ToKillaMockingbird withourmethods,weconcludedthatitsauthorisnotCapote Truman,althoughwecan'tconcludeitisbyHarperLeebecausenotrainingsamplesfrom her.Furthermore,wecantellthatAyersliedtothemediathatitwashewhowrote Dreams FromMyFather bycomparingthematchingratesbetweenbooksbyObamaandAyers. 59 BIBLIOGRAPHY 60 BIBLIOGRAPHY [1]FriederikeAntosch. Thediagnosisofliterarystylewiththeverb-adjectiveratio .New York,1969. [2]YanAnzheng.Onepieceofevidencethatchapters64and67arenottheoriginalversion. JournalofXianyangNormalUniversity ,24:3,2009. [3]RonaldEBee.Statisticalmethodsinthestudyofthemasoretictextoftheoldtes- tament. JournaloftheRoyalStatisticalSociety.SeriesA(General) ,pages611{622, 1971. [4]RonaldEBee.Astatisticalstudyofthesinaipericope. JournaloftheRoyalStatistical Society.SeriesA(General) ,pages406{421,1972. [5]BarronBrainerd.Onthedistinctionbetweenanovelandaromance:adiscriminant analysis. ComputersandtheHumanities ,7(5):259{270,1973. [6]BarronBrainerd.Weighingevidenceinlanguageandliterature:Astatisticalapproach. AMC ,10:12,1974. [7]JohnFBurrows.Word-patternsandstory-shapes:Thestatisticalanalysisofnarrative style. LiteraryandlinguisticComputing ,2(2):61{70,1987. [8]JohnFBurrows.anoceanwhereeachkind...:Statisticalanalysisandsomemajor determinantsofliterarystyle. ComputersandtheHumanities ,23(4-5):309{321,1989. [9]B.C.Chan.Authorshipofthedreamoftheredchamber:Acomputerizedstatisti- calstudyofitsvocabulary. DissertationAbstractsInternationalPartA:Humanities and[DISS.ABST.INT.PT.A-HUM.&SOC.SCI.] ,42(2):1981,1981. [10]KobyCrammerandYoramSinger.Onthelearnabilityanddesignofoutputcodesfor multiclassproblems.In ProceedingsoftheThirteenthAnnualConferenceonComputa- tionalLearningTheory ,COLT'00,pages35{46,SanFrancisco,CA,USA,2000.Morgan KaufmannPublishersInc. 61 [11]A.deMorgan.Lettertorev.heald18/08/1851.InS.ElizabethandD.Morgan,editors, MemoirsofAugustusdeMorganbyhiswifeSophiaElizabethdeMorganwithSelections fromhisLetters .London:Longman'sGreenandCo,1851/1882. [12]JoachimDiederich,orgKindermann,EddaLeopold,andGerhardPaass.Authorship attributionwithsupportvectormachines. Appliedintelligence ,19(1-2):109{123,2003. [13]ThomasG.DietterichandGhulumBakiri.Solvingmulticlasslearningproblemsvia error-correctingoutputcodes. arXivpreprintcs/9501101 ,1995. [14]AlvarEllegard. AStatisticalmethodfordeterminingauthorship:theJuniusLetters, 1769-1772 ,volume13.oteborg:ActaUniversitatisGothoburgensis,1962. [15]JillianMFarringdon,AndrewQueenMorton,MichaelGFarringdon,andMDavid Baker. AnalysingforAuthorship:AGuidetotheCusumTechnique .Universityof WalesPress1996. [16]LiGuoqiangandLiRui-fang.StudyBasedonStatisticsofwordFrequencyResearch onOnlyAuthorofthe"DreamoftheRedChamber"[J]. JournalofShenyangInstitute ofChemicalTechnology ,4,2006. [17]I.Guyon,J.Weston,S.Barnhill,andV.Vapnik.Geneselectionforcancer usingsupportvectormachines. Machinelearning ,46(1):389{422,2002. [18]D.I.HolmesandJ.Kardos.Whowastheauthor?anintroductiontostylometry. CHANCE-BERLINTHENNEWYORK- ,16(2):5{8,2003. [19]Chih-WeiHsuandChih-JenLin.Acomparisonofmethodsformulticlasssupportvector machines. NeuralNetworks,IEEETransactionson ,13(2):415{425,2002. [20]MatthewLJockersandDanielaMWitten.Acomparativestudyofmachinelearning methodsforauthorshipattribution. LiteraryandLinguisticComputing ,pagefqq001, 2010. [21]PatrickJuola.Authorshipattribution. FoundationsandTrendsininformationRe- trieval ,1(3):233{334,2006. [22]VladoKeselj,FuchunPeng,NickCercone,andCalvinThomas.N-gram-basedauthor forauthorshipattribution.In Proceedingsoftheconferencepassociation forcomputationallinguistics,PACLING ,volume3,pages255{264,2003. 62 [23]BradleyKjell.Authorshipattributionoftextsamplesusingneuralnetworksand bayesianIn Systems,Man,andCybernetics,1994.Humans,Information andTechnology.,1994IEEEInternationalConferenceon ,volume2,pages1660{1664. IEEE,1994. [24]BradleyKjell.Authorshipdeterminationusingletterpairfrequencyfeatureswithneural network LiteraryandLinguisticComputing ,9(2):119{124,1994. [25]BradleyKjell,WAddisonWoods,andOphirFrieder.Informationretrievalusingletter tupleswithneuralnetworkandnearestneighborIn Systems,ManandCyber- netics,1995.IntelligentSystemsforthe21stCentury.,IEEEInternationalConference on ,volume2,pages1222{1226.IEEE,1995. [26]MosheKoppel,ShlomoArgamon,andAnatRachelShimoni.Automaticallycategorizing writtentextsbyauthorgender. LiteraryandLinguisticComputing ,17(4):401{412,2002. [27]AndrewQueenMorton. Literarydetection:Howtoproveauthorshipandfraudin literatureanddocuments .BowkerLondon,1978. [28]FrederickMostellerandDavidWallace.Inferenceanddisputedauthorship:Thefeder- alist.1964. [29]A.Pawlowski.Wincentylutoslawski-aforgottenfatherofstylometry. Glottometrics , 8:83{89,2004. [30]CaoQingfu.Thelast40chaptersofdreamoftheredchamberwerenotwrittenbycao xueqincomparisonoffunctionwords,phrasesandchaptertitlesinthe80chapters andthelast40chapters. ADreamofRedMansions ,01:288{319,1985. [31]YuQingxiang.Applicationofstatisticsindreamoftheredchamber. JournalofUni- versityofPolitics ,76:303{327,1998. [32]J.Rudman.Thestateofauthorshipattributionstudies:Someproblemsandsolutions. ComputersandtheHumanities ,31(4):351{365,1997. [33]EfstathiosStamatatos.Asurveyofmodernauthorshipattributionmethods. J.Am. Soc.Inf.Sci.Technol. ,60(3):538{556,March2009. [34]FumitakeTakahashiandShigeoAbe.Optimizingdirectedacyclicgraphsupportvector machines. ANeuralNetworksinPatternRecognition(ANNPR) ,pages166{173, 2003. 63 [35]VVapnik.Estimationofdependenciesbasedonempiricaldata.1979. [36]VladimirNaumovichVapnikandVlamimirVapnik. Statisticallearningtheory ,volume2. WileyNewYork,1998. [37]ZhangWei-dongandLiuLi-chuan.Investigationofthelinguisticstylesofthe80 chaptersandthelast40chaptersofdreamoftheredchamber. JournalofShenzhen University ,1,1986. [38]JasonWeston,ChrisWatkins,etal.Supportvectormachinesformulti-classpattern recognition.In ESANN ,volume99,pages219{224,1999. [39]HuXiangfeng,WangYang,andWuQiang.Multipleauthorsdetection:aquantitative analysisof DreamoftheRedChamber . AdvancesinAdaptiveDataAnalysis ,6(4):Article ID1450012,18pages,2014. [40]LiXianping.Anewhypothesisonthewritingandpublicationofdreamofthered chamber. FudanJournalofSocialScienceEdition ,5,1987. [41]GUdnyYule.Onsentence-lengthasastatisticalcharacteristicofstyleinprose:With applicationtotwocasesofdisputedauthorship. Biometrika ,pages363{390,1939. 64