MACHINELEARNINGMETHODFORAUTHORSHIPATTRIBUTION
By
XianfengHu
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialentoftherequirements
forthedegreeof
AppliedMathematics|DoctorofPhilosophy
2015
ABSTRACT
MACHINELEARNINGMETHODFORAUTHORSHIPATTRIBUTION
By
XianfengHu
Broadlyspeaking,theauthorshipidenorauthorshipattributionproblemisto
determinetheauthorshipofagivensamplesuchastext,paintingandsoon.Ourmain
workistodevelopaneandmathe-soundapproachfortheanalysisofauthorshipof
doubtedbooks.
Inspiredbyvariousauthorshipattributionproblemsinthehistoryofliteratureandthe
applicationofmachinelearninginthestudyofliterarystylometry,wedeveloparigorousnew
methodforthemathematicalanalysisofauthorshipbytestingforaso-calledchrono-divide
inwritingstyles.Ourmethodincorporatessomeofthelatestadvancesinthestudyofau-
thorshipattribution,particularlytechniquesfromsupportvectormachines.Byintroducing
thenotionofrelativefrequencyofwordandphrasesasfeaturerankingmetricsourmethod
provestobehighlyeandrobust.
ApplyingourmethodtotheclassicalChinesenovel
DreamoftheRedChamber
hasled
toconvincingifnotirrefutableevidencethatthe80chaptersandthelast40chapters
ofthebookwerewrittenbytwotauthors.
AlsoapplyingourmethodtotheEnglishnovel
Micro
,weareabletocontheexistence
ofthechrono-divideandidentifyitslocationsothatwecantiatethecontributionof
MichaelCrichtonandRichardPreston,theauthorsofthenovel.
WehavealsotestedourmethodtotheotherthreeGreatClassicalNovelsinChinese.As
expectednochrono-divideshavebeenfoundinthesenovels.Thisprovidesfurtherevidence
oftherobustnessofourmethod.
Wealsoproposedanewapproachtoauthorshipidentosolvetheopenclass
problemwherethecandidategroupisnonexistentorverylarge,whichisreliablyscaled
fromanewmethodwehavedevelopedforthecloseclassprobleminwhichthecandidates
authorpoolissmall.Thisisattainedbyusingsupportvectormachinesandbyanalyzingthe
relativefrequenciesofcommonwordsinthefunctionwordsdictionaryandmostfrequently
usedwords.Thismethodscalesverynicelytotheopenclassproblemthroughanovel
authorrandomizationtechnique
,whereanauthorinquestioniscomparedrepeatedlyto
randomlyselectedauthors.Theauthorrandomizationtechniqueprovestobehighlyrobust
ande.Usingourapproacheswehavefoundanswerstothreewellknownauthorship
controversies:(1)DidRobertGalbraithwrite
Cuckoo'sCalling
?(2)DidHarperLeewrite
ToKillaMockingbird
ordidherfriendTrumanCapotewriteit?(3)DidBillAyerswrite
Obama'sautobiography
DreamsFromMyFather
?
ACKNOWLEDGMENTS
FirstandforemostIwanttothankmyadvisorYangWang,thesmartestpersonIknow.It
hasbeenanhonortobehisPh.D.student.Iwanttothankhimforhelpingmetoshapeand
guidethedirectionoftheresearchinparticular.Hissupportandinsightfuldiscussions
aboutresearchmademypursuitofPh.D.possible.Iappreciateallhiscontributionoftime,
ideasandfundingtomakemyPh.D.productiveandstimulatingoverthepasteyears.The
enthusiasmhehasforhisresearchwascontagiousandmotivationalforme,evenduringthe
toughtimesinthePh.D.pursuit.Beyondhisscienguidance,hisadviceonnavigating
anacademiccareerhasbeeninvaluable.
IwillforeverbethankfultomyadvisorMinWuinSouthChinaUniversityofTechnology.
Shewasandremainsmyexcellentexampleasasuccessfulwomanmathematician,mentor
andteacher.SheisthereasonwhyIdecidedgotopursueacareerinresearch.Iamgrateful
forhersupportandencouragementtopursuePh.D.
IthankMarkIwenforsharinghisGFFTcode.Itishewholeadedmetousec++and
showedtricksofdebug.Ireallyappreciatehispatientandtimeduringdiscussion.Thank
himverymuchforhelpingmetopreparejobapplicationmaterialsandsuggestionsduring
application.IalsowanttothankQiangWuforhelpingtostartmyprojectandmatlab.
Ithankhimforhisinspiringdiscussionandwonderfulsuggestionsinlife.Ialsowantto
thankAdityaViswanathanforhiswonderfulsuggestionsoftoolboxduringdiscussion.I
wouldliketothankZhengfangZhou,AndrewChristliebandYingdaChengforservingon
mydefensecommittee.
Iwouldliketothankmyfellowmathstudentsfortheirfriendshipoverthepasteyears
andforsharingthewonderfullifewithme,ZixuanWangandYuqiHonginparticular.The
iv
nightsspentplayingtennis,sharingdinner,orlaughingoverdesertmadetheexperienceof
graduateschoolmorerewarding.IwouldalsothankmyfriendsLipingChenandLiangmin
Zhouforhelpingmeduringstudyandteach.MyparentsJunlinandGenshui,mybrother
MubiaoandsistersNian,XueandJie,havebeenincrediblysupportivethroughoutmy
studentlife.ThankthemforgivingmeawarmfamilyandhappychildhoodandIdedicate
thisdissertationtothem.
v
TABLEOFCONTENTS
LISTOFTABLES
....................................
viii
LISTOFFIGURES
...................................
x
Chapter1Introduction
...............................
1
1.1AuthorshipAttribution..............................1
1.2RelatedWorks...................................4
1.3OrganizationoftheThesis............................6
Chapter2IntroductiontoMachineLearningTechniques
..........
7
2.1Introduction....................................7
2.2LinearSupportVectorMachine.........................8
2.2.1Separablecase...............................8
2.2.2Non-separablecase............................11
2.3NonlinearSupportVectormachine.......................12
2.4Multiclass.............................14
2.5CrossValidation..................................16
Chapter3StylometryAnaysis:DetectingChrono-DivideinWritingStyles18
3.1Chrono-divide...................................18
3.2Methodology...................................20
3.2.1Initialstylometricfeatureextraction..................20
3.2.2Datapreparation.............................21
3.2.3Featuresubsetselection.........................21
3.2.4Dataanalysis...............................23
3.2.5Thealgorithm...............................24
3.3CaseStudy:Analysisof
DreamoftheRedChamber
..............25
3.3.1Background................................25
3.3.2SeparabilityofthechaptersbyCaoandGao..............29
3.3.3Non-separabilityofthe80chapters.................34
3.3.4Analysisofchapters81-120:stylechangeovertime..........35
3.3.5Comparisonwith
ContinuedDreamoftheRedChamber
........37
3.4CaseStudy:AnalysisoftheotherthreeGreatClassicalNovels........38
3.5CaseStudy:Chrono-divideof
Micro
......................39
3.6Conclusion.....................................42
Chapter4OpenClassAuthorshipIden
...............
43
4.1Introduction....................................43
4.2CloseClassProblems...............................45
4.2.1Methodology...............................45
vi
4.2.2Casestudies................................47
4.3OpenClassProblems...............................51
4.3.1Databaseanddatapreparation.....................51
4.3.2Methodology...............................52
4.3.3Casestudies................................56
4.4Conclusion.....................................59
BIBLIOGRAPHY
....................................
60
vii
LISTOFTABLES
Table3.1:Thefeaturesandvalidationerrorsoftheobtainedfromtwo
randomlyselectedmodelingsubsets...................31
Table3.2:Relativefrequenciesofthetopranked8featuresineachofthefour
GreatClassicalNovels..........................39
Table4.1:
Cuckoo'sCalling
:resultbytheobtainedusing
onebookforeachofthetwoauthorsastrainingsamples.......48
Table4.2:
Cuckoo'sCalling
:resultbytheobtainedusing
onebookforeachofthefourauthorsastrainingsamples,groupI..48
Table4.3:
Cuckoo'sCalling
:resultbytheobtainedusing
onebookforeachofthefourauthorsastrainingsamples,groupII..48
Table4.4:
Cuckoo'sCalling
:resultbytheobtainedusing
twobooksforeachofthefourauthorsastrainingsamples......49
Table4.5:
ToKillAMockingbird
:resultbytheobtained
usingonebookforeachoftheeauthorsastrainingsamples....50
Table4.6:
DreamsFromMyFather
:resultbytheob-
tainedusingonebookforeachofthefourauthorsastrainingsamples.50
Table4.7:Matchingrateformulti-classtrainedbyrandomlycho-
senbooksbyeachauthor.........................53
Table4.8:Matchingrateformulti-classtrainedbyrandomlycho-
sensamplesbyeachauthor.......................53
Table4.9:Matchingrateformulti-classbyrandomlychosenbooks
byeachauthor..............................55
Table4.10:Matchingrateformulti-classwithrandomlychosen
samplesbyeachauthor..........................55
Table4.11:
Cuckoo'sCalling:
mastersuspectedauthorsandtheaveragematch-
ingrate..................................57
Table4.12:
ToKillaMockingbird:
theaveragematchingrate...........58
viii
Table4.13:
DreamsFromMyFather:
theaveragematchingrate..........59
ix
LISTOFFIGURES
Figure2.1:Hyperplanes...............................9
Figure2.2:Optimalhyperplane...........................9
Figure2.3:Linearlynon-separabledataset.....................11
Figure2.4:LinearSVMinnewspace.......................13
Figure2.5:3-class(a)one-vs-one,
H
ij
separatesclassiandclass
j;(b)one-vs-rest,
H
i
separatesclassifromotherclasses.......15
Figure2.6:Thek-foldcrossvalidation.......................17
Figure3.1:Experiment1:(a)Meancrossvalidationerrorrate;(b)Valuesof
SVMonchapters60-90.....................34
Figure3.2:Experiment2:(a)Meancrossvalidationerrorrate;(b)Valuesof
SVMonchapters31-50.Notethereisnochrono-divide...35
Figure3.3:Experiment3:(a)Meancrossvalidationerrorrate;(b)Valuesof
SVMonchapters96-105,whichcorrespondtothesamples
31-50inall80samples.Notetwosamplescomefromonechapterin
thisexperiment.............................37
Figure3.4:Distancesbetweenthe80chaptersoftheCheng-Gaoversion,
thelast40chaptersoftheCheng-Gaoversion,and30chaptersof
ContinuedDreamoftheRedChamber
..................38
Figure3.5:resultsonthetestingsamplesoftheotherthreeclassical
novels:(a)
RomanceoftheThreeKingdoms
;(b)
WaterMargin
;(c)
JourneytotheWest.
..........................40
Figure3.6:
Micro
:averagevaluesofthe7...............41
Figure3.7:
Micro
:resultbytheobtainedusingthe
22partsandthelast22partsofthebookastrainingsamples....42
x
Chapter1
Introduction
1.1AuthorshipAttribution
DidFrancisBaconorChristopherMarlowewritesomeoftheplaysthatwereattributed
toWilliamShakespeare?DidCaoXueqinwritethelastfortychaptersof
Dreamsofthe
RedChamber
,oneofthegreatestmasterpiecesinthehistoryofChineseliterature?Who
wrotethe
Federalist
papers?DidBillAyerswriteObama'sautobiography
DreamsFromMy
Father
?Therehavebeennoshortagesofauthorshipcontroversiesthroughoutthehistory,
andthesearejustafewofthem.Thesequestionshaveoftengenerateddebates
amonghistoriansandliteraryscholars.Historicallyconnoisseurship,disciplinaryknowledge
andbackgroundhistoryhavebeencentralinsuchdebates.Inrecentyears,however,the
\unemotionalapproach"usingquantitativeanalysisbasedonmathematics,statisticsand
computationhasgainedprominenceinthestudyofauthorshipattribution.
Theauthorshipattributionistodeterminetheauthorshipofworkslikebooks,visual
artsandsoonwhoseauthorsareunknown.Therearetwoclassesofproblemsinauthorship
attribution[21]ingeneral.Themostcommononeisthe
closeclass
,whereitisknownthat
thesampleoftextisbyoneoftheauthorsfromagivenset,usuallyaverysmallsetwithtwo
tothreeauthors.Forexample,
Federalist
papersmentionedearlierisatypicalclosedclass
problem,wheretheauthorsaretoAlexanderHamilton,JamesMadisonandJohn
Jay.Farmorechallengingisthe
openclass
problem,wherethesampletextmaycomefrom
1
afarlargersetofauthorsorevennosuspectauthor.Anexampleofanopenclassproblem
istodeterminetheauthorshipofbookswrittenanonymouslyorunderpseudonyms,e.g.
ThePrimaryColor
or
ImperialHubris:WhytheWestisLosingtheWaronTerror
.The
studyofauthorshipattributionhasseenrapidgrowthinrecentyears.Still,thereremains
manychallengesevenfortheclosedclassproblem.Accuracyisoneofthemostprominent
concerns.Fortheopenclassproblemthechallengeisfargreater,andsofartherehavebeen
fewattemptsandnoneoftheknowntechniquescanbereliablyscaledtohandletheopen
classproblem.
Westudybothopenclassandclosedclassproblemsinauthorshipattribution.The
partofthisthesisfocusonanimportantauthorshipquestioninclosedclassproblem,
whichwecallchrono-divide,namelytodeterminewhetherabodyoftextiswrittenbya
singleauthor.Suchastudyisvaluableinanumberofways.Forexample,itiswidely
speculatedamongtheShakespeareanscholarsthatsomeoftheShakespeareplayswerein
factcollaborativelywrittenbyShakespeareandothers,e.g.MiddletonandFletcher.Itisof
scholarlyinteresttonotonlyitbutalsotooutexactlywhichpartsoftheplays
werewrittenbyplaywrightsotherthanShakespeare.Therewerealsohistoricalcontroversies
involvingsuspectedfraud,asinthecaseof
DreamoftheRedChamber
,whereaparticular
bookorsequelattributedtocertainwellknownauthormightinfactbeperpetratedby
someoneelse.Arelatedpracticewasforawellknownandauthortowriteonly
thefewchaptersofabookandthenpassitontoaghostwriter.Thereareclear
bbothethicallyandscholarlytodetectfraudsandghostwritings.Moderndayssee
anexplosionofcoauthoredbooksandarticles.Itwouldbeinterestingtodetectstylistic
inconsistenciesamongpartsofsuchbooks.Asweshallsee,mathematicscanplayacentral
roleinthestudyofauthorship,leadingtotherapidgrowthofthe
stylometry
.
2
Wewillintroduceourmethodonthedetectionofchrono-divideinwritingstyleswith
agivenbodyoftextinthispartandusetwocasestudiestoshowwaysitcanbedone,
andtoillustratetheenessandrobustnessofourmethodsbycomparingthestudyof
casesjustbyoneauthor.ThecasestudyistheclassicalChinesenovel
Dreamofthe
RedChamber
.Thecontroversysurroundingthebookwaswellknownandintenselydebated
intheChineseliterarycircleforover250years.Thesecondcasestudyisthebook
Micro
writtenbyMichaelCrichtonandRichardPreston.TheoneauthorcaseswillbeChina's
otherthreeclassicalnovels,
ThreeKingdoms
,
JourneytotheWest
and
TheWaterMargin
.
Theresearchforthecasestudyhasbeendoneinourwork[39],andmuchofwhatwe
presenthereisreproducedfrom[39].Allmaterialsinthesecondcasestudyarenew.
Thesecondpartofthisthesiswillbethestudyofopenprobleminauthorshipattribution,
whichistodetecttheauthorofabodyoftextwhoseunknowntrueauthorisamongalarge
setofcandidateauthors.Inmanycasesthereisn'tevenasuspectsothesetofcandidate
authorsistheentiredatabase.Tosolveanopenclassidenproblem,wepropose
anewmethodforsolvingthecloseclassproblemwherethenumberofcandidateauthorof
atextissmall(e.g.

8).Wethenreliablyscaleittosolvetheopenclassproblemthrough
theauthorrandomizationtechnique.Aslongasthetrueauthorisinourdatabasewecan
reliablyidentifytheauthor.Furthermore,ifthetrueauthorisnotinthedatabasewecan
alsoreliablyruleoutanyauthorinthedatabase.
Wewilldescribeourtechniquesandthealgorithmofourapproach,anddemonstratethe
enessthroughthreecasestudies:(1)
Cuckoo'sCalling
byRobertGalbraith,which
turnedouttobethepseudonymforJ.K.Rowling;(2)
ToKillaMockingbird
byHarperLee
in1960,whichsomespeculatedtobewrittenbyherlongtimefriendTrumanCapote;(3)
DreamsFromMyFather
byPresidentObama,whoseauthorshipisindoubteversinceBill
3
Ayers\confessed"thathewasthetrueauthor.
1.2RelatedWorks
Theproblemofstylequanandauthorshipattributionintheliteraturegoesatleast
asfarbackas1854bytheEnglishmathematicianAugustusDeMorgan[11],whoinaletter
toaclergymanonthesubjectofGospelauthorship,suggestedthatthelengthsofwords
mightbeusedtotiateauthors.In1897theterm
stylometry
wascoinedbythe
historianofphilosophy,WincentyLutaslowski,asacatch-allforacollectionofstatistical
techniquesappliedtoquestionsofauthorshipandevolutionofstyleintheliteraryarts(see
e.g.[29]).Today,literarystylometryisawelldevelopedandhighlyinterdisciplinaryresearch
areathatdrawsextensivelyfromanumberofdisciplinessuchasmathematicsandstatistics,
literatureandlinguistics,computerscience,informationtheoryandothers.Itisacentral
areaofresearchinstatisticallearning(seee.g.[18]).
Themainassumptionofauthorshipattributionisthateverypersonhashisorherown
styleofspeaking,writingandpainting.Thesestylescanbequanusingsomesocalled
\authorialts"[21]orfeatures.Overthehistoryofauthorshipattributionstudy
manytkindsoffeatureshavebeenproposedforauthorshipidenThegoal
hasalwaysbeentocertainfeatureorfeaturesthatwilltiateoneauthorfrom
another.Yule[41]suggestedthelengthofsentenceasafeatureofwritingstyleandapplied
ittotwocasesofdisputedauthorships.Apopularclassictechniqueforstylometricanaly-
sisofauthorshipinvolvescomparingfrequenciesoftheso-called
functionwords
,aclassof
wordsthatingeneralhavelittlecontentmeaning,butinsteadservetoexpressgrammatical
relationshipswithotherwordswithinasentence.Thetpercentagesofnouns,verbs,
4
adjectives,adverbsandotherpartsofspeechcanalsobeusefulcharactersofwritingstyle,
see[1,5,6].Ellegard[14]determinedtheauthorshipoftheJuniusLettersusingfunction
wordsfrequenciesandsynonympairs(e.g.ratioof\big"vs\large").MostellerandWal-
lace[28]focusedonthedistributionof30functionwordsfromthevariousFederalistpapers
andanalyzedtheauthorshipofthem.Popularearlierfeaturesalsoincludeaveragelength
ofwords,cumulativesums(Qsums)[3,4,15,27],n-grams.In[22]theauthorspresented
amethodforcomputer-assistedauthorshipattributionbasedoncharacter-leveln-gramau-
thorsee[21]and[33]foracomprehensivereview.Whilesomeofthesefeatures
aree,manyhaveshowntohavelessdiscriminatingpower.Weshallnotdiscussthe
advantagesoftfeatureshere.Insteadwereferallinterestedreaderstotheexcel-
lentsurveyarticlesbyJuola[21]andStamatatos[33]foracomprehensivesummaryofthe
prosandconsoftfeaturesinstylomteryanalysis.Ithasbeennoted,however,that
frequenciesofcontentindependentfunctionwordsareamongthemostefeatures[7].
Theofliterarystylometryhasseenimpressiveadvancesinrecentyears,withmore
andmorenewandsophisticatedmathematicaltechniquesaswellassoftwaresbeingdevel-
oped.Machinelearninghasproventobeapowerfultoolintioninlastdecadeand
researchersstartedtouseitinauthorshipattribution.Koppel[26]usedtheexponential
Gradient(EG)algorithmofKivinenandWarmuthtoalinearseparatorbetweenmale-
authoredandfemale-authoreddocuments.Burrows[7][8]appliedprincipalcomponents
analysis(PCA)onwordfrequenciestoauthorshipattribution.Kjell[24,23,25]analyzed
authorshipusingneuralnetworkBayesianandnearestneighborclassi-
[12]appliedsupportvectormachinetoidentifyauthorsofGermannewspaperarticles.
Jockers[20]evencomparedtheperformanceofemethods(Delta,k-nearestneighbors,
thesupportvectormachine,nearestshrunkencentroids,regularizeddiscriminantanalysis)
5
intheclassicauthorshipattributionprobleminvolvingtheFederalistPapers.
1.3OrganizationoftheThesis
Therestofthesisisorganizedasfollows:wewillpresentanoverviewofmachinelearning.
WewillintroduceclassicSupportVectorMachineinparticularanditsgeneralizationto
multi-classinchapter2.Chapter3willintroduceourstudyonthestylometry
analysis,whichfocusondetectingchrono-divideinwritingstyles.Wewilldescribeour
algorithmindetailandanalyzeexperimentalresults.Inchapter4,wewillshowourmethod
forauthorshipattributionproblemswherethetrueauthorisknowninasmallgroupand
extendthismethodtoopenclassproblemswherethecandidateauthorsofatextisinlarge
setorevennosuchset.Wewillreporttheaccuracyofourmethodforcloseproblemsand
showtherobustnessofourmethodforopenclassproblemsthroughextensiveexperiment
andthetestingresultsofthreecasesinlastpartofthischapter.
6
Chapter2
IntroductiontoMachineLearning
Techniques
2.1Introduction
Themainobjectiveofmachinelearning,aofcomputerscience,istolearnthe
patternsofthegivendatawhichisrepresentedbyso-calledfeaturesordescriptors.The
underlyingprincipleofmachinelearningisthatthedatafromreallifeisnotgenerated
randomly,therealwaysexistpatternsinit,althoughwedon'tknowexactlywhattheyare.
Machinelearningcanexplorethepatternsofthegiventrainingdataandmakepredictionfor
thenewdataset.Machinelearningiswidelyusedinallkindsofarealikefacerecognition,
pagerankingandspeechrecognition.Therearethreetypesofproblemsinmachinelearning
ingeneral:

Supervisedmachinelearningwheretheoutputofthetrainingsampleisgiven,andthe
goalistopredicttheoutputsofnewsamples;

Unsupervisedmachinelearningwherenolabelisgivenforthetrainingsample,and
thegoalistothestructureofthetrainingdata;

Reinforcementwherethegoalisgivenandthetrainingdataisdynamic.
7
Generallyspeaking,machinelearningistheprocessofoptimization.Untilnow,many
optimizationalgorithmshavebeenproposedforttypeofprobleminmachinelearn-
ing,suchasNaiveBayes,NearestNeighbors,SupportVectorMachine(SVM)andsoon.In
thisthesis,IwillfocusonSVM,whichisproposedbyVapnik[35]in1970s.Itisthemost
popularmachinelearningalgorithmssincethen.Itisasupervisedlearningmethodwhichis
usedinbothfeatureselectionandpatternrecognitions.SVMhasbeenhighlydevelopedin
boththeoryandalgorithmandfamousforitscapabilityofdealinglargedataset.
2.2LinearSupportVectorMachine
2.2.1Separablecase
Inthebinarysetting,giventhetrainingdataset
f
(
x
1
;y
1
)
;
(
x
2
;y
2
)
;:::;
(
x
l
;y
l
)
g
;x
i
2
R
n
;y
i
2f
1
;

1
g
;i
=1
;:::;l
wedenote
I
=
f
x
i
j
y
i
=1
g
II
=
f
x
i
j
y
i
=

1
g
Wesaythat
I
and
II
arelinearseparableifthereisaunitvector
w
andconstant
b
suchthat
x
>
i
w
+
b>
1
8
x
i
2
I
x
>
i
w
+
b<

1
8
x
i
2
II;
(2.1)
8
i.ethedatacanbeseparatedbythehyperplane
x
>
w
+
b
=0
Figure2.1:Hyperplanes
Figure2.2:Optimalhyperplane
Aswecanseein2.1,therewillbemanyhyperplaneswhichcanseparate
thetwogroups.Whichoneshouldwechoose?Tobestseparatethetwogroups,weneed
tomaximizethemargin
ˆ
(equation2.2)asin2.2,whichisthedistancebetweenthe
9
hyperplane
b
11
and
b
12
.Thehyperplanewhichseparatesthetwogroupsofthetrainingdata
setandhasthemaximalmarginiscalledthe
OptimalHyperplane
anditisunique(see[36]).
ˆ
=
2
k
w
k
2
(2.2)
Hence,theoptimalhyperplanecanbeconstructedasthefollowingPrimaloptimization
problem:
min
w
2
H
1
2
k
w
k
2
s.t.
y
i
(
x
>
i
w
+
b
)

1
8
i
2f
1
;:::;l
g
(2.3)
Tosolvethisquadraticoptimizationproblem,weneedtothesaddlepointofthe
Lagrangefunction:
L
(
w;b;
)=
1
2
w
>
w

l
X
i
=1

i
[
y
i
(
x
>
i
w
+
b
)

1]
;
(2.4)
where

i

0aretheLagrangemultipliers.
AccordingtotheFermattheorem,theminimumizerofthisfunctionhastosatisfythe
followingcondition:
@L
(
w
)
@w
=
w

l
P
i

i
y
i
x
i
=0
@L
(
w
)
@b
=
l
P
i

i
y
i
=0
:
(2.5)
Fromequation2.5,weget:
w
=
l
P
i
=1

i
y
i
x
i
l
P
i
=1

i
y
i
=0
(2.6)
Plugequation2.6intoequation2.4,wehavechangedtheprimalproblemtoits
dual
10
problem:
max

W
(

)=
P
i

i

1
2
P
i;j

i

j
y
i
y
j
x
>
i
x
j
s.t.
l
P
i
=1

i
y
i
=0

i

0
8
i
2f
1
;:::;l
g
(2.7)
Thus,bysolvingequation2.7,wecanobtain

i
.Meanwhileweget
w
byequation2.6and
b
=
1
2
(min
x
i
2
I
x
>
i
w
+max
x
i
2
II
x
>
i
w
).Theinstanceswith

i
>
0arecalled
SupportVectors
,which
areclosesttothehyperplanes
b
11
and
b
12
.Finally,weobtainthedecisionfunction:
f
(
x
)=
w
>
x
+
b
andthesignof
f
(
x
)isthepredictedlabelofinstance
x
.
2.2.2Non-separablecase
Figure2.3:Linearlynon-separabledataset
Inthepreviouspart,wehaveshowedhowtotheoptimalhyperplaneforlinearsep-
arabledata,butinreallife,mostofdataareeithernotlinearlyseparable(e.g2.3),or
onlyseparablewithverysmallmarginsbecauseofthepresenceofoutliers.Wewillgeneral-
11
izetheconceptofoptimalhyperplaneforthenon-separablecasebyrelaxingtheconstraints
2.1inthispart.Byintroducingnonnegative
slack
variables
˘
1
;:::;˘
l
intoconstraints2.1,
wecanconstructthehyperplanewiththesmallesterrors.Thus,theoptimizationproblem
becomes:
min
w
2
H;˘
2
R
l
1
2
k
w
k
2
+
C
l
P
i
=1
˘
i
s.t.
y
i
(
x
>
i
w
+
b
)

1

˘
i
˘
i

0
8
i
2f
1
;:::;l
g
(2.8)
where
C>
0istheparameterthatspthepenaltyofWhenweset
C
tobelarge,theaboveproblemisthesameasthelinearcase.Similartothelinear
separablecase,weneedtothesaddlepointofthefollowingLagrangianinordertosolve
theaboveoptimizationproblem:
max

W
(

)=
P
i

i

1
2
P
i;j

i

j
y
i
y
j
x
>
i
x
j
s.t.
l
P
i
=1

i
y
i
=0
0


i

C
8
i
2f
1
;:::;l
g
(2.9)
Thisformulaissimilartothelinearcaseexceptfortheupperboundof

i
anditcanbe
solvedinthesameway.
2.3NonlinearSupportVectormachine
Untilnow,wehaveonlydiscussedhowtouseSVMstodealwithlinearlyseparabledatasets.
However,forthedatain2.4,whichisnotlinearseparable,wecanextendthelinear
SVMtomoregeneralformsusing\kernel"functionsandseparatethedatawithlinearSVM
inanewspace.ThebasicideaofnonlinearSVMistomaptheinputvectors
f
x
i
g
intothe
12
Figure2.4:LinearSVMinnewspace
highdimensionalspacethroughnonlinearmapping
˚
.Inthenewspace,wecanconstruct
theoptimalhyperplaneasinprevioussection,butnonlinearintheinputspace.
Insteadofcomputinginnerproduct
h
x
i
;x
j
i
,wenowcalculate
K
(
x
i
;x
j
)=
h
˚
(
x
i
)
;˚
(
x
j
)
i
,
whichiscalledakernelfunction.Thustheoptimizationproblemis:
max

W
(

)=
P
i

i

1
2
P
i;j

i

j
y
i
y
j
K
(
x
i
;x
j
)
s.t.
l
P
i
=1

i
y
i
=0
0


i

C
8
i
2f
1
;:::;l
g
(2.10)
Similarly,thedecisionfunctionwillbe:
f
(
x
)=
w
>
˚
(
x
)+
b
=
l
X
i
=1

i
y
i
K
(
x;x
i
)+
b
(2.11)
andthesignof
f
(
x
)isthethelabelofx.
Remark1
Akernelfunction
K
needstosatisfyMercer'stheorem,whichrequiresthatthe
matrix
K
=(
K
(
x
i
;x
j
))
l
i;j
=1
besymmetricandpositiveThefollowingthree
typesaretypical:
1.LinearKernel:
K
(
x;y
)=
h
x;y
i
;
13
2.PolynomialKernel:
K
(
x;y
)=(
h
x;y
i
)
d
;
3.RBFKernel:
K
(
x;y
)=exp(

k
x

y
k
2
2
˙
2
)
.
tkernelsareusedtodealwithdttypesofclascationproblems.Sofor
agiventrainingdatasetwithoutanyknowledgeofitsgeometricstructure,itisbetterto
trytmodelsbychoosingerentkernelsandtheirparameterscarefully.Weusually
dividethetrainingdataintotwoparts,oneofwhichisusedtotrainthemodelandtheother
one(calledvalidationdataset)totestthemodel.Wechoosethestablemodelwhichyields
thehighestvalidationaccuracy.Forthedataweuseinourstudyofauthorshipiden
linearSVMprovestobethesimplestaswellasthebestkernelcomparedtotheothertwo
typesofkernels.
2.4Multiclass
Intheprevioussection,wehavediscussedtheclassicalSVMwhichsolvesbinary
problems.Butinmostscenarios,therearemultiplecategoriesforthetrainingset,i.e.
f
(
x
1
;y
1
)
;
(
x
2
;y
2
)
;:::;
(
x
l
;y
l
)
g
;x
i
2
R
n
;y
i
2f
1
;
2
;:::;N
g
;i
=1
;:::;l
howtoextendthebinarySVMtothemulti-classproblem?
Thebasicandmoststraightforwardideaistodecomposetheproblemintoaseriesof
binaryproblems,andthencombinetheresultstoobtainthemulticlassclassi-
Therearemanytechniquesthathavebeendevelopedsofar,suchasone-vs-rest,
one-vs-one,decisiondirectedacyclicgraph(DAG)in[34],errorcorrectingcodein[13],single
machinein[38]andsoon.Crammer[10]andVapnik[36]evenconsideralltheclassesto-
14
(a)(b)
Figure2.5:3-class(a)one-vs-one,
H
ij
separatesclassiandclassj;(b)one-
vs-rest,
H
i
separatesclassifromotherclasses.
getherinoneoptimizationformulation,butitiscomputationallyexpensiveandnotwidely
used.Inthemeantime,theone-vs-restandone-vs-onearesosimpleandethatthey
havealwaysbeenthemostpopulartechniquesforthemulticlassproblems.
Herewewilldiscussone-vs-restandone-vs-oneindetail.
One-vs-Rest
istobuild
N
tbinary
f
f
i
g
N
i
=1
,eachofwhich
distinguishesoneclassfromtheremaining
N

1classestogetherviewedasasingleclass.
Toconstructthe
i
-thbinaryweletthesamplesfromthe
i
-thclassbethepositive
samples,andothersamplesfromtheother
N

1classesbethenegativesamples.Fortest
sample
x
,weattributeittotheclass
j
0
where
j
0
=argmax
f
i
(
x
).Theadvantageofthis
methodisthatwejustneedtosolve
N
optimizationproblemstosolve
N
class
problem.Buttherearethreemaindisadvantagesusingthismethod:

thereisnoboundonthegeneralizationerror;

thescaleofthedecisionvaluesmaybequitetint

thedataisusuallyextremelyimbalanced,wherethenegativesampledominateseach

15
One-vs-One
istobuildonecforeachpairofclasses,whichyields
N
(
N

1)
=
2
f
f
ij
g
N
i;j
=1
(note
f
ij
=

f
ji
)intotal.Each
f
ij
istodistinguish
classes
i
and
j
,anditcanbeviewedasthescoreofa\match"betweenclass
i
andclass
j
.A
sample
x
isattributedtoclass
i
0
throughsomevotingscheme,forexamplewemaychoose
i
0
=argmax
P
j
f
ij
(
x
),or
i
0
betheclassthathaswonthemostnumberof\matches",
i
0
=argmax
P
j
sign(
f
ij
(
x
)).Althoughthismethodhastheadvantageofmorebalanced
trainingsamplesizesforeachitalsohassomedisadvantages:

Thereisnoboundsonthegeneralizationerror;

computationallyitisexpensivebecauseitrequires
N
(
N

1)
=
2

itcaneasilycauseov
Bothstrategieshavetheiradvantagesanddisadvantages,soitishardtoconcludewhich
oneisalwaysbetter.Theperformanceofone-vs-restandone-vs-onehasbeenobserved
tobecomparableingeneral[19],sowechoosetheone-vs-reststrategyconsideringthe
computationalcost.Becausetheobjectiveistheoverallaccuracy,wechooseonesingle
parametersettomaximizeoverallaccuracyratherthanselectingatparameterset
foreachThelatterispronetoov,inwhichcasethetrainingaccuracyis
highbuttestingaccuracyisusuallylow.
2.5CrossValidation
Crossvalidationisastrategytoassesstheaccuracyofamodelwhenthetestsetisnot
available.Theideaofoneroundofcrossvalidationistoseparatethetrainingdataintotwo
parts,oneofwhichisusedtotrainthemodel,theremainingpart(calledvalidationdata)
16
Figure2.6:Thek-foldcrossvalidation
isusedtopredicttheperformanceofthemodel.In
k
-foldcrossvalidation(see2.6),the
trainingsetispartitionedinto
k
parts,inwhich
k

1partsareusedtotrainthemodel,and
theremainingpartisusedtotesthowaccuratethemodelis.Eachpartworksastestset
onceandonlyonce,sotherearekmodelsconstructedin
k
-foldcrossvalidation.
Theadvantageof
k
-foldcrossvalidationisthateachsampletesthasoneandonlyone
chancetotestthemodel.Whenthetrainingdatasetissmallorthesetofparameteris
verylarge,onemodelcaneasilyleadtoovCrossvalidationcantlyavoid
thisproblemandhelptoselectabetterandstablemodel.However,crossvalidationusually
requiresadditionalcomputationalcost.Inthisthesis,weusea5-foldcrossvalidationinthe
trainingprocessandapplythepluralityvotingschemetolabelthetestsamples.
17
Chapter3
StylometryAnaysis:Detecting
Chrono-DivideinWritingStyles
3.1Chrono-divide
Themainideabehindstatisticallyorcomputationally-supportedauthorshipattributionis
thatbymeasuringsometextualfeatureswecandistinguishbetweentextswrittenbyt
authors.Nearlyathousandtmeasuresincludingsentencelength,wordlength,word
frequencies,characterfrequencies,andvocabularyrichnessfunctionshadbeenproposedthus
far[32]overtheyears.Someofthesemeasures,suchasfrequenciesoffunctionwords,have
provenewhileothers,suchaslengthofwords,haveprovenlesse[21].The
ofliterarystylometryhasseenimpressiveadvancesovertheyears,andhasbecomean
increasinglyimportantresearchinthedigitalagewiththeexplosionoftextsonline.
Thispartfocusesonaparticularclassofauthorshipcontroversies,inwhichthereisa
suspectedchangeofauthorshipatsomepointofabook.Inotherwords,onesuspectsthatthe

X
chaptersofabookwerewrittenbyoneauthorwhiletheremaining
Y
chapterswere
writtenbyanother.Clearly,theauthorshipcontroversyfor
DreamoftheRedChamber
falls
intothiscategory.Sincenotwoauthorshaveexactlythesamewritingstyle,nomatterhow
similartheymightbe,abookwritteninsuchafashionwouldhaveastylisticdiscontinuity
goingfromChapter
X
toChapter
X
+1.Ifwecanquantifythestylesofthetwoauthorsbya
18
stylometricfunction
S
(
n
)(aclaser)where
n
denoteschapters,orchronologicallyordered
samples,ofthebookinquestion,thisstylisticdiscontinuitywillappearasadividingpoint
inthestylometricfunction
S
(
n
)goingfrom
n
=
X
to
n
=
X
+1.Becausethesamplesare
orderedbytime,weshallcallthisdivideinthestylometricfunction
S
(
n
)a
chrono-dividein
style
,orsimplyachrono-divide.
Theunderlyingprincipleofourstudyisthatifabookisinfactwrittenbytwoauthors
AandB,thenthereshouldexistagroupoffeaturesthatcharacterizetheoftheir
respectivestyles.Thesefeatureswillleadtoastylometricfunctionthatseparatethebook
intotwotclasses.Intherestofthispartweshallusethemoreconventionalterm

forsuchastylometricfunction.Thefoundationalprincipleforliterarystylometry
isbuiltaroundsuchSupposethatachrono-divideinstyleexists.Then
anewillshowabreakpointsomewhereinthemiddleofthebook,before
andafterwhichthegivespositivevaluesandnegativevalues,respectively.Thusin
analyzingabooksuspectedtobewrittenbytwoauthorswithachrono-divide,onecanlook
forathatgivesrisetosuchabreakpoint.Theexistenceofsuchawill
providestrongsupportforthetwo-authorhypothesis.Conversely,ifsuchacannot
befoundthenwecancontlyrejectthetwo-authorwithachrono-dividehypothesis.
Weusefunctioncharactersandwordstobuildandselectagroupofstylometricfeatures
havingthehighestdiscriminativepower,andfromwhichweconstructourWe
shalldetailourmethodinthefollowingsubsections.
19
3.2Methodology
3.2.1Initialstylometricfeatureextraction
Supposethebookinquestionissuspectedtobewrittenbytwoauthors.Forsimplicitywe
shallcallthepartwrittenbyauthorA
PartA
andthepartwrittenbyauthorB
PartB
.In
manycases,suchaswith
DreamoftheRedChamber
,bothPartAandpartBareknown.
Insomecases,theyarenotpreciselyknownlike
Micro
.However,forbookssuspectedto
haveachrono-dividefromauthorshipchange,thereisusuallyagoodestimateforwherethe
divideis.TypicallythefewchapterscanbetlyattributedtoAandthelast
fewchapterstoB.
Webeginbychoosingafeaturesetconsistingofthekindsoffeaturesthatmightbeused
consistentlybyasingleauthoroveravarietyofwritings.Typically,thesefeaturesinclude
thefrequenciesofwords(orcharactersforbooksinChinese),phrases,meanandvariation
ofsentencelength,andfrequenciesofdirectspeechesandexclamations,andothers.Inour
analysis,togetabetterunderstandingofanauthor'swritingstyle,wethemost
frequentlyusedcharactersandwordsinthebook,e.g.wewouldthe500mostfrequently
usedcharactersinthewholebook,fromwhichwepickoutonly,say,
n
functioncharacters.
Wechoose
m
words(combinationsofcharacters)amongthe300mostfrequentlyusedwords
inthesameway.Animportantpointisthatbyselectingonlyfunctioncharactersandwords
weobtainaselectionofcharactersandwordsthatare
contentindependent
.Thisleadsto
aninitialsetoffeaturesconsistingofthefrequenciesofthe
n
charactersandthe
m
words,
plusthemeanandvarianceofsentencelengthaswellasthefrequenciesofdirectspeeches
andexclamations.Thesefeatureswillbecomputedovergivensampletextsofthebook
(e.g.chapters).Wenormalizeeachsampletextinthefollowingway:setthemedianofthe
20
meanandvariationofsentencelengthandthefrequenciesofdirectspeeches,exclamations,
n
charactersand
m
wordsineachworkofAandBtobe1.Foreachsample,wenowget
n
+
m
+4features.
3.2.2Datapreparation
Havingconstructedtheappropriatefeaturevectors,webuildadistinguishingmodelthrough
amachinelearningalgorithm.Todosorequirescarefuldatapreparation.Sinceweusually
haveinhandonlylimitedsampleswhilethenumberoffeatureswillbeverylarge,buildinga
modeldirectlyontheentirebookwilleasilyleadtoovToovercometheov
problem,weusethestandardtechniqueofseparatingthewholedataintosamplesconsisting
oftrainingdataandtestdata.Ourmodelwillbeestablishedbasedonlyonthetrainingdata
whileitsperformanceistestedovertheindependenttestdata.IfweknowPartAandPart
Balreadythenasubsetofeachcanbedesignatedastrainingdata.Forbookssuspectedto
haveachrono-divideinstyle,thetrainingdatawillconsistofthefewchaptersandthe
lastfewchapters.Therestofthebookwillbeusedastestdata.
Inordertoobtainmoretrainingsetsandtestingsetsweshallchunkthebookinquestion
intosmallerpiecesofsampletextsofrelativelyuniformsizeandstyle.Inallthebookswe
havestudied,wehavekeptthesampletextstobeatleast1000characterslong.Inthecase
of
DreamoftheRedChamber
eachsampletextisachapter.
3.2.3Featuresubsetselection
Whenwebuildauthorshipanalysismodelusingthetrainingdataonly,wedonotuseallthe
features(
n
+
m
+4features).Insteadwestartoutwithallofthem,buteventuallyselecta
21
subsetoffeaturesthatachievesthehighestdiscriminativepowers.Featuresubsetselection
hasbeenwellunderstoodforhighdimensionaldataanalysisinthemachinelearningcontext.
First,thenumberofdiscriminativefeaturesmaybesmallbecausethenumberoffeatures
anauthorusesinaconsistentlytwayfromothersisusuallynotverybig.Moreover,
thecanperformverypoorlyiftoomanyirrelevantfeaturesareincludedintothe
model.InthispaperwewilluseSupportVectorMachinesRecursiveFeatureElimination
(SVM-RFE)introducedin[17]torealizefeatureselection.
SVM-RFEisafeaturerankingmethod.GivenasetofsampleswecanuselinearSVMto
buildalinearItrankstheimportanceofthefeaturesaccordingtotheirweights.As
mentionedabove,becauseoflargefeaturesizeandsmallsamplesize,themightnot
berobust.Inaddition,thehighcorrelationbetweenthefeaturesmayresultinsmallweights
forrelevantfeatures.ThustherankingbySVMdirectlymaybeinaccurate.Inorder
totheranking,theleastimportantfeatureisremovedandthelinearSVM
isretrained.Thisnewprovidesarankingfortheremainingfeatures.The
processisthenrepeateduntiltherankingofallfeaturesareThisistheSVM-RFE
methodintroducedin[17].TheideaunderlyingSVM-RFEisthatineachrepeat,although
theoverallrankingmaybepoor,theleastimportantfeatureisveryunlikelyarelevantone.
Byiterativelyeliminatingtheleastimportantfeaturesthenewwillbecomemore
andmorereliableandhencewillprovidebetterandbetterranking.Intheapplicationof
geneexpressiondataanalysisSVM-RFEhasbeenproventobesubstantiallysuperiortothe
SVMdirectrankingwithoutRFE.
HoweveringeneralSVM-RFEisnotstableundertheperturbationofsamples.Asmall
changeinsamplesmayresultinverytfeatureranking.Therearetwopossible
reasons.Oneisthatthehighlycorrelatedvariablesaretoosensitiveandmayberankedin
22
tordersbytAnotheristhat,duetotherandomness,somesubset
ofsamplesmightbesingularinthesensethattheyarelessrepresentativeforthewholedata
structure.AsaresulttheSVMareovandthefeaturerankingbySVM-
RFEisthereforeunreliable.Thesituationislessharmfulforperformance
whilethesecondisvital.Toovercomethisphenomenonandguaranteethestabilityofthe
ranking,weuseapseudo-aggregationtechnique.Werandomlychooseasubsetoftraining
samplestorunSVM-RFEtoselectthetopimportantfeatures.Thisprocessisrepeated
tensorhundredstimesandonlythosefeaturesthatappearimportantveryfrequentlyare
deemedastrulyimportantones.Thisremovestherandomnessandresultsinamuchmore
reliableranking.
Withthisrankingoffeatures,wecanconcludewhichstatisticsareusefulforquantifying
thewritingstyle.Weusecrossvalidationtoselectthenumberoffeaturesincludedinthe
model.Thisgroupoffeaturesisastableandmostdiscriminativesubset
offeatures.Aisbuilttoclassifythetestdata.
3.2.4Dataanalysis
Thewehavebuiltisusedtoanalyzetheauthorshipquestion.Weexaminethe
discriminativepoweroftheonthetrainingdata.Ifitcannotevenreliablyclassify
thetrainingdatawecanconvincinglyrejectthetwo-authorhypothesis.Evenifitcanthe
tellingstorywillbewhetheritcanclassify,ordetectachrono-divide,fromthetestdata.
Ifitfailsthenagainweshouldrejectthetwo-authorhypothesis.Ontheotherhand,ifthe
thetrainingdata,anditcanalsoclassifythetestdataaccuratelyordetect
aclearchrono-divide,wecanthenconvincinglyconcludethatthebookdoescontaintwo
twritingstylesandcanthereforebetlyattributedtotwotauthors.
23
Moreover,thefeaturesubsetandtheclaserdescribetheofthetwoauthors'
writingstyles.
3.2.5Thealgorithm
Inthefollowingwesummarizetheprocessofouralgorithm:
1.Initializethedata(thebook),whichcontainspartsAandBsuspectedtobewritten
bytwotauthors.
2.SplitpartAandpartBintomanysectionsandextractthefeaturesforeachsectionas
describedinsection3.2.1.Thisformsthewholedataset
D
,containing
D
A
and
D
B
.
3.Chooseaportion(e.g.20%-30%)of
D
A
and
D
B
respectivelytoformthetestdataset
andleavetheremainingasthetrainingdataset.Thetestdatawillnotbeuseduntil
themodelisbuilt.
4.Randomlychooseasubsetfromthetrainingdataasmodelingdataandtherest(again
20%-30%)asthevalidationdata.RunSVM-RFEonthemodelingdataandusethe
validationdatatodeterminealltheparametersused.Thisprovidesarankingofall
the
n
+
m
+4featuresextractedinstep2.
5.For
d
rangefrom1to
n
+
m
+4,buildausingonlythetop
d
featuresand
evaluatetheirperformanceonthevalidationdata.Thebestmodelistheonewith
minimalvalidationerrorandminimalnumberoftopfeatures.Thefeaturesubsetof
thisbestmodelisrecorded.
6.Repeat
T
timesstep4andstep5toobtain
T
bestmodelsand
T
subsetsofcorre-
spondingimportantfeatures.Werecommend
T
tobelargerthan50.Rankallthe
24
featuresinthesesubsetsaccordingtotheirappearancefrequency.Denote
N
asthe
totalnumberoffeaturesincluded.
7.For
d
=1
;:::;N
,usingcrossvalidationtoselectthenumberoffeaturesthatshould
beincludedintheDenoteitby
d

.Notewerequireboththecross
validationerrorandthenumberoffeaturestobeassmall.
8.Retrainthemodelusingthewholetrainingsetbasedonthistop
d

importantfeatures.
9.Usingthetoclassifythetestdata.Drawtheconclusionaccordingtothe
performance.
Sinceourrankingprocessinvolvesaggregationoflargenumberofmodelsthataretrained
usingSVM-RFEbasedontsubsetsofthesamedatasource,werefertoourapproach
aspseudo-aggregationSVM-RFEmethod.
3.3CaseStudy:Analysisof
DreamoftheRedCham-
ber
3.3.1Background
DreamoftheRedChamber
byCaoXueqinisoneofChina'sFourGreatClassicalNovels.For
morethanoneandahalfcenturiesithasbeenwidelyacknowledgedasthegreatestliterary
masterpieceeverwritteninthehistoryofChineseliterature.Thenovelisremarkableforits
vividlydetaileddescriptionsoflifeofChinainthe18thcenturyduringtheQingDynastyand
thepsychologicalofitslargecastofcharacters.Thereisavastliteraturein
Redology
,
atermdevotedexclusivelytothestudyof
DreamoftheRedChamber
,thattouchesupon
25
virtuallyallaspectsofthebookonecanimagine,fromtheanalysisofevenminorcharacters
inthebooktoin-depthliterarystudyofthebook.MuchofthescopeofRedologyisoutside
thefocusofthispaper.
Theoriginalmanuscriptof
DreamoftheRedChamber
begantocirculateintheyear
1759.Theproblemsconcerningthetextandauthorshipofthenovelareextremelycomplex
andhaveremainedverycontroversialeventoday,andtheyremainanimportantpartof
Redologystudies.Cao,whodiedin1763-4,didnotlivetopublishhisnovel.Onlyhand-
copiedmanuscripts{some80chapters{hadbeencirculating.Itwasnotuntil1791that
theprintedversionwaspublished,whichwasputtogetherbyChengWeiyuanandGao
Eandwasknownasthe
Cheng-Gaoversion
.TheCheng-Gaoversionhas120chapters,40
morethanthevarioushand-copiedversionsthatwerecirculatingatthattime.Chengand
Gaoclaimedthatthis\completeversion"wasbasedonpreviouslyunknownworkingpapers
ofCao,whichtheyobtainedthroughtchannels.Itwastheselast40chaptersthat
werethesubjectofintensedebateandscrutiny.Mostmodernscholarsbelievethatthese40
chapterswerenotwrittenbyCao.ManyviewthoselateadditionsastheworkofGaoE.
Somecritics,suchastherenownedscholarHuShi,calledthemforgeriesperpetratedbyGao,
whileothersbelievethatGaowasdupedintotakingsomeoneelse'sforgeryasanoriginal
work.Thereis,however,aminorityofcriticswhoviewthelast40chaptersasgenuine.
Theanalysisoftheauthenticityofthelast40chaptershaslargelybeenbasedonthe
examinationofplotsandprosestylebyRedologyscholarsandconnoisseurs.Forexample,
manyscholarsconsidertheplottingandproseofthelast40chapterstobeinferiortothe
80chapters.Othershavearguedthatthefatesofmanycharactersintheendwereincon-
sistentwithwhatearlierchaptershavebeenforeshadowing.Anaturalquestioniswhether
amathematicalstylometryanalysisofthebookcanshedsomelightonthisauthenticity
26
debate.
AlthoughthereisavastRedologyliteraturegoingbackover100years,thenumberof
studiesofthebookbasedonmathematicalandstatisticaltechniquesissurprisinglysmall,
particularlyinviewofthefactthatsuchtechniqueshavebeenusedwidelyintheWestfor
settlingauthorshipquestions.AmongthenotableCao[30]meticulouslybrokedown
anumberoffunctioncharactersandwordsintoclassesaccordingtotheirfunctions.By
analyzingtheirfrequenciesinthe40chapters,themiddle40chaptersandthelast40
chapters,Caoconcludedthatthe80chaptersandthelast40chapterswerewrittenby
tauthors.Zhang&Liu[37]examinedtheoccurrenceof240charactersinthebook
thatareoutsidetheGB2312encodingsystem,andfoundthat210ofthemhaveappeared
exclusivelyinthe80chapterswhileonly20ofthemhaveappearedexclusivelyinthe
last40chapters.Thisledtothesameconclusionbytheotherauthors.Yu[31]studiedthe
authorshipbycombiningbothhistoricalknowledgeandstatisticaltools.InthestudyYue
testedtwohypotheses,thatthelast40chapterswerenotwrittenbythesameauthororthey
werewrittenbythesameauthor.Hisstudyfocusedonthefrequenciesof5particularfunction
characters,theproportionoftextstopoemsineachchapter,andafewotherstylometric
peculiaritiessuchasthenumberofsentencesendedwiththecharacter\Ma".Usingvarious
statisticaltechniquesthecomparisonsledthepapertodrawtheconclusionthatitisunlikely
thatthe80chaptersandthelast40chapterswerewrittenbythesameauthor.Atthe
sametime,usinghistoricknowledgeaboutthebookandtheoriginalauthorCaoXueqin,the
paperalsospeculatedthatitwasnotlikelythatthelast40chapterswerecreatedentirely
byasingletauthorsuchasGaoE.Intheoppositedirection,thestudiesofChan
[9]andLi&Li[16]concludedthattheentirebookwaslikelywrittenbyasingleauthor.
Thestudy[16]focusedontheusageoffunctionalcharacterswhile[9]examinedtheusage
27
ofsomeeightythousandcharacters.Bothstudiestabulatedthefrequenciesoftheselected
characters,whichledtoafrequencyvectorforeachofthe40chapters,themiddle40
chaptersandthelast40chapters.Thecorrelationsofthesefrequencyvectorswerecomputed.
In[16]thecorrelationswerefoundtobelargeenoughfortheauthorstoconcludethatthe
entire120chaptersofthebookwerewrittenbythesameauthor.In[9]afourthfrequency
vectorusingpartsofthebook
TheGallantOnes
wasaddedforcomparison.Theauthor
foundtlyhighercorrelationsamongthethreefrequencyvectorsfromchapters
of
DreamoftheRedChamber
thanthecorrelationsbetweenthefourthfrequencyvector
andthethree.Thisfactformedthebasisoftheconclusionbytheauthorthatall120
chapterswerewrittenbyasingleauthor.AtconclusionwasreachedbyLi[40].By
analyzingthefrequenciesof47functionalcharactersandapplyingseveralstatisticaltests
theauthorconjecturedthatthelast40chapterswereputtogetherbyGaoEusingunedited
andmanuscriptsbyCaoXueqinandhisfamilymembers.
Althoughsomeoftheseaforementionedstudiesareimpressiveintheirscopes,missing
conspicuouslyfromtheRedologyliteraturearestudiesbasedonthelatestadvancesinliterary
stylometry,particularlysomeofthenewandpowerfulmethodsfrommachinelearningtheory.
Whilecomparingthefrequenciesoffunctioncharactersandwordsisclearlyaviablewayto
analyzetheauthorshipquestion,careneedstobetakentoaccountforrandom
ofthesefrequencies,especiallywhensomeofthefunctioncharactersandwordsusedfor
comparisonhavelimitedoccurrencesoverallinthebookandsometimesnotatallinsome
chapters.Noneoftheaforementionedstudiesemployedcrossvalidationtoaddressrandom
Wehavesubstantialreservationsaboutdrawingconclusionsfromcorrelations
aloneasinthestudiesofChan[9]andLi&Li[16],becausethetiatingpowerof
anysinglevariablesuchascorrelationisratherlimited.Itwouldbeinterestingtoseea
28
morecomprehensivestudyofcorrelationsonalargecorpusoftextsinChinesetodetermine
itsectivenessasametricforauthorshipattribution,somethingtheauthorsfailedtodo
inbothstudies.Theuseofthebook
TheGallantOnes
in[9]forbenchmarkcomparison
iscurioustousinparticular,especiallyconsideringthattheauthordidnotlimittojust
functioncharacters.Thetwobooksareoftwotgenresandaretintheir
respectivebackgroundsettings.Consideringthese
and
thefactthat
TheGallant
Ones
isknown
not
tobewrittenbyCaoXueqin,itwouldbeashockifthecorrelation
betweenthelast40chaptersof
DreamoftheRedChamber
andthe80chaptersis
not
higherthanthecorrelationbetweenthelast40chaptersand
TheGallantOnes
.Itispossible
thatthecorrelationcomputedin[9]tellsmoreaboutthegenrethantheauthorshipofthe
books.Again,withoutextensiveevidencethatusingthesametechniquethecorrelation
betweentwobodiesoftextswrittenbytauthorsisgenerallylowevenwhentheplots
arecloselyrelated,theargumentmadein[9]isunconvincingatbest.
Havingestablishedarigorousprotocolforchrono-divides,wearenowinposition
toapplythisprotocoltoinvestigatetheauthorshipcontroversyoftheCheng-Gaoversionof
DreamoftheRedChamber
.Inparticularweinvestigatetheexistenceofachrono-divideat
Chapter80.
3.3.2SeparabilityofthechaptersbyCaoandGao
Thebookisdividedintosamples.Tobalancethenumberofsamples,wegenerateone
sampleforeachofthe80chapterswhileusingtheconventionalpracticeofduplicating
eachofthelast40chaptersintotwochapterstoobtain80samples.Fromthosesampleswe
extractthefeaturesbycalculatingthestatisticsproposedinsubsection3.2.1.Thesefeatures
arethennormalizedforfaircomparison.Intotalwehave196variables.Theyarethe144
29
charactersand48words,thenormalizedmeanandvariationofsentencelength,andthe
frequenciesofdirectspeechesandexclamations.
Toinvestigatetheauthorshipcontroversyweperformthreeseparatetests.Firstwebuild
aforthewholebookandlookfortheexistenceofachrono-divideatChapter80.
Foraddedrobustnessandreliabilitywealsoperformthesametestsonlyonthe80
chaptersandthelast40chapters.
IntheexperimentweapplyourmethodtothewholeChen-Gaoversionof
Dream
oftheRedChamber
.Samplesfromthe60chaptersaredesignatedastrainingsamples
foroneclasswhilesamplesfromthelast30chaptersaredesignatedastrainingsamples
foranotherclass.Theremainingsamples,fromChapter61to90,areheldoutastesting
samples.Thetrainingsamplesarefurtherrandomlysplitintomodelingdataof80samples
andvalidationdataof40samples.TheSVM-RFEisrepeated100timesand
d

ischosen
using50crossvalidationruns.Wehavethefollowingobservations.
InstabilityofSVM-RFE.
Therandomnessofthemodelingsethasresultedinvery
substantialinthenumberoffeaturesselectedaswellasfeaturerankings.The
resultedmayalsoperformquitetly.Table3.1liststhefeaturesselected
usingtworentmodelingdatasets.Oneselects11featuresandtheotherselectsonly4,
withonlyonefeatureincommon.Thealsoperformtly.Theexperiments
clearlyestablishtheinstabilityofSVM-REF.
Givensuchinstabilityonecannotreliablydrawanyconclusionsfromanysinglerun.For
example,ifamodelingdatasetseparatesthetrainingdatawellitmightbeduetoover-
ConverselyifitseparatespoorlyitmightbeduetoThisproblemis
overcomewithourPseudoAggregateSVM-RFEmethod.
30
Modelingset
FeaturesSelected
ValidationError
1
qu,de,jiu,hui,zhi,dao,shi,ne,bie,zuo
5/40
2
hui,fang,mei,haoxie
1/40
Table3.1:Thefeaturesandvalidationerrorsoftheobtainedfromtworandomly
selectedmodelingsubsets.
StabilityofPseudoAggregateSVM-RFE.
OurpseudoaggregateSVM-RFEapproach
repeatsSVM-RFE100timesusingrandomizeddatasets.Thedatasetfromeachrepeatis
usedtoselectasetoffeatures,fromwhichaisbeingbuilt.Forsimplicityweshall
refertothedataset,featuresandtheresultingclasrtogetherfromarepeatasa
model
.
Tocounterrandomctuationsweconsiderimportantfeaturestobethosethatappear
frequentlyamongthe100s.Thisreducedtheinstabilitycausedbyrandomness.In
fact,ourbeliefisasfollows:ifthetwoclassesarewellseparated,thereshouldexistasetof
featuresthathelptobuildagoodMostmodelingsubsetsshouldbeabletoselect
thesefeaturesoutandonlyalimitednumberofmodelingsetsmightbesingularandmiss
them.Conversely,ifthetwoclassescannotbewellseparated,noconsistentlydiscriminative
featuresexist.tmodelingsetmayleadtototallytfeaturesubset.Asaresult,
nofeatureappearswithhighfrequencyinall100models.Thisphilosophy,however,isonly
partiallytrue.Whenthetwoclassescannotbeseparated,themodelingprocesssometimes
canovthedatabyselectingalotofvariableswhichresultsinhighabsolutefrequencies
forsomelessimportantorirrelevantfeatures.Suchaphenomenonisusuallyaccompanied
bylargenumberofvariablesandlowvalidationaccuracy.Toimprovetheprocesswepropose
amoreappropriatemetric,whichwecall
relativefrequency
.Inrelativefrequencyweweight
thefrequencybytwocriteria.Inthecriteriaavariableappearinginshortmodelsis
31
weightedmorethanthevariablesappearinginlongmodels.Thisleadstoaweightof
h
(
n
j
)
foravariableinthe
j
thmodel,with
n
j
beingthenumberofvariablesinthe
j
thmodel.
Inthesecondcriteriaavariableinamodelwithhighpredictiveaccuracyisweightedmore
thanavariablewithpoorpredictiveaccuracy.Thisprovidesanotherweight
g
(
A
j
)fora
variableinthe
j
thmodel,where
A
j
denotestheaccuracyofthe
j
thmodelcomputedfrom
thevalidationprocess.Mathematicallytherelativefrequencyforavariable
x
i
inatestrun
of
M
repeatsisas
rf
(
x
i
)=
1
M
M
X
j
=1
g
(
A
j
)
h
(
n
j
)
1
(
x
i
appearsinmodel
j
)
:
(3.1)
Inourstudywealwaysset
M
=100.Also,weset
g
(
A
j
)=exp(
A
j

1
[2
A
j

1]
+
)where[
t
]
+
=
max
f
0
;t
g
and
h
(
n
j
)=[1

cn
j
]
+
forsomeconstant
c
.For
g
(
A
j
)theideaisthatifthe
weightshoulddecayfastiftheaccuracyiscloseto50%orlessbecauseitindicatesthatthe
issimplynote.For
h
(
n
j
)weputinapenaltyforthenumberofvariables
usedinamodel.Inourexperimentswehavechosen
c
=1
=
30,whichseemstoworkwell.
Ourexperimentsshowthatfeaturesyieldedfromrelativefrequencyrankingsarevery
stableandconsistent.Wehaveperformedrunsof100repeatsusingtrandomseeds
inMATLAB,andtheresultsarealwayssimilar.Anadditionalbofusingrelative
frequencyinsteadofabsolutefrequencyisthattheexistenceofaneis
typicallyaccompaniedbyhighrelativefrequenciesforthetopfeatures,whilelowrelative
frequenciesforthetopfeaturesusuallyimplypoorseparability.Hencewecanuserelative
frequencyasasimpleguideontheseparabilityofthesamples.Wewillshowsomeexamples
inthenextsection.
ResultsandConclusion.
Inexperiment1wehaveperformedarunof100repeatson
32
theentireCheng-Gaoversionof
DreamoftheRedChamber
.Altogether70featureshave
appearedinatleastonemodel.However,ofthoseonlyasmallnumberofthemhave
appearedwithhighenoughfrequencytobeviewedasbeingimportant.Weapplycross
validationtoselectthenumberoffeatures,andthemeancrossvalidationerrorrateagainst
tnumberoffeaturesisplottedinFigure3.1(a).Thetellsusthat10to50
featuresareenoughtotellthestylebetweenthetwoparts.Usinglesscharacters
andwordsist,whileusingmoredegradestheperformancealsobybringingintoo
muchnoise.Thesmallcrossvalidationerrorrateisencouraging,anditisalreadyhinting
astrongpossibilitythatthetwotrainingsamplesetshavetstylisticto
supportthetwo-authorhypothesis.
Tosettlethetwo-authorhypothesismoreelyweapplyouronthetest
data,whichuntilnowhasneverbeenusedduringthefeatureselectionandmodeling
process.Inparticularweinvestigatetheexistenceofachrono-divideinthevaluesobtained
throughclassiFigure3.1(b),whichplotsthesevalues,clearlyshowsachrono-divideat
Chapter80:ForChapter81-90theyieldsallnegativevalueswhileforChapters
61-80theryieldsallpositivevalueswiththeexceptionofChapter67.Allowing
somestatisticalabberationstooccur,ourresultsprovideanextremelyconvincingifnot
irrefutableevidencethatthereexistclearstylometricbetweenthewritingsofthe
80chaptersandthelast40chapters.Thisstronglysupportsthetwo-author
hypothesisfor
DreamoftheRedChamber
.Wealsonotethatourinvestigationdidnotneed
toassumethattheknowledgethatthestylisticchangeshouldbeatChapter80.Thefact
thatthechrono-dividewehavedetectedisindeedatChapter80lendsevenstrongersupport
tothetwo-authorhypothesis.
33
(a)(b)
Figure3.1:Experiment1:(a)Meancrossvalidationerrorrate;(b)ValuesofSVM
onchapters60-90.
Interestingly,thefactthatChapter67appearedasan\outlier"inourserves
asfurtherevidencetothevalidityofouranalysis.Itwasonlyafterthetestswerealizedthat
theauthorshipofChapter67itselfisoneofthecontroversiesinRedology.Unlikethemain
controversyabouttheauthorshipofthe80chaptersandthelast40chapters,experts
arelessintheirpositionshere.Again,ourresultsstronglysuggeststhatChapter67
isstylisticallytfromtherestofthe80chapters,anditmaynotbewrittenby
Cao.Ourisconsistentwiththeconclusionof[2].
3.3.3Non-separabilityofthe80chapters
Tofurthervalidateourmethodweapplythesameteststothe80chaptersof
Dream
oftheRedChamber
toseewhetherwecangetachrono-divide(Experiment2).Weusethe
30andlast30chaptersasthetrainingdataandleavechapters31-50asthetestdata.
34
Figure3.2showsthemeancrossvalidationerrorandthevaluesofSVMonthe
testdatachapters31-50.Theexperimentshowsmuchmorefeatureshavebeenselectedin
the100repeats,implyingtheyofaconsistentsubsetofdiscriminativefeatures.
Thelargeerrorsonthetrainingdataalsoindicatetheyforseparation.Whenthe
isappliedtothetestdata,thereisclearlynochrono-divide.Thissuggeststhatour
methodyieldsaconclusionthatiscompletelyconsistentwithwhatisknown.
(a)(b)
Figure3.2:Experiment2:(a)Meancrossvalidationerrorrate;(b)ValuesofSVM
onchapters31-50.Notethereisnochrono-divide.
3.3.4Analysisofchapters81-120:stylechangeovertime
Wenextapplyourmethodtothelast40chapters(Experiment3).Ourexperiment
hasalreadythattheyareunlikelytobewrittenbyCao.However,therearestill
debatesonwhetherthesewerewrittenentirelybyoneauthor(mostlikelyGaohimself),or
bymorethanoneauthor.Ourmathematicalanalysismaysomeinsighthere.
35
Wesplitthe40chaptersintotwosubsetsasbefore.ThetrainingdataincludeChapters
81-95asoneclassandChapters106-120asanother.Thetestdataarethemiddle10chapters.
Becauseoftherelativelysmallnumberofsampleswehavesubdividedeachchapterinto2
sectionstoincreasethesamplesize.Asaresultwenowhave60samplesinthetraining
dataand20intestdata,with2samplescorrespondingtoonechapter.Themeancross
validationerroroftheanditsclassivaluesonthetestsamplesare
showninFigures3.3(a)and(b)respectively.
Inthisexperimentweobservethattheperformanceintermsofboththeand
featurerankingisnoticeablyworsethanthatinExperiment1butsubstantiallybetterthan
thatinExperiment2.Furthermore,unliketheresultsfromthetwoexperiments,the
valuesfromtheshowaninterestingtrend.ComparedwithFigure3.2(b)where
thevaluesappearedtolackanyorder,thevalueshereexhibitacleargradualdownward
shift.Ontheotherhand,comparedtoFigure3.1(b)thevaluesplottedinFigure3.3(b)do
notshowaclearsharpchrono-divide,eventhoughthevalueschangegraduallyfrombeing
positivetobeingnegative.Whatittellsusisthatthewritingstyleofthelast80chaptershad
undergoneagraduatechange,butthischangeisunlikelytobeduetochangeofauthorship.
Ourresultsherecouldbesubjecttoseveralinterpretations.Oneplausibleinterpretation
isthatGaomightindeedobtainedsomeincompletesetofmanuscriptsbyCao,andtriedto
completethenovelbasedonwhathehadobtained.Thestylechangeisaresultofthelack
ofgenuineworkbyCaoasthestorydeveloped.Amoreplausibleinterpretationisthatthe
last40chapterswerewrittenbysomeonesuchasGaotryingtoimitateCao'sstyle,andover
timetheauthorbecamesloppierandreturnedmoreandmoretohisownstyle.
36
(a)(b)
Figure3.3:Experiment3:(a)Meancrossvalidationerrorrate;(b)ValuesofSVM
onchapters96-105,whichcorrespondtothesamples31-50inall80samples.Notetwo
samplescomefromonechapterinthisexperiment.
3.3.5Comparisonwith
ContinuedDreamoftheRedChamber
Itisworthmentioningthatthereareseveralotherattemptstocomplete
DreamoftheRed
Chamber
fromits80chapters,amongthemis
ContinuedDreamoftheRedChamber
byQiZichen.UsingthesamefeaturesforbuildingthecinExperiment1,wecan
computetheEuclideandistancesbetweenallchaptersandtheirdistancesofchaptersfrom
ContinuedDreamoftheRedChamber
,seeFigure3.4.Surprisingly,althoughthesefeatures
areobtainedinfavourofthebetweenCaoandCheng-Gao,theyleadtoeven
largerdistancebetweenthe80chaptersandthosechaptersof
ContinuedDreamofthe
RedChamber
.Itobviouslyimpliesthatthestyleofthe40chaptersbyCheng-Gaoismore
similartothe80chaptersbyCaocomparedto
ContinueddreamoftheRedChamber
.Maybe
that'swhytheCheng-Gaoversionismorepopularthanotherversions.
37
Figure3.4:Distancesbetweenthe80chaptersoftheCheng-Gaoversion,thelast40
chaptersoftheCheng-Gaoversion,and30chaptersof
ContinuedDreamoftheRedChamber
.
3.4CaseStudy:AnalysisoftheotherthreeGreat
ClassicalNovels
TofurtherbolsterthecredibilityofourapproachwetestourmethodontheotherthreeGreat
ClassicalNovelsinChineseliterature,
RomanceoftheThreeKingdoms
,
WaterMargin
,and
JourneytotheWest
.Unlike
DreamoftheRedChamber
,thereisnoauthorshipcontroversy
forthesethreenovels.Thusifourmethodisindeedrobustweshouldexpectnegativeanswers
forthetwo-authorhypothesesforallofthembynochrono-divides.
Aswith
DreamoftheRedChamber
,wespliteachofthethreenovelsintotraining
samplesandtestsamples.Both
RomanceoftheThreeKingdoms
and
WaterMargin
have
120chapters.Inbothcaseswedesignatethe30chaptersandthelast30chaptersas
thetwoclassesoftrainingdata,andthemiddle60chaptersastestdata.For
Journeytothe
38
West
thetwoclassesoftrainingdataaretheandlast25chaptersrespectively,withthe
middle50chaptersastestdata.
Weusethesameproceduretotestforchrono-dividesonthethreenovels.Compared
to
DreamoftheRedChamber
,theselectedfeaturesshowmuchlowerrelativefrequencies,
indicatingyintiatingbetweenthewritingstyles.Table3.2showtherelative
frequencies(with
c
=1
=
30)ofthetop8featuresforeachofthefourGreatClassicalNovels.
Alsoofnoteisthatinthecaseof
WaterMargin
,51featuresareusedtobuilda
fromthe60trainingdata,whichisclearlyanotherstrongindicationoftheculty.
Novel
Relativefrequenciesoftop8features
DreamoftheRedChamber
0.570.460.430.360.310.300.290.19
RomanceoftheThreeKingdoms
0.310.270.260.250.230.220.170.15
WaterMargin
0.180.170.160.160.140.110.110.10
JourneytotheWest
0.030.030.020.020.020.020.020.02
Table3.2:Relativefrequenciesofthetopranked8featuresineachofthefourGreatClassical
Novels.
Figure3.5plotsthevaluesfromtheforallthreenovels.Inallcasesthevalues
insuchawaythatitisquiteclearthatnochrono-dividesexist,asexpected.
Thisanalysisshowsthatourapproachcanreliablyrejectthetwo-authorhypothesiswhen
itisfalse,lendingfurthersupporttotheenessandrobustnessofourmethod.
3.5CaseStudy:Chrono-divideof
Micro
Micro
,atechno-thrillerpublishedposthumouslyin2011,isMichaelCrichton'snovel.
Itwasfoundonhiscomputeruponhisdeathin2008asanmanuscript.Harper
Collinscommissionedscience-writerRichardPrestontocompletethenovelfromCrichton's
notesandresearch.Although
DreamoftheRedChamber
isinChinese,theprincipleofour
39
(a)(b)(c)
Figure3.5:resultsonthetestingsamplesoftheotherthreeclassicalnovels:
(a)
RomanceoftheThreeKingdoms
;(b)
WaterMargin
;(c)
JourneytotheWest.
methodshouldapplytobooksinotherlanguages.
Micro
thusprovidesuswithagoodtest
example.Inthiscasestudy,wewilluseourapproachtocanddetectthechrono-
divideof
Micro
.Wewillalsoperformatnewtestusingbuiltdirectlyfrom
otherbookswrittenbyCrichtonandPrestonforcomparison.Thenewtestservesbothasa
validationofourmethodandasacomparison.Notethatthesecondoptionisnotavailable
for
DreamoftheRedChamber
.
InthedirecttestweusetheotherbookswrittenbyCrichtonandPrestonto
generatethetrainingdata.Atotalof17bookswrittenbyCrichtonand2booksbyPreston
wereusedfortraining.Theinitialfeaturesconsistofthefrequencyof241mostfrequently
usedwordsinthesebooks.Tobuildceachbookwasdividedintomultiplepieces
witheachpiececontainingapproximately2000words.Thefrequenciesofthethe241selected
wordsofeachpieceformasamplingpoint.Overall782datapointsforCrichtonand104
datapointsfromPrestonweregenerated.Toovercometheimbalanceofthesamplingpoints
forCrichtonandRichard,weonlyused728samplesforCrichtonandtheyaresplitinto7
subsets.EachsubsetiscombinedwiththesamplesforPrestontoformatrainingdataset,
fromwhichwebuildalinearclaser.Sototally7wereconstructed.Applying
thetodetectthechrono-dividein
Micro
,wechunkthebookinto56parts,each
40
containingabout2000words.Eachpartprovidesatestingsamplepoint.Weappliedthe7
tothistestingdata.Theaverageofthe7sareplottedinFigure3.6.The
resultshowsabreakpointataroundthe15th-16thsamplepoints.
Figure3.6:
Micro
:averagevaluesofthe7
Wecannowcomparetheabovemethodtotheearliermethodfor
DreamofRedChamber
.
Weassumethat,comparedwiththeoverallstyleofanauthoracrossmultiplebooks,the
styleoftheauthorinasinglebookwouldbemoreconsistent.Asaresultwedivided
Micro
into112partsofapproximately1000wordseach.Themostfrequentlyused265content
independentwordsfromthebookwereusedastheinitialfeatures.Weusethe22
samplepointsandthelast22samplepointsastrainingandvalidationdataandthemiddle
68samplepointsastestdata.TheresultsareshowninFigure3.7.Aclear
breakpointcanbeseenaroundthe29th-30thsamples.
Thetwoexperimentsctheexistenceofachrono-dividein
Micro
,andprovide
furtherevidenceofthevalidityofouroriginalapproachfordiscoveringandlocatingchrono-
divides.Asabyproduct,ourresultsshowthatthechangeofauthorshipfor
Micro
had
occurredbetween1/4and1/3ofthebook.ThisisconsistentwithwhatRichardPreston
hadindicatedinseveralinterviewsaboutthebook.
41
Figure3.7:
Micro
:ionresultbythecobtainedusingthe22partsand
thelast22partsofthebookastrainingsamples.
3.6Conclusion
Inspiredbyauthorshipcontroversyof
DreamoftheRedChamber
andtheapplicationof
SVMinthestudyofliterarystylometry,wehavedevelopedamathematicallyrigorousnew
methodfortheanalysisofauthorshipbytestingforachrono-divideinwritingstyles.We
haveshownthatthemethodishighlyeandrobust.
ApplyingourmethodtotheCheng-Gaoversionof
DreamoftheRedChamber
hasledto
convincingifnotirrefutableevidencethatthe80chaptersandthelast40chaptersof
thebookwerewrittenbytwotauthors.Furthermore,ouranalysishasunexpectedly
providedstrongsupporttothehypothesisthatChapter67wasnottheworkofCaoXueqin
either.
Applyingourmethodto
Micro
,weareabletotheexistenceofchrono-divideand
identifyitslocation.Itprovidesstrongevidenceforustoattributethe1/4ofthework
toMichaelCrichtonandtheleft3/4toRichardPreston.
Therobustnessofourapproachisalsoevidencedbyitsabilitytorejectthemultiple
authorhypothesiswhenthereisnochrono-divide,aswehavedonefortheotherthree
classicalChinesenovels.
42
Chapter4
OpenClassAuthorshipIden
4.1Introduction
Whowrotethe
FightingforOurLives
,whowrotethe
RecipesforDisater
andwhowrote
TheAnimatedSkeleton
?Therearenumerousbooksthatwerepublishedanonymouslyor
underpseudonyms.Althoughauthorshipattributionhasalonghistory,mostofthestudies
aretothecloseauthorshipidencaseswheretheauthorsarelimitedtoa
smallsetofpotentialcandidateauthors.Forexample,the
Federalist
paperswereknownto
bewrittenbythreeauthors(Hamilton,MadisonandJay),anditisonlyamatterofdeciding
whichpaperwaswrittenbywhichoneofthethreeauthors.Incontrast,theaforementioned
booksdidn'thaveanyreliablesetof\suspects",andtheidenoftheirauthorship
becomesanopencase.Theclosecaseidenproblemisgenerallyfarlesschallenging.
Withasmallsetofpotentialauthorsitisoftenviabletosimplylookfortheclosestmatch
insomefeatures,andthisisinfacthowmostoftheseclosecaseswerestudied.However,
thisapproachcannotbeappliedifthenumberofcandidateauthorsisverylarge(insome
casesitwillbetheentiredatabase).Thusforopencaseidenproblemswewillneed
completelynewapproaches,approachesthatwillallowustoreliablypickoutthetrueauthor
amongalargenumberofauthors.Sofar,therehasnotbeenareliablemethodtohandle
openclassauthorshipiden
Authorshipattributionisoneofthemostprominentproblemsinthebroadresearcharea
43
ofliterarystylometry,whichhasseenratherexplosivegrowthinrecentyears.Thebasicidea
behindtheauthorshipattributionisthatpeoplecanalwayssomemeasurementsthat
candistinguishthestylesofditauthors.Althoughthestyleofanauthorint
booksmayvaryduetotgenres,itshouldn'tdtoomuch.Theinstyle
betweenauthorsshouldbemuchlargercomparedtothatwithinanauthorifweuseproper
measurements.Thus,withthesemeasurements,totheauthorofabookwhosesetof
candidateauthorsisknown,wecanusesomebooksfromacandidateauthorastraining
samplesandbuildbetweentauthors.Thepurposeofpartofthis
chapteristosolvecloseclassproblems,wherethesetofcandidateauthorissmall.In
anotherword,ifonebookissuspectedtobewritteneitherbyauthor
A
1
;A
2
,or
A
3
,wecan
alwayssomefeaturesthatcandiscriminatethestyleoftheseauthorsandattributeitto
oneofthecandidateauthors.Thisformsthebasisforclosecaseauthorshipident
Oneofthefundamentalrequirementforconductingopencaseidenistheac-
cesstoalargedatabaseofauthorsandtheirwork.Fortunately,therearemanydatabases
thatareavailableeitherinthepublicdomain(e.g.ProjectGutenberg)orthroughlibrary
subscription(e.g.theHathiTrustcollection).Butevenwithalargedatabase,the
authorofabookssuchas
FightingforOurLives
whenwehavenoclueaboutitsauthorship
ischallenging.Theauthorshipinanopenclassproblemismuchhardertostudy,andvery
fewworkhasbeendonesofar.Harderstillistobeabletodoittlywithinalarge
database.Inthischapter,weproposeanewmethodtoopenclassproblemsinauthorshipat-
tributionusingtheideaofrandomizationtogetherwiththemethodforcloseclassproblems,
bywhichwecandetectthetrueauthorofanytextinalargedatabaseofauthorsanddoit
tly.Wecanattributethetexttoitstrueauthorifheorsheisinthedatabase,and
otherwisewecanconcludethatthetrueauthorisnotinthedatabase.Wewilldescribeand
44
demonstratethereliabilityofouralgorithminthenextsections.Ourmethodisbasedon
anewmethodforclosedcaseidenandanauthorrandomizationtechnique.Inthe
partofthechapterweconductthreecasestudiesbyanalyzingthreebooks:
Cuckoo's
Calling
,
ToKillaMockingbird
and
DreamsfromMyFather
.Allthreebooksareinvolved
incontroversies,someofwhicharestillongoing.Weshowthatourmethodcanalmost
unequivocallysettlethecontroversies.
4.2CloseClassProblems
4.2.1Methodology
Aswementionedearlier,theobjectiveofclosecaseidenistoconstruct
fortheseauthorsandthebestmatchamongthegroupusingcertainmetrics.Herewe
buildSVMusingfrequenciesofwordsasfeatures.Insteadofusingonlyfunction
words,weusethemostcommonlyappearedwordsinthesamplesweareanalyzing.This
turnsouttobemoreeinourtesting,anditisalsopointedoutin[7].
Let
A
1
;A
2
;:::;A
L
betheauthorsinagroupwithinwhichwewouldliketothe
closestmatchforcertaintextinquestion.Webreakdownallthetrainingandtestingtext
into
samples
.Eachsampleisasegmentoftextconsistingof
K
words.Inourstudywe
haveset
K
=2000.Thisprovestoworkverywellandsmaller
K
suchas
K
=1000still
workswell.Thisisimportantinsituationswheretheavailabilityoftextbycertainauthor
islimited.Atypicalbookisbrokendownintobetween25to100samples.Thenumberof
samplesthat
canbeattributedtoagivenauthor
thusdependsonhowthe
authoris.Onaverageanauthorwillhaveacorpusofbetweentwohundredstoathousand
samples.
45
For
L>
2,therearetwopopularwayswecanperformthismulti-class
task.Onewaywouldbetobuildforeachpair,andinessenceconductaseries
ofhead-to-headcontest.Theclosestmatchwouldbetheauthorwiththemostwins.One
potentialproblemisthatas
L
increases,thenumberofpairwisewouldincrease
intheorderof
L
2
,makingitcomputationallylesst.Hereweemploythe
one-vs-rest
method,inwhichwebuild
L

f
1
;f
2
;:::;f
L
.The
f
j
isbuiltusingauthor
A
j
asoneclassandallotherauthorsgroupedintoasingleclass.Thisisshowntoperform
aswellasthepairwiseapproachandinourtesting,itactuallyoutperformsslightlyinour
setting.
Assumethatthetextinquestionisbrokendowninto
N
samples
X
1
;X
2
;:::;X
N
.We
nowapplythe
f
j
toeachsample
X
n
,where
f
j
isthesignfunctionofdecision
function2.11.Thetestisbeingmatchedwithauthor
j
0
(maximumwin)if
j
0
=argmax
j
X
n
f
j
(
X
n
)
:
Inourimplementationwealsogainsomeimprovementbyrunningseveralfoldsof
andtaketheiraveragetominimizethechanceforoutliers,withmatchingbeingdecidedby
avotingschemesuchasmaximumwinorBORDA.Weshallcall
q
j
=N
the
matchingrate
for
author
A
j
,where
q
j
isthenumberofsamplesamong
f
X
n
g
thathavematchedwithauthor
A
j
.Theauthorwiththehighestmatchingrateisdeclaredtobetheclosestmatchwithin
thegroup.
46
4.2.2Casestudies
Casestudy1{
Cuckoo'sCalling
.
Acontroversyreceivingagreatdealofmedia
attentioninthesummerof2013wastherumorallegingthattheauthorRobertGalbraithof
thecrime
Cuckoo'sCalling
wasactuallythepseudonymforJ.K.Rowling.Inan
to\out"Rowling,asastoryintheNationalGeographicdetailed,the
SundayTimes
sent
thebookalongwithbooksbyRowlingandthreeothercrimeauthorstoPatrickJuola
andPeterMillican,twoprominentexpertsinauthorshipattribution,forcomparison.Both
researcherswereabletoidentifyRowlingastheclosestmatchamongthe4authors.Butto
highlightthegeneralyoftrueauthorshipidenon,thestoryalsomentionedthat
Juola\wasn'ttotallytintheresult.Afterall,hehadnowayofknowingwhether
therealauthorwassomebodywhowasn'tinthecomparisonsetofbookswhohappenedto
writelikeRowlingdoes."
Wehavetested
Cuckoo'sCalling
withinagroupof
L
candidates,where
L
canbe2,3
ormoreandoneortwobooksfromeachauthorareusedinconstructingtheTo
makeourresultcomparable,thebookswechoosearesimilartotheonesthatJuolaand
Millicanhadusedintheirtrainingprocess.Toshowtheaccuracyofourwehave
alsotestedotherbooksbyalltheauthors.Thematchingratesarelistedtogetherwith
thatof
Cuckoo'sCalling
.Wehaveconductedseveralgroupsofexperimentsandobtained
thefollowingresults(seetables(4.14.24.34.4)).Asonecansee,althoughtherearesome
whenttrainingsamplesareused,thetestedbooksfromeachauthor
matchedtheirownworkwithveryhighmatchingrates.Furthermore,allthetableshere
indicatethat
Cuckoo'sCalling
istheclosestmatchtotheworkofJ.K.Rowing.Wecan
easilydrawthesameconclusionasJuolaandMillicandid,namelyifRobertGalbraithis
47
amongtheauthorsinthetraininglist,thenitisthepseudonymforJ.K.Rowling.
Testing
n
Training
TheCasualVacancy-J.K.Rowling
ThePrivatePatient-P.D.James
Cuckoo'sCalling
72/72
0/72
Deathly-JKR
92/100
8/100
Death-PDJ
1/45
44/45
Table4.1:
Cuckoo'sCalling
:resultbytheobtainedusingonebook
foreachofthetwoauthorsastrainingsamples.
Testing
n
Training
TheCasualVacancy
J.K.Rowling
ThePrivatePatient
P.D.James
NoMoreDying
R.Rendell
DeadBeat
V.McDermid
Cuckoo'sCalling
65/72
0/72
1/72
6/72
Deathly-JKR
64/100
1/100
8/100
27/100
Death-PDJ
0/45
44/45
0/45
0/45
SomeLie-RR
2/28
0/28
25/28
1/28
KickBack-VM
0/37
0/37
0/37
37/37
Table4.2:
Cuckoo'sCalling
:resultbytheobtainedusingonebook
foreachofthefourauthorsastrainingsamples,groupI.
Testing
n
Training
Half-Blood
J.K.Rowling
ColdBlood
CapoteTruman
UnbearableLightness
Alexander.McCall
MurderofRoger
AgathaChristie
Cuckoo'sCalling
57/72
11/72
3/72
1/72
Deathly-JKR
92/100
8/100
0/100
0/100
Prayer-CT
0/22
22/22
0/22
0/22
Lady-AM
1/32
0/32
31/32
0/32
Appoint-AC
5/26
0/26
4/26
17/26
Table4.3:
Cuckoo'sCalling
:resultbytheobtainedusingonebook
foreachofthefourauthorsastrainingsamples,groupII.
Casestudy2{
ToKillaMockingbird
ToKillaMockingbird
wasasouthern
dramabyHarperLeepublishedin1960.AyearlateritwonthePulitzerPrizeandhad
quicklybecomeoneoftheclassicsinAmericanliterature.Eversinceitspublication,
theauthorshipofthebookhasbeenasubjectofcontroversybecausethiswastheoneand
onlyonepublishedworkbyHarperLeeuntil2013.Somefoundithardtobelievethat
anunknownauthorwouldwriteasingleandgreatnovelandthenstopwriting.Rumors
persistedthetrueauthorswasTrumanCapote,achildhoodfriendofLeeandtheauthorof
48
Testing
n
Training
J.K.Rowling
P.D.James
RuthRendell
ValMacDermid
Cuckoo'sCalling
72/72
0/72
0/72
0/72
Chamb-JKR
42/42
0/42
0/42
0/42
Light-PDJ
1/60
59/60
0/60
0/60
Crack-VM
0/37
0/37
1/37
36/37
Table4.4:
Cuckoo'sCalling
:resultbytheobtainedusingtwobooks
foreachofthefourauthorsastrainingsamples.
InColdBlood
,despitethedenialsofbothLeeandCapote.PearlBelle,afamousliterary
criticandeditorinCambridge,Massachusetts,exposedthatCapoteimpliedtoherthathe
pennedorheavilyeditedthebook.In2001JimGilbert,anAlabamawriter,addedtothe
rumorbyclaimingthat
ToKillaMockingbird
wastheworkofCapoteafterhecomparedit
with
InColdBlood
intermsofliterarystructure.ButHarperLeehasherowndefendersas
well.Dr.WayneFlyt,aretiredprofessorofhistoryfromAuburnUniversity,claimedthat
HarperLeeisthetrueauthorofthisbookafteranalyzingthevoicesofthecharactersand
foundthestylestobequitetfromthestyleofCapote.
Tosolvethisauthorshipmystery,wehavecompared
ToKillAMockingbird
withthe
workbyTrumanCapoteusingtheproposedmethodandauthorrandomizationforclose
classproblems.Wehavetrainedtheusingonebookbyeachoftheecandidate
authors(seetable4.5).Wehavethentested
ToKillAMockingbird
togetherwiththeother
booksbytheselectedcandidateauthors,thematchingratesareshownintable4.5.The
resultsshowthatthetestingworksfromcandidateauthorsmatchedcorrectlyinhighrates,
whilethematchingrateof
ToKillAMockingbird
isdistributedalmostrandomlyinthethree
oftheeauthors.Thus,weconcludethatitisunlikelythat
ToKillAMockingbird
was
writtenbyCapoteTruman.
Casestudy3{
DreamsFromMyFather
.
DreamsfromMyFather
isamemoir
byBarackObamapublishedin1995.In2008WilliamAyersstirredupacontroversyby
49
Testing
n
Training
ADarkerDomain
ValMcDermid
TheExecutionersSong
MailerNorman
InColdBlood
CapoteTruman
TheMurderofFrorence
DouglasPreston
MidnightGarden
JohnBerendt
ToKillaMockingbird
13/49
3/49
15/49
1/49
17/49
Bleeding-VM
63/66
2/66
0/66
0/66
1/66
Castle-MN
9/75
53/75
3/75
10/75
0/75
Voices-CT
1/27
0/27
24/27
0/37
2/27
Prayer-CT
1/22
0/22
21/22
0/22
0/22
Table4.5:
ToKillAMockingbird
:resultbytheobtainedusingone
bookforeachoftheeauthorsastrainingsamples.
\confessing"thathewasthetrueauthorofObama'smemoir.SomeofObama'scriticswere
eagertopointoutthatObamadidn'tpossessthewritingskilltowritethisbest-sellingbook
becausehecouldn'tevenwritea30secondspeech(his2012acceptancespeechwaswritten
byBillClinton).Hisdefenders,ontheotherhand,werebelieversthatObamahimself
wasthetrueauthor.Afterreadingthebook,JackCashillarguedthat
DreamsfromMy
Father
wasthematicallyandsemanticallytooclosetoAyers'searliermemoir,
FugitiveDays
.
PatrickJuolacouldn'tmaketheconclusionandsaid:\Theaccuracysimplyisn'tthere"
usinghistools.
ToinvestigatetheAyers'\confession",weconstructedclasserswithinasmallgroup,
whichincludedthebook
FugitiveDays
,
AudacityofHope
byObama,andothertwobooks
fromotherauthors,andtestedthematchingfor
DreamsFromMyFather
andotherbooks
knowntobebythechosenauthors.Aswecanseeintable4.6,wematchedallthebooksto
thetrueauthorswithhighaccuracy,whilethebestmatchfor
DreamsFromMyFather
was
withObama's
AudacityofHope
.ThisresultleadsustoconcludethatAyerslikelyliedin
his\confession".
Testing
n
Training
AudacityofHope
BarackObama
FugitiveDays
BillAyers
ADistantMirror
BarbaraTuchman
TheStoryofMyLife
ClarenceDarrow
Dreams-Obama
54/74
10/74
3/74
7/74
Confession-Ayers
1/42
40/42
0/42
1/42
Folly-Tuchman
2/80
0/80
78/80
0/80
Table4.6:
DreamsFromMyFather
:resultbytheobtainedusingone
bookforeachofthefourauthorsastrainingsamples.
50
4.3OpenClassProblems
Openclassproblemsaretocheckwhethertheauthorisinthetrainingauthorpoolornot,
ifyes,whoitis.Largecandidategroupandcientalgorithmtodealwithbigdataarethe
twomainchallengesinsolvingopenproblems.Wewillmoveonestepfurtherinauthorship
attributionandreliablydetectwhethertheauthorisinthetrainingauthorpoolornot.Our
approachisbasedonthemethodinsolvingcloseclassproblemandtheideaofrandomization.
Theideaisthatiftheauthorisinasmallgroup,wewillalwaysgethighmatchingrate
forthetrueauthorwhenweconstructinthegroup.Whenweseparatecandidate
authorsintosmallgroupsrandomly,thetrueauthorwouldalwaysobtainhighmatchingrate.
Byrecordingtheoccurrenceoftheauthorswhogetthehighestmatchingrate,wewould
obtainthehighfrequencyofthetrueauthor.IfweauthorAandconstructby
randomlychoosingotherauthorinthetrainingmanytimes,theaveragematchingratefor
authorAshouldbemuchhigherifAisthetrueauthor.Iftheaveragematchingratesfor
allthecandidateauthorsarenothighenough,thenwecanconcludethatthetrueauthoris
notinthecandidateset.
Theopenclassauthorshipidentiproblemisdividedintoseveralcomponents.
Herewebreakdownthesecomponentsanddiscusstheirdetails.
4.3.1Databaseanddatapreparation
Naturally,toperformopenclassauthorshipiden,onewouldneedasizabledigi-
taldatabaseofauthors.Fortunatelythisisnotaproblemtoday.Thepubliclyavailable
ProjectGutenberghasover45,000titlesasofMay2014.MoreimpressiveistheHathiTrust
collection,whichhasover10milliontitlesandisavailablefromseveraluniversitylibraries.
51
TheHathitrustcontainsbothpublicdomainandcopyrightedtitles.Itisanidealdatabase
aroundwhichtobuildalargescaleauthorshipstylometryanalysisproject.Forauthorin
thetraining,weneedenoughsamplestotrainthethus,Wedon'tuseallthe
authorsinthementioneddatabase.Wejustchoosesomeauthorswhohaveatleast3books
andmorethan100samples(oflength2000)fromthesebooks.Finally,wegotadatabase
whichcontains200authorsandeachauthorhasatleast3bookswhichcontainatleast100
samples.
Sinceourmethodisbasedonsupervisedmachinelearning,trainingsamplesinthe
databasewillbeusedtotrainforIntheorywecanuseallavailablesamples
inthetrainingprocess.Inpractice,wenoticedthatresultsdonotimprovetlywith
morethan100samplesperauthor.Infactweobtaingoodresultsevenwithonly30-60
trainingsamplesperauthor.Byrandomlyonebookastestandconstructingtheclas-
betweenthetrueauthorandothers,weobtaintheaveragematchingrateforthetrue
authorbyrepeatingthisprocess50times.Theoretically,thewritingstyleoftheauthorscan
bebetterifthetrainingsamplesarechosenfromtbooksbyeachauthor
ratherthanfromoneortwowholebooksbytheauthors.Thefollowingtwotables4.7and
4.8showthatwhenthesamplesizeincreasefrom30to90,theaveragetruematchingrate
isgettinghigher.Theyalsovthatwhenthetrainingsamplesarefromtbooks
ratherthanoneortwobooksbytheauthor,thetruematchingrateisbetter.Thus,allour
casestudiesinthisparthaveused90trainingsamplesfromtbooksbyeachauthor.
4.3.2Methodology
ImprovedReliabilityandRobustnessthroughAuthorRandomization.
Good
resultscanoftenbeattainedifthereiscertaintythattheauthorofthetextinquestionis
52
L=2
L=3
L=4
L=5
30
0.93
0.87
0.86
0.84
60
0.94
0.93
0.89
0.88
90
0.94
0.93
0.90
0.90
Table4.7:Matchingrateformulti-classtrainedbyrandomlychosenbooksby
eachauthor.
L=2
L=3
L=4
L=5
30
0.95
0.92
0.91
0.88
60
0.96
0.95
0.93
0.92
90
0.97
0.96
0.94
0.93
Table4.8:Matchingrateformulti-classtrainedbyrandomlychosensamples
byeachauthor.
indeedinthegroup(suchasinthecaseof
Federalist
papers).Theproblemisthattheclosest
matchcanalwaysbefoundevenwhentherealauthorofthetextinquestionisinfactnot
amemberofthegroup.Soingeneraloneoftencanonlyclaimtohavefoundtheclosest
matchwithinthegroupwithouttrulyidentifyingtheauthorship.
OurSVMapproachisn'timmunetothisconcern.Infact,itcanhappenwheninasmall
groupamatchingrateofover90%isobtainedbysomeoneotherthanthetrueauthorof
sometextinquestion.Thisusuallyoccurswhenthetrueauthorisnotinthegroup.Sohow
canoneelyminimizesuchfalsematching?
Akeyingredientforourmethodis
authorrandomization
,whichprovestobeahighly
ewaytoidentifyafalsematching.Hereforeach
potentialcandidate
authorweperform
aseriesofsmallgrouponsagainstrandomlychosenauthorsfromadatabase.
Moreprecisely,let
A
1
betheauthorwewishtotestformatchingforthetextsamples
X
1
;X
2
;:::;X
N
.Wenowrandomlyselectauthors
A
2
;:::;A
L
foradatabaseandperform
then.Thisisdoneoverseveraltrials.Intheendweobtainanaveragematching
53
rateforauthor
A
1
.Theoutlineofouralgorithmforopenclassproblemsisasfollowing:
1Initializethedata,whichcontainsatleast3booksbyeachauthorinthedatabase.
2Spliteachbookintopartsof
K
words(usuallyweset
K
=2000),andextractthe
featuresforeachpart.Werandomlychooseonebookbyeachauthorinthedatabase
astest.
3Randomlydividetheauthorsintogroupsof
L
1
authors,andconstructfor
eachgroup.Recordtheauthorswhoreceivesthehighestmatchingrates.
4Repeat
T
1
timesstep3andrecordtheauthorshavingthehighestmatchingrates.
5Removetheauthorswhoseaveragematchingrateisbelowthethreshold
S
1
.Those
authorswhoarestillleftaredeemedaspotentialtrueauthors,or\suspectauthors".
6Fixeachsuspectauthorfromstep5andrandomlychoose
L

1authorsfromthe
databaseandconstructaRecordthematchingratesforthesuspectauthors.
7Repeat
T
2
timesstep6toobtaintheaveragematchingrateforeachsuspectauthor.
8Ifthemaximumaveragematchingratefromstep7isabovesomethreshold
S
2
,we
identifythatauthorasthetrueauthor.Otherwisethetrueauthorisnotinthe
dataset.
Wehavedoneextensivetestofthisapproach,andthetable4.9showsthematchingrates
forbothauthorsandnon-authors.Thelastrowdenotesthemaximalaveragematchingrate
foranon-authorinourtest.Asonecansee,theaveragematchingratesofthetrueauthors
areabove90%evenfor5-classwhiletheaveragematchingrateofthenon-
authorisalwaysonaverage.What'smore,eventhemaximalaveragematchingratefora
54
non-authorisnotevenclosetotheaveragematchingrateforthetrueauthor.Thisisperhaps
themostimportantattributethatallowsustoidentifyauthorshipusingourapproach.
L=2
L=3
L=4
L=5
Author
0.94
0.93
0.90
0.90
Non-author
0.48
0.32
0.24
0.20
Maxnon-author
0.70
0.62
0.59
0.51
Table4.9:Matchingrateformulti-classbyrandomlychosenbooksbyeach
author.
L=2
L=3
L=4
L=5
Author
0.97
0.96
0.94
0.93
Non-author
0.51
0.34
0.25
0.20
Maxnon-author
0.66
0.59
0.52
0.49
Table4.10:Matchingrateformulti-classwithrandomlychosensamplesby
eachauthor.
Theimportanceoftheaboveresultsisthatitprovidesanextremelyrobustwaytonot
onlytelluswhichauthoristhebestmatchforagivenbodyoftext,butalsohowwell
thematchingis.Thegapbetweenmatchingratefortrueauthorsandnon-authorsisso
tthatitcanbescaledupintoalegitimateopenclassauthorshipiden
method,whichwediscussnext.
OpenClassAuthorshipation.
Toscaleourmethodforcloseclassproblems
upforopenclassauthorshipidenwedivideourdatabaseofauthorsintosmall
groups
G
1
;G
2
;:::;G
M
of
L
authors(wechoose
L
tobe4or5.Itworksfor
L
aslargeas
10).Withineachgroup
G
m
webuildwhichcanhaveseveralfoldsasmentioned
earlier.Formorerobustnesswemayalsoestablishseveralsuchgroupingsfromrandom
assignments.Notethatthisisperhapsthemostcomputationallydemandingpartwitha
largedatabase,butitcanbedoneentirelyasaonetimetask.
55
Givenabodyoftextinquestionanditssamples
X
1
;:::;X
N
,wefeedthesamplesinto
theforeachgroup.Eachgroupwillhaveanauthorthathasthehighestmatching
rate,buteventhehighestmatchingratecanbelow.Weonlyselectthosewhosematching
ratesexceedathreshold(e.g.50%)ascandidatesforthenextroundoftesting.Intheory,the
trueauthorshouldalwaysgetthehighestmatchingrateinallgroups,whileotherauthors
won'talwaysobtainhighmatchingrate.Assumethatafterthisinitialroundweareleft
withagroupofpotentialauthors
B
1
;B
2
;:::;B
q
forthetextinquestion.If
q
isstillrather
large,asitispossiblegiventhesizeofthedatabasewehopetobuild,wewouldthentreat
B
1
;B
2
;:::;B
q
asaseparatedatabaseofauthorsanditeratethisprocessuntilasmallnumber
ofcandidateauthors
C
1
;:::;C
p
remain.
Nowheretheauthorrandomizationmethodshowsitspower.Foreachremainingcandi-
dateauthor
C
j
,werandomlyadd
L

1authorsfromtheentiredatabasetoformagroup
G
of
L
authors.Within
G
amatchingratefor
C
j
isobtained.Werepeatthisprocessfor
C
j
manytimestoobtainanaveragematchingratefor
C
j
.Weidentifytheauthoramong
C
j
withthehighestaveragematchingrateastheauthorofthetextinquestion,provided
thattheaveragematchingrateishigherthancertainthreshold.Ifeventhehighestaverage
matchingrateislow,wemayconcludethatthetextiswrittenbyanauthoroutsideour
database.
Inthefollowingcasestudiesweset
L
1
=4or5,
T
1
=
T
2
=50,
S
1
=0
:
5or0
:
6.Wealso
set
S
2
=0
:
8.
4.3.3Casestudies
Wehavebuiltatestdatabaseconsistingof200authorsandeachauthorhasmorethan100
samples.Thisdatabase,althoughsmall,alreadyallowsustotestourmethodandstudy
56
someoftherealworldcontroversies.WeareworkingwiththeMichiganStateLibraryto
tlyaugmentthesizeofthedatabase.ItisconceivablethatusingtheHathitrust
wecanbuildadatabaseconsistingofseveralhundredsofthousandsofauthors.
Casestudy:
Cuckoo'sCalling
.
Asatestcaseweranouropenclassmethodon
Cuckoo'sCalling
usingthedatabasewehadbuilt.Theresultisquitestrikinglyeventhough
ourdatabaseissmall.Inthestagewherewedividetheauthorsintogroupsoflength5,
sotherearenow40groups.Werecordtheauthorineachgroupwiththehighestmatching
rate.Butofthe40leadersineachgrouponly7hasamatchingrateover60%.Ifwe
lowerthethresholdto50%thenthereare16candidatesremaining.However,ifwedothis
50timesthenonly3candidateshavematchingrateof50%orhigherinatleast30trials
(notsurprisinglyRowlingpassedthethresholdinalltrials.).Wealsotriedamorerobust
approachinthestage,inwhichwegrouptheauthorsandkeepthoseaspotentialauthors
(orsuspectauthors).Thenextstageisauthorrandomization,whereweeachsuspect
authorandrandomlychoose
L

1authorsfromthenon-suspectauthorsandbuildthe
Weobtainedtheaveragematchingrateforeachbyrepeating
T
2
times
inthisrandomizationprocess(
T
2
=50).Table4.11showsthatthematchingrateforeach
suspectauthorishigherthantheaverage.Stillthematchingratein
T
2
=50repeatsfor
RowlingwasmuchhigherandwecaneasilyandalmostelyidentifyRowlingasthe
trueauthor.
masterauthors
frequency
L=2
L=3
L=4
L=5
80
50/50
0.96
0.93
0.91
0.91
163
34/50
0.64
0.58
0.50
0.41
46
33/50
0.65
0.57
0.44
0.33
83
28/50
0.63
0.56
0.42
0.31
174
28/50
0.62
0.53
0.40
0.32
Table4.11:
Cuckoo'sCalling:
mastersuspectedauthorsandtheaveragematchingrate.
57
Casestudy2{
ToKillaMockingbird
.
Asanothercasestudy,wehaveapplied
ourmethodofauthorrandomizationtostudytheauthorshipof
ToKillAMockingbird
.We
testthebookformatchingagainstCapoteand
L

1otherrandomlyselectedauthorsfrom
ourdatabase.Additionallywehavealsotested
AnsweredPrayers
byCapoteinparallel.The
trialisrepeated50times.Aswecanseeintable4.12,theaveragematchingrateof
ToKill
aMockingbird
withCapoteislowerthan50%.Incontrast,thematchingratefor
Answered
Prayers
isalwaysabove80%.TheresultshowsvirtuallyirrefutablythatCapotedidnot
write
ToKillaMockingbird
.
testbooks
n
matchingrate
L=2
L=3
L=4
L=5
ToKillaMockingbird
0.49
0.40
0.39
0.36
Voices-CT
0.89
0.84
0.85
0.81
Prayer-CT
0.98
0.96
0.95
0.96
Table4.12:
ToKillaMockingbird:
theaveragematchingrate.
Casestudy3{
DreamsFromMyFather
.
Tostrengthenourconclusion,wehave
againappliedourmethodbymatchingup
DreamsFromMyFather
againstWilliamAyers
and
L

1randomlyselectedauthors.Thisisrepeated50times.Testingresultsintable
4.13aretheaveragematchingrateforthe50trials.TheaveragematchingrateforAyers
islowerthan25%.Incontrastinthesametestbutwith
PublicEnemy:ConfessionsofAn
AmericanDissident
byAyersinplaceof
DreamsfrommyFather
,weareabletoobtainover
98%averagematchingrate!ConstructingtheusingotherbooksbyAyersyield
verysimilarlyhighmatchingrates,whicharealwaysabove95%.Changingthetrainingbook
forAyersgivesalmostthesameresults,wherethematchingratefor
DreamsfromMyfather
againstAyershasalwaysstayedbelow50%.ThisisvirtuallyirrefutableevidencethatAyers
lied
inhis\confession"aboutwritingObama'sautobiography.
Aswecanseeintheresultsofthethreeexperiments,ourmethodforopenclassproblems
58
testbooks
n
matchingrate
L=2
L=3
L=4
L=5
Dreams-Obama
0.24
0.25
0.23
0.23
Confession-Ayers
1.0
0.98
0.99
0.99
Table4.13:
DreamsFromMyFather:
theaveragematchingrate.
canalwaysdetectthetrueauthorsuccessfully,althoughthematchingratesofthetrue
authorsvaryinrentexperiment.Whenthetrueauthorisnotinthecandidatesetlike
ToKillaMockingbird
,theresultsfromourmethodcanleadusclaimthatthetrueauthor
isnotinthecandidateset.
4.4Conclusion
Westudiedthecloseclassauthorshipattributionproblemsusingmachinelearningtechniques
andstudiedtheauthorshipproblemsof
Cuckoo'sCalling
,
ToKillaMockingbird
and
Dreams
FromMyFather
.Alltheexperimentsshowedhighmatchingrateforthetrueauthors.
Weproposedanewalgorithmtosolveopenclassauthorshipattributionproblemsby
extendingthemethodforcloseclassproblemsandrandomizingauthorsinabigauthor
pool.Bycomparingthematchingrateoftrueauthorsandnon-authors,weshowedthat
thisalgorithmisveryreliableandstable.Tofurtherverifyourresult,westudiedthethree
particularcasesasintheclosecase.
Applyingourmethodsto
Cuckoo'sCalling
,wecanthatitiswrittenbyJ.K.Rowling
andpublishedwithhispseudonymRobertGalbraithwithstrongevidence.Studyingtheau-
thorof
ToKillaMockingbird
withourmethods,weconcludedthatitsauthorisnotCapote
Truman,althoughwecan'tconcludeitisbyHarperLeebecausenotrainingsamplesfrom
her.Furthermore,wecantellthatAyersliedtothemediathatitwashewhowrote
Dreams
FromMyFather
bycomparingthematchingratesbetweenbooksbyObamaandAyers.
59
BIBLIOGRAPHY
60
BIBLIOGRAPHY
[1]FriederikeAntosch.
Thediagnosisofliterarystylewiththeverb-adjectiveratio
.New
York,1969.
[2]YanAnzheng.Onepieceofevidencethatchapters64and67arenottheoriginalversion.
JournalofXianyangNormalUniversity
,24:3,2009.
[3]RonaldEBee.Statisticalmethodsinthestudyofthemasoretictextoftheoldtes-
tament.
JournaloftheRoyalStatisticalSociety.SeriesA(General)
,pages611{622,
1971.
[4]RonaldEBee.Astatisticalstudyofthesinaipericope.
JournaloftheRoyalStatistical
Society.SeriesA(General)
,pages406{421,1972.
[5]BarronBrainerd.Onthedistinctionbetweenanovelandaromance:adiscriminant
analysis.
ComputersandtheHumanities
,7(5):259{270,1973.
[6]BarronBrainerd.Weighingevidenceinlanguageandliterature:Astatisticalapproach.
AMC
,10:12,1974.
[7]JohnFBurrows.Word-patternsandstory-shapes:Thestatisticalanalysisofnarrative
style.
LiteraryandlinguisticComputing
,2(2):61{70,1987.
[8]JohnFBurrows.anoceanwhereeachkind...:Statisticalanalysisandsomemajor
determinantsofliterarystyle.
ComputersandtheHumanities
,23(4-5):309{321,1989.
[9]B.C.Chan.Authorshipofthedreamoftheredchamber:Acomputerizedstatisti-
calstudyofitsvocabulary.
DissertationAbstractsInternationalPartA:Humanities
and[DISS.ABST.INT.PT.A-HUM.&SOC.SCI.]
,42(2):1981,1981.
[10]KobyCrammerandYoramSinger.Onthelearnabilityanddesignofoutputcodesfor
multiclassproblems.In
ProceedingsoftheThirteenthAnnualConferenceonComputa-
tionalLearningTheory
,COLT'00,pages35{46,SanFrancisco,CA,USA,2000.Morgan
KaufmannPublishersInc.
61
[11]A.deMorgan.Lettertorev.heald18/08/1851.InS.ElizabethandD.Morgan,editors,
MemoirsofAugustusdeMorganbyhiswifeSophiaElizabethdeMorganwithSelections
fromhisLetters
.London:Longman'sGreenandCo,1851/1882.
[12]JoachimDiederich,orgKindermann,EddaLeopold,andGerhardPaass.Authorship
attributionwithsupportvectormachines.
Appliedintelligence
,19(1-2):109{123,2003.
[13]ThomasG.DietterichandGhulumBakiri.Solvingmulticlasslearningproblemsvia
error-correctingoutputcodes.
arXivpreprintcs/9501101
,1995.
[14]AlvarEllegard.
AStatisticalmethodfordeterminingauthorship:theJuniusLetters,
1769-1772
,volume13.oteborg:ActaUniversitatisGothoburgensis,1962.
[15]JillianMFarringdon,AndrewQueenMorton,MichaelGFarringdon,andMDavid
Baker.
AnalysingforAuthorship:AGuidetotheCusumTechnique
.Universityof
WalesPress1996.
[16]LiGuoqiangandLiRui-fang.StudyBasedonStatisticsofwordFrequencyResearch
onOnlyAuthorofthe"DreamoftheRedChamber"[J].
JournalofShenyangInstitute
ofChemicalTechnology
,4,2006.
[17]I.Guyon,J.Weston,S.Barnhill,andV.Vapnik.Geneselectionforcancer
usingsupportvectormachines.
Machinelearning
,46(1):389{422,2002.
[18]D.I.HolmesandJ.Kardos.Whowastheauthor?anintroductiontostylometry.
CHANCE-BERLINTHENNEWYORK-
,16(2):5{8,2003.
[19]Chih-WeiHsuandChih-JenLin.Acomparisonofmethodsformulticlasssupportvector
machines.
NeuralNetworks,IEEETransactionson
,13(2):415{425,2002.
[20]MatthewLJockersandDanielaMWitten.Acomparativestudyofmachinelearning
methodsforauthorshipattribution.
LiteraryandLinguisticComputing
,pagefqq001,
2010.
[21]PatrickJuola.Authorshipattribution.
FoundationsandTrendsininformationRe-
trieval
,1(3):233{334,2006.
[22]VladoKeselj,FuchunPeng,NickCercone,andCalvinThomas.N-gram-basedauthor
forauthorshipattribution.In
Proceedingsoftheconferencepassociation
forcomputationallinguistics,PACLING
,volume3,pages255{264,2003.
62
[23]BradleyKjell.Authorshipattributionoftextsamplesusingneuralnetworksand
bayesianIn
Systems,Man,andCybernetics,1994.Humans,Information
andTechnology.,1994IEEEInternationalConferenceon
,volume2,pages1660{1664.
IEEE,1994.
[24]BradleyKjell.Authorshipdeterminationusingletterpairfrequencyfeatureswithneural
network
LiteraryandLinguisticComputing
,9(2):119{124,1994.
[25]BradleyKjell,WAddisonWoods,andOphirFrieder.Informationretrievalusingletter
tupleswithneuralnetworkandnearestneighborIn
Systems,ManandCyber-
netics,1995.IntelligentSystemsforthe21stCentury.,IEEEInternationalConference
on
,volume2,pages1222{1226.IEEE,1995.
[26]MosheKoppel,ShlomoArgamon,andAnatRachelShimoni.Automaticallycategorizing
writtentextsbyauthorgender.
LiteraryandLinguisticComputing
,17(4):401{412,2002.
[27]AndrewQueenMorton.
Literarydetection:Howtoproveauthorshipandfraudin
literatureanddocuments
.BowkerLondon,1978.
[28]FrederickMostellerandDavidWallace.Inferenceanddisputedauthorship:Thefeder-
alist.1964.
[29]A.Pawlowski.Wincentylutoslawski-aforgottenfatherofstylometry.
Glottometrics
,
8:83{89,2004.
[30]CaoQingfu.Thelast40chaptersofdreamoftheredchamberwerenotwrittenbycao
xueqincomparisonoffunctionwords,phrasesandchaptertitlesinthe80chapters
andthelast40chapters.
ADreamofRedMansions
,01:288{319,1985.
[31]YuQingxiang.Applicationofstatisticsindreamoftheredchamber.
JournalofUni-
versityofPolitics
,76:303{327,1998.
[32]J.Rudman.Thestateofauthorshipattributionstudies:Someproblemsandsolutions.
ComputersandtheHumanities
,31(4):351{365,1997.
[33]EfstathiosStamatatos.Asurveyofmodernauthorshipattributionmethods.
J.Am.
Soc.Inf.Sci.Technol.
,60(3):538{556,March2009.
[34]FumitakeTakahashiandShigeoAbe.Optimizingdirectedacyclicgraphsupportvector
machines.
ANeuralNetworksinPatternRecognition(ANNPR)
,pages166{173,
2003.
63
[35]VVapnik.Estimationofdependenciesbasedonempiricaldata.1979.
[36]VladimirNaumovichVapnikandVlamimirVapnik.
Statisticallearningtheory
,volume2.
WileyNewYork,1998.
[37]ZhangWei-dongandLiuLi-chuan.Investigationofthelinguisticstylesofthe80
chaptersandthelast40chaptersofdreamoftheredchamber.
JournalofShenzhen
University
,1,1986.
[38]JasonWeston,ChrisWatkins,etal.Supportvectormachinesformulti-classpattern
recognition.In
ESANN
,volume99,pages219{224,1999.
[39]HuXiangfeng,WangYang,andWuQiang.Multipleauthorsdetection:aquantitative
analysisof
DreamoftheRedChamber
.
AdvancesinAdaptiveDataAnalysis
,6(4):Article
ID1450012,18pages,2014.
[40]LiXianping.Anewhypothesisonthewritingandpublicationofdreamofthered
chamber.
FudanJournalofSocialScienceEdition
,5,1987.
[41]GUdnyYule.Onsentence-lengthasastatisticalcharacteristicofstyleinprose:With
applicationtotwocasesofdisputedauthorship.
Biometrika
,pages363{390,1939.
64