KERNEL-BASEDCLUSTERINGOFBIGDATA
By
RadhaChitta
ADISSERTATION
Submittedto
MichiganStateUniversity
inpartialfulllmentoftherequirements
forthedegreeof
ComputerScienceŒDoctorofPhilosophy
2015
ABSTRACT
KERNEL-BASEDCLUSTERINGOFBIGDATA
By
RadhaChitta
Therehasbeenarapidincreaseinthevolumeofdigitaldatao
vertherecentyears.Astudyby
IDCandEMCCorporationpredictedthecreationof44zettaby
tes(
10
21
bytes)ofdigitaldataby
theyear2020.Analysisofthismassiveamountsofdata,popu
larlyknownas
bigdata
,necessi-
tateshighlyscalabledataanalysistechniques.Clusterin
gisanexploratorydataanalysistoolused
todiscovertheunderlyinggroupsinthedata.Thestate-of-
the-artalgorithmsforclusteringbig
datasetsare
linear
clusteringalgorithms,whichassumethatthedataislinear
lyseparableinthe
inputspace,andusemeasuressuchastheEuclideandistance
todenetheinter-pointsimilarities.
Thoughefcient,linearclusteringalgorithmsdonotachie
vehighclusterqualityonreal-worlddata
sets,whicharenotlinearlyseparable.Kernel-basedclust
eringalgorithmsemploynon-linearsimi-
laritymeasurestodenetheinter-pointsimilarities.Asa
result,theyareabletoidentifyclustersof
arbitraryshapesanddensities.However,kernel-basedclu
steringtechniquessufferfromtwomajor
limitations:
(i)Theirrunningtimeandmemorycomplexityincreasequadr
aticallywiththeincreaseinthe
sizeofthedataset.Theycannotscaleuptodatasetscontain
ingbillionsofdatapoints.
(ii)Theperformanceofthekernel-basedclusteringalgori
thmsishighlysensitivetothechoice
ofthekernelsimilarityfunction.Adhocapproaches,relyi
ngonpriordomainknowledge,
arecurrentlyemployedtochoosethekernelfunction,andit
isdifculttodeterminethe
appropriatekernelsimilarityfunctionforthegivendatas
et.
Inthisthesis,wedevelopscalableapproximatekernel-bas
edclusteringalgorithmsusingrandom
samplingandmatrixapproximationtechniques.Theycanclu
sterbigdatasetscontainingbillions
ofhigh-dimensionalpointsnotonlyasefcientlyaslinear
clusteringalgorithmsbutalsoasaccu-
ratelyasclassicalkernel-basedclusteringalgorithms.
Ourrstcontributionisbasedonthepremisethatthesimila
ritymatricescorrespondingtobig
datasetscanusuallybewell-approximatedbylow-rankmatr
icesbuiltfromasubsetofthedata.
Wedevelopanapproximatekernel-basedclusteringalgorit
hm,whichusesalow-rankapproximate
kernelmatrix,constructedfromauniformlysampledsmalls
ubsetofthedata,toperformcluster-
ing.Weshowthattheproposedalgorithmhaslinearrunningt
imecomplexityandlowmemory
requirements,andalsoachieveshighclusterquality,when
providedwithsufcientnumberofdata
samples.Wealsodemonstratethattheproposedalgorithmca
nbeeasilyparallelizedtohandle
distributeddatasets.Wethenemploynon-linearrandomfea
turemapstoapproximatethekernel
similarityfunction,anddesignclusteringalgorithmswhi
chenhancetheefciencyofkernel-based
clustering,aswellaslabelassignmentforpreviouslyunse
endatapoints.
Ournextcontributionisanonlinekernel-basedclustering
algorithmthatcanclusterpotentially
unboundedstreamdatainreal-time.Itintelligentlysampl
esthedatastreamandndsthecluster
labelsusingthesesampledpoints.Theproposedschemeismo
reeffectivethanthecurrentkernel-
basedandlinearstreamclusteringtechniques,bothinterm
sofefciencyandclusterquality.
Wenallyaddresstheissuesofhighdimensionalityandscal
abilitytodatasetscontaininga
largenumberofclusters.Undertheassumptionthatthekern
elmatrixissparsewhenthenumberof
clustersislarge,wemodifytheaboveonlinekernel-basedc
lusteringschemetoperformclustering
inalow-dimensionalspacespannedbythetopeigenvectorso
fthesparsekernelmatrix.The
combinationofsamplingandsparsityfurtherreducestheru
nningtimeandmemorycomplexity.
Theproposedclusteringalgorithmscanbeappliedinanumbe
rofreal-worldapplications.We
demonstratetheefcacyofouralgorithmsusingseverallar
gebenchmarktextandimagedatasets.
Forinstance,theproposedbatchkernelclusteringalgorit
hmswereusedtoclusterlargeimage
datasets(e.g.Tiny)containingupto80millionimages.The
proposedstreamkernelclustering
algorithmwasusedtoclusteroverabilliontweetsfromTwit
ter,forhashtagrecommendation.
ToMyFamily
iv
ACKNOWLEDGMENTS
ﬁLifeisacontinuouslearningprocess.

Eachdaypresentsanopportunityforlearning.ﬂ-LailahGif
tyAkita,ThinkGreat:BeGreat
EverydayduringmyPhDstudieshasbeenagreatopportunityf
orlearning,thankstomy
advisors,colleagues,friends,andfamily.Iamverygratef
ultomythesisadvisorProf.AnilK.
Jain,whohasbeenawonderfulmentor.Hisabilitytoidentif
ygoodresearchproblemshasalways
beenmyinspiration.Iammotivatedbyhisenergy,disciplin
e,meticulousnessandpassionfor
research.Hehastaughtmetoplanandprioritizemywork,and
presentitinaconvincingmanner.
IamalsoverythankfultoProf.RongJin,withwhomIhadthepr
ivilegeofworkingclosely.
Underhisguidance,Ihavelearnthowtoformalizeaproblem,
anddevelopcoherentsolutionsto
theproblem,usingdifferentmachinelearningtools.Iamin
spiredbyhisextensiveknowledgeand
hard-workingnature.
IwouldliketothankmyPhDcommitteemembers,Prof.Pang-Ni
ngTan,Prof.Shantanu
Chakrabartty,andProf.SelinAviyentefortheirvaluablec
ommentsandsuggestions.Prof.Pang-
NingTanwasalwaysavailablewhenIneededhelp,andprovide
dveryusefulsuggestions.
Iamgratefultoseveralotherresearcherswhohavementored
meatvariousstagesofmyre-
search.IhavehadtheprivilegeofworkingwithDr.SuvritSr
aandDr.FrancescoDinuzzo,atthe
MaxPlanckInstituteforIntelligentSystems,Germany.Iwo
uldliketothankthemforgivingme
aninsightintoseveralemergingproblemsinmachinelearni
ng.IthankDr.GaneshRameshfrom
Edmodoforprovidingmetheopportunitytolearnmoreaboutn
aturallanguageprocessing,and
buildingscalablesolutions.Dr.TimothyHavenswasveryhe
lpfulwhenwewereworkingtogether
duringtherstyearofmyPhD.
Iwouldliketothankmylabmatesandfriends:Shalini,Soweo
n,Serhat,Zheyun,Jinfeng,
v
Mehrdad,Kien,Alessandra,Abhishek,Brendan,Jung-Eun,S
unpreet,Inci,Scott,Lacey,Charles,
andKeyur.TheymademylifeatMSUverymemorable.Iwouldlik
etospeciallythankSerhatfor
allthehelpfuldiscussions,andSoweonforhersupportande
ncouragement.IamthankfultoLinda
Moore,CathyDavison,NormaTeague,KatieTrinklein,Court
neyKosloskiandDebbieKruchfor
theiradministrativesupport.ManythankstotheCSEandHPC
Cadministrators,speciallyKelly
Climer,AdamPitcher,Dr.DirkColbry,andDr.BenjaminOng.
Lastbutnottheleast,Iwouldliketothankmyfamily.Iamdee
plyindebtedtomyhusband
Praveen,withoutwhosesupportandmotivation,Iwouldnoth
avebeenabletopursueandcomplete
myPhD.Myparents,mysisterandmyparents-in-lawhavebeen
verysupportivethroughoutthe
pastveyears.IwasinspiredbymyfatherRamamurthytopurs
uehigherstudies,andstriveto
makehimproud.IwouldliketospeciallymentionmymotherSu
dhaLakshmi,whohasbeenmy
rolemodelandinspiration.Icanalwayscountonhertoencou
ragemeandupliftmyspirits.
vi
TABLEOFCONTENTS
LISTOFTABLES
.......................................
x
LISTOFFIGURES
......................................
xiv
LISTOFALGORITHMS
...................................
xxi
Chapter1Introduction
..................................
1
1.1DataAnalysis....................................
...4
1.1.1DataRepresentation............................
......4
1.1.2Learning......................................
...5
1.1.3Inference.....................................
...6
1.2Clustering......................................
...7
1.2.1ClusteringAlgorithms..........................
.......8
1.2.2ChallengesinDataClustering....................
.........10
1.3ClusteringBigData...............................
.....13
1.3.1Clusteringwith
k
-means................................17
1.4KernelBasedClustering...........................
......19
1.4.1Kernel
k
-means.....................................25
1.4.2Challenges....................................
...27
1.4.2.1Scalability.................................
.....28
1.4.2.2Choiceofkernel..............................
.....29
1.5ThesisContributions.............................
......31
1.6DatasetsandEvaluationMetrics....................
.........35
1.6.1Datasets......................................
..35
1.6.2EvaluationMetrics.............................
......39
1.7ThesisOverview..................................
...41
Chapter2ApproximateKernel-basedClustering
....................
42
2.1Introduction....................................
....42
2.2RelatedWork.....................................
..43
2.2.1Low-rankMatrixApproximation...................
.........44
2.2.1.1CURmatrixapproximation......................
.......45
2.2.1.2Nystrommatrixapproximation..................
..........46
2.2.2Kernel-basedClusteringforLargeDatasets........
..............47
2.3ApproximateKernelk-means........................
.......49
2.3.1Parameters....................................
...52
2.3.1.1Samplesize..................................
....54
2.3.1.2Samplingstrategies..........................
........55
2.3.2Analysis......................................
...56
vii
2.3.2.1Computationalcomplexity.....................
.........56
2.3.2.2Approximationerror..........................
.......57
2.3.3DistributedClustering.........................
........60
2.4ExperimentalResults.............................
......64
2.4.1Datasets......................................
..65
2.4.2Baselines.....................................
...65
2.4.3Parameters....................................
...65
2.4.4Results.......................................
..66
2.4.4.1Runningtime.................................
....66
2.4.4.2Clusterquality..............................
......67
2.4.4.3Parametersensitivity........................
.........71
2.4.4.4Samplingstrategies..........................
........73
2.4.4.5Scalabilityanalysis.........................
.........75
2.4.5DistributedApproximateKernel
k
-means.......................78
2.5Summary.........................................
79
Chapter3Kernel-basedClusteringUsingRandomFeatureMap
s
...........
80
3.1Introduction....................................
....80
3.2Background......................................
..81
3.3KernelClusteringusingRandomFourierFeatures......
..............83
3.3.1Analysis......................................
...86
3.3.1.1Computationalcomplexity.....................
.........86
3.3.1.2Approximateerror............................
......86
3.4KernelClusteringusingRandomFourierFeaturesinCons
trainedEigenspace.....88
3.4.1Analysis......................................
...90
3.4.1.1Computationalcomplexity.....................
.........90
3.4.1.2Approximationerror..........................
.......91
3.4.2Out-of-sampleClustering.......................
.........95
3.5ExperimentalResults.............................
......96
3.5.1Datasets......................................
..96
3.5.2Baselines.....................................
...96
3.5.3Parameters....................................
...97
3.5.4Results.......................................
..97
3.5.4.1Runningtime.................................
....97
3.5.4.2Clusterquality..............................
......99
3.5.4.3Parametersensitivity........................
.........101
3.5.4.4Scalability.................................
.....103
3.5.4.5Out-of-sampleclustering.....................
..........108
3.6Summary.........................................
112
Chapter4StreamClustering
...............................
113
4.1Introduction....................................
....113
4.2Background......................................
..114
viii
4.3ApproximateKernel
k
-meansforStreams........................117
4.3.1Sampling......................................
..118
4.3.2Clustering....................................
....121
4.3.3LabelAssignment...............................
.....123
4.4ImplementationandComplexity.....................
........124
4.5ExperimentalResults.............................
......126
4.5.1Datasets......................................
..126
4.5.2Baselines.....................................
...126
4.5.3Parameters....................................
...127
4.5.4Results.......................................
..128
4.5.4.1Clusteringefciencyandquality...............
............128
4.5.4.2Parametersensitivity:.......................
..........133
4.6Applications:TwitterStreamClustering............
.............140
4.7Summary.........................................
144
Chapter5Kernel-BasedClusteringforLargeNumberofClust
ers
..........
145
5.1Introduction....................................
....145
5.2Background......................................
..147
5.3SparseKernelk-means.............................
.....150
5.4Analysis........................................
..154
5.4.1ComputationalComplexity.......................
........154
5.4.2ApproximationError............................
......156
5.5ExperimentalResults.............................
......162
5.5.1Datasets......................................
..162
5.5.2BaselinesandParameters........................
........162
5.5.3Results.......................................
..164
5.5.3.1Runningtime.................................
....164
5.5.3.2Clusterquality..............................
......165
5.5.3.3Parametersensitivity........................
.........167
5.5.3.4Scalability.................................
.....172
5.6Summary.........................................
173
Chapter6SummaryandFutureWork
..........................
174
6.1Contributions...................................
....175
6.2FutureWork......................................
..177
BIBLIOGRAPHY
.......................................
179
ix
LISTOFTABLES
Table1.1Notation...................................
....7
Table1.2ClusteringtechniquesforBigData.............
............14
Table1.3Popularkernelfunctions.....................
..........23
Table1.4Comparisonoftherunningtimesof
k
-meansandkernel
k
-meansona
100
-
dimensionalsyntheticdatasetcontaining
10
clustersandexponentiallyincreasing
numberofdatapoints,ona2.8GHzprocessorwith40GBmemory
.........28
Table1.5Descriptionofdatasetsusedforevaluationofthe
proposedalgorithms......35
Table2.1Comparisonoftheconfusionmatricesoftheapprox
imatekernel
k
-means,kernel
k
-meansand
k
-meansalgorithmsforthetwo-dimensionalsemi-circlesda
taset,
containing
500
points(
250
pointsineachofthetwoclusters).Theapproximate
kernel
k
-meansalgorithmachievesclusterqualitycomparabletoth
atofthekernel
k
-meansalgorithm...................................5
3
Table2.2Runningtime(inseconds)oftheproposedapproxim
atekernel
k
-meansandthe
baselinealgorithms.Thesamplesize
m
issetto
2
;000
,forboththeproposedalgo-
rithmandtheNystromapproximationbasedspectralcluster
ingalgorithm.Itisnot
feasibletoexecutekernel
k
-meansonthelargeForestCoverType,Imagenet-34,
Poker,andNetworkIntrusiondatasetsduetotheirlargesiz
e.Anapproximate
valueoftherunningtimeofkernel
k
-meansonthesedatasetsisobtainedbyrst
executingkernel
k
-meansonarandomlychosensubsetof
50
;000
datapointsto
ndtheclustercenters,andthenassigningtheremainingpo
intstotheclosestclus-
tercenter........................................66
Table2.3Effectofthesamplesize
m
ontherunningtime(inseconds)oftheproposed
approximatekernel
k
-meansclusteringalgorithm...................74
Table2.4Comparisonofsamplingtimes(inmilliseconds)of
theuniform,column-norm
and
k
-meanssamplingstrategiesontheCIFAR-10andMNISTdatase
ts.Parameter
m
representsthesamplesize............................
..76
Table2.5Performanceofthedistributedapproximatekerne
l
k
-meansalgorithmonthe
Tinyimagedatasetandtheconcentriccirclesdataset,with
parameters
m
=1
;000
and
P
=1024
......................................78
x
Table3.1ComparisonoftheconfusionmatricesoftheRFF,ke
rnel
k
-means,and
k
-means
algorithmsforthetwo-dimensionalsemi-circlesdataset,
containing
500
points
(
250
pointsineachofthetwoclusters)......................
..84
Table3.2Runningtime(inseconds)oftheRFFandSVclusteri
ngalgorithmsonthesix
benchmarkdatasets.Theparameter
m
,whichrepresentsthenumberofFourier
componentsfortheRFFandSVclusteringalgorithms,andthe
samplesizeforthe
approximatekernel
k
-meansandNystromapproximationbasedspectralclusterin
g
algorithms,issetto
m
=2
;000
.Itisnotfeasibletoexecutekernel
k
-meansonthe
largeForestCoverType,Imagenet-34,Poker,andNetworkIn
trusiondatasetsdue
totheirlargesize.Anapproximateoftherunningtimeofker
nel
k
-meansonthese
datasetsisobtainedbyrstexecutingkernel
k
-meansonarandomlychosensubset
of
50
;000
datapointstondtheclustercenters,andthenassigningth
eremaining
pointstotheclosestclustercenter....................
.......98
Table3.3EffectofthenumberofFouriercomponents
m
ontherunningtime(inseconds)
oftheRFFandSVclusteringalgorithmsonthesixbenchmarkd
atasets.Parameter
m
representsthenumberofFouriercomponentsfortheRFFandS
Vclustering
algorithms,andthesamplesizefortheapproximatekernel
k
-meansandNystrom
approximationbasedspectralclusteringalgorithms.....
.............104
Table3.4Runningtime(inseconds)andpredictionaccuracy
(in%)forout-of-sampledata
points.Parameter
m
representsthesamplesizefortheapproximatekernel
k
-means
algorithmandthenumberofFouriercomponentsfortheSVclu
steringalgorithm.
Thevalueof
m
issetto
1
;000
forboththealgorithms.Itisnotfeasibletoexecute
theWKPCAalgorithmonthelargeForestCoverType,Imagenet
-34,Poker,and
NetworkIntrusiondatasetsduetotheirlargesize........
...........111
Table4.1Majorpublishedapproachestostreamclustering.
.................115
Table4.2Effectofthemaximumbuffersize
M
ontherunningtime(inmilliseconds)of
theproposedapproximatestreamkernel
k
-meansalgorithm.Parametersettings:
m
=5
;000
,
˝
=1
...................................137
Table4.3Effectofthemaximumbuffersize
M
ontheSilhouettecoefcientofthepro-
posedapproximatestreamkernel
k
-meansalgorithm.Parametersettings:
m
=
5
;000
,
˝
=1
......................................137
Table4.4Effectofthemaximumbuffersize
M
ontheNMI(in%)oftheproposedap-
proximatestreamkernel
k
-meansalgorithm.Parametersettings:
m
=5
;000
,
˝
=1
.137
Table4.5Effectoftheclusterlifetimethreshold

=exp(

˝
)
ontherunningtime(in
milliseconds)oftheproposedapproximatestreamkernel
k
-meansalgorithm.Pa-
rametersettings:
m
=5
;000
,
M
=20
;000
......................138
xi
Table4.6Effectoftheclusterlifetimethreshold

=exp(

˝
)
ontheSilhouettecoef-
cientoftheproposedapproximatestreamkernel
k
-meansalgorithm.Parameters:
m
=5
;000
,
M
=20
;000
...............................138
Table4.7Effectoftheclusterlifetimethreshold

=exp(

˝
)
ontheNMI(in%)ofthe
proposedapproximatestreamkernel
k
-meansalgorithm.Parameters:
m
=5
;000
,
M
=20
;000
......................................138
Table4.8Comparisonoftheperformanceoftheapproximates
treamkernel
k
-meansalgo-
rithmwithimportancesamplingandBernoullisampling....
...........139
Table5.1Complexityofpopularpartitionalclusteringalg
orithms:
n
and
d
representthe
sizeanddimensionalityofthedatarespectively,and
C
representsthenumber
ofclusters.Parameter
m>C
representsthesizeofthesampledsubsetforthe
sampling-basedapproximateclusteringalgorithms.
n
sv

C
representsthenum-
berofsupportvectors.DBSCANandCanopyalgorithmsaredep
endentonuser-
denedintra-clusterandinter-clusterdistancethreshol
ds,sotheircomplexityisnot
directlydependenton
C
................................146
Table5.2Runningtime(inseconds)oftheproposedsparseke
rnel
k
-meansandthethree
baselinealgorithmsonthefourdatasets.Theparametersof
theproposedalgorithm
weresetto
m
=20
;000
,
M
=50
;000
,and
p
=1
;000
.Thesamplesize
m
forthe
approximatekernel
k
-meansalgorithmwassetequalto
20
;000
fortheCIFAR-100
datasetand
10
;000
fortheremainingdatasets.Itisnotfeasibletoexecuteker
nel
k
-meansontheImagenet-164,YoutubeandTinydatasetsdueto
theirlargesize.
Theapproximaterunningtimeofkernel
k
-meansonthesedatasetsisobtainedby
rstexecutingthealgorithmonarandomlychosensubsetof
50
;000
datapoints
tondtheclustercenters,andthenassigningtheremaining
pointstotheclosest
clustercenter......................................
164
Table5.3Silhouettecoefcient(

e

02
)oftheproposedsparsekernel
k
-meansandthe
threebaselinealgorithmsontheCIFAR-100dataset.Thepar
ametersofthepro-
posedalgorithmweresetto
m
=20
;000
,
M
=50
;000
,and
p
=1
;000
.The
samplesize
m
fortheapproximatekernel
k
-meansalgorithmwassetequalto
m
=20
;000
......................................166
Table5.4Comparisonoftherunningtime(inseconds)ofthep
roposedsparsekernel
k
-
meansalgorithmandtheapproximatekernel
k
-meansalgorithmontheCIFAR-100
andtheImagenet-164datasets.Parameter
m
representstheinitialsamplesetsize
fortheproposedalgorithm,andthesizeofthesampledsubse
tfortheapproximate
kernel
k
-meansalgorithm.Theremainingparametersoftheproposed
algorithm
aresetto
M
=50
;000
,and
p
=1
;000
.Approximatekernel
k
-meansisinfeasible
fortheImagenet-164datasetwhen
m>
10
;000
duetoitslargesize.........168
xii
Table5.5Comparisonofthesilhouettecoefcient(

e

02
)oftheproposedsparsekernel
k
-meansalgorithmandtheapproximatekernel
k
-meansalgorithmontheCIFAR-
100dataset.Parameter
m
representstheinitialsamplesetsizefortheproposed
algorithm,andthesizeofthesampledsubsetfortheapproxi
matekernel
k
-means
algorithm.Theremainingparametersoftheproposedalgori
thmweresetto
M
=
50
;000
,and
p
=1
;000
.................................168
Table5.6Effectofthesizeoftheneighborhood
p
ontherunningtime(inseconds),the
silhouettecoefcientandNMI(in%)oftheproposedsparsek
ernel
k
-meansalgo-
rithmontheCIFAR-100andImagenet-164datasets.Theremai
ningparametersof
theproposedalgorithmweresetto
m
=20
;000
,and
M
=50
;000
..........170
xiii
LISTOFFIGURES
Figure1.1Emergingsizeofthedigitalworld.Imagefrom[2]
................2
Figure1.2GrowthofTargetedDisplayAdvertising.Imagefr
om[59]............3
Figure1.3Atwo-dimensionalexampletodemonstratehierar
chicalandpartitionalclus-
teringtechniques.Figure(a)showsasetofpointsintwo-di
mensionalspace,con-
tainingthreeclusters.Hierarchicalclusteringgenerate
sadendrogramforthedata.
Figure(b)showsadendrogramgeneratedusingthecomplete-
linkagglomerative
hierarchicalclusteringalgorithm.Thehorizontalaxisre
presentsthedatapoints
andtheverticalaxisrepresentsthedistancebetweenthecl
usterswhentheyrst
merge.Byapplyingathresholdonthedistanceat
4
units(shownbytheblack
dottedline),wecanobtainthethreeclusters.Partitional
clusteringdirectlynds
the
C
clustersinthedataset.Figure(c)showsthethreeclusters
,representedby
theblue,greenandredpoints,obtainedusingthe
k
-meansalgorithm.Thestarred
pointsinblackrepresenttheclustercenters............
..........8
Figure1.4Atwo-dimensionalexamplethatdemonstratesthe
limitationsof
k
-meansclus-
tering.
500
two-dimensionalpointscontainingtwosemi-circularclus
tersareshown
inFigure(a).Pointsnumbered
1

250
belongtotherstclusterandpoints
numbered
251

500
belongtothesecondcluster.Theclustersobtainedusing
k
-means(usingEuclideandistancemeasure)donotreectthe
trueunderlying
clusters(showninFigure(b)),becausetheclustersarenot
linearlyseparableas
expectedbythe
k
-meansalgorithm.Ontheotherhand,thekernel
k
-meansalgo-
rithmusingtheRBFkernel(withkernelwidth
˙
2
=0
:4
)revealsthetrueclusters
(showninFigure(c)).Figures(d)and(e)showthe
500

500
similaritymatrices
correspondingtotheEuclideandistanceandtheRBFkernels
imilarity,respec-
tively.TheRBFkernelsimilaritymatrixcontainsdistinct
blockswhichdistinguish
betweenthepointsfromdifferentclusters.Thesimilarity
betweenthepointsinthe
sametrueclusterishigherthanthesimilaritybetweenpoin
tsindifferentclusters.
TheEuclideandistancematrix,ontheotherhand,doesnotco
ntainsuchdistinct
blocks,whichexplainsthefailureofthe
k
-meansalgorithmonthisdata.......20
Figure1.5Similarityofimagesexpressedthroughgrayleve
lhistograms.Thehistogram
oftheintensityvaluesoftheimageofawebsite(Figure(b))
isverydifferentfrom
thehistogramsoftheimagesofbutteries(Figures(d)and(
f)).Thehistogramsof
thetwobutteryimagesaresimilartoeachother..........
.........21
xiv
Figure1.6Sensitivityofthekernel
k
-meansalgorithmtothechoiceofkernelfunction.
Thesemi-circlesdataset(showninFigure(a))isclustered
usingkernel
k
-means
withtheRBFkernel.Whenthekernelwidthissetto
0
:4
,thetwoclustersare
correctlydetected(showninFigure(b)),whereaswhenthek
ernelwidthissetto
0
:1
,thepointsareclusteredincorrectly(showninFigure(c))
.Figure(d)showsthe
variationintheclusteringerrorofkernel
k
-means,denedin(1.10),withrespect
tothekernelwidth...................................
30
Figure1.7Scalabilityofclusteringalgorithmsintermsof
n
,
d
and
C
,andthecontribution
oftheproposedalgorithmsinimprovingthescalabilityofk
ernel-basedclustering.
Theplotshowsthemaximumsizeofthedatasetthatcanbeclus
teredwithless
than
100
GBmemoryona
2
:8
GHzprocessorwithareasonableamountofcluster-
ingtime(lessthan
10
hours).Thelinearclusteringalgorithmsarerepresentedi
n
blue,currentkernel-basedclusteringalgorithmsareshow
ningreen,parallelclus-
teringalgorithmsareshowninmagenta,andtheproposedclu
steringalgorithmsare
representedinred.Existingkernel-basedclusteringalgo
rithmscanclusteronlyup
totheorderof
10
;000
pointswith
100
featuresinto
100
clusters.Theproposed
batchclusteringalgorithms(approximatekernel
k
-means,RFFclustering,andSV
clusteringalgorithms)arecapableofperformingkernel-b
asedclusteringondata
setsaslargeas
10
million,withthesameresourceconstraints.Theproposedo
n-
lineclusteringalgorithms(approximatestreamkernel
k
-meansandsparsekernel
k
-meansalgorithms)canclusterarbitrarily-sizeddataset
swithdimensionalityin
theorderof
1
;000
andthenumberofclustersintheorderof
10
;000
.........32
Figure2.1Illustrationoftheapproximatekernel
k
-meansalgorithmonthetwo-
dimensionalsemi-circlesdatasetcontaining
500
points(
250
pointsineachofthe
twoclusters).Figure(a)showsallthedatapoints(inred)a
ndtheuniformlysam-
pledpoints(inblue).Figures(b)-(e)showtheprocessofdi
scoveryofthetwo
clustersinthedatasetandtheircentersintheinputspace(
representedbyx)bythe
approximatekernel
k
-meansalgorithm.........................53
Figure2.2ExampleimagesfromthreeclustersintheImagene
t-34dataset.Theclusters
represent(a)buttery,(b)odometer,and(c)websiteimage
s.............67
Figure2.3Silhouettecoefcientvaluesofthepartitionso
btainedusingapproximateker-
nel
k
-means,comparedtothoseofthepartitionsobtainedusingt
hebaselinealgo-
rithms.Thesamplesize
m
issetto
2
;000
,forboththeproposedalgorithmandthe
Nystromapproximationbasedspectralclusteringalgorith
m.............68
xv
Figure2.4NMIvalues(in%)ofthepartitionsobtainedusing
approximatekernel
k
-means,
withrespecttothetrueclasslabels.Thesamplesize
m
issetto
2
;000
,forboth
theproposedalgorithmandtheNystromapproximationbased
spectralclustering
algorithm.Itisnotfeasibletoexecutekernel
k
-meansonthelargeForestCover
Type,Imagenet-34,Poker,andNetworkIntrusiondatasetsd
uetotheirlargesize.
TheapproximateNMIvaluesofkernel
k
-meansonthesedatasetsareobtainedby
rstexecutingkernel
k
-meansonarandomlychosensubsetof
50
;000
datapoints
tondtheclustercenters,andthenassigningtheremaining
pointstotheclosest
clustercenter......................................
69
Figure2.5ExampleimagesfromtheclustersfoundintheCIFA
R-10datasetusingapprox-
imatekernel
k
-means.Theclustersrepresentthefollowingobjects:(a)a
irplane,
(b)automobile,(c)bird,(d)cat,(e)deer,(f)dog,(g)frog
,(h)horse,(i)ship,and
(j)truck.........................................70
Figure2.6Effectofthesamplesize
m
ontheNMIvalues(in%)ofthepartitionsobtained
usingapproximatekernel
k
-means,withrespecttothetrueclasslabels........72
Figure2.7Effectofthesamplesize
m
ontheSilhouettecoefcientvaluesofthepartitions
obtainedusingapproximatekernel
k
-means......................73
Figure2.8ComparisonofSilhouettecoefcientvaluesofth
epartitionsobtainedfromap-
proximatekernel
k
-meansusingtheuniform,column-normand
k
-meanssampling
strategies,ontheCIFAR-10andMNISTdatasets.Parameter
m
representsthe
samplesize.......................................76
Figure2.9ComparisonofNMIvalues(in%)ofthepartitionso
btainedfromapproximate
kernel
k
-meansusingtheuniform,column-normand
k
-meanssamplingstrategies,
ontheCIFAR-10andMNISTdatasets.Parameter
m
representsthesamplesize...77
Figure2.10Runningtimeoftheapproximatekernel
k
-meansalgorithmfordifferentvalues
of(a)
n
,(b)
d
and(c)
C
.................................77
Figure3.1AsimpleexampletoillustratetheRFFclustering
algorithm.(a)Two-
dimensionaldatasetwith
500
pointsfromtwoclusters(
250
pointsineachcluster),
(b)Plotofthematrix
H
obtainedbysampling
m
=1
Fouriercomponent.(c)
Clustersobtainedbyexecuting
k
-meanson
H
.....................84
Figure3.2Silhouettecoefcientvaluesofthepartitionso
btainedusingtheRFFandSV
clusteringalgorithms.Theparameter
m
,whichrepresentsthenumberofFourier
componentsfortheRFFandSVclusteringalgorithms,andthe
samplesizeforthe
approximatekernel
k
-meansandNystromapproximationbasedspectralclusterin
g
algorithms,issetto
m
=2
;000
............................100
xvi
Figure3.3NMIvalues(in%)ofthepartitionsobtainedusing
theRFFandSVclustering
algorithms,withrespecttothetrueclasslabels.Theparam
eter
m
,whichrepresents
thenumberofFouriercomponentsfortheRFFandSVclusterin
galgorithms,and
thesamplesizefortheapproximatekernel
k
-meansandNystromapproximation
basedspectralclusteringalgorithms,issetto
m
=2
;000
.Itisnotfeasibleto
executekernel
k
-meansonthelargeForestCoverType,Imagenet-34,Poker,a
nd
NetworkIntrusiondatasetsduetotheirlargesize.Theappr
oximateNMIvalues
ofkernel
k
-meansonthesedatasetsareobtainedbyrstexecutingkern
el
k
-means
onarandomlychosensubsetof
50
;000
datapointstondtheclustercenters,and
thenassigningtheremainingpointstotheclosestclusterc
enter............102
Figure3.4EffectofthenumberofFouriercomponents
m
onthesilhouettecoefcientval-
uesofthepartitionsobtainedusingtheRFFandSVclusterin
galgorithms.Param-
eter
m
representsthenumberofFouriercomponentsfortheRFFandS
Vclustering
algorithms,andthesamplesizefortheapproximatekernel
k
-meansandNystrom
approximationbasedspectralclusteringalgorithms.....
.............103
Figure3.5EffectofthenumberofFouriercomponents
m
ontheNMIvalues(in%)of
thepartitionsobtainedusingtheRFFandSVclusteringalgo
rithms,onthesix
benchmarkdatasets.Parameter
m
representsthenumberofFouriercomponents
fortheRFFandSVclusteringalgorithms,andthesamplesize
fortheapproximate
kernel
k
-meansandNystromapproximationbasedspectralclusterin
galgorithms..107
Figure3.6RunningtimeoftheRFFclusteringalgorithmford
ifferentvaluesof(a)
n
,(b)
d
and(c)
C
.......................................108
Figure3.7RunningtimeoftheSVclusteringalgorithmfordi
fferentvaluesof(a)
n
,(b)
d
and(c)
C
........................................109
Figure4.1Schemaoftheproposedapproximatestreamkernel
k
-meansalgorithm......117
Figure4.2Illustrationofimportancesamplingonatwo-dim
ensionalsyntheticdataset
containing
1
;000
pointsalong
10
concentriccircles(
100
pointsineachcluster),
representedbyﬁoﬂinFigure(a).Figure(b)shows
50
pointssampledusingim-
portancesampling,andFigures(c)and(d)show
50
and
100
pointsselectedusing
Bernoullisampling,respectively.Thesampledpointsarer
epresentedusingﬁ*ﬂ.
Allthe
10
clustersarewell-representedbyjust
50
pointssampledusingimportance
sampling.Ontheotherhand,
50
pointssampledusingBernoullisamplingarenot
adequatetorepresentthese
10
clusters(Cluster
4
inredhasnorepresentatives).At
least
100
pointsareneededtorepresentalltheclusters...........
.....119
xvii
Figure4.3Runningtime(inmilliseconds)ofthestreamclus
teringalgorithms.Thepa-
rametersfortheproposedapproximatestreamkernel
k
-meansalgorithmaresetto
m
=5
;000
,
M
=20
;000
,and
˝
=1
.ThecoresetsizefortheStreamKM++algo-
rithm,andthechunksizeofthesKKMalgorithmaresetto
5
;000
.Itisnotfeasible
toexecutekernel
k
-meansontheForestCoverType,Imagenet-34,Poker,andNet
-
workIntrusiondatasetsduetotheirlargesize.Theapproxi
materunningtimeof
kernel
k
-meansonthesedatasetsisobtainedbyrstexecutingkerne
l
k
-meanson
arandomlychosensubsetof
50
;000
datapointstondtheclustercenters,andthen
assigningtheremainingpointstotheclosestclustercente
r..............129
Figure4.4Silhouettecoefcientvaluesofthepartitionso
btainedusingtheproposedap-
proximatestreamkernel
k
-meansalgorithm.Theparametersfortheproposedal-
gorithmweresetto
m
=5
;000
,
M
=20
;000
,and
˝
=1
.Thecoresetsizefor
theStreamKM++algorithm,andthechunksizeofthesKKMalgo
rithmwereset
to
5
;000
.........................................130
Figure4.5NMI(in%)oftheclusteringalgorithmswithrespe
cttothetrueclasslabels.
Theparametersfortheproposedapproximatestreamkernel
k
-meansalgorithmare
setto
m
=5
;000
,
M
=20
;000
,and
˝
=1
.ThecoresetsizefortheStreamKM++
algorithm,andthechunksizeofthesKKMalgorithmaresetto
5
;000
.Itisnot
feasibletoexecutekernel
k
-meansontheForestCoverType,Imagenet-34,Poker,
andNetworkIntrusiondatasetsduetotheirlargesize.Thea
pproximateNMI
valuesofkernel
k
-meansonthesedatasetsisobtainedbyrstexecutingkerne
l
k
-meansonarandomlychosensubsetof
50
;000
datapointstondthecluster
centers,andthenassigningtheremainingpointstotheclos
estclustercenter.....131
Figure4.6ChangeintheNMI(in%)oftheproposedapproximat
estreamkernel
k
-means
algorithmovertime.Theparameters
m
,
M
and
˝
weresetto
m
=5
;000
,
M
=
20
;000
and
˝
=1
,respectively.............................132
Figure4.7Effectoftheinitialsamplesize
m
ontherunningtime(inmilliseconds)ofthe
proposedapproximatestreamkernel
k
-meansalgorithm.Parameter
m
represents
theinitialsamplesetsize,thecoresetsizeandthechunksi
zefortheapproximate
streamkernel
k
-means,StreamKM++andsKKMalgorithms,respectively.The
parameters
M
and
˝
aresetto
M
=20
;000
and
˝
=1
,respectively.........134
Figure4.8Effectoftheinitialsamplesize
m
onthesilhouettecoefcientvaluesofthe
proposedapproximatestreamkernel
k
-meansalgorithm.Parameter
m
represents
theinitialsamplesetsize,thecoresetsizeandthechunksi
zefortheapproximate
streamkernel
k
-means,StreamKM++andsKKMalgorithms,respectively.The
parameters
M
and
˝
aresetto
M
=20
;000
and
˝
=1
,respectively.........135
xviii
Figure4.9Effectoftheinitialsamplesize
m
ontheNMI(in%)oftheproposedapprox-
imatestreamkernel
k
-meansalgorithm.Parameter
m
representstheinitialsample
setsize,thecoresetsizeandthechunksizefortheapproxim
atestreamkernel
k
-
means,StreamKM++andsKKMalgorithms,respectively.Thep
arameters
M
and
˝
aresetto
M
=20
;000
and
˝
=1
,respectively...................136
Figure4.10Sampletweetsfromthe
ASP.NET
cluster......................141
Figure4.11Sampletweetsfromthe
HTML
cluster.......................142
Figure4.12TrendingclustersinTwitter.Thehorizontalax
isrepresentsthetimelineindays
andtheverticalaxisrepresentsthepercentageratioofthe
numberoftweetsin
theclustertothetotalnumberoftweetsobtainedontheday.
Figure(a)shows
thetrendsobtainedbytheproposedapproximatestreamkern
el
k
-meansalgorithm,
andFigure(b)showsthetruetrends.....................
.....143
Figure5.1Illustrationofkernelsparsityonatwo-dimensi
onalsyntheticdatasetcontain-
ing
1
;000
pointsalong
10
concentriccircles.Figure(a)showsallthedatapoints
(representedbyﬁoﬂ)andFigure(b)showstheRBFkernelmatr
ixcorrespondingto
thisdata.Neighboringpointshavethesameclusterlabelwh
enthekernelisdened
correctlyforthedataset.............................
...148
Figure5.2Sampleimagesfromthreeofthe
100
clustersintheCIFAR-100datasetob-
tainedusingtheproposedalgorithm....................
......165
Figure5.3NMI(in%)oftheproposedsparsekernel
k
-meansandthethreebaselineal-
gorithmsontheCIFAR-100andImagenet-164datasets.Thepa
rametersofthe
proposedalgorithmweresetto
m
=20
;000
,
M
=50
;000
,and
p
=1
;000
.
Thesamplesize
m
fortheapproximatekernel
k
-meansalgorithmwassetequalto
20
;000
fortheCIFAR-100datasetand
10
;000
fortheImagenet-164dataset.Itis
notfeasibletoexecutekernel
k
-meansontheImagenet-164dataset,duetoitslarge
size.TheapproximateNMIvalueachievedbykernel
k
-meansontheImagenet-164
datasetisobtainedbyrstexecutingthealgorithmonarand
omlychosensubset
of
50
;000
datapointstondtheclustercenters,andthenassigningth
eremaining
pointstotheclosestclustercenter....................
.......166
Figure5.4ComparisonoftheNMI(in%)oftheproposedsparse
kernel
k
-meansalgo-
rithmandtheapproximatekernel
k
-meansalgorithmontheCIFAR-100andthe
Imagenet-164datasets.Parameter
m
representstheinitialsamplesetsizeforthe
proposedalgorithm,andthesizeofthesampledsubsetforth
eapproximatekernel
k
-meansalgorithm.Theremainingparametersoftheproposed
algorithmwereset
to
M
=50
;000
,and
p
=1
;000
.Approximatekernel
k
-meansisinfeasibleforthe
Imagenet-164datasetwhen
m>
10
;000
duetoitslargesize.............169
xix
Figure5.5Effectofthenumberofclusters
C
ontherunningtime(inseconds)ofthe
proposedsparsekernel
k
-meansalgorithm.......................171
Figure5.6Effectofthenumberofclusters
C
ontheNMI(in%)oftheproposedsparse
kernel
k
-meansalgorithm................................171
Figure5.7Runningtimeofthesparsekernel
k
-meansclusteringalgorithmfordifferent
valuesof(a)
n
,(b)
d
and(c)
C
.............................172
xx
LISTOFALGORITHMS
Algorithm1
k
-means.............................................
.................18
Algorithm2Kernel
k
-means.............................................
..........26
Algorithm3ApproximateKernel
k
-means...........................................52
Algorithm4DistributedApproximateKernel
k
-means................................61
Algorithm5Meta-ClusteringAlgorithm.................
............................62
Algorithm6RFFClustering............................
............................83
Algorithm7SVClustering.............................
............................89
Algorithm8ApproximateStreamKernel
k
-means..................................125
Algorithm9SparseKernel
k
-means.............................................
..151
Algorithm10Approximate
k
-means.............................................
...154
xxi
Chapter1

Introduction

Overthepastcoupleofdecades,greatadvancementshavebee
nmadeindatageneration,collection
andstoragetechnologies.Thishasresultedina
digitaldataexplosion
.Dataisuploadedeveryday
bybillionsofuserstothewebintheformoftext,image,audi
oandvideo,throughvariousmedia
suchasblogs,e-mails,socialnetworks,photoandvideohos
tingservices.Itisestimatedthat
204
millione-mailmessagesareexchangedeveryminute
1
;overabillionusersonFacebookshare
4
:75
billionpiecesofcontenteveryhalfhour,including
350
millionphotosand
4
millionvideos
2
;
and
300
hoursofvideosareuploadedtoYouTubeeveryminute
3
.Inaddition,alargeamountof
dataaboutthewebusersandtheirwebactivityiscollectedb
yahostofcompanieslikeGoogle,
Microsoft,FacebookandTwitter.Thisdataisnowpopularly
termedas
BigData
[105].
Bigdataisformallydenedasﬁhighvolume,highvelocity,a
nd/orhighvarietyinformation
assetsthatrequirenewformsofprocessingtoenableenhanc
eddecisionmaking,insightdiscovery
andprocessoptimizationﬂ.Itischaracterizedbythe
3
V's-Volume,Velocity,andVariety.Volume
indicatesthescaleofthedata.AstudybyIDCandEMCCorpora
tionpredictedthecreation
of44zettabytes(
10
21
bytes)ofdigitaldatabytheyear2020(SeeFigure1.1)[2].T
hisboils
1
http://mashable.com/2014/04/23/data-online-every-mi
nute
2
http://www.digitaltrends.com/social-media/according
-to-facebook-there-are-350-million-photos-uploaded
-on-the-social-network-daily-and-thats-just-crazy
3
https://www.youtube.com/yt/press/statistics.html
1
Figure1.1Emergingsizeofthedigitalworld.Imagefrom[2]
.
downtoabout2.3zettabytesofdatageneratedeveryday.Vel
ocityrelatestoreal-timeprocessing
ofstreamingdatainapplicationslikecomputernetworksan
dstockexchanges.TheNewYork
StockExchangecapturesabout
1
TBoftradeinformationduringeachtradingsession.Real-t
ime
processingofthisdatacanaidatraderinmakingimportantt
radedecisions.Varietypertains
totheheterogeneityofthedigitaldata.Bothstructuredda
tasuchascensusrecordsandlegal
records,andunstructureddataliketext,imagesandvideos
fromthewebformpartofbigdata.
Specializedtechniquesmaybeneededtohandledifferentfo
rmatsofthedata.Otherattributes
suchasreliability,volatilityandusefulnessofthedatah
avebeenaddedtothedenitionofbig
dataovertheyears.Virtuallyeverylargebusinessisinter
estedingatheringlargeamountsofdata
fromitscustomersandminingittoextractusefulinformati
oninatimelymanner.Thisinformation
helpsthebusinessprovidebetterservicetoitscustomersa
ndincreaseitsprotability.
About
23%
ofthishumongousamountofdigitaldataisbelievedtoconta
inusefulinformation
thatcanbeleveragedbycompanies,governmentagenciesand
individualusers
4
.Forinstance,a
partialﬁblueprintﬂofeveryuseronthewebcanbecreatedby
combiningtheinformationfrom
theirFacebook/Googleproles,statusupdates,Twittertw
eets,metadataoftheirphotoandvideo
uploads,webpagevisits,andallsortsofotherminutedata.
Thisgivesaninsightintotheinterests
4
http://www.mckinsey.com/insights/business_technolog
y/big_data_the_next_frontier_for_innovation
2
Figure1.2GrowthofTargetedDisplayAdvertising.Imagefr
om[59].
andneedsoftheusers,therebyallowingcompaniestotarget
aselectgroupofusersfortheirprod-
ucts.Userspreferonlineadvertisementsthatmatchtheiri
nterestsoverrandomadvertisements.
Figure1.2showsthetremendousgrowththathasbeenachieve
dintargetedadvertisingoverthe
years,asaconsequenceofusingdataanalytics
5
tounderstandthebehaviorofwebusers[59].
BigDataanalyticshasalsoleadtothedevelopmentofnewapp
licationsandserviceslike
Microsoft'sHealthVault
6
,aplatformthatenablespatientstocompilepersonalhealt
hinformation
frommultiplesourcesintoasingleonlinerepository,andc
oordinatetheirhealthmanagement
withotherusers.ApplicationssuchasGoogleFluTrends
7
andDengueTrends
8
predictedthe
diseaseoutbreakwellbeforetheofcialCDC(USCentersfor
DiseaseControlandPrevention)
5
Dataanalyticsisthescienceofexaminingdatawiththepurp
oseofinferringusefulinformation,andmaking
decisionsandpredictionsbasedontheinferences.Itencom
passesamyriadofmethodologiesandtoolstoperform
automatedanalysisofdata[1].
6
https://www.healthvault.com/us/en/overview
7
http://www.google.org/utrends
8
http://www.google.org/denguetrends
3
andEISS(EuropeanInuenzaSurveillanceScheme)reportsa
republished,basedonaggregated
searchactivity,reducingthenumberofpeopleaffectedbyt
hedisease[71].
1.1DataAnalysis

Dataanalysisisgenerallydividedintoexploratoryandcon
rmatorydataanalysis[174].Thepur-
poseofexploratoryanalysisistodiscoverpatternsandmod
elthedata.Exploratorydataanalysisis
usuallyfollowedbyaphaseofconrmatorydataanalysiswhi
chaimsatmodelvalidation.Several
statisticalmethodshavebeenproposedtoperformdataanal
ysis.Statisticalpatternrecognitionand
machinelearningisconcernedwithpredictiveanalysis,wh
ichinvolvesdiscoveringrelationships
betweenobjectsandpredictingfutureevents,basedonthek
nowledgeobtained.Patternrecogni-
tioncomprisesofthreephases:datarepresentation,learn
ingandinference.
1.1.1DataRepresentation

Datarepresentationinvolvesselectingasetoffeaturesto
denotetheobjectsinthedataset.A
d
-dimensionalvector
x
=(
x
1
;:::;x
d
)
>
denoteseachobject,where
x
p
;p
2
[d
]representsa
feature.Thefeaturesmaybenumerical,categoricalorordi
nal.Forinstance,adocumentmay
berepresentedusingthewordsinthedocument;inwhichcase
each
x
p
denotesawordinthe
document.Animagemayberepresentedusingthepixelintens
ityvalues.Inthiscase,
x
p
is
thenumericalintensityvalueatthe
p
th
pixel.Therepresentationemployeddictatesthekindof
analysisthecanbeperformedonthedataset,andtheinterpr
etationoftheresultsofanalysis.
Therefore,itisimportanttoselectthecorrectrepresenta
tion.Inmostapplications,priordomain
knowledgeisusefulinselectingtheobjectrepresentation
.Recently,deeplearningtechniqueshave
beenemployedtoautomaticallylearntherepresentationfo
robjects[20].
4
1.1.2Learning

Afterasuitablerepresentationischosen,thedataisinput
toalearningalgorithmwhichtsamodel
tothedata.
Thesimplestlearningtaskisthatof
supervisedlearning
,alsotermedasclassication[97].
Thegoalofsupervisedlearningistoderiveafunctionthatm
apsthesetofinputobjectstoasetof
targets(classes),using
labeled
trainingdata.Forinstance,givenasetoftaggedimages,th
elearner
analyzestheimagesandlearnsafunctionmappingtheimages
totheirtags.Supervisedlearning
ndsuseinmanyapplicationssuchasobjectrecognition,sp
amdetection,intrusiondetection,and
machinetranslation.
Unfortunately,onlyabout
3%
ofthepotentiallyusefuldataonthewebislabeled(e.g.tag
sfor
objectsinimages),anditisextremelyexpensivetoobtaint
helabelsforthemassiveamountofdata,
makingsupervisedlearningdifcultinmostbigdataapplic
ations[2].Oflate,crowdsourcingtools
suchasAmazonMechanicalTurk
9
havebeenusedtoobtainlabelsforthedataitems,frommulti
ple
usersovertheweb[29].However,labelsobtainedthroughsu
chapproachescanbeunreliableand
ambiguous.Forexample,inthetaskofimagetaggingthrough
crowdsourcing,oneusermaytag
theimageofapoodlewiththelabelﬁdogﬂ,whereasanotherus
ermaylabelitasﬁanimalﬂ(i.e.
usageofhypernymsversushyponyms).Thesametagﬁjaguarﬂc
ouldapplytoboththecaraswell
astheanimal(polysemy).Spammerscanintentionallygener
atewronglabelsleadingtonoisein
thedata.Additionaleffortsareneededtohandletheseissu
es[138,185].
Semi-supervisedlearning
techniquesalleviatetheneedforlabelinglargedatasetsb
yutiliz-
ingalargepoolofunlabeledobjectsinconjunctionwithare
lativelysmallsetoflabeledobjectsto
learnaclassier[189].Ithasbeenfoundthattheclassier
slearntthroughsemi-supervisedlearn-
ingmethodscanbemoreaccuratethanthoselearntusinglabe
leddataalone,becausetheunlabeled
dataallowsthelearnertoexploretheunderlyingstructure
ofthedata.Thoughsemi-supervised
learningmethodsmitigatethelabelingproblemassociated
withsupervisedlearningmethodsto
9
https://www.mturk.com/mturk
5
someextent,theyarestillsusceptibletosameissuesasthe
supervisedlearningtechniques.More-
over,itisexpensivetoobtainsupervisioninapplications
suchasstockmarketanalysis,wherehigh
levelofexpertiseisrequiredtoidentifythestocktrends[
130].
Unsupervisedlearning
tasksinvolvendingthehiddenstructureindata.Unlikesu
pervised
andsemi-supervisedlearning,thesetasksdonotrequireth
edatatobelabeled,therebyavoiding
thecostoftaggingthedataandallowingonetoleveragethea
bundantdatacorpus.Examplesof
unsupervisedlearningtasksincludedensityestimation,d
imensionalityreduction,featureselection
andextraction,andclustering[83].
Clustering,alsoknownasunsupervisedclassication,iso
neoftheprimaryapproachesto
unsupervisedlearning.Thepurposeofclusteringistodisc
overthenaturalgroupingoftheinput
objects.Oneofthegoalsofclusteringistosummarizeandco
mpressthedata,leadingtoefcient
organizationandconvenientaccessofthedata.Itisoftene
mployedasaprecursortoclassication.
Thedataisrstcompressedusingclustering,andasupervis
edlearningmodelisbuiltusingonly
thecompresseddata.Forinstance,intheimagetaggingprob
lem,ifthelearnerwasonlyprovided
withalargenumberofuntaggedimages,theimagescanbegrou
pedintoclustersbasedonapre-
denedsimilarity.Eachclustercanberepresentedbyasmal
lsetofprototypeimages,andthe
labelsfortheserepresentativeimagesobtainedthroughcr
owdsourcing,whichcanthenbeusedto
learnataggingfunctioninasupervisedmanner.Thisproces
sischeaperandmorereliablethan
obtainingthelabelsforalltheimages.Clusteringndsuse
inamultitudeofapplicationssuch
aswebsearch,socialnetworkanalysis,imageretrieval,ge
neexpressionanalysis,marketanalysis
andrecommendationsystems[90].

1.1.3Inference

Inthisphase,thelearntmodelisusedfordecisionmakingan
dprediction,asrequiredbytheap-
plication.Forexample,intheimagetaggingproblem,themo
delcomprisingthemappingfunction
canbeusedtopredictthetagscorrespondingtoanimagethel
earnerhasnotseenpreviously.In
6
Table1.1Notation.
Symbol
Description
D
=
f
x
1
;:::;
x
n
g
Inputdatasettobeclustered
x
i
i
th
datapoint
˜
Inputspace
H

Featurespace/ReproducingKernelHilbertSpace(RKHS)
kk
H

FunctionalnorminRKHS
d
Dimensionalityoftheinputspace
n
Numberofpointsinthedataset
C
Numberofclusters
U
=(
u
1
;:::;
u
C
)
>
Clustermembershipmatrix
(
C

n
)
P
=
f
U
2f
0
;1
g
C

n
:U
>
1
=
1
g
Setofvalidclustermembershipmatrices
C
k
k
th
cluster
c
k
k
th
clustercenter
n
k
Numberofpointsinthe
k
th
cluster
'
Mappingfunctionfrom
˜
to
H


(

;
)
Kernelfunction
K
Kernelmatrix
(
n

n
)
socialnetworks,clusteringisemployedtogroupusersbase
dontheirgender,occupation,web
activity,andotherattributes,toautomaticallynduserc
ommunities[128].Basedonthecommu-
nitiesidentied,recommendationsfornewconnectionsand
contentcanbemadetotheusers.
Inthisthesis,wefocusontheclusteringproblem.Notation
susedthroughoutthisthesisare
summarizedinTable1.1.

1.2Clustering

Clustering,oneoftheprimaryapproachestounsupervisedl
earning,isthetaskofgroupingaset
ofobjectsintoclustersbasedonsomeuser-denedsimilari
ty.Givenasetof
n
objectsrepresented
by
D
=
f
x
1
;:::;
x
n
g
,whereeachpoint
x
i
2
˜
and
˜
<
d
,theobjectiveofclustering,inmost
applications,istogroupthepointsinto
C
clusters,representedby
fC
1
;:::;
C
C
g
,suchthatthe
clustersreectthenaturalgroupingoftheobjects.Thede
nitionofnaturalgroupingissubjective,
7
(a)
 1 5231826 71322 412 215 616 31110 9 81417192520212430272928123456(b)
(c)
Figure1.3Atwo-dimensionalexampletodemonstratehierar
chicalandpartitionalclusteringtech-
niques.Figure(a)showsasetofpointsintwo-dimensionals
pace,containingthreeclusters.Hier-
archicalclusteringgeneratesadendrogramforthedata.Fi
gure(b)showsadendrogramgenerated
usingthecomplete-linkagglomerativehierarchicalclust
eringalgorithm.Thehorizontalaxisrep-
resentsthedatapointsandtheverticalaxisrepresentsthe
distancebetweentheclusterswhenthey
rstmerge.Byapplyingathresholdonthedistanceat
4
units(shownbytheblackdottedline),
wecanobtainthethreeclusters.Partitionalclusteringdi
rectlyndsthe
C
clustersinthedataset.
Figure(c)showsthethreeclusters,representedbytheblue
,greenandredpoints,obtainedusing
the
k
-meansalgorithm.Thestarredpointsinblackrepresentthe
clustercenters.
anddependentonanumberoffactorsincludingtheobjectsin
thedataset,theirrepresentation,
andthegoalofclusteranalysis.Themostcommonobjectivei
stogroupthepointssuchthatthe
similaritybetweenthepointswithinthesameclusterisgre
aterthanthesimilaritybetweenthe
pointsindifferentclusters.Thestructureoftheclusters
obtainedisdeterminedbythedenitionof
thesimilarity.Itisusuallydenedintermsofadistancefu
nction
d
:˜

˜
!<
.
1.2.1ClusteringAlgorithms

Historically,twotypeofclusteringalgorithmshavebeend
eveloped:hierarchicalandparti-
tional[88].

Hierarchicalclusteringalgorithms,asthenamesuggests,
buildahierarchyofclusters;the
rootofthetreecontainsallthe
n
pointsinthedataset,andtheleavescontaintheindividual
points.Agglomerativehierarchicalclusteringalgorithm
sstartwith
n
clusters,eachwithone
8
point,andrecursivelymergetheclusterswhicharemostsim
ilartoeachother.Divisive
hierarchicalclusteringalgorithms,ontheotherhand,sta
rtwiththerootcontainingallthe
datapoints,andrecursivelysplitthedataintoclustersin
atop-downmanner.Themost
well-knownhierarchicalclusteringalgorithmsarethesin
gle-link,complete-linkandWard's
algorithms[88].Thesingle-linkalgorithmdenesthesimi
laritybetweentwoclustersasthe
similaritybetweentheirmostsimilarmembers,whereasthe
complete-linkalgorithmdenes
thesimilaritybetweentwoclustersasthesimilarityofthe
irmostdissimilarmembers.The
Ward'sclusteringalgorithmrecursivelymergesthecluste
rsthatleadstotheleastpossible
increaseintheintra-clustervarianceaftermerging.Figu
re1.3(b)showsthecomplete-link
dendrogramcorrespondingtotheclustersinthetwo-dimens
ionaldatasetinFigure1.3(a).

Partitionalclusteringalgorithms,directlypartitionth
edatainto
C
clusters,asshowninFig-
ure1.3(c).Popularpartitionalclusteringalgorithmsinc
ludecentroid-based(
k
-means,
k
-
medoids)[87,94],model-based(Mixturemodels,LatentDir
ichletAllocation)[24],graph-
theoretic(MinimumSpanningTrees,Normalized-cut,Spect
ralclustering)[77,161],and
densityandgrid-based(DBSCAN,OPTICS,CLIQUE)algorithm
s[61].
Fromastatisticalviewpoint,clusteringtechniquescanal
sobecategorizedasparametricand
non-parametric[127].Parametricapproachestoclusterin
gassumethatthedataisdrawnfrom
adensity
p
(
x
)
whichisamixtureofparametricdensities,andthegoalofcl
usteringistoiden-
tifythecomponentdensities.Thecentroid-basedandmodel
-basedclusteringalgorithmsfallin
thiscategory.Non-parametricapproachesarebasedonthep
remisethattheclustersrepresentthe
modesofthedensity
p
(
x
)
,andtheaimofclusteringistodetectthehigh-densityregi
onsinthe
data.Themodalstructureof
p
(
x
)
canbesummarizedina
clustertree
.Eachlevelinthecluster
treerepresentsthefeaturespace
L
(
;p
)=
f
x
j
p
(
x
)
>
g
.Clustertreescanbeconstructedusing
thesingle-linkclusteringalgorithmtobuildneighborhoo
dgraphs,andndingtheconnectedcom-
ponentsintheneighborhoodgraphs.Density-basedpartiti
onalclusteringalgorithmssuchasDB-
9
SCANandOPTICS,arespecializednon-parametricclusterin
gtechniques,whichndthemodes
ataxeduser-deneddensitythreshold.Mean-shiftcluste
ringalgorithmsestimatethedensity
locallyateach
x
,andndthemodesusingagradientascentprocedureonthelo
caldensity.
1.2.2ChallengesinDataClustering

Dataclusteringisadifcultproblem,asreectedbythehun
dredsofclusteringalgorithmsthat
havebeenpublished,andthenewonesthatcontinuetoappear
.Duetotheinherentunsupervised
natureofclustering,thereareseveralfactorsthataffect
theclusteringprocess.

Datarepresentation.
Thedatacanbeinputtoclusteringalgorithmsintwoforms:(
i)the
n

d
patternmatrix
containingthe
d
featurevaluesforeachofthe
n
objects,and(ii)
the
n

n
proximitymatrix
,whoseentriesrepresentthesimilarity/dissimilaritybe
tweenthe
correspondingobjects.Givenasuitablesimilaritymeasur
e,itiseasytoconvertapattern
matrixtotheproximitymatrix.Similarly,methodslikesin
gularvaluedecompositionand
multi-dimensionalscalingcanbeenusedtoapproximatethe
patternmatrixcorrespondingto
thegivenproximitymatrix[47].Conventionally,hierarch
icalclusteringalgorithmsassume
inputintheformoftheproximitymatrix,whereaspartition
alclusteringalgorithmsaccept
thepatternmatrixasinput.

Thefeaturesusedtorepresentthedatainthepatternmatrix
playanimportantroleinclus-
tering.Iftherepresentationisgood,theclusteringalgor
ithmwillbeabletondcompact
clustersinthedata.Dimensionalityofthedatasetisalsoc
rucialtothequalityofclustersob-
tained.High-dimensionalrepresentationswithredundant
andnoisyfeaturesnotonlyleadto
longclusteringtimes,butmayalsodeterioratethecluster
structureinthedata.Featureselec-
tionandextractiontechniquessuchasforward/backwardse
lectionandprincipalcomponent
analysisareusedtodeterminethemostdiscriminativefeat
ures,andreducethedimensional-
ityofthedataset[89].Deeplearningtechniques[20]andke
rnellearningtechniques[112]
10
canbeemployedtolearnthedatarepresentationfromthegiv
endataset.

Numberofclusters.
Mostclusteringalgorithmsrequirethespecicationofthe
numberof
clusters
C
.Whilecentroid-based,model-basedandgraph-theoretica
lgorithmsdirectlyac-
ceptthenumberofclustersasinput,densityandgrid-based
algorithmsacceptotherparam-
eterssuchasthemaximuminter-clusterdistance,whichare
indirectlyrelatedtothenumber
ofclusters.Automaticallydeterminingthenumberofclust
ersisadifcultproblemand,
inpractice,domainknowledgeisusedtodeterminethispara
meter.Severalheuristicshave
beenproposedtoestimatethenumberofclusters.In[172],t
henumberofclustersisdeter-
minedbyminimizingtheﬁgapﬂbetweentheclusteringerror
10
foreachvalueof
C
,andthe
expectedclusteringerrorofareferencedistribution.Cro
ss-validationtechniquescanbeused
tondthevalueof
C
atwhichtheerrorcurvecorrespondingtothevalidationdat
aexhibits
asharpchange[68].

ClusteringAlgorithm.
Theobjectiveofclusteringdictatesthealgorithmchosenf
orclus-
tering,andinturn,thequalityandthestructureoftheclus
tersobtained.Centroid-based
clusteringalgorithmssuchas
k
-meansaimatminimizingthesumofthedistancesbetween
thepointsandtheirrepresentativecentroids.Thisobject
iveissuitableforapplicationswhere
theclustersarecompactandhyper-sphericalorhyper-elli
psoidal.Densitybasedalgorithms
aimatndingthedenseregionsinthedata.Thesingle-linkh
ierarchicalclusteringalgorithm
ndslongelongatedclusterscalledﬁchainsﬂ,asthecriter
ionformergingclustersislocal,
whereasthecomplete-linkhierarchicalclusteringalgori
thmndslargecompactclusters.
Eachclusteringalgorithmisassociatedwithadifferentsi
milaritymeasure.

Similaritymeasures.
Thesimilaritymeasureemployedbytheclusteringalgorith
miscrucial
tothestructureoftheclustersobtained.Thechoiceofthes
imilarityfunctiondependsonthe
datarepresentationscheme,andtheobjectiveofclusterin
g.Apopulardistancefunctionis
10
RefertoSection1.3.1forthedenitionofclusteringerror
.
11
thesquaredEuclideandistancedenedby
d
2
(
x
a
;x
b
)=
jj
x
a

x
b
jj
2

2
;(1.1)
where
x
a
;x
b
2D
.However,theEuclideandistanceisnotsuitableforallapp
lications.
OtherdistancemeasuressuchasMahalanobis,Minkowskiand
non-lineardistancemeasures
havebeenappliedintheliteraturetoimprovetheclusterin
gperformanceinmanyapplica-
tions[171](SeeSection1.4).

ClusteringTendency,QualityandStability.
Mostclusteringalgorithmswillndclustersin
thegivendataset,evenifthedatadoesnotcontainanynatur
alclusters.Thestudyofclus-
teringtendencydealswithexaminingthedatabeforeexecut
ingtheclusteringalgorithm,to
determineifthedatacontainsanyclusters.Clusteringten
dencyisusuallyassessedthrough
visualassessmenttechniqueswhichreorderthesimilarity
matrixtoexaminewhetherornot
thedatacontainsclusters[85].Thesetechniquescanalsob
eusedtodeterminethenumber
ofclustersinthedataset.

Afterobtainingtheclusters,weneedtoevaluatethevalidi
tyandqualityoftheclusters.
Severalmeasureshavebeenidentiedtoevaluatethecluste
rsobtained,andthechoiceofthe
qualitycriteriondependsontheapplication.Clustervali
ditymeasuresarebroadlyclassied
aseitherinternalorexternalmeasures[88].Internalmeas
uressuchasthevalueofthe
clusteringalgorithm'sobjectivefunctionandtheinter-c
lusterdistancesassessthesimilarity
betweentheclusterstructureandthedata.Asclusteringis
anunsupervisedtask,itislogical
toemployinternalmeasurestoevaluatethepartitions.How
ever,thesemeasuresaredifcult
tointerpretandoftenvaryfromoneclusteringalgorithmto
another.Ontheotherhand,
externalmeasuressuchaspredictionaccuracyandclusterp
urityusepriorinformationlike
thetrueclasslabelstoassesstheclusterquality.Externa
lmeasuresaremorepopularlyused
toevaluateandcomparetheclusteringresultsofdifferent
clusteringalgorithms,astheyare
12
easiertointerpretthaninternalvaliditymeasures.

Clusterstabilitymeasuresthesensitivityoftheclusters
tosmallperturbationsinthedata
set[119].Itisdependentonboththedatasetandthealgorit
hmusedtoperformclustering.
Clusteringalgorithmswhichgeneratestableclustersarep
referredastheywillberobustto
noiseandoutliersinthedata.Stabilityistypicallymeasu
redusingdataresamplingtech-
niquessuchasbootstrapping.Multipledatasetsofthesame
size,generatedfromthesame
probabilitydistribution,areclusteredusingthesamealg
orithmandthesimilaritybetween
thepartitionsofthesedatasetsisusedasameasureoftheal
gorithm'sstability.

Scalability.
Inadditiontotheclusterquality,thechoiceofthecluster
ingalgorithmisalso
determinedbythescalabilityofthealgorithm.Thisfactor
becomesallthemorecrucial
whendesigningsystemsforbigdataanalysis.Twoimportant
factorsthatdeterminethescal-
abilityofaclusteringalgorithmareitsrunningtimecompl
exityanditsmemoryfootprint.
Clusteringalgorithmswhichhavelinearorsub-linearrunn
ingtimecomplexity,andrequire
minimumamountofmemoryaredesirable.
1.3ClusteringBigData

Whenthesizeofthedataset
n
isintheorderofbillionsandthedimensionalityofthedata
d
is
intheorderofthousands,asisthecaseinmanybigdataanaly
ticsproblems,thescalabilityof
thealgorithmbecomesanimportantfactorwhilechoosingac
lusteringalgorithm.Hierarchical
clusteringalgorithmsareassociatedwithatleast
O
(
n
2
d
+
n
2
log(
n
))
runningtimeand
O
(
n
2
)
memorycomplexity,whichrenderstheminfeasibleforlarge
datasets.Thesameholdsformany
ofthepartitionalclusteringalgorithmssuchasthemodelb
asedalgorithmslikeLatentDirichlet
Allocation,graph-basedalgorithmssuchasspectralclust
eringanddensity-basedalgorithmslike
DBSCAN.Theyhaverunningtimecomplexitiesrangingfrom
O
(
n
log(
n
))
to
O
(
n
3
)
intermsofthe
numberofpointsinthedata,andatleastlineartimecomplex
itywithrespecttothedimensionality
13
Table1.2ClusteringtechniquesforBigData.
Clusteringapproaches
Runningtimecomplexity
Memorycomplexity
Linearclustering
k
-means
O
(
nCd
)
O
(
nd
)
Sampling-based
CLARA[94]
O
(
Cm
2
+
C
(
n

C
))
O
(
n
2
)
clusteringwith
CURE[80]
O
(
m
2
log(
m
))
O
(
md
)
samplesize
m
˝
n
Coreset[82]
O
(
n
+
C
polylog
(
n
))
O
(
nd
)
Compression
BIRCH[197]
O
(
nd
)
M
y
CLARANS[136]
O
(
n
2
)
O
(
n
2
)
Streamcluster-

ing
Stream[79],

ClusTree[98]
O
(
nCd
)
M
y
Scalable
k
-
means[30],

Single-pass

k
-means[62]
O
(
nd
)
StreamKM++[6]
O
(
dns
)
*
O
(
ds
log(
n=s
))
*
Distributedclus-

tering
Parallel
k
-
means[60,199]
O
(
nCd
)
O
(
PC
2
n

)
;>
0
with
P
tasks
MapReduce

basedspectral

clustering[35]
O
(
n
2
d=P
+
r
3
+
nr
+
nC
2
)
**
O
(
n
2
=P
)
Nearest-

neighborcluster-

ing[115]
O
(
n
log(
n
)
=P
)
O
(
n=P
)
*
s
=
O
(
dC
log(
n
)log
d=
2
(
C
log(
n
)))
**
r
representsthetherankoftheafnitymatrix
y
M
isauser-denedparameterrepresentingtheamountofmemor
yavailable
d
andthenumberofclusters
C
.Severalclusteringalgorithmshavebeenmodiedandspeci
al
algorithmshavebeendevelopedintheliterature,toscaleu
ptolargedatasets.Mostofthese
algorithmsinvolveapreprocessingphasetocompressordis
tributethedata,beforeclusteringis
performed.Someofthepopularmethodstoefcientlycluste
rlargedatasets(listedinTable1.2)
canbeclassiedbasedontheirpreprocessingapproach,asf
ollows:

Sampling-basedmethodsreducethecomputationtimebyrst
choosingasubsetofthegiven
datasetandthenusingthissubsettondtheclusters.Theke
yideabehindallsampling-based
14
clusteringtechniquesistoobtaintheclusterrepresentat
ives,usingonlythesampledsubset,
andthenassigntheremainingdatapointstotheclosestrepr
esentative.Thesuccessofthese
techniquesdependsonthepremisethattheselectedsubseti
sanunbiasedsampleandis
representativeoftheentiredataset.Thissubsetischosen
eitherrandomly(CLARA[94],
CURE[80])orthroughanintelligentsamplingschemesuchas
coresetsampling[82,183].
Coreset-basedclusteringrstndsasmallsetofweightedd
atapointscalledthecoreset,
whichapproximatesthegivendataset,withinauser-dened
errormargin,andthenobtains
theclustercentersusingthiscoreset.In[63],itisproved
thatacoresetofsize
O
(
C
2
=
4
)
is
sufcienttoobtainan
O
(1+

)
approximation,where

istheerrorparameter.

ClusteringalgorithmssuchasBIRCH[197]andCLARANS[136]
improvetheclustering
efciencybyencapsulatingthedatasetintospecialdatast
ructuresliketreesandgraphsfor
efcientdataaccess.Forinstance,BIRCHdenesadatastru
cture,calledtheClustering-
FeatureTree(CF-Tree).Eachleafnodeinthistreesummariz
esasetofpointswhoseinter-
pointdistancesarelessthanauser-denedthreshold,byth
esumofthepoints,sumofthe
squaresofthedatapoints,andthenumberofpoints.Eachnon
-leafnodesummarizesthe
samestatisticsforallitschildnodes.Thepointsinthedat
asetareaddedincrementallytothe
CF-Tree.Theleafentriesofthetreearethenclusteredusin
ganagglomerativehierarchical
clusteringalgorithmtoobtainthenaldatapartition.Oth
erapproachessummarizethedata
into
kd
-treesandR-treesforfast
k
-nearestneighborsearch[115].

Streamclustering[8]algorithmsaredesignedtooperatein
asinglepassoveranarbitrary-
sizeddataset.Onlythesufcientstatistics(suchastheme
anandvarianceoftheclusters,
whentheclustersareassumedtobedrawnfromaGaussianmixt
ure)ofthedataseensofar
areretained,therebyreducingthememoryrequirements.On
eoftherststreamclustering
algorithmswasproposedbyGuha
etal.
[79].Theyrstsummarizethedatastreamintoa
largernumberofclustersthandesired,andthenclusterthe
centroidsobtainedintherststep.
15
StreamclusteringalgorithmssuchasCluStream[8],ClusTr
ee[98],scalable
k
-means[30],
andsingle-pass
k
-means[62]werebuiltusingasimilaridea,containinganon
linephase
tosummarizetheincomingdata,andanofinephasetocluste
rthesummarizeddata.The
summarizationisusuallyintheformoftrees[8,30],grids[
32,36]andcoresets[6,63].For
instance,theCluStreamalgorithmsummarizesthedataseti
ntoaCF-Tree,inwhicheach
nodestoresthelinearsumandthesquaredsumofasetofpoint
swhicharewithinauser-
deneddistancefromeachother.Eachnoderepresentsamicr
o-clusterwhosecenterand
radiuscanbefoundusingthelinearandsquaredsumvalues.T
he
k
-meansalgorithmisthe
algorithmofchoicefortheofinephasetoobtainthenalcl
usters.

Withtheevolutionofcloudcomputing,parallelprocessing
techniquesforclusteringhave
gainedpopularity[48,60].Thesetechniquesspeedupthecl
usteringprocessbyrstdivid-
ingthetaskintoanumberofindependentsub-tasksthatcanb
eperformedsimultaneously,
andthenefcientlymergingthesesolutionsintothenalso
lution.Forinstance,in[60],
theMapReduceframework[148]isemployedtospeedupthe
k
-meansandthe
k
-medians
clusteringalgorithms.Thedatasetissplitamongmanyproc
essorsandasmallrepresenta-
tivedatasampleisobtainedfromeachoftheprocessors.The
serepresentativedatapoints
arethenclusteredtoobtaintheclustercentersormedians.
InparallellatentDirichletallo-
cation,eachtaskndsthelatentvariablescorrespondingt
oadifferentcomponentofthe
mixture[133].TheMahoutplatform[143]implementsanumbe
rofparallelclustering
algorithms,includingparallel
k
-means,latentDirichletallocation,andmean-shiftclust
er-
ing[37,133,135,199].Billionsofimageswereclusteredus
inganefcientparallelnearest-
neighborclusteringin[115].
Datasetsofsizesclosetoabillionhavebeenclusteredusin
gtheparallelizedversionsofthe
k
-
means,nearestneighborandspectralclusteringalgorithm
s.Tothebestofourknowledge,basedon
thepublishedarticles,thelargestdatasetthathasbeencl
usteredconsistedofa
1
:5
billionimages,
16
eachrepresentedbya
100
-dimensionalvectorcontainingtheHaarwaveletcoefcien
ts[115].They
wereclusteredinto
50
millionclustersusingthedistributednearestneighboral
gorithmin
10
hours
using
2
;000
CPUs.Datasetsthatarebiginbothsize(
n
)anddimensionality(
d
),likesocial-
networkgraphsandwebgraphs,wereclusteredusingsubspac
eclusteringalgorithmsandparallel
spectralclusteringalgorithms[35,181].

1.3.1Clusteringwith
k
-means
Amongthevarious
O
(
n
)
runningtimeclusteringalgorithmsinTable1.2,themostpo
pularalgo-
rithmforclusteringlargescaledatasetsisthe
k-means
algorithm[87].Itiseasytoimplement,
simpleandefcient.Itiseasytoparallelize,hasrelative
lyfewparameterswhencomparedtothe
otheralgorithms,andyieldsclusteringresultssimilarto
manyotherclusteringalgorithms[192].
Millionsofpointscanbeclusteredusing
k
-meanswithinminutes.Extensiveresearchhasbeen
performedtosolvethe
k
-meansproblemandobtainstrongtheoreticalguaranteeswi
threspectto
itsconvergenceandaccuracy.Forthesereasons,wefocuson
the
k
-meansalgorithminthisthesis.
Thekeyideabehind
k
-meansistominimizethe
clusteringerror
,denedasthesumofthe
squareddistancesbetweenthedatapointsandthecenteroft
heclustertowhicheachpointis
assigned.Thiscanbeposedasthefollowingmin-maxoptimiz
ationproblem:
min
U
2P
max
c
k
2
˜
C
X
k
=1
n
X
i
=1
U
k;i
d
2
(
c
k
;x
i
)
;(1.2)
where
U
=(
u
1
;:::;
u
C
)
>
istheclustermembershipmatrix,
c
k
2
˜;k
2
[C
]arethecluster
centers,anddomain
P
=
f
U
2f
0
;1
g
C

n
:U
>
1
=
1
g
,where
1
isavectorofallones.The
mostcommonlyuseddistancemeasure
d
(

;
)
isthesquaredEuclideandistancemeasure,dened
in(1.1).The
k
-meansproblemwiththesquaredEuclideandistancemeasure
isdenedas
min
U
2P
max
c
k
2
˜
C
X
k
=1
n
X
i
=1
U
k;i
jj
c
k

x
i
jj
2

2
:(1.3)
17
Algorithm1
k
-means
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered

C
:thenumberofclusters
2:
Output
:Clustermembershipmatrix
U
2f
0
;1
g
C

n
3:
Randomlyinitializethemembershipmatrix
U
withzerosandones,ensuringthat
U
>
1
=
1
.
4:
repeat
5:
Computetheclustercenters
c
k
=
1
u
>

k
1
n
P
i
=1
U
k;i
x
i
;k
2
[C
].
6:
for
i
=1
;:::;n
do
7:
Findtheclosestclustercenter
k

for
x
i
,bysolving
k

=argmin
k
2
[
C
]
jj
c
k

x
i
jj
2

2
:8:
Updatethe
i
th
columnof
U
by
U
k;i
=1
for
k
=
k

andzero,otherwise.
9:
endfor
10:
until
convergenceisreached
Theaboveproblem(1.3)isanNP-completeintegerprogrammi
ngproblem,duetowhichitis
difculttosolve[121].Agreedyapproximatealgorithm,pr
oposedbyLloydsolves(1.3)itera-
tively[116].Thecentersareinitializedrandomly.Ineach
iteration,everydatapointisassignedto
theclusterwhosecenterisclosesttoit,andthenthecluste
rcentersarerecalculatedasthemeans
ofthepointsassignedtothecluster,i.e.the
k
th
center
c
k
isobtainedas
c
k
=
1
n
k
n
X
i
=1
U
k;i
x
i
;k
2
[C
];(1.4)
where
n
k
=
u
>

k
1
isthenumberofpointsassignedtothe
k
th
cluster.Thesetwostepsarerepeated
untiltheclusterlabelsofthedatapointsdonotchangeinco
nsecutiveiterations.Thisproce-
dureisdescribedinAlgorithm1.Ithas
O
(
ndCl
)
runningtimecomplexityand
O
(
nd
)
memory
complexity,where
l
isthenumberofiterationsrequiredforconvergence.Sever
almethodshave
beendevelopedintheliteraturetoinitializethealgorith
mintelligentlyandensurethatthesolution
obtainedisa
(1+

)
-approximationoftheoptimalsolutionof(1.3)[12,101].
18
1.4KernelBasedClustering

Theissueofscalabilitycanbeaddressedbyusingthelarges
caleclusteringalgorithmsdescribed
inSection1.3.However,mostofthesealgorithms,includin
g
k
-means,arelinearclusteringalgo-
rithms,i.e.theyassumethattheclustersarelinearlysepa
rableintheinputspace(e.g.thedataset
showninFigure1.3(a))anddenetheinter-pointsimilarit
iesusingmeasuressuchastheEuclidean
distance.Theysufferfromthefollowingtwomaindrawbacks
:
(i)Datasetsthatcontainclustersthatcannotbeseparated
byahyperplaneintheinputspace
cannotbeclusteredbylinearclusteringalgorithms.Forth
isreason,alltheclusteringalgo-
rithmsinTable1.2,withtheexceptionofspectralclusteri
ng,areonlyabletondcompact
well-separatedclustersinthedata.Theyarealsonotrobus
ttonoiseandoutliersinthedata.
ConsidertheexampleshowninFigure1.4.ThedatasetinFigu
re1.4(a)contains
500
points
intheformoftwosemi-circles.Weexpectaclusteringalgor
ithmtogroupthepointsineach
semi-circle,anddetectthetwosemi-circularclusters.Th
eclustersresultingfrom
k
-means
withEuclideandistanceareshowninFigure1.4(b).Duetoth
euseofEuclideandistance,
thetwo-dimensionalspaceisdividedintotwohalf-spacesa
ndtheresultingclustersaresep-
aratedbytheblackdottedline.OtherEuclidean-distanceb
asedpartitionalalgorithmsalso
ndsimilarincorrectpartitions.
(ii)Non-linearsimilaritymeasurescanbeusedtondarbit
rarilyshapedclusters,andaremore
suitableforreal-worldapplications.Forexample,suppos
etwoimagesarerepresentedby
theirpixelintensityvalues.Theimagesmaybeconsideredm
oresimilartoeachotherif
theycompriseofsimilarpixelvalues,asshowninFigure1.5
.Thusthedifferencebetween
theimagesisreectedbetterbythedissimilarityofimageh
istogramsthanbytheEuclidean
distancebetweenthepixelvalues[14,106].
19
(a)
(b)
(c)
  1002003004005001002003004005002468(d)
  1002003004005001002003004005000.20.40.60.8(e)
Figure1.4Atwo-dimensionalexamplethatdemonstratesthe
limitationsof
k
-meansclustering.
500
two-dimensionalpointscontainingtwosemi-circularclus
tersareshowninFigure(a).Points
numbered
1

250
belongtotherstclusterandpointsnumbered
251

500
belongtothesecond
cluster.Theclustersobtainedusing
k
-means(usingEuclideandistancemeasure)donotreectthe
trueunderlyingclusters(showninFigure(b)),becausethe
clustersarenotlinearlyseparableas
expectedbythe
k
-meansalgorithm.Ontheotherhand,thekernel
k
-meansalgorithmusingtheRBF
kernel(withkernelwidth
˙
2
=0
:4
)revealsthetrueclusters(showninFigure(c)).Figures(d
)and
(e)showthe
500

500
similaritymatricescorrespondingtotheEuclideandistan
ceandtheRBF
kernelsimilarity,respectively.TheRBFkernelsimilarit
ymatrixcontainsdistinctblockswhich
distinguishbetweenthepointsfromdifferentclusters.Th
esimilaritybetweenthepointsinthe
sametrueclusterishigherthanthesimilaritybetweenpoin
tsindifferentclusters.TheEuclidean
distancematrix,ontheotherhand,doesnotcontainsuchdis
tinctblocks,whichexplainsthefailure
ofthe
k
-meansalgorithmonthisdata.
20
(a)
01000200030000100200(b)
(c)
01002003000100200(d)
(e)
01002003000100200(f)
Figure1.5Similarityofimagesexpressedthroughgrayleve
lhistograms.Thehistogramofthe
intensityvaluesoftheimageofawebsite(Figure(b))isver
ydifferentfromthehistogramsofthe
imagesofbutteries(Figures(d)and(f)).Thehistogramso
fthetwobutteryimagesaresimilar
toeachother.
21
Theissueofnon-linearseparabilityistackledusing
kernelfunctions
11
.Thekeybehindthe
successofkernel-basedlearningalgorithmsisthefacttha
tanydatasetbecomeslinearlyseparable
whenprojectedtoanappropriatehighdimensionalspace.Co
nsideranon-linearfunction
'
:˜
!
H

,whichmapsthepointsinthe
inputspace
˜
toahighdimensional
featurespace
H

.The
distancebetweenthedatapointsinthisfeaturespacecanbe
denedintermsofthedotproducts
oftheprojectedpoints.Forinstance,theEuclideandistan
cebetweentwopoints
x
a
and
x
b
in
H

isdenedas
jj
'
(
x
a
)

'
(
x
b
)
jj
2

2
=
h
'
(
x
a
)
;'
(
x
a
)
i
+
h
'
(
x
b
)
;'
(
x
b
)
i
2
h
'
(
x
a
)
;'
(
x
b
)
i
:Inpracticalapplications,thedimensionalityof
H

isextremelyhigh,possiblyinnite.Hence,
theexplicitcomputationofthemapping
'
ishighlycomputationallyintensiveand,inmostcases,
infeasible.Thiscomputationisavoidedbyreplacingthedo
tproductwithanon-linearkernel
distancefunction

(

;
):
˜

˜
!<
.Thedistancebetweenanytwopointsisnowdenedin
termsofthekernelfunction

as
d
2


(
x
a
;x
b
)=

(
x
a
;x
a
)+

(
x
b
;x
b
)

2

(
x
a
;x
b
)
:(1.5)
Akernelfunction

isadmissibleifandonlyifitsatisestheMercer'sconditi
on[159,Theorem
2.10].Informallystated,Mercer'stheoremassertsthatth
ereexistsamapping
'
andanexpansion

(
x
a
;x
b
)=
'
(
x
a
)
>
'
(
x
b
)
ifandonlyif,foranyfunction
g
(
x
)
suchthat
R
g
(
x
)
2
d
x
isnite,we
have
Z

(
x
a
;x
b
)
g
(
x
a
)
g
(
x
b
)
d
x
a
d
x
b

0
:SuchakernelisknownastheMercerkernelorReproducingKer
nel,andthefeaturespace
H

is
calledthe
ReproducingKernelHilbertSpace(RKHS)
.Thematrix
K
=[

(
x
i
;x
j
)]
;x
i
;x
j
2D
is
11
http://crsouza.blogspot.com/2010/03/kernel-function
s-for-machine-learning.html
22
Table1.3Popularkernelfunctions.
Linear

(
x
a
;x
b
)=
x
>

a
x
b
+
c
forconstant
c
Polynomial

(
x
a
;x
b
)=

x
>

a
x
b
+
c

d
,
d
isthedegreeofthepolynomialkernel
RBF

(
x
a
;x
b
)=exp


jj
x
a

x
b
jj
2

2
2
˙
2

,
˙>
0
isthekernelwidthparameter
Laplacian

(
x
a
;x
b
)=exp


jj
x
a

x
b
jj
˙

Chi-square

(
x
a
;x
b
)=1

d
P
i
=1
(
x
i

a

x
i

b
)
2
0
:
5
(
x
i

a
+
x
i

b
)
HistogramIntersection

(
x
a
;x
b
)=
P
min(
hist
(
x
a
)
;hist
(
x
b
))
Stringkernel
Numberofcommonsubsequencesbetweenstringsequences
x
a
and
x
b
knownasthekernelmatrixorGrammatrix.Thesimplestkerne
lfunctionsarepositivedenite
kernelswhosecorrespondingkernelmatrixisHermitianand
positive-denite.TheRadialBasis
Function(RBF)kerneldenedby

(
x
a
;x
b
)=exp
 

jj
x
a

x
b
jj
2

2
2
˙
2
!
;˙>
0
(1.6)
isapopularpositive-denitekernelfunction.Itperforms
wellonalargenumberofbenchmark
datasets.Theparameter
˙
2
,knownasthekernelwidth,scalesthedistancebetweenthep
oints.Ta-
ble1.3listssomeofthepopularkernelfunctions.Chi-squa
rekernel,histogramintersectionkernel
andtheirvariantsarecommonlyusedinimageandvideo-rela
tedapplications.Stringkernelsare
popularintext-miningapplications.Theremainingkernel
sinTable1.3aregenerickernels.Using
thelinearkernelisthesameasusingtheEuclideandistance
measure.
Kernelbasedclusteringtechniquesuse(1.5)todenethesi
milaritybetweenobjects.Conse-
quently,whenprovidedwiththeappropriatekernelfunctio
n,theyhavetheabilitytocapturethe
non-linearstructureinrealworlddatasetsand,thus,usua
llyperformbetterthanthelinearcluster-
ingalgorithms,intermsofclusterquality[95].Variouske
rnel-basedclusteringalgorithmshave
beendeveloped,includingkernel
k
-means,spectralclustering,supportvectorclustering,m
aximum
marginclustering,kernelself-organizingmapsandkernel
neuralgas[65].
23
Spectralclustering[118]isbasedontheideaofspectralgr
aphpartitioning.Thedatapoints
arerepresentedasnodesinagraphandtheafnitybetweenth
enodesisdenedbythekernel
similaritybetweenthepoints.Thegraphispartitionedint
o
C
componentsbyrstcomputing
thegraphLaplacianmatrixandtheeigenvectorscorrespond
ingtoitssmallest
C
eigenvalues,and
thenclusteringtheeigenvectorsinto
C
clustersusing
k
-means.Thedatapartitionisobtainedvia
thegraphpartition.Spectralclusteringiswidelyemploye
dforimagesegmentationandgraph
partitioningproblems.
Supportvectorclustering[19]involvesprojectingthedat
atoahighdimensionalfeaturespace
andsearchingforaminimumenclosingsphereinthisspace.T
hisenclosingsphereisprojected
backintotheinputspaceandthesupportvectorsareusedtod
enetheclusterboundaries.All
pointsthatliewithinaclusterboundaryareassignedtothe
samecluster.Themaximummargin
clusteringtechnique[190]ndstheclusterlabelingwhich
whenusedtondamaximummargin
classier(e.g.SupportVectorMachines)forthegivendata
,resultsinamarginthatismaximal
overallpossibleclusterlabelings.Aconvexoptimization
problemwiththeclusterlabelsandthe
marginoftheSupportVectorMachineasvariables,andconst
raintsonthenumberofpointsper
clusterandthedifferenceintheclustersizes,isformulat
ed.Thelabelsandtheclassiermargin
areoptimizedsimultaneouslytondtheoptimalclusterlab
els.
Thekernelself-organizingmap[120]algorithmextendsthe
self-organizingmap[96]algo-
rithmtousekerneldistancemeasures.Thekeyideabehindth
isalgorithmistoconstructalow-
dimensional(typicallytwo-dimensional)topology-prese
rvingmapoftheinputdatasetthrough
competitivelearning.Aself-organizingmap,alsoknownas
theKohenenmap,consistsofatwo-
layernetwork,theinputlayercontaining
d
nodesandanoutputlayercontainingatleast
C
nodes.
Eachoutputnodeisrandomlyinitializedwithaweight.When
anewdatapointisinputtothe
network,thenodewhoseweightisclosesttotheinputdatapo
intintermsofthekerneldistanceis
determined.ThisnodeiscalledtheBestMatchingUnit.Thew
eightsofthebestmatchingunitand
itsneighboringnodesareupdated,basedonapre-denednei
ghborhoodfunction.Afteranumber
24
ofpassesoverthedataset,theweightsofthenodesconverge
toformdistinctiveregionsinthe
outputlayerfromwhichtheclustersinthedatacanbereadof
f.
Thekernelneuralgasalgorithm[145],inspiredbytheself-
organizingmapalgorithm,also
createsamapoftheinputdata.Thedifferencebetweenthetw
omethodsisthatwhile,onlythe
weightsofafewneighboringnodesofthebestmatchingunita
reupdatedinaself-organizingmap,
theweightsofallthenodesareupdatedintheneuralgasalgo
rithm.Thenodesarerankedbasedon
theirproximitytothebestmatchingunit,andtheirweights
updatedonthebasisoftheirrank.The
nearestnodeisupdatedbyahigherfactorthanthefarthestn
ode.Thisupdatemechanismleadsto
neuralgasconvergingfasterthanself-organizingmaps.
Similartothe
k
-meansalgorithm,thekernel
k
-meansalgorithm[160]isthemostpopular
kernel-basedclusteringalgorithmduetoitsimplicity.Se
veralstudieshavealsoestablishedthe
theoreticalequivalenceofkernel
k
-meansandotherkernel-basedclusteringmethods,suggest
ing
thattheyyieldsimilarresults[51,52].

1.4.1Kernel
k
-means
Thekernel
k
-meansalgorithmcanbeviewedasanon-linearextensionoft
he
k
-meansalgo-
rithm.ItreplacestheEuclideandistancefunction(1.1)em
ployedinthe
k
-meansalgorithmwitha
non-linearkerneldistancefunctiondenedin(1.5).
Let
K
2<
n

n
bethekernelmatrixwith
K
i;j
=

(
x
i
;x
j
)
,where

(

;
)
isthekernelfunction.
Let
H

betheReproducingKernelHilbertSpace(RKHS)endowedbyth
ekernelfunction

(

;
)
,
and
jjjj
H

bethefunctionalnormfor
H

.Similartothe
k
-meansproblem,theobjectiveofkernel
k
-meansistominimizetheclusteringerror.Hence,thekerne
l
k
-meansproblemcanbecastasthe
followingoptimizationproblem:
min
U
2P
max
f
c
k
(

)
2H

g
C

k
=1
C
X
k
=1
n
X
i
=1
U
k;i
k
c
k
(

)


(
x
i
;
)
k
2

H

;(1.7)
25
Algorithm2
Kernel
k
-means
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered


(

;
):
˜

˜
7!<
:thekernelfunction

C
:thenumberofclusters
2:
Output
:Clustermembershipmatrix
U
2f
0
;1
g
C

n
3:
Computethekernelmatrix
K
=[

(
x
i
;x
j
)]
n

n
.
4:
Randomlyinitializethemembershipmatrix
U
withzerosandones,ensuringthat
U
>
1
=
1
.
5:
repeat
6:
for
i
=1
;:::;n
do
7:
Findtheclosestclustercenter
k

for
x
i
,bysolving
k

=argmin
k
2
[
C
]
jj
c
k
(

)


(
x
i
;
)
jj
2

H

=argmin
k
2
[
C
]
u
>

k
K
u
k

u
>

k
1

2

2
u
>

k
K
i
u
>

k
1
;where
u
k
isthe
k
th
columnof
U
>
,and
K
i
isthe
i
th
columnof
K
.
8:
Updatethe
i
th
columnof
U
by
U
k;i
=1
for
k
=
k

andzerootherwise.
9:
endfor
10:
until
convergenceisreached
where
U
=(
u
1
;:::;
u
C
)
>
istheclustermembershipmatrix,
c
k
(

)
2H

;k
2
[C
]arethecluster
centers,anddomain
P
=
f
U
2f
0
;1
g
C

n
:U
>
1
=
1
g
,where
1
isavectorofallones.Theabove
problemisalsoNP-complete.Asimpliedversionoftheprob
lem,whichrelaxestheconstraints
on
U
,issolvedtoobtainthesolution[72,192].Let
n
k
=
u
>

k
1
bethenumberofdatapoints
assignedtothe
k
th
cluster,and
b
U
=(
b
u
1
;:::;
b
u
C
)
>
=[
diag
(
n
1
;:::;n
C
)]

1
U;
e
U
=(
e
u
1
;:::;
e
u
C
)
>
=[
diag
(
p
n
1
;:::;
p
n
C
)]

1
U;
(1.8)
denotethe
`
1
and
`
2
normalizedmembershipmatrices,respectively.
Itiseasytoverifythat,giventhe
C

n
clustermembershipmatrix
U
,theoptimalsolutionfor
26
theclustercentersis
c
k
(

)=
n
X
i
=1
b
U
k;i

(
x
i
;
)
;k
2
[C
]:(1.9)
Asaresult,wecanformulate(1.7)asthefollowingoptimiza
tionproblemover
U
:
min
U
2P
tr
(
K
)

tr
(
e
UK
e
U
>
)
;(1.10)
whichcanbefurtherreformulatedasthefollowingtracemax
imizationproblem:
max
U
2P
tr
(
e
UK
e
U
>
)
:(1.11)
Notethatthe
k
-meansoptimizationproblemin(1.3)canalsobewrittenast
hefollowingtrace
maximizationproblem:
max
U
2P
tr
(
e
UXX
>
e
U
>
)
;(1.12)
where
X
=(
x
1
;:::;
x
n
)
>
isthe
n

d
patternmatrixcorrespondingtothedataset
D
.Therefore,a
greedyiterativealgorithmsimilartothe
k
-meansalgorithmcanbeemployedtosolve(1.11),with
theEuclideandistancefunctionreplacedbythekernelsimi
larityfunction.
Thekernel
k
-meansalgorithmisdescribedinAlgorithm2.Figure1.4(c)
showstheresultof
applyingthekernel
k
-meansalgorithmtothesyntheticsemi-circlesdatasetinF
igure1.4(a)using
theRBFkernelfunctionin(1.6),withthekernelwidth
˙
2
setto
0
:4
.Itcanbeobservedthatkernel
k
-meansisabletodetectthetwosemi-circlescorrectly,unl
ikethe
k
-meansalgorithm.
1.4.2Challenges

Thoughkernelbasedclusteringalgorithmsachievebetterc
lusterquality,theysufferfromtwo
majorlimitations.
27
Table1.4Comparisonoftherunningtimesof
k
-meansandkernel
k
-meansona
100
-dimensional
syntheticdatasetcontaining
10
clustersandexponentiallyincreasingnumberofdatapoint
s,ona
2.8GHzprocessorwith40GBmemory.
Datasetsize
10
4
10
5
10
6
10
7
10
8
Runningtime
k
-means
0.03
0.17
2.30
34.90
5508.50
inseconds
Kernel
k
-means
3.09
320.10
>
48
hours
1.4.2.1Scalability

Anaiveimplementationofkernel
k
-meansrequiresthecomputationofthe
n

n
kernelmatrix
K
(Step3inAlgorithm2)whichtakes
O
(
n
2
)
timeandmemory.Clusteringmillionsofobjects
usingkernel
k
-meansrequiresmorethan
8
;000
GBofmemoryandlargeamountofcomputing
resources.Table1.4comparestherunningtimesofthe
k
-meansandthekernel
k
-meansalgorithms
ona
100
-dimensionalsyntheticdatasetcontaining
10
clustersandexponentiallyincreasingnumber
ofpoints.Thealgorithmswereexecutedona
2
:8
GHzprocessorwith
40
GBmemory.Itcanbe
seenthatrunningkernel
k
-meansisfarmoreexpensivethanrunning
k
-means,especiallyonlarge
datasets.
Itisalsoexpensivetoassignpreviouslyunseendatapoints
toclustersusingkernel
k
-means,
oftentermedasthe
out-of-sample-problem
.Tondtheclusterlabelforanewdatapoint
x
,we
needtocomputethedistancebetween
x
andalltheclustercentersasfollows:
d
2


(
x
;c
k
)=
jj
c
k
(

)


(
x
;
)
jj
2

H

=
u
>

k
K
u
k

u
>

k
1

2

2
u
>

k
K
x
u
>

k
1
;k
2
[C
];(1.13)
where
K
x
=(

(
x
;x
1
)
;:::;
(
x
;x
n
))
>
.Itrequiresthecomputationofthe
O
(
n
)
-sizedvector
K
x
inadditiontothekernelmatrix
K
.Thisisduetothefactthatthereisnoexplicitrepresentat
ionfor
theclustercenters.Iftherewasa
d
-dimensionalrepresentationfortheclustercenters
c
k
(asinthe
caseof
k
-means),thedistance
d
2


(
x
;c
k
)
canbecomputedin
O
(
d
)
time.
Clearly,scalabilityisamajorchallengefacedbykernel
k
-means.Otherkernel-basedalgo-
28
rithmsalsohavehighrunningtimecomplexity.Forinstance
,spectralclusteringinvolvesthecom-
putationofthetop
C
eigenvectorsofthekernelmatrix,whichisof
O
(
n
3
)
complexity.
Intheliterature,theissueofscalabilityhaslargelybeen
addressedthroughtheuseofcloud
computingandparallelalgorithms.TheMahoutplatform[14
3]implementstheparallelspectral
clusteringalgorithmwhichusesthedistributedLanczosei
gensolvertoobtaintheeigenvectorsof
theLaplacianmatrix[35].Distributedimplementationsof
SupportVectorMachineshavebeen
developedtoperformclustering[78,169].However,parall
elizationofkernelbasedalgorithmsis
notsimpleduetotheirnon-linearnature[15].Forinstance
,inordertoparallelizekernel
k
-means,
onemustreplicatethedatatoallthetasks,leadingtolarge
resourceandcommunicationoverheads.
Approximateclusteringtechniquesareusefulinalleviati
ngthisissue.Samplingmethods,such
astheNystrommethod[187],havebeenemployedtoobtainlow
rankapproximationofthekernel
matrixtoaddressthischallenge[67,113].Low-dimensiona
lprojectioncombinedwithsampling
havebeenusedtofurtherimprovetheclusteringefciencya
ndtackletheout-of-sampleprob-
lem[11,153].

1.4.2.2Choiceofkernel

Theroleofthekernelfunctionistoreectthetruestructur
eofthedataset.However,ifthekernel
functionischosenwrongly,theperformanceoftheclusteri
ngalgorithmdegrades.TheRBFkernel
denedin(1.6)performswellonmostbenchmarkdatasets.Ho
wever,evenfortheRBFkernel,
thekernelwidthparameterhastobechosencarefully.Figur
e1.6demonstratesthesensitivityof
kernel
k
-meanstothekernelwidthparameter.Kernel
k
-meansisexecutedonthesemi-circlesdata
setshowninFigure1.4(a),usingtheRBFkernelwithkernelw
idthvalues:
0
:4
and
0
:1
.Theclusters
obtainedareshowninFigures1.6(b)and1.6(c)respectivel
y.When
˙
2
=0
:4
,thetrueclustersare
revealed.Ontheotherhand,when
˙
2
=0
:1
,theclustersaredistorted.Figure1.6(d)plotsthe
clusteringerrorofkernel
k
-means,denedin(1.10),againsttheRBFkernelwidth.Itis
clearthat
theperformancedependsonthechoiceofthekernelwidth.He
nce,anotherchallengeassociated
29
(a)
(b)
(c)
00.51498.5499499.5500Kernelwidth
<2Clusteringerror
(d)
Figure1.6Sensitivityofthekernel
k
-meansalgorithmtothechoiceofkernelfunction.Thesemi-
circlesdataset(showninFigure(a))isclusteredusingker
nel
k
-meanswiththeRBFkernel.When
thekernelwidthissetto
0
:4
,thetwoclustersarecorrectlydetected(showninFigure(b
)),whereas
whenthekernelwidthissetto
0
:1
,thepointsareclusteredincorrectly(showninFigure(c))
.Figure
(d)showsthevariationintheclusteringerrorofkernel
k
-means,denedin(1.10),withrespectto
thekernelwidth.
30
withkernelbasedalgorithmsisthechoiceofthekernelfunc
tionandthekernelparameters.
Kernellearningtechniquesaimatlearningapositivesemi-
denitekernelmatrixthatreects
thetruesimilaritybetweenthepointsinthedataset[4].In
thesupervisedlearningsetting,the
kernelisoptimizedtoalignwiththetrueclassstructureof
thedata.Thisisachievedbyeither
minimizingtheerroroftheclassierforthechosenkernel,
ormaximizingthesimilaritybetween
thekernelandtheclassmatrix.Astheclasslabelsarenotav
ailableinthesettingofunsupervised
learning,othercriterionsuchascompactnessofthecluste
rsinthefeaturespace,anddegreeof
alignmentwiththestructureofthedataareutilized[112,1
77,200].
1.5ThesisContributions

Theobjectiveofthisthesisistodesignclusteringalgorit
hmsthatcanaccuratelyidentifytheclus-
tersindatasetscontainingbillionsofpoints,thousandso
ffeaturesandthousandsofclusters.As
kernel-basedclusteringalgorithmsgenerallyachievehig
hclusterquality,providedthecorrectker-
nelfunctionischosen,weaddressthescalabilitychalleng
eassociatedwithkernelbasedclustering
algorithms.Ourmaincontributionisthedevelopmentofef
cientapproximationsofthekernel
k
-
meansalgorithmtoenablekernelbasedclusteringoflarged
atasets.Wedemonstrateanalytically
andempiricallythattheproposedapproximatealgorithmsa
recomparabletokernel
k
-meansin
termsofaccuracyand,atthesametime,comparableto
k
-meansintermsofefciency,achieving
thedesiredtrade-offbetweenscalabilityandaccuracy.We
thenextendtheproposedapproximate
algorithmstohandledistributedandstreamingdata,pushi
ngthelimitsonthenumberofobjects
thatcanbeclusteredaccuratelywithlimitedcomputingand
memoryresources.Figure1.7shows
thescalabilityofsomeofthepopularlinearandkernel-bas
edclusteringalgorithmsintermsof
n
,
d
and
C
,andthecontributionoftheproposedclusteringalgorithm
sinimprovingthescalabilityof
kernel-basedclustering.
Inthefollowing,wedescribethespeciccontributionsofe
achchapter:
31
1001020100102104100102104106CndFigure1.7Scalabilityofclusteringalgorithmsintermsof
n
,
d
and
C
,andthecontributionof
theproposedalgorithmsinimprovingthescalabilityofker
nel-basedclustering.Theplotshows
themaximumsizeofthedatasetthatcanbeclusteredwithles
sthan
100
GBmemoryona
2
:8
GHzprocessorwithareasonableamountofclusteringtime(l
essthan
10
hours).Thelinearclus-
teringalgorithmsarerepresentedinblue,currentkernel-
basedclusteringalgorithmsareshown
ingreen,parallelclusteringalgorithmsareshowninmagen
ta,andtheproposedclusteringalgo-
rithmsarerepresentedinred.Existingkernel-basedclust
eringalgorithmscanclusteronlyupto
theorderof
10
;000
pointswith
100
featuresinto
100
clusters.Theproposedbatchclusteringal-
gorithms(approximatekernel
k
-means,RFFclustering,andSVclusteringalgorithms)arec
apable
ofperformingkernel-basedclusteringondatasetsaslarge
as
10
million,withthesameresource
constraints.Theproposedonlineclusteringalgorithms(a
pproximatestreamkernel
k
-meansand
sparsekernel
k
-meansalgorithms)canclusterarbitrarily-sizeddataset
swithdimensionalityinthe
orderof
1
;000
andthenumberofclustersintheorderof
10
;000
.
32

Chapters2and3addressthescalabilityofkernel
k
-meansusingkernelapproximationtech-
niques.Thecomputationaldemandofkernel
k
-meansstemsfromthefactthatitcomputes
an
n

n
kernelmatrix
K
,leadingto
O
(
n
2
)
runningtimeandmemorycomplexity.This
canbealleviatedbyreplacingthekernelmatrix
K
withanapproximatematrixwhichcan
becomputedmoreefciently.InChapter2,werstpresentar
andomizedalgorithm,called
approximatekernelk-means
whichreplaces
K
,withalowrankapproximatekernelmatrix.
Itscomplexityislinearintermsof
n
,whileitsclusteringperformanceisequivalenttothat
ofkernel
k
-means.Wethenextendtheproposedapproximatealgorithmt
ohandlelargedata
setsinadistributedenvironment.InChapter3,weproposet
woclusteringalgorithms:
RFF
clustering
and
SVclustering
,whichemploy
randomfeaturemaps
[92,147]toobtainlow-
dimensionalrepresentationsforthedatapoints,suchthat
thedotproductofanytwopoints
inthelow-dimensionalspaceapproximatesthekernelsimil
aritybetweenthem.Thisallows
ustoexecutealinearclusteringalgorithmonthetransform
eddatapoints.TheSVclustering
algorithmhasalowerrunningtimethantheapproximatekern
el
k
-meansalgorithm.Italso
allowstheexplicitcomputationoftheclustercenters,lea
dingtoanefcientsolutiontothe
out-of-sampleclusteringproblem.Wedemonstratethatiti
spossibletoclusterbillionsof
datapointsefcientlyandaccuratelyusingthealgorithms
proposedinthesetwochapters.
Forinstance,wewereabletoclusterasyntheticdatasetcon
taining
1
billion
10
-dimensional
pointsusingthedistributedapproximatekernel
k
-meansalgorithmin
15
minutes(onacom-
putingclusterwith
1
;024
,
2
:8
GHzprocessorsandshared
40
GBmemory),withhighcluster
quality(
80%
accuracyintermsofNMI
12
).Itwouldtakemanydaystoclusterthisdataset
usingkernel
k
-meansandotherkernel-basedclusteringalgorithms,whil
elinearclustering
algorithmslike
k
-meanscannotachievecomparableaccuracy.

Batchclusteringalgorithmssuchas
k
-meansandkernel
k
-meansareiterativeinnatureand
needtoaccesstheinputdatapointsmultipletimes.However
,manydatasetsaretoolarge
12
RefertoSection1.6.2forthedenitionofNMI.
33
toloadintothememory,soitwouldnotonlybeprohibitively
expensivetoperformmultiple
passesoverthedata,butalsoinfeasibletocomputethekern
elmatrix.Someapplications
suchassocialnetworkanalysisandintrusiondetectioninn
etworks,involvepotentiallyun-
boundedsequencesofdatapointscalleddatastreams.Onlya
smallsubsetofthedatacanbe
stored,dependingonthesizeofthedatabuffer.Duetothis,
eachdatapointcanbeaccessed
atmostonce.Thisdataalsoevolvesovertime,sothedatapoi
ntsthatarrivedrecentlyhave
higherrelevancethantheolderdata.Therehavebeenrelati
velyfeweffortstoapplykernel
basedclusteringtodatastreams,duetothecostofcomputin
gthekernel.InChapter4,we
presentanefcientalgorithmcalled
approximatestreamkernelk-means
,toperformkernel
clusteringonstreamdata.Thekeyideaistoconstructtheke
rnelmatrixdynamicallyusing
importancesampling,andassignlabelstotheincomingdata
pointsinreal-time.Weusesev-
eralbenchmarkdatasetstosimulatestreamdatasets,andev
aluatetheperformanceofthe
proposedalgorithmonthesedatasets.Wedemonstratethato
uralgorithmisabletocluster
streamdatasetsinreal-timewithspeedsupto
8
MBps.

Documentandimagedatasets,containmillionsofhigh-dime
nsionalpointsandusuallybe-
longtoalargenumberofcategories.Findingclustersinsuc
hdatasetsiscomputationally
expensiveusingkernel-basedclusteringtechniquesbecau
setheyhavequadraticrunningtime
complexityintermsofthenumberofdatapoints,andlineart
imecomplexityintermsofthe
numberofdimensionsandthenumberofclusters.Althoughth
eapproximatekernelclus-
teringalgorithmsdiscussedinChapters2-4reducetherunn
ingtimecomplexityintermsof
thenumberofdatapoints,theirclusteringtimegrowslinea
rlywiththenumberofclusters.
InChapter5,wepresentthe
sparsekernelk-meansalgorithm
whichcanefcientlycluster
largedatasetsintothousandsofclusterswithsignicantl
ylowerprocessingandmemoryre-
quirements,withhighclusteringaccuracy.Itassumesthat
thekernelmatrixissparsewhen
thenumberofclustersislarge,andconstructsasparsekern
elmatrixforasubsetofthedata
set,sampledincrementallyusingimportancesampling.Clu
sterlabelsareobtainedbyclus-
34
Table1.5Descriptionofdatasetsusedforevaluationofthe
proposedalgorithms.
Dataset
Numberofdatapoints
n
Dimensionality
d
Numberofclusters
C
CIFAR-10[99]
60,000
384
10
CIFAR-100[99]
60,000
384
100
MNIST[108]
70,000
784
10
ForestCoverType[23]
581,012
54
7
Imagenet-34[49]
949,401
900
34
Imagenet-164[49]
1,262,102
900
164
Poker[33]
1,025,010
30
10
NetworkIntrusion[167]
4,897,988
50
10
Youtube
10,143,254
6,647
N/A
Tiny[173]
79,302,017
384
N/A
Twitter
1,000,000,000
8,042
N/A
Concentriccircles
100to1,000,000,000
10to1,000
10to1,000
teringthissparsekernelmatrixinalowdimensionalspaces
pannedbyitstopeigenvectors.
Thisalgorithmhasrunningtimecomplexitylinearinthesiz
eandthedimensionalityofthe
dataset,andlogarithmicinthenumberofclusters.
1.6DatasetsandEvaluationMetrics

1.6.1Datasets

Todemonstratetheeffectivenessoftheproposedalgorithm
s,weuseseveralbenchmarkdatasets
ofdifferentsizesanddimensionalities,fromseveraldoma
ins.Thedescriptionofthedatasetsis
summarizedinTable1.5.

MNIST[108]:
TheMNISTdatasetisasubsetofthedatabaseofhandwrittend
igitsavailable
fromNIST.Itcontains
70
;000
imagesfrom
10
classes,eachclassrepresentingoneofthe
digits,
0

9
.Eachimageisrepresentedasa
784
-dimensionalfeaturevectorcontainingthe
pixelintensityvalues.

ForestCoverType[23]:
Thisdatasetiscomposedofcartographicvariablesobtaine
dfrom
35
theUSGeologicalSurvey(USGS)andtheUSForestService(US
FS)data.Eachofthe
581
;012
datapointsrepresentstheattributesofa
30

30
squaremetercelloftheforest
oor.Thereareatotalof
12
attributes,includingqualitativemeasureslikesoiltype
and
wildernessarea,andquantitativemeasureslikeslope,ele
vation,anddistancetohydrology.
These
12
attributesarerepresentedusing
54
features.Thedataaregroupedinto
7
classes,
eachrepresentingadifferentforestcovertype.Thetrueco
vertypewasdeterminedfromthe
USFSRegion2ResourceInformationSystem(RIS)data.

Imagenet[49]:
TheImagenetdatasetcontainsabout
14
millionimagesorganizedaccord-
ingtotheWordnethierarchy[64].Eachnodeinthishierarch
yrepresentsaconceptknown
asaﬁsynsetﬂ.Wedownloaded
1
;262
;102
imagesfrom
1
;000
synsets,mergedtheleafnodes
inthesynsettreebasedontheirsimilaritytoforma164-cla
ssdataset.Wecallthisdata
setImagenet-164anduseittodemonstratetheeffectivenes
softhesparsekernel
k
-means
algorithminChapter5.Bylteringouttheclasseswithfewe
rthan
500
images,weformed
abalanceddatasetcontaining
949
;401
imagesfrom
34
classes,whichwecallImagenet-34
dataset.Thisdatasetisusedtoevaluatetheremainingclus
teringalgorithms.Wecom-
putedtheScaleInvariantFeatureTransform(SIFT)descrip
tors[117]oftheimagesusing
theVLFeatlibrary[178],andclusteredarandomlychosensu
bsetof10millionSIFTfea-
turestoformavisualvocabulary.EachSIFTdescriptorwast
henquantizedintoavisual
wordusingthenearestclustercenter.Weobtaineda900-dim
ensionalvectorrepresentation
foreachimage,whichwasthennormalizedtolieintherange
[0
;1]
.

Poker[33]:
Thisdataset,availableintheUCIrepository[13],contain
s
1
;025
;010
data
points.Eachdatapointisanexampleofaﬁhandﬂconsistingo
fveplayingcardsdrawn
fromastandarddeckof
52
.Eachcardisdescribedusingtwoattributes:suitandrank.
These
attributesarerepresentedusinga
30
-dimensionalcategoricalfeaturevector.Thereare
10
classesinthedataset,eachdepictingatypeofpokerhand.

NetworkIntrusion[167]:
TheNetworkIntrusiondatasetcontains
4
;898
;431
50-
36
dimensionaldatapointsrepresentingtheTCPdumpdatafrom
sevenweeksofalocal-area
networktrafc.Thedataisclassiedinto23classes,onecl
assrepresentinglegitimatetrafc
andtheremaining22classesrepresentingdifferenttypeso
fillegitimatetrafc.Weltered
outthedatafromclasseswhichcontainfewerthan
500
datapoints,toformadatasetwith
4
;897
;988
datapointsfrom10classes.

Youtube
13
:
Youtubeisavideohostingwebsitewhichallowsuserstouplo
ad,viewandshare
videosovertheweb.Ithasoveronebillionusersuploadingo
ver
300
hoursofvideosev-
eryminute,onawiderangeoftopics.WeusedtheYoutubeSear
chAPI
14
todownload
themeta-datacorrespondingto
10
;143
;254
videosusing
26
;000
non-abstractnounsfrom
Wordnet[64]assearchqueries.Weusedthevideotitle,desc
riptionandthevideothumbnail
(whichusuallycontainsthekeyframeinthevideo)toextrac
tfeaturesforeachrecord.For
eachvideo,weeliminatedstopwordsfromthetitleanddescr
iptiontoobtainavocabulary
containing
6
;135
terms,andextractedthecorrespondingtf-idf(termfreque
ncy-inversedoc-
umentfrequency)features[125].Featurevalue
x
r;t
,representingtheweightassignedtothe
term
t
inrecord
r
,measureshowimportantthetermistotherecordinthedatas
et.Itis
denedas
x
r;t
=
tf
(
r;t
)

idf
(
t;
D
)
(1.14)
=
8

>

<

>

:
1+log
f
(
r;t
)
log
n

log
f
(
t
)
if
f
(
r;t
)
>
0
0
otherwise
;(1.15)
where
f
(
r;t
)
representsthenumberoftimestheterm
t
occursintherecord
r
and
f
(
t
)
representsthenumberofrecordscontainingtheterm
t
.Wethendownloadedthethumbnail
ofthevideoandextractedtheglobalGISTfeatures[141]oft
heimage.Thenal
6
;647
-
dimensionalfeaturevectorwasobtainedbyconcatenatingt
hetf-idfandGISTfeatures.We
13
www.youtube.com
14
https://developers.google.com/youtube/v3
37
usethisdatasettoevaluatetheperformanceofthesparseke
rnel
k
-meansalgorithmproposed
inChapter5onlargehighdimensionaldatasets.

Tiny,CIFAR-10andCIFAR-100[99,173]:
TheTinyImagedatasetcontains
79
;302
;017
unique
32

32
colorimages,downloadedfromtheInternet.Theywereobtai
nedbyextract-
ing
75
;062
non-abstractEnglishnounsfromtheWordnetdatabase[64]a
ndusingthemto
searchforimagesin
7
independentimagesearchengines.Theseimagesweredownlo
aded
anddown-sampledto
32

32
.Werepresentedeachimageusinga
384
-dimensionalGIST
descriptor[141].Thoughthesearchqueriescanbeusedtolo
oselylabeltheimages,these
labelsareunreliable.Toevaluatetheaccuracyofthepropo
sedalgorithms,weusedthe
CIFAR-10andCIFAR-100datasets,manuallylabeledsubsets
oftheTinydataset.The
CIFAR-10datasetcontains
60
;000
imagesfrom
10
classes(bird,truck,deer,dog,cat,frog,
car,plane,horseandship).TheCIFAR-100alsocontains
60
;000
imagesfrom
100
classes.

Twitter
15
:
Twitterisasocialnetworkwithover100millionactiveuser
spostingover
100
;000
shortmessages(called
tweets
)perminute.Thetweetscontainpersonalupdates,
real-timeinformationaboutevents,newsetc.Eachtweetco
ntainsatextmessagelimited
to
140
charactersandcanincludeuser-mentions,links,emoticon
s,andhashtagsinaddi-
tiontoplaintext.Wedownloadedoverabilliontweetsusing
theTwitterstreamingsearch
APIusing
20
programminglanguages(Python,Perl,C#,Java,Ruby,C++,J
avaScript,VB-
Script,Scala,ObjectiveC,PHP,SQL,Postgresql,GO,Julia
,Erlang,HTML,XML,Swift,
andASP.NET)assearchterms.Welteredoutthenon-English
tweets,removedthehash-
tags,eliminatedthestopwordsandrepresentedeachtweetw
iththetf-idffeatures,dened
in(1.15),correspondingto
8
;042
terms.Weusethisdatasettodemonstratetheefciencyof
theapproximatestreamkernel
k
-meansalgorithminChapter4onfaststreamingdatasets.
Inadditiontotheabovereal-worlddatasets,weuseasynthe
ticdataset,whichwecallthe
con-
centriccircles
dataset,todemonstratethescalabilityoftheproposedalg
orithms.Thedataset
15
www.twitter.com
38
containingcircularclustersofvaryingradii,wasgenerat
edwithdifferentnumberofpoints,rang-
ingfrom
100
to
1
billion.Thedatadimensionalityrangesfrom
10
to
1
;000
andthenumberof
clustersrangesfrom
10
to
1
;000
.Eachclustercontainsthesamenumberofpoints.Anexample
datasetcontaining
1
;000
two-dimensionalpointsalong
10
concentriccircles(
100
pointsineach
cluster)isshowninFigure4.2(a).

1.6.2EvaluationMetrics

Thegoalofourresearchistoreducetheresourcesneededfor
kernelclustering,withminimal
reductionintheclusterquality.Inordertoevaluatethere
ductioninrunningtimeandmemory
complexity,wemeasuredthetimetakenforclusteringtheda
tapoints,andtheamountofmemory
used.
Theclusterqualityoftheproposedalgorithmswereevaluat
edusingtwotypesofmeasures:
(a)internalmeasuresevaluatethestructureandcompactne
ssoftheclusters,while(b)external
measuresevaluatehowwelltheclusterlabelsmatchwiththe
trueclasslabels.Weusedtheinternal
Silhouettecoefcient
[151]andtheexternal
NormalizedMutualInformation(NMI)
[104]measures
toevaluatetheclusterqualityofouralgorithms.
TheSilhouettecoefcientmeasuresthecompactnessofthec
lusters.Let
d
k;i
representthe
averagedissimilaritybetweendatapoint
x
i
andallthepointsassignedtothecluster
C
k
,i.e.
d
k;i
=
1
n
k
X
x
j
2C
k
x
i
6=
x
j
d
2
(
x
i
;x
j
)
;where
n
k
isthenumberofpoints(exceptfor
x
i
)assignedtocluster
C
k
.Foreachdatapoint
x
i
,
denethecoefcients
a
i
and
b
i
asfollows:
a
i
=
d
k

;i
;and
b
i
=min
k
6=
k

d
k;i
;39
where
k

istheindexoftheclustertowhich
x
i
isassigned.Thecoefcient
a
i
representstheaverage
dissimilarityof
x
i
withallotherpointswithinthesamecluster,andthecoefc
ient
b
i
represents
theaveragedissimilaritybetween
x
i
andallthepointsinneighboringcluster.TheSilhouette
coefcientisdenedas
Silhouette
=
1
n
n
X
i
=1
b
i

a
i
max(
a
i
;b
i
)
:(1.16)
ThevalueoftheSilhouettecoefcientliesintherange
[
1
;1]
.Avaluecloseto
1
isdesired.When
thecoefcientiscloseto
1
itimpliesthat
a
i
˝
b
i
foralargenumberofpoints,i.e.manyofthe
pointsarewell-matchedtotheclustertowhichtheywereass
igned.Ontheotherhand,whenthe
Silhouettecoefcientvalueiscloseto

1
,
a
i
˛
b
i
foralargenumberofdatapoints,whichimplies
thatmanyofthepointsaremoresimilartotheneighboringcl
ustersthantheclustertowhichthey
havebeenassigned.Avaluecloseto
0
denotesthatmanydatapointslieontheboundariesoftheir
naturalclusters.
TheNormalizedMutualInformationwithrespecttothetruec
lasslabelsofthedatapoints
isdenedasfollows:Let
U
a
and
U
b
betheclustermembershipmatricescorrespondingtotwo
partitions
a
and
b
ofthesamedataset.Let
n
a

i
representthenumberofdatapointsthathavebeen
assignedlabel
i
inpartition
a
,and
n
a;b

i;j
representsthenumberofdatapointsthathavebeenassigned
label
i
inpartition
a
andlabel
j
inpartition
b
.Wehave
NMI
(
a;b
)=
C
P
i
=1
C
P
j
=1
n
a;b

i;j
log

n
n
a;b

i;j
n
a

i
n
b

j

v

u

u

t

C
P
i
=1
n
a

i
log
n
a

i
n

 
C
P
j
=1
n
b

j
log
n
b

j
n
!
;(1.17)
where
a
representsthepartitionobtainedfromtheclusteringalgo
rithm,and
b
representsthepar-
titionbasedonthetrueclasses.AnNMIvalueof1indicatesp
erfectmatchingwiththetrueclass
distributionwhereas0indicatesperfectmismatch.Thetru
eclasslabelsareavailableformostof
thedatasets(excepttheTinyimageandYoutubedatasets).W
eusedtheCIFAR-10dataset,a
40
labeledsubsetoftheTinyimagedataset,toevaluatetheper
formanceontheTinydataset.
1.7ThesisOverview

Kernel-basedclusteringalgorithms,whichperformwellon
real-worlddatasets,arenotscalable
tobigdatasets,containingbillionsofhigh-dimensionalp
ointsfromthousandsofclusters.We
proposescalableapproximatekernel-basedclusteringalg
orithms,anddemonstratetheirefciency
andeffectivenessonseveraldiverselarge-scaledatasets
.Theremainderofthisthesisisorganized
asfollows:Chapters2and3describetheapproximatebatchc
lusteringalgorithms(approximate
kernel
k
-means,andkernel-basedclusteringusingrandomFourierf
eatures),basedonworkpub-
lishedin[40]and[42],respectively.Thesealgorithmscan
clusterupto
10
milliondatapointswith
thousandsoffeatures,andachievehighclusterquality.Ch
apter4,basedonthepublication[43],
describestheapproximatestreamkernel
k
-meansalgorithm,whichcanclusterstreamingdataof
arbitrarysizesinreal-time.Thesparsekernel
k
-meansalgorithm,discussedinChapter5,canclus-
terarbitrarily-sizedhigh-dimensionaldatasets,intoth
ousandsofclusters.Itisapplicabletolarge
documentandimagerepositories.Thisworkwaspublishedin
[38].Weconcludeourstudyand
presentdirectionsforfutureworkinChapter6.
41
Chapter2

ApproximateKernel-basedClustering

2.1Introduction

AsdiscussedinChapter1,kernel
k
-meansachievesbetterclusteringperformancethan
k
-means,
becauseitexploresthenon-linearstructureinthedatausi
ngcomplexnon-linearsimilaritymea-
sures.However,ithasrunningtimeandmemorycomplexityqu
adraticinthenumberofdatapoints
n
,leadingtoitsnon-scalabilitytobigdatasets.
Toaddressthisissue,weproposeanapproximatekernelclus
teringalgorithmcalled
Approxi-
mateKernelk-means
[40],basedonrandomsampling.Wesample
m
pointsfromthedatasetof
n
points,andexpresstheclustercentersaslinearcombinati
onsofvectorsinthespacespannedby
thissubset.Theweightsofthesampledpointsinthecluster
centers,andtheclusterlabelsofthe
pointsareobtainedsimultaneouslyusingiterativeoptimi
zation.Onlyasmall
n

m
portionof
thekernelmatrixneedstobecomputedusingtheproposedalg
orithm,therebyreducingtherun-
ningtimecomplexityofclusteringfrom
O
(
n
2
)
to
O
(
nm
)
.When
n
isintheorderofmillions,the
samplesize
m
ismuchsmallerthan
n
.Hencetheproposedalgorithmiscomparableto
k
-meansin
termsofefciency.Weshowanalyticallyandempiricallyth
attheclusterqualityachievedbythe
proposedapproximatekernel
k
-meansiscomparabletothatofkernel
k
-means.
42
Thischapterisorganizedasfollows:InSection2.2,webrie
yreviewsomeofthepopular
approximatekernel-basedclusteringschemesdevelopedin
theliterature.Weformallydescribe
theproposedapproximatekernel
k
-meansalgorithminSection2.3.Thekeyparameterswhich
determinethesuccessoftheproposedalgorithmarethenumb
erofsamples
m
andthesampling
strategy.WediscusstheseissuesinSection2.3.1.InSecti
on2.3.2,weanalyzetheproposed
algorithm'srunningtimeandmemorycomplexity.Wealsosho
wthatthedifferencebetweenthe
performanceoftheapproximatekernel
k
-meansandthekernel
k
-meansalgorithmsintermsofthe
clusteringerror
1
reducesasthenumberofsamples
m
increases,attherateof
O
(1
=m
)
.In2.3.3,we
presentthedistributedapproximatekernel
k
-meansalgorithm[41],whichparallelizestheproposed
approximatekernel
k
-meansalgorithm,inordertoscaleuptodatasetscontainin
gbillionsof
datapoints.Finally,inSection2.4,wedemonstrateempiri
callythattheproposedapproximate
clusteringalgorithmisanefcientandaccuratevariantof
thekernel
k
-meansalgorithm,andcan
beusedtoclusterlargedatasets,containingbillionsofpo
ints.
2.2RelatedWork

Largematriceslikethekernelmatricescorrespondingtola
rgedatasetshavefastdecayingeigen-
spectrums[187].Therefore,thecomputationalrequiremen
tsofoperationsinvolvingsuchmatri-
cescanbereducedbyreplacingthemwiththeirlow-rankappr
oximations.Mostofthescalable
kernel-basedlearningalgorithms,includingtheproposed
approximatekernel
k
-meansalgorithm,
takeadvantageofthisfactintheirdesign.
Below,werstbrieyreviewthelow-rankmatrixapproximat
ionliterature,andthendescribe
someofthelarge-scalekernel-basedclusteringalgorithm
sdevelopedintheliterature.
1
Clusteringerrorisdenedasthesumofthesquareddistance
sbetweenthedatapointsandthecenterofthe
clustertowhichthedatapointisassigned.SeeSection1.3.
1fortheformaldenitionofclusteringerror.
43
2.2.1Low-rankMatrixApproximation

Givenan
n

m
matrix
A
,theobjectiveoflow-rankapproximationistondarank-
r
matrix
A
r
thatminimizestheerrordenedby
jj
A

A
r
jj
p
;where
jjjj
p
representseitherthespectralnormortheFrobeniusnorm.T
heoptimalsolution
to(2.2.1)isgivenby
A


r
=
r
X
k
=1

k
u
k
v
>
k
;where
f

k
g
r

k
=1
representthelargest
r
singularvaluesof
A
,and
f
u
k
g
r

k
=1
and
f
v
k
g
r

k
=1
arethe
correspondingleftandrightsingularvectors[58].Thetim
erequiredtoestimatethesingular
vectorsis
O
(
mn
min
f
m;n
g
)
,whichcanbeprohibitivewhen
m
and
n
arelarge.
Severalefcientalgorithmshavebeenproposedintheliter
aturetoapproximatetheSingular
ValueDecomposition[129].Oneoftheearliestalgorithmsb
yFrieze
etal.
involvesindependently
sampling
s
rowsandcolumnsfrom
A
,toforman
s

s
matrix
S
.
A
isthenprojectedonto
thespanofthedominanteigenvectorsof
S
.Theyshowedthatwhenthecolumnsandrowsare
sampledwithprobabilityproportionaltothecolumnandrow
normsrespectively,andthesample
size
s
=
O
(max
f
r
4


3
;r
2


4
g
)
,theapproximationerrorcanbebounded,withhighprobabil
ity,
as
jj
A

A
r
jj
2

F
=
jj
A

A


r
jj
2

F
+

jj
A
jj
2

F
;(2.1)
where
>
0
isanerrorparameter[70].Achlioptas
etal.
obtainedasimilarresultbydesigninga
randommatrix
R
withentriesdependentonthevaluesinthematrix
A
,suchthatthematrix
A
+
R
issparse,andtheexpectation
E
[R
]=0
.Thetop
r
singularvectorsofthesparsematrix
A
+
R
are
usedinplaceofthesingularvectorsof
A
tondthelowrankapproximationof
A
[5].Thesingular
vectorsofasparsematrixcanbecomputedefcientlyusingt
heLanczosbidiagonalizationmethod
anditsvariants[163].
44
Approximationschemesdevelopedthereafterachievedtigh
terboundsontheerrorandthesam-
plingcomplexity,byusingdifferentsparsication,sampl
ingandprojectionschemes[50,54,57,
103,109,137,155,187].Forinstance,in[137],matrix
A
issparsiedbynullifyingtheentries
whichhavesufcientlylowmagnitudes.Elementsareretain
edinproportiontotheirmagnitude.
Sequentialcolumnandrowsamplingwithprobabilitypropor
tionaltothecolumn's(row's)distance
fromthespanofthecolumns(rows)alreadyselectedisusedi
n[50].In[155],
A
isrstprojected
intoalow-dimensionalspaceusingauniformrandommatrix
R
2f
+1
;
1
g
m

s
as
B
=
AR
,
andthenprojectedontothespanofthebestrank-
r
approximationof
B
.Theseworksobtained
multiplicativeerrorboundsoftheform
jj
A

A
r
jj
p
=(1+

)

jj
A

A


r
jj
p

:(2.2)
Themostwidely-studiedsampling-basedapproximationtec
hniquesintheliteraturearethe
CURandtheNystromapproximations:

2.2.1.1CURmatrixapproximation

TheCURmatrixdecompositionmethodfactorizes
A
as
A
'
CUR
where
C
contains
s
columns
and
R
contains
t
rowsselectedfrom
A
,suchthat
s
˝
m;t
˝
n
.The
s

t
matrix
U
iscon-
structedtoachieveminimalapproximationerror[22,44,55
,122,184].Berry
etal.
obtained
C
are
R
usingQuasi-Gram-Schmidtorthogonalizationof
A
and
A
>
,respectively[22].Drineas
et
al.
proposedefcientlinear-timealgorithmswhichrandomlys
amplethecolumnsandrows,and
obtain
U
throughthesingularvaluedecompositionofasmall
s

s
matrix[55].In[122],they
improvedtheapproximationbysampling
C
and
R
basedontheimportanceofthecolumnsand
rows,measuredintermsoftheirstatisticalleveragescore
s
2
.Wang
etal.
augmentedasparsica-
2
Thestatisticalleveragescoreofthe
i
th
columnofarank-
rn

m
matrix
A
withsingularvaluedecomposition
A
=
U

V
>
isdenedas
ˇ
i
=
1
r


V
(
i
)


2

2
,where
V
(
i
)
representsthe
i
th
rowin
V
.Itisameasureoftheindependence
ofthecolumnanditsinuenceonthematrix.
45
tionproceduretotheCURdecompositionalgorithmtofurthe
rimproveitsefciency[184].The
CURdecompositionispreferredovertheSVDdecompositioni
nmanyapplicationsbecauseitcan
beinterpretedmoreeasily.

2.2.1.2Nystrommatrixapproximation

TheNystromapproximationcanbeviewedasaspecialization
oftheCURdecompositionforsym-
metricpositivesemi-denite(SPSD)matrices
3
likekernelsimilaritymatrices[17,57,103,109,111,
170,187,195].Itwasrstusedin[187]toperformclassica
tionandregressionusingGaussian
processes.Itwasthenadoptedinmanykernel-basedlearnin
gtaskssuchasclassication[187],
regression[46,187],clustering[35,67,113],manifoldle
arning[194]anddimensionalityreduc-
tion[7].
Let
K
representan
n

n
SPSDmatrixand
K

r
representitsbestrank-
r
approximation.The
NystromapproximationstudiedbyWilliams
etal.
in[187]samples
m
˝
n
columnsuniformly
withoutreplacementfrom
K
,toformthe
n

m
matrix
K
B
.Let
b
K
bethe
m

m
intersection
betweenthesampledcolumnsandthecorrespondingrows.
K
isapproximatedby
e
K
=
K
B
b
K

1
K
>
B
:(2.3)
Drineas
etal.
developedavariantwhichuses
b
K

r
,thebestrank-
r
approximationof
b
K
,inplace
of
b
K
in(2.3),andsamplesthecolumnsuniformly
with
replacement,toobtainapproximation
errorboundsoftheform(2.1)[57].Severalnon-uniformsam
plingtechniqueshavebeenexplored
in[17,74,103,170,195]toobtainimprovederrorbounds.
3
An
n

n
matrix
K
ispositivesemi-deniteif
x
>
K
x

0
,forallnon-zero
x
2<
n
.
46
2.2.2Kernel-basedClusteringforLargeDatasets

Samplingandsparsicationtechniqueshavebeenemployedt
odevelopefcientkernel-basedclus-
teringalgorithms.
Thespectralclusteringalgorithmisagraph-basedcluster
ingtechnique[118].The
n
points
inthedatasetarerepresentedasnodesofagraph.Eachedgei
nthegraphisweightedbythe
similaritybetweenthepointsconnectedbytheedge.Let
K
denotethe
n

n
similaritymatrix.
Thespectralclusteringalgorithmusestherst
C
eigenvectorsoftheLaplacianmatrixdenedby
L
=
I

diag
(
K
>
1
)

1
=
2
K
diag
(
K
>
1
)

1
=
2
;tondtheclusters.Theobviouscomputationalbottlenecks
inthisalgorithmarethecalculationof
theLaplacianmatrixandthecomputationofitseigenvector
s,whichrequire
O
(
n
2
)
and
O
(
n
3
)
time
respectively.
ThecolumnsamplingmethodbyFrieze
etal.
andNystromapproximationmethodhavebeen
usedintheliteraturetospeedupspectralclustering[18,6
7,102].Thekeyideaistoapproximate
theLaplacianmatrix,andusetheapproximateeigenvectors
tondtheclusters.Therunningtime
complexityoftheseapproximatespectralclusteringalgor
ithmsis
O
(
nm
+
m
3
)
,where
m
isthe
numberofcolumnssampledfromthekernelmatrix.As
m
isusuallymuchlowerthan
n
,these
algorithmsrunmuchfasterthanspectralclustering.Rando
mprojectioncanbecombinedwith
samplingtofurtherimprovetheclusteringefciency[73,1
53].Alow-dimensionalprojection
ofthesimilaritymatrix,obtainedbymultiplyingitwithar
andomGaussianmatrix,isusedto
constructthegraphLaplacianmatrix.Theseapproximatesp
ectralclusteringalgorithmshavebeen
appliedsuccessfullyinimagesegmentationproblems[113]
.
Nystromapproximationwasalsousedtoacceleratethekerne
lneuralgasalgorithm[145].The
objectiveofkernelneuralgasistond
C
prototypestorepresentthedata.Eachprototypeis
associatedwitharandomlyinitializedweight,whichisupd
atedbyafactorproportionaltothe
47
similaritybetweentheprototypeandtheinputdatapoint.S
imilartothekernel
k
-means,the
prototypesareexpressedaslinearcombinationsofthedata
pointsintheHilbertspace,soeach
weightupdateinvolvesthe
n

n
kernelmatrix.TheNystromapproximationofthekernelmatr
ix
wasusedin[156]toperformtheweightupdates,therebyredu
cingitsrunningtimecomplexity.
Inadditiontotheaboveapproximatemethods,severalheuri
sticandapplication-specical-
gorithmshavebeenproposedtoperformefcientkernel-bas
edclustering.ZhangandRudnicky
reducedthememoryrequirementsofthekernel
k
-meansalgorithmbychangingtheorderin
whichclusteringisperformed.Thekernelmatrixiscompute
dblockwise,andtheclusterlabels
areobtainedbyexaminingonlyoneblockatatime[196].TheK
ASPalgorithmrstclusters
theinputdatasetinto
m
clustersusing
k
-means,andthenexecutesspectralclusteringonthe
m
(
C
˝
m
˝
n
)
clustercenterstoobtainthe
C
clusters[191].TheRASPclusteringmethod
rstpartitionsthedataspaceusingRandomProjection(RP)
trees[191].RPtreesaredatastruc-
turesthatpartitionthedataspaceinto
m
cells,bysplittingrecursivelyalongonerandomlychosen
coordinateatatime.Eachcellinthepartitionisrepresent
edbyitscenter,andspectralclustering
isexecutedonthe
m
representativecenters.Thesemethodsreducetherunningt
imecomplexity
ofclusteringto
O
(
nm
+
m
3
)
.Chen
etal.
sparsiedthesimilaritymatrixbyretainingonlythe
similarityvaluescorrespondingtothenearest
p
neighborsforeachnode,andproposedasimple
schemetoparallelizethesimilaritycomputationandclust
ering[35].Thenearestneighborsare
foundusing
kd
-trees[131]andmetrictrees[176],therebyreducingtheov
erallmemoryrequire-
mentto
O
(
np
)
,althoughtherunningtimecomplexityisstill
O
(
n
2
log(
p
))
.TheGEM(Graph
Extraction+weightedkernel
k
-Means)algorithmproposedin[186]speedsupkernel
k
-meansfor
socialnetworkgraphsbyeliminatingthenodeswithlowdegr
ee.Itmakesuseofthepowerlaw
distributionofsocialnetworks,whichindicatesthatasma
llsetofhighdegreeverticescovera
largeportionofthenetwork.
48
2.3ApproximateKernelk-means

Givenadataset
D
=
f
x
1
;:::;
x
n
g
,andakernelfunction

(

;
)
,kernel
k
-meansnds
C
clusters,
whosecenters
c
k
(

)
arerepresentedaslinearcombinationsofallthepointsint
hedataset,in
accordancewiththerepresentertheorem[158],i.e.
c
k
(

)=
n
X
i
=1
b
U
k;i

(
x
i
;
)
;k
2
[C
];(2.4)
where
b
U
istheclustermembershipmatrixnormalizedbythenumberof
pointsineachcluster,as
denedin(2.8).Inotherwords,theclustercenterslieinth
esubspacespannedbyallthedata
points,i.e.
c
k
(

)
2H

=
span
(

(
x
1
;
)
;:::;
(
x
n
;
))
;k
2
[C
].Asaconsequence,thekernel
k
-meansalgorithmrequiresthecomputationof
O
(
n
2
)
kernelsimilarityvalues,leadingtoitsnon-
scalability.
Wecanavoidcomputingthefullkernelmatrixifwerestrictt
hesolutionfortheclustercenters
toasmallersubspace
H
a
ˆH

.
H
a
shouldbeconstructedsuchthat
(i)
H
a
issmallenoughtoallowefcientcomputation,and
(ii)
H
a
isrichenoughtoyielddatapartitionssimilartothoseobta
inedusing
H

.
Weemployasimplerandomizedapproachforconstructing
H
a
:werandomlysample
m
datapoints
(
m
˝
n
),denotedby
b
D
=
f
b
x
1
;:::;
b
x
m
g
,andconstructthesubspace
H
a
=
span
(
b
x
1
;:::;
b
x
m
)
.
Giventhesubspace
H
a
,wemodifythekernel
k
-meansoptimizationproblem(1.7)as
min
U
2P
max
f
c
k
(

)
2H
a
g
C

k
=1
C
X
k
=1
n
X
i
=1
U
k;i
jj
c
k
(

)


(
x
i
;
)
jj
2

H

;(2.5)
where
U
=(
u
1
;:::;
u
C
)
>
istheclustermembershipmatrix,
c
k
(

)
2H
a
;k
2
[C
]arethecluster
centers,anddomain
P
=
f
U
2f
0
;1
g
C

n
:U
>
1
=
1
g
,where
1
isavectorofallones.Let
K
B
2<
n

m
representthekernelsimilaritymatrixbetweendatapoints
in
D
andthesampleddata
49
points
b
D
,and
b
K
2<
m

m
representthekernelsimilaritybetweenthesampleddatapo
ints.The
followinglemmaallowsustoreduce(2.5)toanoptimization
probleminvolvingonlythecluster
membershipmatrix
U
.
Lemma1.
Giventheclustermembershipmatrix
U
,theoptimalclustercentersin
(2.5)
aregiven
by
c
k
(

)=
m
X
i
=1

k;i

(
b
x
i
;
)
;(2.6)
where

=
b
UK
B
b
K

1
.Theoptimizationproblemfor
U
isgivenby
min
U
tr
(
K
)

tr
(
e
UK
B
b
K

1
K
>
B
e
U
>
)
;(2.7)
where
b
U
and
e
U
aredenedby
b
U
=(
b
u
1
;:::;
b
u
C
)
>
=[
diag
(
n
1
;:::;n
C
)]

1
U;
e
U
=(
e
u
1
;:::;
e
u
C
)
>
=[
diag
(
p
n
1
;:::;
p
n
C
)]

1
U;
and
n
k
=
u
>

k
1
;k
2
[C
]:(2.8)
Proof.
Let
'
i
=(

(
x
i
;b
x
1
)
;:::;
(
x
i
;b
x
m
))
and

i
=(

i;
1
;:::;
i;m
)
bethe
i
th
rowsofmatrices
K
B
and

respectively.As
c
k
(

)
2H
a
=
span
(
b
x
1
;:::;
b
x
m
)
,wecanexpress
c
k
(

)
as
c
k
(

)=
m
X
i
=1

k;i

(
b
x
i
;
)
;50
andwritetheobjectivefunctionin(2.5)as
C
X
k
=1
n
X
i
=1
U
k;i
jj
c
k
(

)


(
x
i
;
)
jj
2

H

C
X
k
=1
n
X
i
=1
U
k;i


m
X
j
=1

k;j

(
b
x
j
;
)


(
x
i
;
)


2

H

=
tr
(
K
)+
C
X
k
=1

n
k

>
k
b
K
k

2
u
>

k
K
B

k

:(2.9)
Byminimizingtheaboveexpressionwithrespectto

k
,wehave

k
=
b
K

1
K
>
B
b
u
k
;k
2
[C
](2.10)
andtherefore,

=
b
UK
B
b
K

1
.Wecompletetheproofbysubstitutingtheexpressionfor

into
(2.9).
AsindicatedbyLemma1,weneedtocomputeonly
K
B
forndingtheclustermemberships.
b
K
ispartof
K
B
andthereforedoesnotneedtobecomputedseparately.When
m
˝
n
,this
computationalcostwouldbesignicantlysmallerthanthat
ofcomputingthefullmatrix.
Werefertotheproposedalgorithmas
ApproximateKernel
k
-means
,outlinedinAlgorithm3.
Figure2.1illustratesthealgorithmonatwo-dimensionals
yntheticdatasetcontainingtwosemi-
circles.Exceptforafewpointswhicharemisclustered,the
resultissimilartothatofkernel
k
-means.Table2.1comparestheconfusionmatricesofthepar
titionsobtainedusingtheapproxi-
matekernel
k
-meansalgorithm,withthoseofthekernel
k
-meansandthe
k
-meansalgorithms.A
confusionmatrixshowsthemappingbetweenthetrueclassla
belsandtheclusterlabels.Each
clusterisassignedaclasslabel,correspondingtothetrue
labelofthemajorityofthedatapointsin
thecluster.Eachentry
(
k;c
)
intheconfusionmatrixrepresentthenumberofdatapointsf
romclass
c
assignedtocluster
k
.Thediagonalentriesrepresentthenumberofpointsthatha
vebeenassigned
tothecorrectcluster.ItisclearfromTable2.1thatthepro
posedalgorithmachievesclusterquality
51
comparabletothatofthekernel
k
-meansalgorithm,andismuchmoreaccuratethanthe
k
-means
algorithm.
Algorithm3
ApproximateKernel
k
-means
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered


(

;
):
<
d
<
d
7!<
:thekernelfunction

C
:thenumberofclusters

m
:thenumberofrandomlysampleddatapoints(
C<m
˝
n
)

MAXITER
:maximumnumberofiterations
2:
Output
:Clustermembershipmatrix
U
2f
0
;1
g
C

n
3:
Sample
m
datapointsfrom
D
,denotedby
b
D
=
f
b
x
1
;:::;
b
x
m
g
.
4:
Computematrices
K
B
=[

(
x
i
;b
x
j
)]
n

m
,
b
K
=[

(
b
x
i
;b
x
j
)]
m

m
,and
T
=
K
B
b
K

1
.
5:
Randomlyinitializethemembershipmatrix
U
,ensuringthat
U
>
1
=
1
.
6:
Set
t
=0
.
7:
repeat
8:
Set
t
=
t
+1
.
9:
Computethe
`
1
normalizedmembershipmatrix
b
U
by
b
U
=[
diag
(
U
1
)]

1
U
.
10:
Calculate

=
b
UT
.
11:
for
i
=1
;:::;n
do
12:
Findtheclosestclustercenter
k

for
x
i
by
k

=argmin
k
2
[
C
]

>
k
b
K
k

2
'
>

i

k
;where

k
and
'
i
arethe
k
th
and
i
th
rowsofmatrices

and
K
B
,respectively.
13:
Updatethe
i
th
columnof
U
by
U
k;i
=1
for
k
=
k

andzerootherwise.
14:
endfor
15:
until
themembershipmatrix
U
doesnotchange,or
t>MAXITER
2.3.1Parameters

Inadditiontothekernelfunctionandthenumberofclusters
,theapproximatekernel
k
-meansis
parameterizedbythesamplesize
m
andtherandomsamplingtechniqueemployedtoobtainthe
subset
b
D
.Theseparametersplayacrucialroleindeterminingtheclu
steringperformanceofthe
proposedalgorithm.
52
(a)
(b)
(c)
(d)
(e)
Figure2.1Illustrationoftheapproximatekernel
k
-meansalgorithmonthetwo-dimensionalsemi-
circlesdatasetcontaining
500
points(
250
pointsineachofthetwoclusters).Figure(a)showsall
thedatapoints(inred)andtheuniformlysampledpoints(in
blue).Figures(b)-(e)showtheprocess
ofdiscoveryofthetwoclustersinthedatasetandtheircent
ersintheinputspace(representedby
x)bytheapproximatekernel
k
-meansalgorithm.
Table2.1Comparisonoftheconfusionmatricesoftheapprox
imatekernel
k
-means,kernel
k
-
meansand
k
-meansalgorithmsforthetwo-dimensionalsemi-circlesda
taset,containing
500
points
(
250
pointsineachofthetwoclusters).Theapproximatekernel
k
-meansalgorithmachieves
clusterqualitycomparabletothatofthekernel
k
-meansalgorithm.
Class1
Class2
Cluster1
245
4
Cluster2
5
246
(a)Approximatekernel
k
-means
Class1
Class2
Cluster1
250
0
Cluster2
0
250
(b)Kernel
k
-means
Class1
Class2
Cluster1
132
129
Cluster2
118
121
(c)
k
-means
53
2.3.1.1Samplesize

Bycomparingtheoptimizationproblemofapproximatekerne
l
k
-meansin(2.7)withthekernel
k
-meansproblemin(1.10),wecanobservethattheapproximat
ekernel
k
-meansproblemcanbe
viewedasthekernel
k
-meansprobleminwhichthekernelmatrix
K
isreplacedbyitsNystrom
approximation
K
B
b
K

1
K
>
B
.Therefore,theclusteringperformanceoftheapproximate
k
-means
problemwillbeclosetotheclusteringperformanceofkerne
l
k
-meansiftheapproximationerror


K

K
B
b
K

1
K
>
B


issmall.Thefollowinglemmaadaptedfrom[74]characteriz
esthenumber
ofsamplesrequiredtoobtainagoodapproximation.

Lemma2.
Let
f

k
;v
k
g
n

k
=1
denotetheeigenvaluesandeigenvectorsofthekernelmatri
x
K
.Let
V
C
=(
v
1
;:::;
v
C
)
denotetheeigenvectorscorrespondingtothedominant
C
eigenvaluesof
K
.
Denetheﬁcoherenceﬂofthedominant
C
-dimensionalinvariantsubspaceof
K
as
˝
=
n
C
max
1

i

n


V
(
i
)
C


2

2
;(2.11)
where
V
(
i
)
C
isthe
i
th
rowin
V
C
.Assumethattheeigengap

C


C
+1
issufcientlylarge.Forany

2
(0
;1)
,wehave


K

K
B
b
K

1
K
>
B


2


C
+1

1+
2
n
m

;withprobability
1


,provided
m

8
˝C
log(
C=
)
.
Thecoherenceofamatrix
˝
isameasureofthenumberofinformativecolumnsinthema-
trix.Whenthecoherenceislow,fewcolumnsaresufcientto
obtainanaccurateapproximation.
Lemma2indicatesthattheapproximationerrorreducesatar
ateof
O
(1
=m
)
,withincreasing
m
.
Inourexperiments,weexaminedtheperformanceofouralgor
ithmfordifferentsamplesizes
m
,rangingfrom
0
:001%
to
15%
ofthedatasetsize
n
,andobservedthatsetting
m
equalto
0
:01%
to
0
:05%
of
n
leadstoasatisfactoryperformance.
54
2.3.1.2Samplingstrategies

Anotherimportantfactorthatinuencestheproposedappro
ximatekernel
k
-meansalgorithmis
thesamplingdistributionemployedtoconstructthekernel
approximation.Thesimplestsampling
techniqueisuniformrandomsampling,i.e.eachpointissel
ectedwithaprobability
1
=n
.Several
non-uniformsamplingandgreedyapproachestoperformlow-
rankmatrixapproximation,have
beenstudiedintheliterature.
(i)Diagonalsamplinginvolveschoosingadatapoint
x
i
withaprobabilityproportionaltothe
diagonalelement
K
(
x
i
;x
i
)
[17,57].Thisdistributionisthesameastheuniformdistri
bution
forexponentialkernelsoftheform

(
x
a
;x
b
)=exp


jj
x
a

x
b
jj
q

p

;p;q>
0
;suchastheRBFkernelandtheLaplaciankernel,becauseallt
hediagonalentriesareequal
tooneanother.
(ii)Column-normsamplinginvolveschoosing
x
i
withaprobabilityproportionaltothe
`
2
norm
ofthecolumnvector
K
(

;x
i
)
[69].
(iii)In[195],
k
-meansisappliedtothedatasetandtheclustercentersobta
inedareusedinplace
ofthesampleddataset
b
D
.
(iv)Adaptivesamplingtechniquesinvolveselectingdatap
ointssequentially,toensuremaximum
coverageofthedata[50,102,114,142].Forexample,agreed
yselectionprocedurewhich
selectsapointwhichisfarthestfromthecurrentlyselecte
dsetofpointsisemployedin[142].
Liu
etal.
proposeselectingadatapointwhichwouldformasubspacewi
ththepreviously
chosenpoints,sothatthetotaldistanceofunsampleddatap
ointstothissubspaceismini-
mized[114].
55
(v)Samplingbasedontheimportanceofthedatapointinterm
softhestatisticalleveragescores
andthecoherenceofthedataisemployedin[170].
Thenon-uniformsamplingtechniqueslikecolumn-normsamp
ling,adaptivesamplingandimpor-
tancesamplinghave
O
(
n
2
)
runningtimecomplexity.Hence,theyareinfeasibleforlar
gedatasets.
Samplingusing
k
-meanscanbeperformedin
O
(
nm
)
time.Uniformanddiagonalsamplinghave
lineartimecomplexity.Kumar
etal.
comparedthediagonalandcolumnsamplingtechniqueswith
uniformsamplingandshowedthatuniformsamplingwithoutr
eplacementismoreeffectivethan
thenon-uniformsamplingtechniques[103].Weexploresome
ofthesetechniquesempiricallyin
Section2.4.

2.3.2Analysis

Inthissection,werstanalyzethecomputationalcomplexi
tyoftheproposedapproximatekernel
k
-meansalgorithm,andthenexaminethequalityofthedatapa
rtitionsgeneratedbytheproposed
algorithm.

2.3.2.1Computationalcomplexity

Assuminguniformsamplingstrategy,samplingcanbeperfor
medin
O
(
n
)
time.Themostex-
pensiveoperationsintheproposedalgorithmarethematrix
inversion
b
K

1
andcalculationofthe
matrix
T
=
K
B
b
K

1
,whichhaveatotalcomputationalcostof
O
(
m
3
+
m
2
n
)
.Thecostofcom-
puting

andupdatingthemembershipmatrix
U
is
O
(
mnCl
)
,where
l
isthenumberofiterations
neededforconvergence.Hence,theoverallrunningtimecom
plexityoftheapproximatekernel
k
-meansalgorithmis
O
(
m
3
+
m
2
n
+
mnCl
)
.Wecanfurtherreducethecomputationalcomplexity
byavoidingthematrixinversion
b
K

1
andformulatingthecalculationof

=
b
UT
=
b
UK
B
b
K

1
as
56
thefollowingoptimizationproblem:
min

2<
C

m
1
2
tr
(

b
K
)

tr
(
b
UK
B

>
)
(2.12)
If
b
K
iswellconditioned(i.e.theminimumeigenvalueof
b
K
issignicantlylargerthanzero),
wecansolvetheoptimizationproblemin(2.12)byasimplegr
adientdescentmethodwitha
convergencerateof
O
(log(1
="
))
,where
"
isthedesiredaccuracy.Asthecomputationalcost
ofeachstepinthegradientdescentmethodis
O
(
m
2
C
)
,theoverallcomputationalcostisonly
O
(
m
2
Cl
log(1
="
))
˝
O
(
m
3
)
when
Cl
˝
m
.Thisreducestheoverallcomputationalcostto
O
(
m
2
Cl
+
mnCl
+
m
2
n
)
.Asthelargestmatrixthatneedstobestoredinmemoryis
K
B
,the
memoryrequirementisonly
O
(
mn
)
.Thisisadramaticdecreaseintherunningtimeandmemory
requirementsforlargedatasetswhencomparedtothe
O
(
n
2
)
complexityofkernel
k
-means.The
runningtimecomplexityofapproximatekernel
k
-meansisalsolowerthanthatoftheNystrom
approximationbasedspectralclusteringalgorithm,which
needstocomputetheeigenvectorsof
the
m

m
matrix
b
K
in
O
(
m
3
)
time.
2.3.2.2Approximationerror

Inthissection,wecomparetheclusteringerrorofapproxim
atekernel
k
-meanswiththatofkernel
k
-means.Theonlydifferencebetweenthetwoalgorithmsisth
efactthatapproximatekernel
k
-meansrestrictstheclustercenterstoasmallsubspace
H
a
,constructedusingthesampleddata
points.Ouranalysiswillthereforebefocusedonboundingt
heexpectederrorduetothisconstraint.
Letbinaryrandomvariables
˘
=(
˘
1
;˘
2
;:::;˘
n
)
>
2f
0
;1
g
n
representthesamplingvector,i.e.
˘
i
=1
if
x
i
2
b
D
andzerootherwise.Thefollowingpropositionallowsustow
ritetheclustering
errorintermsofrandomvariable
˘
:
Proposition1.
Giventheclustermembershipmatrix
U
=(
u
1
;:::;
u
C
)
>
,theclusteringerrorcan
57
beexpressedin
˘
as
L
(
U;˘
)=
tr
(
K
)+
C
X
k
=1
L
k
(
U;˘
)
;(2.13)
where
L
k
(
U;˘
)
is
L
k
(
U;˘
)=min

k
2<
n

2
u
>

k
K
(

k

˘
)+
n
k
(

k

˘
)
>
K
(

k

˘
)
:(2.14)
Notethatapproximatekernel
k
-meansbecomesequivalenttokernel
k
-meanswhen
˘
=
1
,
where
1
isavectorofallones,implyingthatallthedatapointsares
electedforconstructingthe
subspace
H
a
.Asaresult,
L
(
U;
1
)
istheclusteringerrorofthekernel
k
-meansalgorithm.
Thefollowinglemmarelatestheexpectedclusteringerroro
fapproximatekernel
k
-meanswith
thatofkernel
k
-means.
Lemma3.
Giventhemembershipmatrix
U
,wehavetheexpectationof
L
(
U;˘
)
boundedasfollows
E
˘
[L
(
U;˘
)]
L
(
U;
1
)+
tr

e
U
h
K

1
+
m
n
[diag
(
K
)]

1
i

1
e
U
>

;(2.15)
where
L
(
U;
1
)=
tr
(
K
)

tr
(
e
UK
e
U
>
)
.
Proof.
Werstbound
E
˘
[L
k
(
U;˘
)]
as
1
n
k
E
˘
[L
k
(
U;˘
)]
=
E
˘
h
min


2
b
u
>

k
K
(


˘
)+(


˘
)
>
K
(


˘
)
i

min

E
˘


2
b
u
>

k
K
(


˘
)+(


˘
)
>
K
(


˘
)

=min


2
m
n
b
u
>

k
K
+
m
2
n
2

>
K
+
m
n

1

m
n


>
diag
(
K
)


min


2
m
n
b
u
>

k
K
+
m
n

>

m
n
K
+
diag
(
K
)

:
58
Byminimizingtheaboveexpressionwithrespectto

,weobtain


=

m
n
K
+
diag
(
K
)


1
K
b
u
k
:Therefore,
1
n
k
E
˘
[L
k
(
U;˘
)]

m
n
b
u
k
K

m
n
K
+
diag
(
K
)


1
K
b
u
k
:E
˘
[L
k
(
U;˘
)]
canbeboundedas
E
˘
[L
k
(
U;˘
)]+
n
k
b
u
>

k
K
b
u
k

n
k
b
u
>

k

K

K
h
K
+
n
m
diag
(
K
)
i

1
K

b
u
k
=
e
u
>

k

K

1
+
m
n
[diag
(
K
)]

1


1
e
u
k
:Wecompletetheproofbyaddingup
E
˘
[L
k
(
U;˘
)]
andusingthefactthat
L
k
(
U;
1
)=min


2
u
>

k
K
+
n
k

>
K
=

e
u
>

k
K
e
u
k
:Theaboveresultcanbeinterpretedintermsoftheeigenvalu
esofthekernelmatrix.
Corollary1.
Assume

(
x
;x
)

1
forany
x
.Let

1


2

:::


n

0
betheeigenvaluesof
matrix
K
.Giventhemembershipmatrix
U
,wehave
E
˘
[L
(
U;˘
)]
L
(
U;
1
)

1+
P
C

i
=1

i
=
[1+

i
m=n
]tr
(
K
)

P
C

i
=1

i

1+
C=m
P
n

i
=
C
+1

i
=n
:Proof.
As

(
x
;x
)

1
forany
x
,wehavediag
(
K
)

I
,where
I
isanidentitymatrix.As
e
U
isan
59
`
2
normalizedmatrix,wehave
tr

e
U
h
K

1
+
m
n
[diag
(
K
)]

1
i

1
e
U
>


tr

e
U
h
K

1
+
m
n
I
i

1
e
U
>


C
X
i
=1

i
1+
m
i
=n

Cn
m
and
L
(
U;
1
)=
tr
(
K

UKU
>
)

tr
(
K
)

C
X
i
=1

i
:Wecompletetheproofbycombiningtheaboveinequalities.
ToillustratetheresultofCorollary1,consideraspecialk
ernelmatrix
K
thathasitsrst
a
eigenvaluesequal
n=a
andtheremainingeigenvaluesequalzero;i.e.

1
=
:::
=

a
=
n=a
and

a
+1
=
:::
=

n
=0
.Wefurtherassume
a>
2
C
;i.e.thenumberofnon-zeroeigenvaluesof
K
islargerthantwicethenumberofclusters.Then,according
toCorollary1,wehave
E
˘
[L
(
U;˘
)]
L
(
U;
1
)
L
(
U;
1
)

1+
Ca
m
(
a

C
)

1+
2
C
m
;indicatingthatwhenthenumberofnon-zeroeigenvaluesof
K
issignicantlylargerthanthenum-
beroftheclusters,thedifferenceintheclusteringerrors
ofkernel
k
-meansandourapproximation
schemewilldecreaseattherateof
O
(1
=m
)
.ThisresultconcurswiththeresultofLemma1.
2.3.3DistributedClustering
Astheproposedapproximatekernel
k
-meansalgorithmhas
O
(
nm
)
runningtimecomplexity,
itiseasiertoparallelizethanthekernel
k
-meansalgorithm.InAlgorithm4,weproposeascheme
toparallelizeapproximatekernel
k
-means.Thekeyideaistodistributethekernelcomputation
and
performapproximateclusteringusingarelativelysmaller
matrix.
60
Algorithm4
DistributedApproximateKernel
k
-means
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered


(

;
):
˜

˜
7!<
:thekernelfunction

C
:thenumberofclusters

m
:thenumberofrandomlysampleddatapoints(
C<m
˝
n
)

P
:thenumberoftasks

MAXITER
:maximumnumberofiterations
2:
Output
:Clustermembershipmatrix
U
2f
0
;1
g
C

n
//Mastertask
3:
Randomlysample
m
datapointsfrom
D
,denotedby
b
D
=
f
b
x
1
;:::;
b
x
m
g
andcompute
b
K
=
[
(
b
x
i
;b
x
j
)]
m

m
.
4:
Randomlysplittheunsampleddatapointsinto
P
parts

D
1
;:::;
D
P

.
5:
Executeinparallel:

//Task
l
6:
Compute
K
l
B
=[

(
x
i
;b
x
j
)]
s

m
and
T
l
=
K
l
B
b
K

1
,where
x
i
2D
l
and
s
isthenumberof
pointsin
D
l
.
7:
Randomlyinitializethemembershipmatrix
U
l
,ensuringthat
U
l
>
1
=
1
.
8:
Set
t
=0
.
9:
repeat
10:
Set
t
=
t
+1
.
11:
Calculate

l
=[
diag
(
U
l
1
)]

1
U
l
T
l
.
12:
for
i
=1
;:::;s
do
13:
Findtheclosestclustercenter
k

for
x
i
2D
l
by
k

=argmin
k
2
[
C
]


l
k

>
b
K


l
k


2

'
l

i

>


l
k

;where

l
k
and
'
l

i
arethe
k
th
and
i
th
rowsofmatrices

l
and
K
l
B
,respectively.
14:
Updatethe
i
th
columnof
U
l
by
U
l
k;i
=1
for
k
=
k

andzerootherwise.
15:
endfor
16:
until
themembershipmatrix
U
doesnotchangeor
t>MAXITER
17:
for
eachpoint
x
i
=
2D
l
do
18:
Findtheclosestclustercenter
k

k

=argmin
k
2
[
C
]


l
k

>
b
K


l
k


2

'
l

i

>


l
k

;where

l
k
isthe
k
th
rowin

l
and
'
l

i
=(

(
x
i
;b
x
1
)
;:::;
(
x
i
;b
x
m
))
.
19:
Updatethe
i
th
columnof
U
l
by
U
l
k;i
=1
for
k
=
k

andzerootherwise.
20:
endfor
21:
endparallelexecution

//Mastertask
22:
Randomlyselectanindex
l
andset
U
=
U
l
,orcombinethematrices

U
l

P

l
=1
usinganensem-
bleclusteringalgorithm(e.g.theMeta-clusteringalgori
thmdescribedinAlgorithm5).
61
Algorithm5
Meta-ClusteringAlgorithm
1:
Input
:Clustermembershipmatrices

U
l

P

l
=1
;U
l
2f
0
;1
g
C

n
2:
Output
:Consensusclustermembershipmatrix
U
3:
Concatenatethemembershipmatrices
f
U
l
g
P

l
=1
toobtainan
PC

n
matrix
U
=
(
u
1
;u
2
;:::;
u
PC
)
>
.
4:
ComputetheJaccardsimilarity
s
i;j
betweenthevectors
u
i
and
u
j
,
i;j
2
[PC
]using
s
i;j
=
u
>

i
u
j
k
u
i
k
2
+
k
u
j
k
2

u
>

i
u
j
:5:
Constructacompleteweightedmeta-graph
G
=(
V;E
)
,wherevertexset
V
=
f
u
1
;u
2
;:::;
u
PC
g
andeachedge
(
u
i
;u
j
)
isweightedby
s
i;j
.
6:
Partition
G
into
C
meta-clusters
f
ˇ
k
g
C

k
=1
where
ˇ
k
=
n
u
(1)

k
;u
(2)

k
;:::
u
(
s
k
)
k
o
.
7:
Computethemeanvectorsforeachmeta-cluster:
f

k
g
C

k
=1
using

k
=
1
s
k
s
k
X
i
=1
u
(
i
)
k
:8:
for
i
=1
;:::;n
do
9:
Updatethe
i
th
columnof
U
as
U
k

;i
=
(
1
if
k

=argmax
k
2
[
C
]

k;i
0
otherwise
10:
endfor
Werstsample
m
points
b
D
=
f
b
x
1
;:::;
b
x
m
g
fromthedatasetandrandomlysplittheremaining
n

m
datapointsinto
P
parts

D
1
;:::;
D
P

.Letthematrix
b
K
=

(
b
x
i
;b
x
j
)
where
b
x
i
;b
x
j
2
b
D
.
Wethenmapeachpartitiontoaprocessingnode.Eachnodecom
putesthekernelmatrix
K
l
B
=

(
x
i
;b
x
j
)
,where
x
i
2D
l
,thesetofpointsassignedtothenode,andndstheclusterl
abelsfor
the
s
pointsin
D
l
andthecorrespondingclustercenters,usingthematrices
K
l
B
and
b
K
.Each
point
x
i
=
2D
l
isassignedtotheclusterwhosecenterisclosest.Thisproc
essgenerates
P
cluster
membershipmatrices

U
l

P

l
=1
;U
l
2f
0
;1
g
C

n
.Toobtainthenalclustermembershipmatrix
U
,
wecaneitherrandomlychooseoneindex
l
andset
U
=
U
l
,orcombinethemusinganensemble
clusteringalgorithm.
62
Theobjectiveofensembleclustering[180]istocombinemul
tiplepartitionsofthegivendata
set.ApopularensembleclusteringalgorithmistheMeta-Cl
usteringalgorithm(MCLA)[168],
describedinAlgorithm5.Itmaximizestheaveragenormaliz
edmutualinformationbetweenthe
partitionsusinghypergraphpartitioning.Given
P
clustermembershipmatrices,
f
U
1
;:::;U
P
g
,
where
U
l
=(
u
l

1
;:::;
u
l

C
)
>
,theobjectiveofthisalgorithmistondaconsensusmember
ship
matrix
U
thatmaximizestheAverageNormalizedMutualInformation,
denedas
ANMI
=
1
P
P
X
k
=1
NMI
(
U;U
l
)
;(2.16)
where
NMI
(
U
a
;U
b
)
,theNormalizedMutualInformation(NMI)[104]betweentwo
partitions
a
and
b
,representedbythemembershipmatrices
U
a
and
U
b
respectively,isdenedby
NMI
(
U
a
;U
b
)=
C
P
i
=1
C
P
j
=1
n
a;b

i;j
log

n
n
a;b

i;j
n
a

i
n
b

j

v

u

u

t

C
P
i
=1
n
a

i
log
n
a

i
n

 
C
P
j
=1
n
b

j
log
n
b

j
n
!
:(2.17)
Inequation(2.17),
n
a

i
representsthenumberofdatapointsthathavebeenassigned
label
i
inpar-
tition
a
,and
n
a;b

i;j
representsthenumberofdatapointsthathavebeenassigned
label
i
inpartition
a
andlabel
j
inpartition
b
.NMIvalueslieintherange
[0
;1]
.AnNMIvalueof1indicatesperfect
matchingbetweenthetwopartitionswhereas0indicatesper
fectmismatch.Maximizing(2.16)
isacombinatorialoptimizationproblemandsolvingitexha
ustivelyiscomputationallyinfeasible.
MCLAobtainsanapproximateconsensussolutionbyrepresen
tingthesetofpartitionsasahyper-
graph.Eachvector
u
l

k
;k
2
[C
];l
2
[P
]representsavertexinaregularundirectedgraph,called
the
meta-graph
.Vertex
u
i
isconnectedtovertex
u
j
byanedgewhoseweightisproportionalto
theJaccardsimilaritybetweenthetwovectors
u
i
and
u
j
:
s
i;j
=
u
>

i
u
j
k
u
i
k
2
+
k
u
j
k
2

u
>

i
u
j
:(2.18)
63
Thismeta-graphispartitionedusingagraphpartitioninga
lgorithmsuchasMETIS[93]toobtain
C
balancedmeta-clusters
f
ˇ
1
;ˇ
2
;:::ˇ
C
g
.Eachmeta-cluster
ˇ
k
=
n
u
(1)

k
;u
(2)

k
;:::
u
(
s
k
)
k
o
,containing
s
k
vertices,isrepresentedbythemeanvector

k
=
1
s
k
s
k
X
i
=1
u
(
i
)
k
:(2.19)
Thevalue

k;i
representstheassociationbetweendatapoint
x
i
andthe
k
th
cluster.Eachdatapoint
x
i
isassignedtothemeta-clusterwithwhichitisassociatedt
hemost,breakingtiesrandomly,i.e
U
k

;i
=
8

>

<

>

:
1
if
k

=argmax
k
2
[
C
]

k;i
0
otherwise
(2.20)
Byparallelizingtheapproximatekernel
k
-meansalgorithm,therunningtimecomplexityfor
kernelcalculationandclusteringreducesto
O
(
nm=P
)
and
O
(
m
2
C
+
mnC=P
+
m
2
n=P
)
,respec-
tively.Iftheensembleclusteringalgorithmisemployedto
combinethepartitionsinthelaststep,
anadditionalcostof
O
(
nC
2
P
2
)
isincurred.Thecommunicationoverheadisminimal.Onlyth
e
m
sampleddatapointsneedtobereplicated,incontrasttothe
n
numberofdatapointsthatneed
tobereplicatedacrossallthenodesinparallelkernel
k
-means.
2.4ExperimentalResults

Inthissection,weshowthattheapproximatekernel
k
-meansalgorithmisanefcientandscalable
variantofthekernel
k
-meansalgorithm.Ithaslowerrunningtimeandmemoryrequi
rementsbut
isonparwithkernel
k
-meansintermsoftheclusteringquality.
64
2.4.1Datasets

Weusethemedium-sizedCIFAR-10andMNISTdatasets,forwhi
chitisfeasiblebutexpensive
tocomputethe
n

n
kernelmatrix,todemonstratethattheproposedalgorithm'
sclusteringper-
formanceissimilartothatofthekernel
k
-meansalgorithm,intermsoftheclusterquality.Wethen
demonstratetheefciencyoftheproposedalgorithmonlarg
edatasets,onasingleprocessor,us-
ingthelargeForestCoverType,Imagenet-34,PokerandNetw
orkIntrusiondatasets.Weanalyze
thescalabilityofouralgorithmusingthesyntheticconcen
triccirclesdataset.Wenallyexecute
thedistributedapproximatekernel
k
-meansontheTinydatasetandtheconcentriccirclesdatase
t
containingabillionpoints.

2.4.2Baselines

Werstcomparedtheproposedtechniquewiththekernel
k
-meansalgorithmtoshowthatsimilar
performanceisachievedbyouralgorithm.Wealsogaugedour
algorithm'sperformanceagainst
thatoftheNystromspectralclusteringalgorithm[67],whi
chclustersthetop
C
eigenvectorsofa
lowrankapproximatekernelmatrix,obtainedthroughtheNy
stromapproximationtechnique,and
the
k
-meansalgorithmtoshowthatouralgorithmachievesbetter
clusterquality.
2.4.3Parameters

Todenetheinter-pointsimilarity,weusedtheuniversalR
BFkernelwiththekernelwidthparam-
etersetequalto
ˆ
d
,where
d
istheaveragepairwiseEuclideandistancebetweenthedata
points,
andparameter
ˆ
isavalueintherange
[0
;1]
4
.ThevaluewhichachievedthebestNMIwasem-
ployed.Weevaluatedtheefciencyoftheproposedalgorith
mfordifferentsamplesizesranging
from
m
=100
to
m
=2
;000
.Weselectedthesesamplesizestoensurethatthetrueclust
ersin
eachdatasetaresufcientlyrepresentedinthesample,wit
hhighprobability.Forthepurposeof
4
Theaveragepairwisesimilaritywasusedonlyasaheuristic
tosettheRBFkernelwidth,andnotrequiredbythe
proposedalgorithm.Othertechniquesmaybeemployedtocho
osethekernelandthekernelparameters.
65
evaluation,thenumberofclusters
C
wassetequaltothenumberoftrueclassesinthedataset.
AllalgorithmswereimplementedinMATLAB
5
andrunona2.8GHzprocessor.Thememory
usedwasexplicitlylimitedto
40
GB.Weexecutedeachalgorithm
10
timesandpresenttheresults
averagedovertheseruns.Differentpermutationsofthedat
asetwereinputtothealgorithmineach
run.

2.4.4Results

2.4.4.1Runningtime

Table2.2Runningtime(inseconds)oftheproposedapproxim
atekernel
k
-meansandthebaseline
algorithms.Thesamplesize
m
issetto
2
;000
,forboththeproposedalgorithmandtheNystrom
approximationbasedspectralclusteringalgorithm.Itisn
otfeasibletoexecutekernel
k
-meanson
thelargeForestCoverType,Imagenet-34,Poker,andNetwor
kIntrusiondatasetsduetotheirlarge
size.Anapproximatevalueoftherunningtimeofkernel
k
-meansonthesedatasetsisobtained
byrstexecutingkernel
k
-meansonarandomlychosensubsetof
50
;000
datapointstondthe
clustercenters,andthenassigningtheremainingpointsto
theclosestclustercenter.
Dataset
Approximate
Nystrom
Kernel
k
-means
kernel
k
-means
approximation
k
-means
(proposed)
basedspectral
clustering
CIFAR-10
37.01
116.13
725.32
159.22
(

6
:52
)
(

1
:97
)
(

7
:39
)
(

75
:81
)
MNIST
57.73
4,186.02
914.59
448.69
(

12
:94
)
(

386
:17
)
(

235
:14
)
(

177
:24
)
Forest
157.48
573.55
4,721.03
40.88
CoverType
(

27
:37
)
(

327
:49
)
(

504
:21
)
(

6
:4
)
Imagenet-34
1,261.02
1,841.47
154,416.48
31,076.41
(

37
:39
)
(

123
:82
)
(

32
;302
:44
)
(

9
;355
:41
)
Poker
256.26
520.48
9,942.40
40.88
(

44
:84
)
(

51
:29
)
(

1
;476
:00
)
(

6
:40
)
Network
891.08
1,682.46
34,784.56
953.41
Intrusion
(

237
:17
)
(

235
:70
)
(

1
;493
:59
)
(

169
:38
)
5
Weusedthe
k
-meansimplementationintheMATLABStatisticsToolboxand
theNystromapproximationbased
spectralclusteringimplementation[35]availableathttp
://alumni.cs.ucsb.edu/wychen/sc.html.Theremainingal
go-
rithmswereimplementedin-house.
66
Figure2.2ExampleimagesfromthreeclustersintheImagene
t-34dataset.Theclustersrepresent
(a)buttery,(b)odometer,and(c)websiteimages.
Therunningtimesoftheproposedalgorithmforsamplesize
m
=2
;000
andthebaseline
algorithmsarerecordedinTable2.2.Weobservedthataspee
dupofover
90%
wasachievedbyour
algorithmwhencomparedtokernel
k
-meansontheCIFAR-10andMNISTdatasets.Itisinfeasible
tocalculatethe
n

n
kernelforthelargeForestCoverType,Imagenet-34,Pokera
ndNetwork
Intrusiondatasets.Togaugetheefciencyofouralgorithm
againstkernel
k
-meansonthesedata
sets,werandomlyselectedasetof
50
;000
pointsfromthesedatasets,executedkernel
k
-meanson
thissubset,andassignedclusterlabelstotheremainingpo
intsbyndingtheclusterwhosecenter
isclosest.Ouralgorithmwasfasterthanthisversionofthe
kernel
k
-meansalgorithmaswell.
Eventhe
k
-meansalgorithmwasslowerthantheproposedapproximatek
ernel
k
-meansalgorithm
onmostofthedatasets,duetotheirhighdimensionality.Ou
ralgorithmwasalsofasterthan
thespectralclusteringalgorithmbasedontheNystromappr
oximation,becausespectralclustering
requirestheeigendecompositionofthesimilaritymatrix.
Themosttime-consumingoperationin
ouralgorithm,computationoftheinversematrix
b
K

1
,heavilyinuencedtheclusteringtime.
2.4.4.2Clusterquality

Figures2.2and2.5showexamplesofclustersobtained,usin
gtheapproximatekernel
k
-means
algorithm,fromtheImagenet-34andtheCIFAR-10datasets,
respectively.Weassignedaclass
labeltoeachcluster,basedonthetrueclassofmajorityoft
heobjectsinthecluster.
Thesilhouettecoefcientsoftheproposedalgorithmareco
mparedwiththoseofthebaseline
67
algorithms,ontheCIFAR-10andMNISTdatasets,inFigure2.
3.Computingthesilhouettecoef-
cientvaluesforthepartitionsoftheremainingdatasetsi
scomputationallyprohibitive.Onboth
theCIFAR-10andMNISTdatasets,thesilhouettecoefcient
valuesachievedbytheproposed
algorithmareclosetothoseofthekernel
k
-meansalgorithm,provingthatthetwoalgorithmsyield
similarpartitions.TheNystromapproximationbasedspect
ralclusteringalgorithmachieveslower
silhouettevalues,whilethe
k
-meansalgorithmachievesvaluescloseto
0
,showingthattheclusters
obtainedarenotcompact.TheNMIvaluesachievedbytheprop
osedalgorithmagainstthebase-
00.010.020.03Silhouette(a)CIFAR-10
00.10.20.30.40.5Silhouette(b)MNIST
Figure2.3Silhouettecoefcientvaluesofthepartitionso
btainedusingapproximatekernel
k
-
means,comparedtothoseofthepartitionsobtainedusingth
ebaselinealgorithms.Thesample
size
m
issetto
2
;000
,forboththeproposedalgorithmandtheNystromapproximat
ionbased
spectralclusteringalgorithm.

linealgorithmsareshowninFigure2.4.Duetothesmallsize
oftheimagesintheCIFAR-10data
set,itisdifculttoobtainahighclusteringaccuracyonth
isdataset.Despitethisdifculty,our
algorithmpartitionedtheimagesintoclusterssimilartot
hoseobtainedbyusingkernel
k
-means.
TheMNISTdatasetwasalsoclusteredintopartitionssimila
rtothepartitionsobtainedfromkernel
k
-means.Ouralgorithm'spredictionaccuracyintermsofNMI
withrespecttothetrueclasslabels
iscomparabletothatofkernel
k
-means.Theproposedalgorithm'sNMIvaluesaremarginally
betterthanthoseoftheapproximatespectralclusteringal
gorithm,becausethespectralclustering
algorithmusesonlythetop
C
eigenvectorsofthekernelmatrixtodeterminetheclusters
,which
68
051015NMI(a)CIFAR-10
01020304050NMI(b)MNIST
051015NMI(c)ForestCoverType
0246810NMI(d)Imagenet-34
0102030NMI(e)Poker
0510NMI(f)NetworkIntrusion
Figure2.4NMIvalues(in%)ofthepartitionsobtainedusing
approximatekernel
k
-means,with
respecttothetrueclasslabels.Thesamplesize
m
issetto
2
;000
,forboththeproposedalgorithm
andtheNystromapproximationbasedspectralclusteringal
gorithm.Itisnotfeasibletoexecute
kernel
k
-meansonthelargeForestCoverType,Imagenet-34,Poker,a
ndNetworkIntrusiondata
setsduetotheirlargesize.TheapproximateNMIvaluesofke
rnel
k
-meansonthesedatasetsare
obtainedbyrstexecutingkernel
k
-meansonarandomlychosensubsetof
50
;000
datapointsto
ndtheclustercenters,andthenassigningtheremainingpo
intstotheclosestclustercenter.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure2.5ExampleimagesfromtheclustersfoundintheCIFA
R-10datasetusingapproximate
kernel
k
-means.Theclustersrepresentthefollowingobjects:(a)a
irplane,(b)automobile,(c)bird,
(d)cat,(e)deer,(f)dog,(g)frog,(h)horse,(i)ship,and(
j)truck.
maybetoorestrictiveforthesedatasets.Asexpected,allt
hekernel-basedalgorithmsperformed
betterthan
k
-means.
2.4.4.3Parametersensitivity

Theproposedapproximatekernel
k
-meansalgorithmisdependentononecrucialparameter:the
samplesize
m
.Westudytheeffectofvaryingthisparameterontherunning
timeofthealgorithm
inTable2.3,andtheclusterqualityinFigure2.6(NMIvalue
s)andFigure2.7(Silhouettecoef-
cientvalues).Wecomparetheperformanceofouralgorithma
gainsttheNystromapproximation
basedspectralclusteringalgorithm,whichalsodependson
thesameparameter.InTable2.3,the
executiontimeissplitintothetimetakenforcomputingthe
kernelmatrixandclusteringthedata
points.Thekernelcomputationtimeiscommontothepropose
dalgorithmandtheNystromap-
proximationbasedspectralclusteringalgorithm.Moretim
ewasspentinclusteringthaninkernel
calculation,duetothesimplicityoftheRBFkernel.Though
ouralgorithmtooklongerthanthe
approximatespectralclusteringalgorithmforsmallsampl
esizes(
m

1
;000
),therunningtimeof
thespectralclusteringalgorithmincreasedcubicallywit
hthenumberofsamples.Ouralgorithm
wasfasterforlargesamplesizes,whenhighclusterquality
wasachieved.Therunningtimeof
ouralgorithmalsoincreasedasthesamplesize
m
increased,butatalowerrate.Thesilhouette
coefcientvaluesoftheproposedalgorithmincreasedmarg
inallyasthesamplesizeincreased,
andwerehigherthanthoseachievedbytheNystromapproxima
tionbasedspectralclusteringal-
gorithm.TheNMIvaluesachievedbyouralgorithmwerealsoh
igherthanthoseachievedbythe
Nystromapproximationbasedspectralclusteringalgorith
m,especiallywhenthesamplesizeis
large,andspectralclusteringiscomputationallyexpensi
ve.OnlyontheImagenet-34dataset,our
algorithmperformsmarginallyworsethanthespectralclus
teringalgorithm.Thereisamarginal
improvementintheNMIofouralgorithmasthesamplesizeinc
reases.
71
1002005001,0002,000051015mNMI(a)CIFAR-10
1002005001,0002,00001020304050mNMI(b)MNIST
1002005001,0002,000051015mNMI(c)ForestCoverType
1002005001,0002,0000246810mNMI(d)Imagenet-34
1002005001,0002,0000102030mNMI(e)Poker
1002005001,0002,000051015mNMI(f)NetworkIntrusion
Figure2.6Effectofthesamplesize
m
ontheNMIvalues(in%)ofthepartitionsobtainedusing
approximatekernel
k
-means,withrespecttothetrueclasslabels.
72
1002005001,0002,00000.010.020.03mSilhouette(a)CIFAR-10
1002005001,0002,00000.10.20.30.40.5mSilhouette(b)MNIST
Figure2.7Effectofthesamplesize
m
ontheSilhouettecoefcientvaluesofthepartitionsobtai
ned
usingapproximatekernel
k
-means.
2.4.4.4Samplingstrategies

Inourimplementationoftheproposedalgorithm,weemploye
duniformrandomsamplingtoselect
thesubsetofdatausingwhichthekernelmatrixisconstruct
ed.Othersamplingstrategiessuchas
column-normsampling,diagonalsamplingand
k
-meansbasedsamplingmaybeusedtoselectthe
samples.Table2.4,Figure2.8,andFigure2.9comparetheru
nningtime,silhouettecoefcient
andNMIvalues,respectively,forthecolumnnormsamplinga
ndthe
k
-meanssamplingstrategies
withuniformrandomsampling.Forcolumnnorm-sampling,we
assumethatthe
n

n
kernel
matrixispre-computedandonlyrecordthetimetakenforcom
putingthecolumnnorms,andthe
timetakenforchoosingtherst
m
indices,asthesamplingtime.For
k
-meanssampling,we
recordthetimetakentoexecute
k
-meansandndtherepresentativesamples.Asexpected,the
samplingtimeforboththenon-uniformsamplingtechniques
wasgreaterthanthetimerequired
foruniformrandomsampling.Columnnormsamplingismoreex
pensivethan
k
-meanssampling,
afterthekernelcomputationtimeistakenintoaccount.Bot
hthenon-uniformsamplingtechniques
areasaccurateasuniformrandomsamplingforsubstantiall
ylargesamplesizes,bothintermsof
silhouettecoefcientvaluesaswellastheNMIvalues.This
showsthattheadditionaltimespent
fornon-uniformsamplingdoesnotleadtosignicantimprov
ementintheperformance,aligning
73
Table2.3Effectofthesamplesize
m
ontherunningtime(inseconds)oftheproposedapproximate
kernel
k
-meansclusteringalgorithm.
m
Approx.
Approx.
Nystrom
kernel
kernel
approx.
calculation
k
-means
based
(proposed)
spectral
clustering
100
0.34
11.95
0.57
(

0
:
04
)
(

4
:
62
)
(

0
:
12
)
200
0.87
39.04
0.99
(

0
:
07
)
(

15
:
04
)
(

0
:
13
)
500
1.36
11.84
4.25
(

0
:
03
)
(

2
:
11
)
(

1
:
86
)
1,000
3.63
45.87
22.61
(

0
:
23
)
(

21
:
94
)
(

5
:
03
)
2,000
4.60
32.41
111.53
(

0
:
20
)
(

6
:
32
)
(

1
:
77
)
(a)CIFAR-10
m
Approx.
Approx.
Nystrom
kernel
kernel
approx.
calculation
k
-means
based
(proposed)
spectral
clustering
100
0.65
25.91
7.20
(

0
:
06
)
(

3
:
05
)
(

1
:
00
)
200
1.06
14.54
49.56
(

0
:
18
)
(

7
:
85
)
(

9
:
19
)
500
1.99
21.36
348.86
(

0
:
34
)
(

8
:
35
)
(

107
:
43
)
1,000
3.32
25.78
920.34
(

0
:
44
)
(

6
:
78
)
(

219
:
62
)
2,000
5.81
51.92
4,180.21
(

0
:
35
)
(

12
:
59
)
(

385
:
82
)
(b)MNIST
m
Approx.
Approx.
Nystrom
kernel
kernel
approx.
calculation
k
-means
based
(proposed)
spectral
clustering
100
1.40
17.70
10.35
(

0
:
29
)
(

6
:
06
)
(

1
:
44
)
200
1.64
22.57
16.83
(

0
:
09
)
(

12
:
39
)
(

2
:
38
)
500
3.82
28.56
50.11
(

0
:
03
)
(

11
:
61
)
(

10
:
83
)
1,000
11.14
55.01
137.26
(

0
:
68
)
(

18
:
57
)
(

40
:
88
)
2,000
22.80
134.68
550.75
(

1
:
27
)
(

26
:
10
)
(

326
:
22
)
(c)ForestCoverType
m
Approx.
Approx.
Nystrom
kernel
kernel
approx.
calculation
k
-means
based
(proposed)
spectral
clustering
100
47.29
504.41
78.53
(

1
:
12
)
(

119
:
41
)
(

7
:
14
)
200
68.15
608.24
115.16
(

0
:
16
)
(

10
:
78
)
(

4
:
47
)
500
168.83
737.24
292.69
(

0
:
27
)
(

209
:
26
)
(

7
:
21
)
1,000
181.93
847.06
404.73
(

11
:
95
)
(

22
:
88
)
(

79
:
77
)
2,000
344.39
916.63
1497.08
(

3
:
77
)
(

33
:
62
)
(

120
:
05
)
(d)Imagenet-34
74
Table2.3(cont'd)
m
Approx.
Approx.
Nystrom
kernel
kernel
approx.
calculation
k
-means
based
(proposed)
spectral
clustering
100
2.85
53.02
10.88
(

0
:
36
)
(

10
:
86
)
(

1
:
65
)
200
7.31
81.83
46.78
(

1
:
40
)
(

30
:
72
)
(

4
:
21
)
500
12.74
104.83
90.57
(

2
:
41
)
(

17
:
76
)
(

18
:
57
)
1,000
31.29
171.55
261.14
(

2
:
64
)
(

41
:
61
)
(

20
:
51
)
2,000
40.75
215.51
479.73
(

3
:
83
)
(

41
:
01
)
(

47
:
46
)
(e)Poker
m
Approx.
Approx.
Nystrom
kernel
kernel
approx.
calculation
k
-means
based
(proposed)
spectral
clustering
100
7.52
729.07
241.84
(

0
:
64
)
(

237
:
67
)
(

65
:
00
)
200
13.82
683.22
200.48
(

4
:
15
)
(

438
:
10
)
(

45
:
24
)
500
41.36
339.77
436.79
(

10
:
75
)
(

119
:
48
)
(

206
:
47
)
1,000
87.24
551.39
668.91
(

10
:
54
)
(

78
:
01
)
(

49
:
37
)
2,000
115.14
775.94
1567.32
(

7
:
06
)
(

230
:
11
)
(

228
:
64
)
(f)NetworkIntrusion
withtheresultsofearlierworkssuchas[103].

2.4.4.5Scalabilityanalysis

Weanalyzethescalabilityoftheproposedapproximatekern
el
k
-meansfordifferentvaluesof
n
,
d
,
C
usingthesyntheticconcentriccirclesdataset.Weemploye
dtheRBFkernelfunctiontocompute
theapproximatekernelmatrices,andsetthenumberofsampl
edpoints
m
=1
;000
when
C<
100
and
m
=10
C
when
C

100
.Thiswasdoneinordertoensurethattheconditionimposedb
y
Lemma2issatised.
Figure2.10(a)showsthattherunningtimeofthealgorithmv
ariesnearlylinearlyasthenumber
ofpointsinthedataset
n
variesfrom
100
to
10
million,thedimensionality
d
=100
,andthe
numberofclusters
C
=10
.ThisconcurswithourcomplexityanalysisinSection2.3.2
.
Wesetthenumberofdatapoints
n
=10
6
andthenumberofclusters
C
=10
,andstudiedthe
effectofthedatadimensionalityontheperformanceofthep
roposedalgorithminFigure2.10(b).
Thedimensionalityofthedatasetplaysanimportantroleon
lyinthecalculationofthekernel.The
RBFkernelissimpleandtakesonlyafew
100
secondstocalculate,evenfor
n
=10
6
.Therunning
75
Table2.4Comparisonofsamplingtimes(inmilliseconds)of
theuniform,column-normand
k
-
meanssamplingstrategiesontheCIFAR-10andMNISTdataset
s.Parameter
m
representsthe
samplesize.
m
CIFAR-10
MNIST
Uniform
Columnnorm
k
-means
Uniform
Columnnorm
k
-means
Random
(

e03
)
(

e06
)
Random
(

e03
)
(

e06
)
100
9.62
67.62
1.68
9.41
94.22
3
:83
(

1
:62
)
(

2
:31
)
(

0
:43
)
(

1
:74
)
(

3
:97
)
(

0
:542
)
200
4.24
68.21
1.90
9.34
88.92
2
:62
(

1
:12
)
(

3
:49
)
(

0
:20
)
(

1
:16
)
(

4
:44
)
(

0
:254
)
500
3.99
64.54
2.14
11.10
86.27
7
:82
(

0
:65
)
(

4
:26
)
(

0
:14
)
(

3
:81
)
(

0
:94
)
(

3
:42
)
1,000
5.43
67.42
2.44
8.41
86.15
5
:88
(

0
:87
)
(

5
:59
)
(

0
:16
)
(

1
:38
)
(

0
:70
)
(

1
:78
)
2,000
4.62
70.43
2.66
9.53
86.66
4
:91
(

2
:20
)
(

7
:20
)
(

0
:03
)
(

1
:94
)
(

0
:85
)
(

0
:207
)
1002005001,0002,00000.010.020.030.04mSilhouette(a)CIFAR-10
1002005001,0002,00000.20.40.6mSilhouette(b)MNIST
Figure2.8ComparisonofSilhouettecoefcientvaluesofth
epartitionsobtainedfromapproximate
kernel
k
-meansusingtheuniform,column-normand
k
-meanssamplingstrategies,ontheCIFAR-
10andMNISTdatasets.Parameter
m
representsthesamplesize.
76
1002005001,0002,000051015mNMI(a)CIFAR-10
1002005001,0002,00001020304050mNMI(b)MNIST
Figure2.9ComparisonofNMIvalues(in%)ofthepartitionso
btainedfromapproximatekernel
k
-meansusingtheuniform,column-normand
k
-meanssamplingstrategies,ontheCIFAR-10and
MNISTdatasets.Parameter
m
representsthesamplesize.
10550001000015000Running timein secondsSize of the data set n(log scale)(a)
1011021031100120013001400Running timein secondsDimensionality of the data set d(log scale)(b)
101102103246810x 104Running timein secondsNumber of clusters C(log scale)(c)
Figure2.10Runningtimeoftheapproximatekernel
k
-meansalgorithmfordifferentvaluesof(a)
n
,(b)
d
and(c)
C
.
timeisdominatedbythetimetakenforclustering.Asaresul
t,therunningtimevariesminimally
whenthedimensionalityofthedatasetvariesfrom
d
=10
to
d
=1
;000
.
Wexed
n
=10
6
and
d
=100
,increasedthenumberofclustersinthedatasetfrom
C
=10
to
C
=1
;000
,andrecordedtherunningtimeofouralgorithminFigure2.1
0(c).Asexpected,the
runningtimealmostincreaseslinearlywith
C
.When
C<
100
,thenumberofsamples
m
isxed
to
1
;000
.Therefore,thenumberofclustershasasignicanteffecto
nlyontheclusteringtime.
When
C

100
,thenumberofsamples
m
alsoneedtobeincreased,therebyaffectingboththe
kernelcalculationtimeandtheclusteringtime.
77
2.4.5DistributedApproximateKernel
k
-means
Ondatasetsofsizesgreaterthan
10
million,executionofapproximatekernel
k
-meansonasingle
processorishighlytime-consuming.Weemployedthedistri
butedapproximatekernel
k
-meansto
clustertheTinyimagesdatasetandthesyntheticconcentri
ccirclesdataset.
Wesetthesamplesize
m
=1
;000
andthenumberoftasks
P
=1
;024
.Eachtaskwasrunona
2.8GHzprocessor,withatotalof
100
GBsharedmemory.TheRBFkernelwasusedforbothdata
sets.Thenumberofclusterswassetto
C
=100
and
C
=10
fortheTinyimagesandconcentric
circlesdatasets,respectively.
Theclusteringperformanceofthedistributedalgorithmon
thetwodatasetsispresentedin
Table2.5.Whenapproximatekernel
k
-meanswasexecutedontheTinyimagesdatasetonasingle
processor,ittookabout
8
:5
hours.Thedistributedalgorithmisabletoclusterthisdat
asetinunder
2
minutes.Theconcentriccirclesdatasetcontaining
1
billionpointswasalsoclusteredinless
than
15
minutes.ThetrueclasslabelsarenotavailablefortheTiny
imagedataset,soitwasnot
possibletoevaluatetheclusterquality.Ontheconcentric
circlesdataset,anNMIofabout
78%
wasachieved.

Table2.5Performanceofthedistributedapproximatekerne
l
k
-meansalgorithmontheTinyimage
datasetandtheconcentriccirclesdataset,withparameter
s
m
=1
;000
and
P
=1024
.
Dataset
Tiny
Concentriccircles
n
79,302,017
1,000,000,000
d
384
10
C
100
10
Running
Kernelcalculation
0.21
1.17
(

0
:07
)
(

0
:09
)
time
Clustering
94.03
876.75
(

6
:58
)
(

163
:06
)
NMI
N/A
77.80
(

0
:10
)
78
2.5Summary

Inthischapter,wepresentedtheapproximatekernel
k
-meansalgorithm,anefcientapproximation
forthekernel
k
-meansclusteringalgorithm,suitableforbigdatasets.Th
ekeytotheefciencyof
approximatekernel
k
-meansisthefactthatitdoesnotrequirethecalculationof
thepairwisesim-
ilaritiesbetweenallthedatapoints.Byrestrictingthecl
ustercenterstolieinasubspacespanned
byasmallsetofrandomlysampleddatapoints,itisabletoco
mputetheclustersusingonlyasmall
portionofthekernelmatrix.Consequently,ithaslowerrun
ningtimeandmemorycomplexitythan
kernel
k
-meansandotherkernel-basedclusteringalgorithms.Weha
veshowntheoreticallythat,
thedifferenceintheclusteringerroroftheapproximateke
rnel
k
-meansandthekernel
k
-means
algorithms,reduceslinearlyasthenumberofsampledpoint
sincreases.Experimentalresultsalso
showthattheperformanceofapproximatekernel
k
-meansiscomparabletothatofkernel
k
-means
andotherstate-of-the-artapproximatekernelclustering
algorithms,intermsoftheclusterquality,
whileitsrunningtimeisclosetothatoflinearclusteringa
lgorithmssuchas
k
-means.Thoughnot
aseasilyparallelizableas
k
-means,itrequireslesserdatareplicationandcommunicat
ionthanker-
nel
k
-means.Hence,itcanhandledistributeddatasetsmoreefc
ientlythankernel
k
-means.The
proposedapproximatekernel
k
-meansachievesourobjectiveofclusteringbigdatasetsef
ciently
andaccurately.
79
Chapter3

Kernel-basedClusteringUsingRandom

FeatureMaps

3.1Introduction

Althoughtheapproximatekernel
k
-meansalgorithmisaccurateandscalable,ithasthefollow
ing
limitations:

Theapproximatekernel
k
-meansalgorithmsamplesasubsetof
m
pointsfromthedataset,
andconstructsa
n

m
kernelmatrix
K
B
,betweenthe
n
pointsinthedatasetandthesampled
points.When
n
isintheorderofbillions,andthenumberofclustersisalso
comparably
large,calculatingthe
O
(
nm
)
matrix
K
B
maybeinfeasible.Forinstance,ifwewereto
clustertheTinyimagedatasetcontaining
80
millionimagesinto
75
;062
clusters(thetrue
numberofclassesinthedataset),approximatekernel
k
-meanswouldrequireabout
m
=10
5
samples.Thiswouldboildowntocalculatingabout
8
trillionsimilarityvalues,whichis
computationallyexpensive.

Approximatekernel
k
-meanscannotefcientlyhandleout-of-sampleclustering
,i.e.the
problemofassigningnewdatapointstoclustersafterthecl
usteringiscomplete.Inorderto
80
ndtheclusterlabelforanewpoint
x

,weneedtocompute
jj
c
k
(

)


(
x

;
)
jj
2

H

=

>
k
b
K
k

2
'
>

k
;k
2
[C
];where
'
=[

(
x

;b
x
1
)
;:::;
(
x

;b
x
m
)]
and

k
isthe
k
th
rowofthe
C

m
matrix

,con-
tainingtheweightsofthesampledpointsineachofthe
C
clusters.Thisoperationhas
O
(
m
2
C
+
mC
2
+
md
)
runningtimecomplexityandcanbeinefcientforlarge
m
.
Toaddresstheabovelimitations,weproposetwoalgorithms
whichuserandomfeaturemapsto
obtainan
O
(
m
)
-dimensionalembeddingoftheHilbertspaceassociatedwit
hthekernel

(

;
)
,
where
m
˝
n
[42].Ourrstalgorithmcalledthe
RFFclustering
algorithmobtainsvectorrepre-
sentationsofthedatapointstoforman
n

m
patternmatrix.Thispatternmatrixisclusteredusing
alinearclusteringalgorithmlike
k
-meanstoobtainthedatapartitions.Thisalgorithm,liket
he
approximatekernelk-means,has
O
(
nm
)
runningtimecomplexityandmemoryrequirements.The
secondalgorithm,whichwecallthe
SVclustering
algorithm,isdesignedalongthelinesofspectral
clustering.Itapproximatestheeigenvectorsofthe
n

n
kernelmatrixbythedominant
C
singular
vectorsofthepatternmatrix,andobtainsthedatapartitio
nbyclusteringthesesingularvectorsin
O
(
nC
2
)
time.TheSVclusteringalgorithmprovidesa
C
-dimensionalrepresentationofthecluster
centers,usingwhichpreviouslyunseendatapointscanbeas
signedtoclustersefciently.
3.2Background

ThematrixapproximationmethodsdiscussedinSection2.2e
ssentiallyfactorizethekernelmatrix
toobtainalow-dimensionalrepresentationofthedata.Ano
therformofkernelapproximation,
initiallyproposedforsupervisedkernel-basedlearningb
yRahimiandRechtin[147],involves
factorizingthekernelfunctioninsteadofthekernelmatri
x,bymappingthedataexplicitlyintoa
low-dimensionalrandomizedfeaturespace.
81
Akernelfunction

(

;
)
is
shift-invariant
if

(
x
;y
)=

(
x

y
)
forall
x
;y
2<
d
.Popular
examplesofshift-invariantkernelsaretheRBFandLaplaci
ankernels.Let
p
(
w
)
denotetheFourier
transformofsuchakernelfunction

(
x

y
)
,i.e.

(
x

y
)=
Z
<
d
p
(
w
)exp(
j
w
>
(
x

y
))
d
w
:Accordingtothefollowingtheoremfromharmonicanalysis,
p
(
w
)
isavalidprobabilitydensity
function,providedthekernelfunctioniscontinuous,posi
tive-deniteandscaledappropriately.
Theorem2.
(Bochner'stheorem[152])Acontinuouskernel

(
x
;y
)=

(
x

y
)
on
<
d
ispositive
deniteifandonlyif

(

)
istheFouriertransformofanon-negativemeasure.
Forinstance,theFouriertransform[26]oftheRBFkernelfu
nctionistheGaussianprobability
distributionfunction.Let
w
bea
d
-dimensionalvectorsampledfrom
p
(
w
)
.Thekernelfunction
canbeapproximatedas

(
x
;y
)=
E
w

f
(
w
;x
)
>
f
(
w
;y
)

;(3.1)
where
f
(
w
;x
)=(cos(
w
>
x
)
;sin(
w
>
x
))
>
:Wecanapproximatetheexpectationin(3.1)withtheempiric
almeanover
m
Fouriercomponents
f
w
1
;:::;
w
m
g
,sampledfromthedistribution
p
(
w
)
,andobtainthefollowingrepresentationfor
thepoint
x
:
z
(
x
)=
1
p
m
(cos(
w
>
1
x
)
;:::;
cos(
w
>
m
x
)
;sin(
w
>
1
x
)
;:::;
sin(
w
>
m
x
))
:(3.2)
Thefeatures
z
(
x
)
arecalledthe
RandomFourierFeatures
.Thekernelsimilaritybetweenanytwo
points
x
and
y
canbeapproximatedbytheinnerproductbetweentherandomF
ourierfeatures
82
Algorithm6
RFFClustering
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered
clustered


:theRBFkernelwidthparameter

C
:thenumberofclusters

m
:thenumberofFouriercomponents(
C<m
˝
n
)
2:
Output
:Clustermembershipmatrix
U
2f
0
;1
g
C

n
3:
Draw
m
independentsamples
w
1
;:::;
w
m
fromtheGaussiandistribution
N

0
;1

I

.Let
W
=(
w
1
;:::;
w
m
)
.
4:
Computethematrix
H
=[cos(
XW
)sin(
XW
)]
,where
X
=(
x
1
;:::;
x
n
)
>
istheinput
patternmatrix.
5:
Runthe
k
-meansalgorithm(Algorithm1)on
H
withthenumberofclusterssetto
C
,and
obtainthemembershipmatrix
U
.
correspondingtothedatapoints,i.e.

(
x
;y
)
'
z
(
x
)
>
z
(
y
)
:(3.3)
Givenadataset
D
=
f
x
1
;:::;
x
n
g
,wecanobtainitslow-dimensionalrepresentation
b
D
=
f
z
(
x
1
)
;:::;
z
(
x
n
)
g
,andapplyafastlinearlearningalgorithmto
b
D
,insteadofexecutingakernel-
basedlearningalgorithmon
D
.Thisallowsustolearnthenon-linearrelationsinthedata
efciently
usinglinearmachines.
Thiskernelapproximationhasbeenemployedinseverallarg
e-scalelearningtaskssuchas
classication[25,147,182],regression[123],datacompr
ession[146]andnoveltydetection[164].
Randomfeaturemapshavebeenextendedtoshift-variantker
nelssuchasintersectionker-
nels[110,179]andotherpositivedenitekernelsusingMac
laurinandTaylorexpansionsofthe
kernelfunction[81,92].

3.3KernelClusteringusingRandomFourierFeatures
Randomfeaturemapscanbeusedforclusteringbigdatasetse
fciently.Weproposeanalgo-
83
(a)
(b)
(c)
Figure3.1AsimpleexampletoillustratetheRFFclustering
algorithm.(a)Two-dimensionaldata
setwith
500
pointsfromtwoclusters(
250
pointsineachcluster),(b)Plotofthematrix
H
obtained
bysampling
m
=1
Fouriercomponent.(c)Clustersobtainedbyexecuting
k
-meanson
H
.
Table3.1ComparisonoftheconfusionmatricesoftheRFF,ke
rnel
k
-means,and
k
-meansalgo-
rithmsforthetwo-dimensionalsemi-circlesdataset,cont
aining
500
points(
250
pointsineachof
thetwoclusters).
Class1
Class2
Cluster1
220
41
Cluster2
30
209
(a)RFFclustering
Class1
Class2
Cluster1
250
0
Cluster2
0
250
(b)Kernel
k
-means
Class1
Class2
Cluster1
132
129
Cluster2
118
121
(c)
k
-means
rithmcalledthe
RFFclustering
algorithm,whichrstprojectsthedatasetintoalow-dimen
sional
spaceusingrandomFourierfeaturemaps,andthenexecutes
k
-meansonthetransformeddata.
Let
D
=
f
x
1
;:::;
x
n
g
representtheinputdataset,and

(

;
)
bethekernelfunction.We
assumethat

(

;
)
isshift-invariant
1
andsatisesthecondition

(
x
;x
)=

(0)=1
.Let
K
=
[
(
x
i
;x
j
)]
n

n
denotethekernelmatrix.Thematrix
H
=

z
(
x
1
)
>
;:::;
z
(
x
n
)
>

(3.4)
denotesthedatamatrixobtainedbymappingeachpoint
x
2D
usingtherandomfeaturemap
z
(

)
.
1
Theassumptionofshift-invarianceismadeonlyforsimplic
ity.Randomfeaturemapscanbeusedforother
positivesemi-denitekernelsaswell,asdemonstratedin[
81,92].
84
Using(3.3),wecanapproximatethekernelmatrix
K
by
b
K
=
H
>
H:
(3.5)
Wecanreplacethekernelmatrix
K
inthekernel
k
-meansoptimizationproblem(1.11)with
theapproximatekernelmatrix
b
K
in(3.5),leadingtothefollowingoptimizationproblem:
max
U
2P
tr
(
e
UH
>
H
e
U
>
)
;(3.6)
where
U
=(
u
1
;:::;
u
C
)
>
istheclustermembershipmatrix,
P
=
f
U
2f
0
;1
g
C

n
:U
>
1
=
1
g
,
e
U
=[
diag
(
U
1
)]

1
=
2
U
,and
1
isavectorofallones.Bycomparingtheaboveproblemtothe
k
-meansoptimizationproblem(1.12),itbecomesevidenttha
ttheproblemin(3.6)canbesolved
byexecuting
k
-meansonthematrix
H
.Algorithm6describestheRFFclusteringalgorithmfor
clusteringusingtherandomFourierfeaturesobtainedfrom
theRBFkernel.Weillustratethe
algorithminFigure3.1.Figure3.1(a)showsatwo-dimensio
naldatasetcontaining
500
points
fromtwosemi-circularclusters.Thetwoclustersareident
iedperfectlywhenthekernel
k
-means
algorithmisexecutedonthisdataset.Forthepurposeofill
ustration,wesampledoneFourier
component(i.e.
m
=1
)andgeneratedatwo-dimensionalmatrix
H
torepresentthedata.Aplotof
thisrepresentationisshowninFigure3.1(b).Notethatthe
twoclustersaremoreseparatedinthis
spacethanintheoriginalfeaturespace.Figure3.1(c)show
stheclustersobtainedwhen
k
-meansis
executedon
H
.Theerror,intermsofthenumberofpointsthataregroupedi
ntothewrongcluster,
isabout
14%
,asshownintheconfusionmatricesinTable3.1.Aconfusion
matrixshowsthe
mappingbetweenthetrueclasslabelsandtheclusterlabels
.Eachclusterisassignedaclasslabel,
correspondingtothetruelabelofthemajorityofthedatapo
intsinthecluster.Eachentry
(
k;c
)
in
theconfusionmatrixrepresentthenumberofdatapointsfro
mclass
c
assignedtocluster
k
.The
diagonalentriesrepresentthenumberofpointsthathavebe
enassignedtothecorrectcluster.The
confusionmatricesshowthattheaccuracyoftheRFFcluster
ingalgorithmisclosetothatofthe
85
kernel
k
-meansalgorithm,andhigherthanthatofthe
k
-meansalgorithm.
3.3.1Analysis

Inthissection,werstanalyzethecomputationalcomplexi
tyoftheRFFclusteringalgorithm,and
thenexaminethequalityofthedatapartitionsgenerated.

3.3.1.1Computationalcomplexity

SamplingfromtheFouriertransformofthekernelfunctioni
sarelativelyinexpensiveoperation
formostshift-invariantkernels.Forinstance,severalef
cienttechniqueshavebeenproposedfor
samplingfromaGaussiandistributionintheliterature[53
].ThecruxoftheproposedRFFclus-
teringalgorithmthusliesincomputingthelow-dimensiona
lrandomFourierfeatures
H
.Given
md
-dimensionalFouriercomponents,themappingtothematrix
H
canbeperformedin
O
(
ndm
)
time.Le
etal.
proposedtheFastfoodalgorithmwhichreducestherunningt
imecomplexityof
thisoperationto
O
(
nm
log(
d
))
[107].Insteadofdirectlymultiplyingthedatamatrix
X
withthe
randomGaussianmatrix
W
toobtainthematrix
H
,theycombine
W
withaWalsh-Hadamardma-
trix.MultiplicationwithHadamardmatricescanbeperform
edinloglineartime,therebyreducing
therunningtime.AsGaussianmatricescombinedwithHadama
rdmatricesbehavelikeGaussian
matrices,thisdoesnotaffectthekernelmatrixapproximat
ionsignicantly.Executing
k
-meanson
H
takes
O
(
nmCl
)
time,where
l
isthenumberofiterationsrequiredforconvergence.Thus,
the
overallrunningtimecomplexityoftheRFFclusteringalgor
ithmis
O
(
nm
log(
d
)+
nmCl
)
.Only
O
(
nm
)
memoryisrequiredtostorethematrix
H
.
3.3.1.2Approximateerror

Toexaminethedifferencebetweentheclusteringsolutions
ofthekernel
k
-meansalgorithmand
theRFFclusteringalgorithm,wemustrstboundthekernela
pproximationerror


K

b
K


F
.In
thefollowingtheorem,weshowthatthiserrordecreasesatt
herateof
O
(1
=
p
m
)
:
86
Theorem3.
Forany

2
(0
;1)
,withprobability
1


,wehave


b
K

K


F

2ln(2
=
)
m
+
r
2ln(2
=
)
m
=
O

1
p
m

:(3.7)
Proof.
Weusethefollowingresultfrom[165]toprovethistheorem:
Lemma4.
Let
H

beaHilbertspaceand
˘
bearandomvariableon
(
Z;ˆ
)
withvaluesin
H

.
Assume
k
˘
k
M<
1
almostsurely.Denote
˙
2
(
˘
)=
E
(
k
˘
k
2
)
.Let
f
z
i
g
m

i
=1
beindependent
randomdrawersof
ˆ
.Forany
0
<<
1
,withcondence
1


,


1
m
m
X
i
=1
(
˘
i

E
[˘
i
])


2
M
ln(2
=
)
m
+
r
2
˙
2
(
˘
)ln(2
=
)
m
:(3.8)
Dene
a
(
w
)=
1
p
n
(cos(
w
>
x
1
)
;:::;
cos(
w
>
x
n
))
>
and
b
(
w
)=
1
p
n
(sin(
w
>
x
1
)
;:::;
sin(
w
>
x
n
))
>
:Let
˘
i
=
a
(
w
i
)
a
(
w
i
)
>
+
b
(
w
i
)
b
(
w
i
)
>
.Wehave
E
[˘
i
]=
E
[a
(
w
i
)
a
(
w
i
)
>
+
b
(
w
i
)
b
(
w
i
)
>
]=
K
and
jj
˘
i
jj
2

F
=
jj
a
(
w
i
)
j
2
+
j
b
(
w
i
)
jj
2
=1
,whichimplies
M
=
˙
2
=1
.Weobtaintheresult(3.7)
bysubstitutingthesevaluesin(3.8).
b
K
isagoodapproximationof
K
providedthatthenumberofFouriercomponents
m
issuf-
cientlylarge.Wecannowobtainanupperboundonthediffere
ncebetweenthesolutionsofthe
kernel
k
-meansoptimizationproblemin(1.11)andtheoptimization
problemin(3.6):
Theorem4.
Let
U

and
U

m
betheoptimalsolutionsof
(1.11)
and
(3.6)
,respectively.Let
e
U

=
U

[D
]
1
=
2
and
e
U

m
=
U

m
[D
m
]
1
=
2
denotethenormalizedversionsof
U

and
U

m
,where
D
=
87
diag
([
U

]>
1
)
and
D
m
=
diag
([
U

m
]>
1
)
.Forany

2
(0
;1)
,withprobability
1


,wehave
tr

[e
U


e
U

m
]>
K
[e
U


e
U

m
]

4ln(2
=
)
m
+
r
8ln(2
=
)
m
=
O

1
p
m

:Proof.
Wehave
tr
([
e
U

]>
K
e
U

)

tr
([
e
U

]>
b
K
e
U

)+


K

b
K


F

tr
([
e
U

m
]>
b
K
e
U

m
)+


K

b
K


F

tr
([
e
U

m
]>
K
e
U

m
)+2


K

b
K


F
:Sincetr
([
e
U

]>
K
e
U

)

tr
([
e
U

m
]>
K
e
U

m
)
,wehave
j
tr
([
e
U

]>
K
e
U

)

tr
([
e
U

m
]>
K
e
U

m
)
j
2


K

b
K


F
:WecompletetheproofbyusingtheresultfromTheorem3andth
estrongconvexitypropertyof
tr
(
e
U
>
K
e
U
)
.
3.4KernelClusteringusingRandomFourierFeaturesinCon-
strainedEigenspace
Despiteitssimplicity,RFFclusteringmaysufferfromhigh
computationalcost.Asseenin
Theorem4,alargenumberofrandomFouriercomponentsmaybe
requiredtoachievealowap-
proximationerror.Asaconsequence,weneedtoexecute
k
-meansoverahigh-dimensionalspace,
leadingtohighruntimecomplexity.Toaddressthisproblem
,weproposeusinganideasimilar
tothatintheapproximatekernel
k
-meansalgorithm,andconstraintheclustercenterstoliei
nthe
88
Algorithm7
SVClustering
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered


:theRBFkernelwidthparameter

C
:thenumberofclusters

m
:thenumberofFouriercomponents(
C<m
˝
n
)
2:
Output
:Clustermembershipmatrix
U
2f
0
;1
g
C

n
3:
Draw
m
independentsamples
w
1
;:::;
w
m
fromtheGaussiandistribution
N

0
;1

I

.Let
W
=(
w
1
;:::;
w
m
)
.
4:
Computethematrix
H
=[cos(
XW
)sin(
XW
)]
,where
X
=(
x
1
;:::;
x
n
)
>
istheinput
patternmatrix.
5:
Computetheleftsingularvectorsof
H
correspondingtoitstop
C
singularvaluestoobtainthe
matrix
b
V
C
=(
b
v
1
;:::;
b
v
C
)
.
6:
Runthe
k
-meansalgorithm(Algorithm1)on
b
V
C
withthenumberofclusterssetto
C
and
obtainthemembershipmatrix
U
.
subspacespannedbythetopeigenvectorsofthekernelmatri
x.Let
f
(

i
;v
i
)
g
n

i
=1
denotetheeigen-
valuesandeigenvectorsofthekernelmatrix
K
,rankedinthedescendingorderoftheeigenvalues.
Let
H
a
=
span
(
v
1
;:::;
v
C
)
representthespacespannedbythedominant
C
eigenvectors.The
kernel
k
-meansproblemin(1.7)canbeapproximatedas
min
U
2P
max
f
c
k
(

)
2H
a
g
C

k
=1
C
X
k
=1
n
X
i
=1
U
ki
n
jj
c
k
(

)


(
x
i
;
)
jj
2

H

;(3.9)
where
c
k
(

)
representtheclustercenters,
U
=(
u
1
;:::;
u
C
)
>
istheclustermembershipmatrix,
P
=
f
U
2f
0
;1
g
C

n
:U
>
1
=
1
g
,and
1
isavectorofallones.Theaboveproblem(3.9)
canbesolvedbyexecuting
k
-meansonthetopeigenvectorsof
K
,i.e.bysolvingthefollowing
optimizationproblem:
max
U
2P
tr
(
e
U
[V
C
V
>
C
]e
U
>
)
;(3.10)
where
V
C
=(
v
1
;:::;
v
C
)
,and
e
U
=[
diag
(
U
1
)]

1
=
2
U
.Thismethodleadstoasignicantreduc-
tionincomputationalcostwhencomparedtotheRFFclusteri
ngalgorithm,aseachdatapointis
representedbya
C
-dimensionalvectorand
k
-meansneedstobeexecutedoveralowerdimensional
space.
89
However,computingtheeigenvectorsof
K
requiresthecomputationofthe
n

n
kernelmatrix,
whichisinfeasiblewhen
n
islarge.Wecircumventthisissuebyapproximatingtheeige
nvectorsof
K
usingthesingularvectorsoftherandomFourierfeatures,a
ndtherebyavoidcomputingthefull
kernelmatrix.Morespecically,wecomputethetop
C
singularvaluesandthecorrespondingleft
singularvectorsof
H
,denotedby
f
(
b

i
;b
v
i
)
g
C

i
=1
,andrepresentthedatapointsin
D
bythematrix
b
V
C
=(
b
v
1
;:::;
b
v
C
)
.Wethensolvetheapproximateoptimizationproblem
max
U
2P
tr
(
e
U
[b
V
C
b
V
>
C
]e
U
>
)
;(3.11)
byexecuting
k
-meansonthematrix
b
V
C
toobtainthe
C
clusters.Thisprocedure,namedasthe
SVclusteringalgorithm,isoutlinedinAlgorithm7.Ithast
hesameinputandoutputastheRFF
clusteringalgorithm,butdiffersinthenaltwosteps.Ast
hedimensionalityoftheinputtothe
k
-meansclusteringstepintheSVclusteringalgorithmissig
nicantlysmallerthanthatintheRFF
clusteringalgorithm,SVclusteringismoreefcientthanR
FFclustering,despitetheoverheadof
computingthesingularvectors.

3.4.1Analysis

Inthissection,wediscussthecomputationalcomplexityof
theSVclusteringalgorithmandbound
itsapproximationerror.

3.4.1.1Computationalcomplexity

AstheinitialstepsintheSVclusteringalgorithmarethesa
meastheRFFclusteringalgorithm,
thesestepshavethesamerunningtimecomplexity.Inadditi
on,thealgorithminvolvesperforming
thesingularvaluedecompositionof
H
.Ifthetopsingularvectorsof
H
arefoundusingcon-
ventionalmethods,theruntimecomplexityoftheSVDstepwo
uldbe
O
(
nm
2
)
.Wereducethis
complexityinourimplementationbyusingtheapproximateS
VDtechniqueproposedin[70].We
90
sample
s
rowsfrom
H
toforman
s

2
m
matrix
S
.Thetopeigenvectorsof
S
>
S
,denotedby
e
V
=(
e
v
1
;:::;
e
v
C
)
,areclosetothetopeigenvectorsof
H
>
H
andthesingularvectorsof
H
canbe
recoveredfromtheeigenvectorsof
S
>
S
as
H
e
V
.Usingthisapproximation,theruntimecomplex-
ityofSVDisreducedto
O
(
sm
min
f
s;m
g
)
.Thetimetakentoexecute
k
-meansonthesingular
vectorsis
O
(
nC
2
l
)
.
When
max(
m;s;l;C
)
˝
n
,boththeRFFandSVclusteringalgorithmshavelineartimec
om-
plexity.However,thetimetakenbythe
k
-meansstepintheSVclusteringalgorithmis
O
(
nC
2
l
)
,
asopposedto
O
(
nmCl
)
,thetimetakenbythe
k
-meansstepintheRFFclusteringalgorithm.As
C
isusuallymuchlesserthan
m
,theSValgorithmismuchmoreefcientthantheRFFclusteri
ng
algorithm.
Thevalueschosenfor
m
and
s
introduceatrade-offbetweentheclusteringqualityandef
-
ciency.Highervaluesresultinbetterclusteringqualityb
utlesserspeedup.Inourimplementation,
wefoundthatareasonablygoodaccuracycanbeachievedbyse
ttingthevalueof
m
torangebe-
tween
1%
and
2%
of
n
,andsetting
s
toaround
2%
of
n
.Lower
m=n
ratiovaluesworkwellas
n
increases.

3.4.1.2Approximationerror

TheSVclusteringalgorithmreliesontheassumptionofthee
xistenceofalargeeigengap.This
theorythathasbeenadoptedbymanyearlierkernel-basedal
gorithmswhichrelyonthespec-
tralembeddingofthedata[118],essentiallyimpliesthatm
ostattributesofthedatacanbewell
approximatedbyvectorsinthelow-dimensionalspacespann
edbythetopeigenvectors.
Thefollowingtheoremprovesthatwhenthelast
n

C
eigenvalues
f

i
g
n

i
=
C
+1
of
K
aresuf-
cientlysmall,thesubspace
H
canbewellapproximatedbythesubspace
H
a
spannedbythetop
C
eigenvectorsof
K
.
Theorem5.
Let
E
and
E
a
representtheoptimalclusteringerrorsinthekernelk-mea
nsproblem
91
(1.7)
andtheoptimizationproblem
(3.9)
,respectively.Wehave
j
E

E
a
j
n
X
i
=
C
+1

i
:Proof.
Let
f
c


k
(

)
g
C

k
=1
and
U

betheoptimalsolutionsto(1.7).Let
c
a

k
(

)
representtheprojection
of
c


k
intothesubspace
H
a
.Forany

(
x
i
;
)
,let
g
i
(

)
and
h
i
(

)
betheprojectionsof

(
x
i
;
)
into
thesubspace
H
a
andspan
(
v
C
+1
;:::;
v
n
)
,respectively.Wehave
E
a
=min
U
max
c
k
(

)
2H
a
C
X
k
=1
n
X
i
=1
U
ki
n
jj
c
k
(

)


(
x
i
;
)
jj
2

H


C
X
k
=1
n
X
i
=1
U

ki
n
jj
c
a

k
(

)


(
x
i
;
)
jj
2

H

=
C
X
k
=1
n
X
i
=1
U

ki
n

jj
c
a

k
(

)

g
i
(

)
jj
2

H

+
jj
h
i
(

)
jj
2

H


E
+
1
n
C
X
k
=1
n
X
i
=1
jj
h
i
(

)
jj
2

H


E
+
n
X
i
=
C
+1

i
:Weproveasetofpreliminarylemmasbeforepresentingourma
inresultinTheorem6which
boundstheclusteringerroroftheSVclusteringalgorithm.

Lemma5.
(Resultfrommatrixperturbationtheory[166])Let
(

i
;v
i
)
;i
2
[n
]betheeigenvalues
andeigenvectorsofasymmetricmatrix
A
2<
n

n
rankedinthedescendingorderofeigenvalues.
Set
X
=(
v
1
;:::;
v
C
)
and
Y
=(
v
C
+1
;:::;
v
n
)
.Givenasymmetricperturbationmatrix
E
,let
(
X;Y
)
>
E
(
X;Y
)=
0

B

@
E
11
E
12
E
21
E
22
1

C

A
:92
Let
jjjj
representaconsistentfamilyofnormsandlet

=
jj
E
21
jj
;
=

C


C
+1
jj
E
11
jjjj
E
22
jj
:If
>
0
and


<
1
2
,thenthereexistsauniquematrix
P
2<
(
n

C
)

C
satisfying
jj
P
jj
<
2


;suchthat
X
0
=(
X
+
YP
)(
I
+
P
>
P
)

1
=
2
;and
Y
0
=(
Y

XP
>
)(
I
+
PP
>
)

1
=
2
aretheeigenvectorsof
A
+
E
.
Lemma6.
Given

2
(0
;1)
,weassume
(

C


C
+1
)

3
,where
=
2ln(2
=
)
m
+
r
2ln(2
=
)
m
;(3.12)
thereexists,withprobability
1


,amatrix
P
2<
(
n

C
)

C
satisfying
jj
P
jj
F

2

C


C
+1


;suchthat
b
V
C
=(
V
C
+

V
C
P
)(
I
+
P
>
P
)

1
=
2
;where
b
V
C
=(
b
v
1
;:::;
b
v
C
)
,
V
C
=(
v
1
;:::;
v
C
)
,and

V
C
=(
v
C
+1
;:::;
v
n
)
.
93
Proof.
Let
E
=
b
K

K
.UsingTheorem3andLemma5,wehave

=


V
>
C
E

V
C


jj
E
jj

;and

=

C


C
+1


V
>
C
EV
C


V
>
C
E

V
C


C


C
+1


>
0
:As

C


C
+1

3
,wealsohave


<
1
2
,allowingustoapplyLemma5andobtaintherequired
result.
Lemma7.
UndertheassumptionsofLemma6,withaprobability
1


,wehave
C
X
i
=1
jj
b
v
i

v
i
jj
2

2
jj
P
jj
2

F

18
2
(

C


C
+1
)
2
;where

isdenedin
(3.12)
.
Proof.
Dene
A
=
P
(
I
+
P
>
P
)

1
=
2
and
B
=
I

(
I
+
P
>
P
)

1
=
2
.Let
f

i
g
C

i
=1
betheeigenvalues
of
P
>
P
.UsingtheresultfromLemma6,wehave
C
X
i
=1
jj
b
v
i

v
i
jj
2
=


V
C
A


2

F
+
jj
V
C
B
jj
2

F
jj
A
jj
2

F
+
jj
B
jj
2

F
jj
P
jj
2

F
+
C
X
i
=1

i
(1+
p

i
)
2

2
jj
P
jj
2

F
:94
Wecompletetheproofbyusingthefactthat
jj
P
jj
F

2

C


C
+1


3

C


C
+1
:Inthefollowingtheorem,weboundtheapproximationerroro
ftheSVclusteringalgorithmand
showthatityieldsabetterapproximationofkernelcluster
ingthantheRFFclusteringalgorithm,
providedthereisasufcientlylargegapintheeigenspectr
um.
Theorem6.
Let
U

and
U

m
betheoptimalsolutionsof
(3.10)
and
(3.11)
,andlet
e
U

and
e
U

m
representtheirnormalizedversions(asdenedinTheorem4
),respectively.Given

2
(0
;1)
,
assume
(

C


C
+1
)

3
,where

isdenedin
(3.12)
.Withprobability
1


,wehave
tr

[e
U


e
U

m
]>
[e
U


e
U

m
]

18
2
(

C


C
+1
)
2
=
O

1
m

:Proof.
ThistheoremisadirectresultofLemmas6and7.
Theorem6showsthat,liketheRFFclusteringalgorithm,the
SVclusteringalgorithm'sap-
proximationerrorreducesasthenumberofFouriercomponen
tsincreases,albeitatahigherrate
of
O
(1
=m
)
.
3.4.2Out-of-sampleClustering

TheSVclusteringalgorithmcanbeusedtoefcientlyassign
clusterlabelstodatapointsthat
werenotseenpreviously.TheclustercentersintheSVclust
eringalgorithmlieinthesubspace
H
a
=
span
(
b
v
1
;:::;
b
v
C
)
,andcanbeexpressedaslinearcombinationsofthesevector
s:
e
c
k
=
1
n
k
n
X
i
=1
U
ki
b
v
i
;95
where
n
k
isthenumberofdatapointsinthe
k
th
cluster.Givenadatapoint
x

2<
d
,wecanobtain
itsclusterlabelusingthefollowingdoubleprojectionsch
eme:
(i)ComputetherandomFourierfeatures
z
(
x

)=
1
p
m
(cos(
w
>
1
x

)
;:::;
cos(
w
>
m
x

)
;sin(
w
>
1
x

)
;:::;
sin(
w
>
m
x

))
:(ii)Project
z
(
x

)
intothesubspace
H
a
toobtain
b
v

.
(iii)Assign
x

tothecluster
k
whichminimizes
jj
e
c
k

b
v

jj
2

2
.
Usingthisprocess,clusterlabelscanbeassignedin
O
(
md
)
time.
3.5ExperimentalResults

3.5.1Datasets

WeevaluatedtheperformanceoftheRFFandSVclusteringalg
orithmsontheCIFAR-10,MNIST,
ForestCoverType,Imagenet-34,Poker,andNetworkIntrusi
ondatasets.Themedium-sized
CIFAR-10andMNISTdatasetsareusedtocomparetheperforma
nceoftheproposedalgorithms
withthekernel
k
-meansalgorithm.Theremainingdatasetsareusedtodemons
tratethescalability
ofthealgorithmstolargedatasets.

3.5.2Baselines

Usingthemedium-sizedCIFAR-10andMNISTdatasets,wecomp
aredtheproposedalgorithms
withthekernel
k
-meansalgorithm,todemonstratethattheirclusteringper
formanceisclosetothat
ofthekernel
k
-meansintermsofclusterquality.Wealsocomparedtheirpe
rformancewiththe
approximatekernel
k
-meansalgorithmandtheNystromapproximationbasedspect
ralclustering
96
algorithm.Wealsogaugedtheperformanceofouralgorithms
againstthatofthe
k
-meansalgorithm
toshowthattheyachievebetterclusterquality.

3.5.3Parameters

WeusedtheRBFkernelforallthekernel-basedalgorithmson
allthedatasets.Wesetthekernel
widthequalto
ˆ
d
,where
d
istheaveragepairwiseEuclideandistancebetweenthedata
pointsand
parameter
ˆ
istunedintherange
[0
;1]
toobtainoptimalperformance
2
.Wevariedthenumber
ofFouriercomponents
m
from
100
to
2
;000
.Fortheapproximatekernel
k
-meansandspectral
clusteringalgorithms,
m
representsthesizeofthesampledrawnfromthedataset.The
valueof
s
,
thenumberofrowssampledfrom
H
tocomputetheapproximatesingularvectors,wassetto
2%
ofthetotalnumberofdatapoints
n
.Thenumberofclusters
C
wassetequaltothetruenumberof
classesinthedataset.
AllalgorithmswereimplementedinMATLAB
3
andrunona
2
:8
GHzprocessorusing
40
GBRAM.Allresultsareaveragedover
10
runsofthealgorithms.Ineachrunoftheproposed
algorithms,weusedadifferentsetofrandomlysampledFour
iercomponents.Forthebaseline
algorithmswhichuseasubsetofthedata,weuseddifferentr
andomlysampledsubsetsineach
run.

3.5.4Results

3.5.4.1Runningtime

Therunningtimeofthebaselinealgorithmsandtheproposed
RFFandSValgorithmsarerecorded
inTable3.2.ThenumberofFouriercomponents
m
fortheRFFandSVclusteringalgorithms
2
Theaveragepairwisesimilaritywasusedonlyasaheuristic
tosettheRBFkernelwidth,andnotrequiredbythe
proposedalgorithms.Othertechniquesmaybeemployedtoch
oosethekernelandthekernelparameters.
3
Weusedthe
k
-meansimplementationintheMATLABStatisticsToolboxand
theNystromapproximationbased
spectralclusteringimplementation[35]availableathttp
://alumni.cs.ucsb.edu/wychen/sc.html.Theremainingal
go-
rithmswereimplementedin-house.
97
Table3.2Runningtime(inseconds)oftheRFFandSVclusteri
ngalgorithmsonthesixbenchmark
datasets.Theparameter
m
,whichrepresentsthenumberofFouriercomponentsfortheR
FFand
SVclusteringalgorithms,andthesamplesizefortheapprox
imatekernel
k
-meansandNystrom
approximationbasedspectralclusteringalgorithms,isse
tto
m
=2
;000
.Itisnotfeasibleto
executekernel
k
-meansonthelargeForestCoverType,Imagenet-34,Poker,a
ndNetworkIntrusion
datasetsduetotheirlargesize.Anapproximateoftherunni
ngtimeofkernel
k
-meansonthese
datasetsisobtainedbyrstexecutingkernel
k
-meansonarandomlychosensubsetof
50
;000
data
pointstondtheclustercenters,andthenassigningtherem
ainingpointstotheclosestcluster
center.
Dataset
RFF
SV
Approx.
Nystrom
Kernel
k
-means
clustering
clustering
kernel
approx.
k
-means
(proposed)
(proposed)
k
-means
based
spectral
clustering
CIFAR-10
3,418.21
58.32
37.01
116.13
725.32
159.22
(

907
:14
)
(

38
:68
)
(

6
:52
)
(

1
:97
)
(

7
:39
)
(

75
:81
)
MNIST
1,089.26
39.94
57.73
4,186.02
914.59
448.69
(

483
:63
)
(

5
:64
)
(

12
:94
)
(

386
:17
)
(

235
:14
)
(

177
:24
)
Forest
2,078.63
76.99
157.48
573.55
4,721.03
40.88
CoverType
(

617
:22
)
(

17
:04
)
(

27
:37
)
(

327
:49
)
(

504
:21
)
(

6
:4
)
Imagenet-34
1,333.85
212.32
1,261.02
1,841.47
154,416
31,076
(

6
:53
)
(

4
:75
)
(

37
:39
)
(

123
:82
)
(

32
;302
)
(

9
;355
)
Poker
4,530.44
41.08
256.26
520.48
9,942
40.88
(

276
:37
)
(

2
:57
)
(

44
:84
)
(

51
:29
)
(

1
;476
)
(

6
:40
)
Network
24,151
435.53
891.08
1,682.46
34,784
953.41
Intrusion
(

6
;351
:34
)
(

189
:07
)
(

237
:17
)
(

235
:70
)
(

1
;493
)
(

169
:38
)
wassetto
2
;000
andthesamplesetsizefortheapproximatekernel
k
-meansandtheNystrom
approximationbasedspectralclusteringalgorithmwasals
osetto
2
;000
.Werstobservethat
theRFFclusteringalgorithmtooklongerthantheSVcluster
ingalgorithmonallthedatasets.
Thoughbothalgorithmsrequirethecomputationofthedatam
atrix
H
,thetimetakentoperform
thiscomputationwasinsignicantwhencomparedtothe
k
-meansclusteringtime.RFFclustering
involvesrunning
k
-meansona
2
m
-dimensionalmatrix,whichtakeslongerthanrunning
k
-means
ona
C
-dimensionalmatrix.AlthoughtheSVclusteringalgorithm
includescomputingthesingular
vectorsof
H
,theoverheadofperformingSVDissmall,renderingitmoree
fcientthantheRFF
clusteringalgorithm.OntheCIFAR-10dataset,theSVclust
eringalgorithmwasatleast
15
times
98
fasterthantheRFFclusteringalgorithm.OntheMNISTdatas
et,theSVclusteringalgorithmwas
about
20
timesfasterthantheRFFclusteringalgorithm.Similarspe
edupswereobtainedforthe
otherdatasetsaswell.WewillseelaterthattheSVclusteri
ngalgorithmachievessimilarclustering
accuracyastheRFFclusteringalgorithm.Soweconcludetha
ttheSVclusteringalgorithmismore
suitableforlargescalekernelclusteringthantheRFFclus
teringalgorithm.
TheNystromapproximationbasedspectralclusteringalgor
ithmndstheclustersbyexecuting
k
-meansonthetopeigenvectorsofalowrankapproximatekern
elmatrixderivedfromarandomly
sampleddatasubsetofsize
m
.Itrstobtainstheeigenvectorsofan
m

m
matrixandthenex-
trapolatesthemtothetopeigenvectorsofthe
n

n
kernelmatrix.AstheSVclusteringalgorithm
onlyndsthetopsingularvectorsofan
s

m
matrix,itismoreefcientthantheNystromap-
proximationbasedspectralclusteringalgorithm.TheSVcl
usteringalgorithmwasalsofasterthan
approximatekernel
k
-meansonallthedatasets.
Asexpected,theSValgorithmwasfasterthanthekernel
k
-meansalgorithmontheCIFAR-10
andMNISTdatasets.Asitisprohibitivetoexecutekernel
k
-meansonthelargeForestCoverType,
Imagenet-34,PokerandNetworkIntrusiondatasets,werand
omlyselectedasubsetof
50
;000
pointsfromthesedatasets,executedkernel
k
-meansonthissubsettoobtaintheclustercenters,and
assignedtheremainingpointstotheclosestcenter.Wereco
rdedthetimetakenforthisprocedure
asthetimetakenbykernel
k
-meansontheselargedatasets.TheSValgorithmwasfastert
hanthis
approximateversionofkernel
k
-meansaswellonallthedatasets.Whenthedimensionalityo
fthe
datasetwasgreaterthanthenumberofclustersinit,theSVc
lusteringalgorithmranfasterthan
the
k
-meansalgorithm.
3.5.4.2Clusterquality

Figure3.2recordsthesilhouettecoefcientvaluesofthep
roposedandbaselinealgorithmson
theCIFAR-10andMNISTdatasets.Theproposedalgorithmsac
hievedvaluescomparabletothe
kernel
k
-meansalgorithmandtheapproximatekernel
k
-meansalgorithm,showingthattheyyield
99
00.010.020.03Silhouette(a)CIFAR-10
00.10.20.30.40.5Silhouette(b)MNIST
Figure3.2Silhouettecoefcientvaluesofthepartitionso
btainedusingtheRFFandSVclustering
algorithms.Theparameter
m
,whichrepresentsthenumberofFouriercomponentsfortheR
FFand
SVclusteringalgorithms,andthesamplesizefortheapprox
imatekernel
k
-meansandNystrom
approximationbasedspectralclusteringalgorithms,isse
tto
m
=2
;000
.
similarpartitions.Thesilhouettecoefcientvaluesofth
eNystromapproximationbasedspectral
clusteringalgorithmweremarginallylowerthanthoseofth
eremainingkernel-basedclustering
algorithms.The
k
-meansalgorithmyieldednon-compactpartitionswithsilh
ouettevaluescloser
to
0
.
Figure3.3showstheNMIvaluesachievedbytheproposedalgo
rithmsandthebaselinealgo-
rithms.Werstobservethattheaccuracyofallthekernel-b
asedalgorithms,includingtheproposed
algorithms,wasbetterthanthatofthe
k
-meansalgorithm,demonstratingthefactthatincorporat-
inganon-linearsimilarityfunctionimprovestheclusteri
ngperformance.OntheCIFAR-10and
MNISTdataset,weobservedthattheperformanceofbothoura
lgorithmswassimilartothatof
kernel
k
-means.Comparisonwithkernel
k
-meansisnotfeasibleontheremainingdatasetsdueto
theirlargesize.Theproposedalgorithmsoutperformedthe
approximateversionofkernel
k
-means
inwhichasubsetofthedatawasclusteredandtheremainingp
ointswereassignedtotheclosest
center.Theproposedalgorithms'performancewassignica
ntlybetterthanthatoftheNystromap-
proximationbasedspectralclusteringalgorithmonalldat
asets.Theyperformedonlymarginally
100
worsethantheapproximatekernel
k
-meansalgorithm.ThedifferenceintheNMIvaluesofthe
RFFclusteringalgorithmandtheSVclusteringisminimalfo
rmostdatasets.
3.5.4.3Parametersensitivity

ThenumberofFouriercomponents
m
playsacrucialroleintheperformanceoftheRFFandSV
clusteringalgorithms.Therunningtimeofthealgorithmsi
scomparedwiththeapproximatekernel
k
-meansandtheNystromspectralclusteringalgorithmsford
ifferentvaluesof
m
inTable3.3.
Inthetable,
m
representsthenumberofFouriercomponentsinthecontexto
ftheRFFandSV
clusteringmethods,anditrepresentsthesizeofthesample
drawnfromthedatasetinthecontext
ofapproximatekernel
k
-meansandNystromapproximationbasedspectralclusterin
g.Asobserved
earlier,theSValgorithmisfasterthantheRFFclusteringa
lgorithm.Forinstance,theSValgorithm
isabout
15
timesfasterthantheRFFalgorithmontheCIFAR-10datasetw
hen
m
=100
.The
speedupfactorincreasedasthenumberofFouriercomponent
s
m
increases.Wenotethatthe
speedupontheNetworkIntrusiondatasetbecamesignicant
onlywhen
m

500
.TheSV
clusteringalgorithmwasalsofasterthanapproximatekern
el
k
-meansforallvaluesof
m
,due
tothefactthatunliketheapproximatekernel
k
-meansalgorithm,thedimensionalityoftheinput
tothe
k
-meansstep(whichdominatestherunningtime)remainscons
tantdespitetheincreasein
m
.Thedimensionalityoftheinputkernelintheapproximatek
ernel
k
-meansalgorithmincreases
linearlywith
m
.
Thesilhouettecoefcientvaluesachievedbythealgorithm
sontheCIFAR-10andMNIST
datasets,fordifferentvaluesof
m
areshowninFigure3.4.Werstobservethatthesilhouette
valuesachievedbytheproposedRFFandSVclusteringalgori
thmsincreasedsignicantlyas
m
increased.Thevalueswereinitiallymuchlowerthanthosea
chievedbytheapproximatekernel
k
-
meansandNystromapproximationbasedspectralclustering
algorithms,butbecamecomparable
when
m

1
;000
.
TheNMIvaluesachievedbythealgorithmsfordifferentvalu
esof
m
areshowninFigure3.5.
101
051015NMI(a)CIFAR-10
01020304050NMI(b)MNIST
051015NMI(c)ForestCoverType
0246810NMI(d)Imagenet-34
0102030NMI(e)Poker
0510NMI(f)NetworkIntrusion
Figure3.3NMIvalues(in%)ofthepartitionsobtainedusing
theRFFandSVclusteringalgo-
rithms,withrespecttothetrueclasslabels.Theparameter
m
,whichrepresentsthenumberof
FouriercomponentsfortheRFFandSVclusteringalgorithms
,andthesamplesizefortheapprox-
imatekernel
k
-meansandNystromapproximationbasedspectralclusterin
galgorithms,issetto
m
=2
;000
.Itisnotfeasibletoexecutekernel
k
-meansonthelargeForestCoverType,Imagenet-
34,Poker,andNetworkIntrusiondatasetsduetotheirlarge
size.TheapproximateNMIvalues
ofkernel
k
-meansonthesedatasetsareobtainedbyrstexecutingkern
el
k
-meansonarandomly
chosensubsetof
50
;000
datapointstondtheclustercenters,andthenassigningth
eremaining
pointstotheclosestclustercenter.
102
Wenotethat,althoughtheSVclusteringalgorithmperforme
dworsethantheRFFclusteringal-
gorithmintermsofNMIwhen
m
issmall,ityieldedsimilarperformanceastheRFFclusteri
ng
algorithmwhen
m
wassubstantiallylarge.OntheMNISTdataset,asthevalueo
f
m
increased
from
100
to
2
;000
,theaverageNMIachievedbytheRFFclusteringalgorithmin
creasedbyabout
15%
whereastheSVclusteringalgorithmachievedanincreaseof
20%
.Similarratesofincrease
wereobservedonotherdatasetsalso.Thisveriesourclaim
thattheapproximationerrorofthe
SVclusteringalgorithmdecreasesatahigherratewithresp
ecttotheparameter
m
,thanthatof
theRFFclusteringalgorithm.WhiletheNMIvaluesoftheSVc
lusteringmethodarehigherthan
thoseoftheNystromspectralclusteringmethodforall
m
valuesonmostdatasets,theyareonly
marginallylowerthanthoseoftheapproximatekernel
k
-meansalgorithmforsmall
m
,andbecome
closetotheapproximatekernel
k
-meansvaluesas
m
increases.
1002005001,0002,00000.010.020.03mSilhouette(a)CIFAR-10
1002005001,0002,00000.10.20.30.40.5mSilhouette(b)MNIST
Figure3.4EffectofthenumberofFouriercomponents
m
onthesilhouettecoefcientvaluesof
thepartitionsobtainedusingtheRFFandSVclusteringalgo
rithms.Parameter
m
representsthe
numberofFouriercomponentsfortheRFFandSVclusteringal
gorithms,andthesamplesizefor
theapproximatekernel
k
-meansandNystromapproximationbasedspectralclusterin
galgorithms.
3.5.4.4Scalability

WeanalyzethescalabilityoftheproposedRFFandSVcluster
ingalgorithmsfordifferentvalues
of
n
,
d
,
C
usingthesyntheticconcentriccirclesdataset.Wesetthen
umberofFourierfeatures
103
Table3.3EffectofthenumberofFouriercomponents
m
ontherunningtime(inseconds)ofthe
RFFandSVclusteringalgorithmsonthesixbenchmarkdatase
ts.Parameter
m
representsthe
numberofFouriercomponentsfortheRFFandSVclusteringal
gorithms,andthesamplesizefor
theapproximatekernel
k
-meansandNystromapproximationbasedspectralclusterin
galgorithms.
m
RFFclustering
SVclustering
Approx.kernel
Nystromapprox.based
(proposed)
(proposed)
k
-means
basedspectralclustering
100
89.94
5.39
12.29
0.91
(

18
:96
)
(

1
:63
)
(

4
:66
)
(

0
:16
)
200
176.47
6.09
39.91
1.86
(

47
:59
)
(

1
:76
)
(

15
:11
)
(

0
:20
)
500
449.23
10.71
13.20
5.61
(

103
:61
)
(

3
:32
)
(

2
:14
)
(

1
:89
)
1,000
1,176.74
16.46
49.50
26.24
(

276
:07
)
(

6
:54
)
(

22
:17
)
(

5
:26
)
2,000
3,418.21
58.32
37.01
116.13
(

907
:14
)
(

38
:68
)
(

6
:52
)
(

1
:97
)
(a)CIFAR-10
m
RFFclustering
SVclustering
Approx.kernel
Nystromapprox.based
(proposed)
(proposed)
k
-means
basedspectralclustering
100
85.36
3.85
26.57
6.00
(

25
:64
)
(

2
:37
)
(

3
:12
)
(

0
:89
)
200
122.31
4.66
17.98
46.70
(

48
:31
)
(

1
:78
)
(

7
:99
)
(

8
:51
)
500
272.57
9.22
24.72
342.38
(

111
:25
)
(

1
:22
)
(

8
:46
)
(

105
:80
)
1,000
517.48
17.46
36.34
914.18
(

44
:6
)
(

1
:43
)
(

6
:92
)
(

215
:77
)
2,000
1,089.26
39.94
86.43
4163.76
(

483
:63
)
(

5
:64
)
(

12
:71
)
(

383
:37
)
(b)MNIST
104
Table3.3(cont'd)
m
RFFclustering
SVclustering
Approx.kernel
Nystromapprox.based
(proposed)
(proposed)
k
-means
basedspectralclustering
100
154.97
9.62
19.10
11.75
(

65
:72
)
(

2
:57
)
(

6
:35
)
(

1
:73
)
200
174.88
10.77
24.21
13.65
(

65
:36
)
(

1
:67
)
(

12
:48
)
(

1
:59
)
500
534.01
22.15
32.48
41.92
(

216
:18
)
(

6
:08
)
(

11
:64
)
(

7
:89
)
1,000
1,032.58
35.46
66.15
124.83
(

221
:56
)
(

5
:20
)
(

19
:25
)
(

38
:32
)
2,000
2,078.63
76.99
157.48
534.77
(

617
:22
)
(

17
:04
)
(

27
:37
)
(

323
:76
)
(c)ForestCoverType
m
RFFclustering
SVclustering
Approx.kernel
Nystromapprox.based
(proposed)
(proposed)
k
-means
basedspectralclustering
100
24.43
17.72
551.70
125.82
(

0
:92
)
(

1
:09
)
(

120
:53
)
(

8
:26
)
200
57.66
33.82
676.39
183.31
(

2
:15
)
(

0
:96
)
(

10
:94
)
(

4
:63
)
500
163.74
84.34
906.07
461.52
(

5
:54
)
(

4
:62
)
(

209
:53
)
(

7
:48
)
1,000
340.23
160.89
1028.99
586.66
(

11
:30
)
(

5
:65
)
(

34
:83
)
(

91
:72
)
2,000
1,333.85
212.32
1261.02
1841.47
(

6
:53
)
(

4
:75
)
(

37
:39
)
(

123
:82
)
(d)Imagenet-34
105
Table3.3(cont'd)
m
RFFclustering
SVclustering
Approx.kernel
Nystromapprox.based
(proposed)
(proposed)
k
-means
basedspectralclustering
100
144.22
12.32
55.57
10.88
(

11
:88
)
(

1
:70
)
(

11
:22
)
(

1
:65
)
200
411.32
17.35
89.14
46.78
(

34
:34
)
(

2
:07
)
(

32
:12
)
(

4
:21
)
500
654.98
22.82
117.57
90.57
(

132
:70
)
(

2
:48
)
(

20
:17
)
(

18
:57
)
1,000
2,287.53
27.37
202.84
261.14
(

159
:06
)
(

2
:09
)
(

44
:25
)
(

20
:51
)
2,000
4,530.44
41.08
256.26
479.73
(

276
:37
)
(

2
:57
)
(

44
:84
)
(

47
:46
)
(e)Poker
m
RFFclustering
SVclustering
Approx.kernel
Nystromapprox.based
(proposed)
(proposed)
k
-means
basedspectralclustering
100
2,252.44
147.93
736.59
145.21
(

465
:94
)
(

62
:03
)
(

238
:31
)
(

22
:76
)
200
5,371.85
258.86
697.04
169.27
(

1
;765
:02
)
(

41
:32
)
(

442
:25
)
(

38
:15
)
500
5,296.87
245.37
586.14
366.42
(

3
;321
:66
)
(

158
:57
)
(

130
:23
)
(

175
:57
)
1,000
24,151.47
435.53
763.75
589.57
(

6
;351
:34
)
(

189
:07
)
(

88
:55
)
(

54
:14
)
(f)NetworkIntrusion
106
1002005001,0002,000051015mNMI(a)CIFAR-10
1002005001,0002,00001020304050mNMI(b)MNIST
1002005001,0002,000051015mNMI(c)ForestCoverType
1002005001,0002,0000246810mNMI(d)Imagenet-34
1002005001,0002,0000102030mNMI(e)Poker
1002005001,0002,000051015mNMI(f)NetworkIntrusion
Figure3.5EffectofthenumberofFouriercomponents
m
ontheNMIvalues(in%)ofthepartitions
obtainedusingtheRFFandSVclusteringalgorithms,onthes
ixbenchmarkdatasets.Parameter
m
representsthenumberofFouriercomponentsfortheRFFandS
Vclusteringalgorithms,and
thesamplesizefortheapproximatekernel
k
-meansandNystromapproximationbasedspectral
clusteringalgorithms.
107
m
to
1
;000
.Figures3.6(a)and3.7(a)showthattherunningtimeoftheR
FFandSVclustering
algorithmsvarynearlylinearlyasthenumberofpointsinth
edatasetvariesfrom
n
=100
to
n
=10
7
,withdimensionality
d
=100
andnumberofclusters
C
=10
.Thescalabilityplotsof
theRFFandSVclusteringalgorithmsaresimilartothescala
bilityplotsoftheapproximatekernel
k
-meansalgorithm,becauseallthreealgorithmshavelinear
timecomplexitywithrespectto
n
.
Thedimensionalityofthedatasetaffectsthetimetakenfor
calculationoftheFourierfeatures.
Theorderofincreaseintherunningtimeofthetwoalgorithm
sas
d
variesfrom
d
=10
to
d
=
1
;000
,with
n
=10
6
and
C
=10
,areshowninFigures3.6(b)and3.7(b).
Asthenumberofclusterswasincreasedfrom
C
=10
to
C
=1
;000
,with
n
=10
5
and
d
=100
,therunningtimeoftheRFFandSValgorithmsincreasesalmo
stlinearlywith
C
,as
showninFigures3.6(c)and3.7(c).Wenotethatthenumberof
clustersaffectstherunningtime
oftheSVclusteringalgorithmmorethanthatoftheRFFclust
eringalgorithm,becausetheSV
clusteringalgorithmprojectsthedataintoa
C
-dimensionalspacebeforeclustering.
1021041060.511.52x 104Running timein secondsSize of the data set n (log scale)(a)
10110210311.522.5x 104Running timein secondsDimensionality of the data set d(log scale)(b)
10110251015x 104Running timein secondsNumber of clusters C (log scale)(c)
Figure3.6RunningtimeoftheRFFclusteringalgorithmford
ifferentvaluesof(a)
n
,(b)
d
and(c)
C
.
3.5.4.5Out-of-sampleclustering

Toevaluatetheperformanceofouralgorithmonout-of-samp
ledatapoints,wedividedeachdata
setintotwoparts,onecontaining
80%
ofthedata,andtheothercontainingtheremaining
20%
.
Wecalltherstpartasthe
trainingset
andthesecondpartasthe
testset
,inaccordancewith
108
102104106200400600Running timein secondsSize of the data set n (log scale)(a)
1011021034006008001000Running timein secondsDimensionality of the data set d (log scale)(b)
10110212345x 104Running timein secondsNumber of clusters C (log scale)(c)
Figure3.7RunningtimeoftheSVclusteringalgorithmfordi
fferentvaluesof(a)
n
,(b)
d
and(c)
C
.
theconventionfollowedinsupervisedlearningproblems.W
ecomputedtheclustercentersusing
thetrainingset,andassignedeachtestpointtotheclosest
clustercenter,usingtheSVclustering
algorithm.Theclassassignmentofatestpointwasdetermin
edbythemajorityclassinthecluster
towhichitwasassigned.
Wecomparedtheperformanceofouralgorithmwiththeweight
edkernelprincipalcomponent
analysis(WKPCA)extensionforout-of-sampledatapoints,
proposedin[11].Thismethodrst
ndstheeigenvectors
Z
=(
z
1
;:::;
z
C
)
ofthematrix
D
1
MK
correspondingtoitssmallest
C
eigenvalues,where
D=
diag

K
>
1

isthedegreematrixand
M
=
I

1
1
>
D

1
1
11
>
D
1
isacenteringmatrix,andthenencodestheeigenvectorsint
obinarycodewordsbasedontheir
sign.Thesecodewordsareclusteredtoobtain
C
binarycodewords
f
c
1
;:::;
c
C
g
.Thefollowing
procedureisemployedtoobtaintheclusterlabelforanewpo
int
x

:
(i)Project
x

ontothespacespannedbytheeigenvectorsofthetrainingse
tas
'

Z
,where
'

=(

(
x

;x
1
)
;:::;
(
x

;x
n
))
.
(ii)Computethecodeword
c

=
sign
(
'

)
.
(iii)Assign
x

tothecluster
k
whichminimizes
d
HM
(
c

;c
k
)
,where
d
HM
representstheHam-
mingdistance[66]betweenthevectors
c

and
c
k
,denedas
d
HM
(
x
a
;x
b
)=
j
x
a

x
b
j
109
TheWKPCAextensionrequirestheeigendecompositionofan
n

n
matrix,whichtakes
O
(
n
3
)
time.Inaddition,an
O
(
n
)
vectorneedstobecomputedtoperformlabelassignment.
Wealsocomparetheperformanceoftheproposedalgorithmwi
ththeapproximatekernel
k
-
meansalgorithm.Thetestpoint
x

isaddedtotheclusterwhosecenter,givenby
c
k
(

)=

>
k
b
K
k

2
'
>

k
;isclosest.Intheaboveexpression,
'
=[

(
x

;b
x
1
)
;:::;
(
x

;b
x
m
)]
,
f
b
x
1
;:::;
b
x
m
g
arethesetof
sampleddatapoints,
b
K
isthekernelsimilaritybetweenthesampledpoints,and

k
isthe
k
th
row
ofclustercentercoefcientmatrix

,givenby(2.10).
Wereporttherunningtimeandaccuracyonthesixdatasetsin
Table3.4.Therunningtimeis
dividedintotrainingtimeandtestingtime.Thetrainingti
meforWKPCAincludesthetimetaken
tocomputethekernelmatrixforthetrainingdataanditseig
envectors,andthetimetakentoconvert
theeigenvectorstotheclustercodewords.Thetestingtime
isthetimetakenfordataprojectionand
Hammingdistancecomputationforallthetestdatapoints.F
ortheapproximatekernel
k
-means
algorithm,thetrainingtimeincludesthetimetoclusterth
etrainingdataandobtainthecluster
centercoefcientmatrix

.Thetestingtimeincludesthetimetakentocomputethesimi
larity
betweenthetestdatapointsandthesampleddatapoints,and
thetimetoassigntheclusterlabels
tothetestdatapoints.ForSVclustering,thetrainingtime
isdenedasthetimetakentocompute
therandomfourierfeaturesandthesingularvectorsforthe
trainingdata,andthetestingtimeis
denedasthetimetakentoassignlabelstotestdata.
TheWKPCAmethodtookabout
40
seconds,onanaverage,toassignlabelstothe
12
;000
testimagesintheCIFAR-10dataset,whereasourmethodtook
lessthan
5
seconds,for
m
=
1
;000
.OntheMNISTdataset,theWKPCAmethodtookabout
940
secondstoclusterthetest
setcontaining
14
;000
datapoints,signicantlylongerthantheproposedalgorit
hm,whichtook
around
60
seconds,for
m
=1
;000
.ItisinfeasibletoevaluatetheperformanceofWKPCAonthe
110
largedatasets.Weobservedthatboththeproposedalgorith
mandtheWKPCAmethodachieved
similarclassicationperformanceontheCIFAR-10andMNIS
Tdatasets.Areasonablygood
accuracywasachievedontheremaininglargedatasetsalso.
Theproposedalgorithmalsoruns
fasterthantheapproximatekernel
k
-meansalgorithm,andachievescomparabletestaccuracy.
Table3.4Runningtime(inseconds)andpredictionaccuracy
(in%)forout-of-sampledatapoints.
Parameter
m
representsthesamplesizefortheapproximatekernel
k
-meansalgorithmandthe
numberofFouriercomponentsfortheSVclusteringalgorith
m.Thevalueof
m
issetto
1
;000
for
boththealgorithms.ItisnotfeasibletoexecutetheWKPCAa
lgorithmonthelargeForestCover
Type,Imagenet-34,Poker,andNetworkIntrusiondatasetsd
uetotheirlargesize.
Dataset
CIFAR
MNIST
Forest
Imagenet
Poker
Network
-10
Cover
-34
Intrusion
Type
Training
WKPCA
755.02
910.90
-
-
-
-
time
(

91
:35
)
(

84
:37
)
Approx.
26.24
62.11
39.38
1913
391.04
998.36
kernel
(

2
:36
)
(

3
:58
)
(

3
:93
)
(

414
)
(

120
:1
)
(

812
:73
)
k
-means
SV
5.96
10.48
25.28
155.89
49.75
115.73
clustering
(

0
:83
)
(

0
:51
)
(

1
:61
)
(

4
:77
)
(

6
:09
)
(

3
:50
)
Testing
WKPCA
39.68
29.50
-
-
-
-
time
(

2
:77
)
(

4
:69
)
Approx.
22.47
55.38
26.76
1543
373.45
213.68
kernel
(

2
:05
)
(

1
:75
)
(

0
:97
)
(

412
)
(

119
:5
)
(

29
:28
)
k
-means
SV
5.33
2.12
5.97
80.24
14.24
121.35
clustering
(

2
:25
)
(

0
:57
)
(

2
:33
)
(

0
:02
)
(

0
:51
)
(

32
:50
)
Accuracy
WKPCA
80.70
84.84
-
-
-
-
Approx.
83.08
88.76
59.39
88.50
55.40
57.30
kernel
(

0
:01
)
(

0
:001
)
(

0
:10
)
(

0
:002
)
(

0
:001
)
(

0
:03
)
k
-means
SV
83.13
88.33
58.42
80.56
55.41
59.03
clustering
(

0
:04
)
(

0
:52
)
(

0
:64
)
(

0
:01
)
(

0
:04
)
(

0
:03
)
111
3.6Summary

TheRFFclusteringandtheSVclusteringalgorithms,propos
edinthischapter,userandomFourier
featurestoobtainagoodapproximationofkernelclusterin
gusinganefcientlinearclustering
algorithm.Wehaveanalyticallyboundtheapproximationer
rorofboththesemethods.Wehave
shownthat,whenthereisalargegapintheeigenspectrumoft
hekernelmatrix,asisthecasein
mostbigdatasets,theSVclusteringalgorithmwhichcluste
rsthesingularvectorsoftherandom
Fourierfeaturesisamoreeffectiveandscalableapproxima
tionofkernelclustering,allowinglarge
datasetswithmillionsofdatapointstobeclusteredusingk
ernel-basedclustering.Italsosolves
theout-of-sampleclusteringproblemefciently.TheRFFc
lusteringalgorithmcanbetrivially
parallelizedbyreplicatingtherandomGaussianmatrixacr
ossthecomputingnodes,calculating
therandomFourierfeaturesforasubsetofthedataineachno
de,andemployingtheparallel
k
-
meansalgorithmtoclustertherandomFourierfeaturematri
x,toobtaintheclusterlabels.TheSV
clusteringalgorithmcanbesimilarlyparallelized,byusi
ngthedistributedLanczoseigensolverto
obtaintheeigenvectorsoftherandomFourierfeaturematri
x.
Theapproximatekernel
k
-meansalgorithminChapter2andtherandomFourierfeature
s-based
algorithmsinthischapterareallbasedonsamplingthedata
setandusingthesamplesasbasis
functionsfortheclustercenters.Whileapproximatekerne
l
k
-meansemploysthedata-dependent
Nystromkernelapproximation,andobtainsthebasisfuncti
onsbyfactorizingthekernelmatrix,
thebasisfunctionsinRFFandSVclusteringalgorithmsared
ependentonthekernelfunction.
Therefore,thesealgorithmsrequirealargenumberofFouri
ercomponentstoachieveclusterqual-
ityequivalenttothatoftheapproximatekernel
k
-meansalgorithm.Kernelselectionisalsovery
crucialintheRFFandSVclusteringalgorithms.Wehavefocu
sedonusingscale-invariantkernel
functionsinourwork,butthesealgorithmscanbeextendedt
opolynomialandintersectionkernels
usingtheschemesprescribedin[92]andreferencestherein
,toobtainthebasisfunctions.
112
Chapter4

StreamClustering

4.1Introduction

Inadditiontothelargevolume,bigdataisalsocharacteriz
edbyﬁvelocityﬂ-thecontinuouspace
atwhichdataowsinfromsourcessuchassensors,machines,
networks,anduserinteractionwith
websites.Analysisofthisreal-timedatacanhelpinmaking
valuabledecisions.Forinstance,
intrusionscanbedetectedinIPnetworksbyanalyzingthene
tworktrafc.
Clusteringstreamingdataischallengingduetothefollowi
ngtworeasons:
(i)Streamingdatasetsareoftentoolargetoloadinmemory;
theycouldpotentiallybeun-
bounded.Onlyasmallsubsetofthedatamaybestored,depend
ingontheamountofmemory
available.Sothedatacanbeaccessedatmostonce,and
(ii)thedataisnon-stationary,i.e.thedistributionofth
edatachangesovertime.Thedatathat
arrivedmorerecentlyhashigherrelevancethantheolderda
ta.
Batchclusteringalgorithmssuchas
k
-meansandkernel
k
-means,assumethatthedataiscom-
pletelyavailableinmemoryatthetimeofclustering.Theya
lsoassumethattheinputdatais
drawnfromthemixtureofaxedsetofdistributions,andthe
aimofclusteringistoidentifythese
113
componentdistributions.Therefore,batchclusteringalg
orithmscannotbedirectlyusedtocluster
streamingdata.Streamclusteringalgorithmsmodelthedat
adynamically.Clusterlabelsareas-
signedtodatapointsastheyarrive,inanonlinemanner.Str
eamclusteringalgorithmsgenerally
consistoftwostages:(i)anonlinephase,wherethestreamd
ataissummarizedintoﬁprototypesﬂ
asitarrives,and(ii)anofinephasewheretheseprototype
sareusedtoobtaintheclusters.Theset
ofprototypesaredynamicallyupdatedtoaccountfortheevo
lutionoftheclustersinthestreaming
data.
ManystreamclusteringalgorithmsusemeasuressuchastheE
uclideandistancetodenethe
pairwisesimilarity.Asdemonstratedintheearlierchapte
rs,kernel-basedalgorithmsachievebetter
clusteringqualitythanlinearclusteringalgorithms.How
ever,kernel-basedclusteringalgorithms
areill-suitedtostreamsbecauseoftheirhighcomputation
alcomplexity.Inthischapter,weadapt
thekernel
k
-meansalgorithmtoefcientlyhandlestreamingdata.Thep
roposedalgorithmsamples
thedatapointsastheyarriveandconstructsanapproximate
kernelmatrixusingthesampledpoints.
Thesamplingisperformedwithprobabilityproportionalto
thestatisticalleveragescores[34]of
thismatrix,ameasureoftheimportanceofthedatapoints.T
hesampleddatapointsarestoredin
memoryandusedtodeterminetheclusterlabelsoftheincomi
ngdatapoints.Weshowthatonlya
smallsubsetofthedataneedstobestoredinmemory,thereby
enhancingtheefciencyofkernel
clusteringfordatastreams.

4.2Background

Datastreamclusteringhasbeenstudiedextensivelyinthep
atternrecognitionanddatamining
literature.Moststreamclusteringalgorithmssummarizet
hedatastreamusingspecialdatastruc-
tures,andobtaintheclusterrepresentativesusingthissu
mmary.Theydifferbythedatastructures
usedtosummarizethedata;commondatastructuresaretrees
,coresets,andgrids(SeeTable4.1).
StreamandLSearchalgorithmssplittheincomingdataintoc
hunks,clusterthechunksindi-
114
Table4.1Majorpublishedapproachestostreamclustering.
Approachesforstreamclustering
Examples
CF-Trees
Stream[79],StreamLSearch[140],Scalable

k
-means[30],Single-pass
k
-means[62]
Microclustertrees
CluStream[8],ClusTree[98],ClusTrel[124],

DenStream[32],HPStream[9]
Coresets
StreamKM++[6]
Grids
D-Stream[36],ODAC[149]
Approximateclustering
Streaming
k
-meansapproximation[10],Fast
streaming
k
-means[162]
Kernel-based
Incrementalspectralclustering[139],Adap-

tivenon-linearclustering[86],sKKM[84],

TechnoStream[134]
viduallytondtheclusterprototypes,andthenclusterthe
seprototypestoobtainthenalclus-
ters[79,140].Thesealgorithmscannotbeusedtoperformre
al-timeclustering.TheCluster-
ingFeature(CF)TreewasintroducedbyZhang
etal.
asapartoftheBIRCHclusteringalgo-
rithm[197].ACF-Treesummarizesthedatastreamintoahier
archyofnodes.Eachnodecontains
asetofCF-vectorscomprisingthelinearsumandthesquared
sumofasetofpoints,whichare
closetoeachother.TheCF-Treehasbeenusedinseveralstre
amclusteringalgorithmssuchas
scalable
k
-means,andsingle-pass
k
-meansalgorithms[30,62].TheideaofCF-vectorswasthen
extendedtoﬁmicro-clustersﬂwhichincludethetemporalin
formationaboutthedata[32,98,124].
Thisinformationisusedtodetectevolutionarychangesint
hedatastream.Forinstance,theCluS-
treamalgorithmstoresthelinearandsquaredsumsofthetim
estampsofthedatapointsinthe
microcluster,inadditiontothelinearsumandthesquareds
umofthedatapoints.Thesetimes-
tampvaluesareusedtoassignweightstothedatapoints,the
rebygivingmoreimportancetothe
newdatathanolderdatawhileclustering.Similarly,theHP
Streamalgorithmweightstheclusters
usingthetemporalinformationandassignsdatatomorerece
ntclusters[9].
Acoresetisaweightedsubsetofpointsthatapproximatethe
inputdatasetuptoapre-dened
errormargin.TheStreamKM++algorithmsummarizesthedata
streamintoasetofcoresetsor-
115
ganizedintoahierarchyknownasthecoresettree[6].Eachn
odeinthetreecontainsasubsetof
pointsrepresentedbyasetofprototypes.Thenalclusters
areobtainedbygroupingthecoreset
representativesintherootnodeofthecoresettree.Grid-b
asedalgorithmssuchasDStreamand
DGClustpartitionthe
d
-dimensionalfeaturespaceintogridcells[36,149].Eachc
ellisrepresented
byatuplecontainingthetimestamps,aclusterlabelandthe
densityofthegrid.Datapointsare
addedtothegridsandthegridsummariesareupdatedincreme
ntally,asthedatapointsarrive.Ap-
proximateclusteringalgorithmssuchasstreaming
k
-means[10,162]chooseasubsetofthepoints
fromthestream,ensuringthattheselectedpointsareasdis
tantfromeachotheraspossible,and
execute
k
-meansonthedatasubset.
Tothebestofourknowledge,basedonpublishedliterature,
veryfewattemptshavebeenmade
tousenon-linearsimilaritymeasuresforclusteringdatas
treams.Theagglomerativehierarchical
clusteringalgorithmisadaptedtousekerneldistancemeas
uresin[193].Theincrementalspectral
clusteringalgorithm[139]extendsspectralclusteringto
streamdatabytreatingeachnewedgein
thegraphasavectorappendedtothesimilaritymatrix.Theg
raphLaplacian,itseigenvaluesand
eigenvectorsareupdatedincrementallywiththenewedges.
Thestreamkernel
k
-meansalgorithm[84]dividesthedatasetintowindowsofx
edtime-steps,
andperformsclusteringusingthedatapointsineverytwoco
nsecutivewindows.Informationfrom
thecurrenttime-stepispassedontothetothenexttime-ste
pintheformofmeta-vectorscontaining
weightsforeachofthe
C
clusters.Jain
etal.
proposedatwo-tiersystemcalledtheadaptivenon-
linearclusteringalgorithmtoperformstreamclusteringu
singnon-linearsimilarity[86].Intherst
tier,theincomingdatapointsarepartitionedintosegment
s,separatedfromeachotherbynoveldata
points.Adatapoint
x
isconsiderednovelifthekernel-baseddistancefrom
x
tothemeanofthe
datapointsinthecurrentsegmentisgreaterthantheuser-d
enedthreshold.Inthesecondtier,
therepresentativesegmentsareidentiedandprojectedin
toalow-dimensionalspacespannedby
thedominantprincipalcoordinatesofthedatainkernelspa
ce[76].Theclusterlabelsforthedata
pointsareobtainedbyclusteringthelow-dimensionalrepr
esentationsofthedata.Thistechnique
116
requirestheeigendecompositionofalargenumberofpoints
inthesecondtier.Theproposed
methodusesthecompletehistoryofdata,anddoesnotrequir
ecomplexoperations,unlikethe
existingmethods.

4.3ApproximateKernel
k
-meansforStreams
Figure4.1Schemaoftheproposedapproximatestreamkernel
k
-meansalgorithm.
InChapter2,wepresentedtheapproximatekernel
k
-meansalgorithmwhichconstrainedthe
clustercenterstothespanofasubsetofthedatapoints.Wee
mployasimilarstrategytocluster
streamingdata.Thekeyideaistosamplethedatapointsasth
eyarriveandconstructthekernel
matrixincrementallyusingthesampledpoints.Thisapprox
imatekernelmatrixisusedtocluster
thesampledpoints.Theclusterlabelsareassignedtotheun
sampleddatapointsusingtheirkernel
similaritywiththesampledpoints.Ahighleveloverviewof
theproposedclusteringframework
ispresentedinFigure4.1.Ourframeworkconsistsofthreep
rimarycomponents,workingin
tandem:(i)importancesampling,(ii)clustering,and(iii
)clusterlabelassignment.Thesampling
componentsamplesthepointsfromthestream,andconstruct
stheapproximatekernelmatrix.
Theclusteringandlabelassignmentcomponentsupdatethec
lustersandthenumberofclusters
dynamically,andassignclusterlabelstoallthedatapoint
sinthestream.
117
Wedescribeeachofthesecomponentsinthefollowingsectio
ns:
4.3.1Sampling

Oneoftheobstaclestousingkernel
k
-meansforclusteringstreamdataisthatitrequiresthe
computationofthe
n

n
kernelmatrix,where
n
isthenumberofpointsinthedataset.Itis
infeasibletocomputethefullkernelmatrixforstreamdata
because
n
ispotentiallyunbounded.
Theapproximatekernel-basedclusteringalgorithmspropo
sedinChapters2and3alsoneedto
storetheentiredatainmemory,beforeconstructingtheapp
roximatekernelmatrices.Thestream
clusteringalgorithmproposedinthischapteralleviatest
hisissuebyincrementallysamplinga
subsetofthepointsfromthestream,andusingonlythissubs
ettoconstructthekernelmatrix.
Wemaintainabuffer
S
inmemorytostorethesampledpoints;thenumberofpoints
s
in
S
is
constrainedbytheuser-denedparameters
m
and
M
(
m

s

M
).Let
K
t

1
representthekernel
matrixattime
(
t

1)
with
K
1
=

(
x
1
;x
1
)
.Whenadatapoint
x
t
arrivesattime
t
,weupdatethe
kernelmatrixas
K
t
=
8

>

>

>

>

<

>

>

>

>

:
2

6

4
K
t

1
'
>
'
(
x
t
;x
t
)
3

7

5
withprobability
p
t
;K
t

1
withprobability
1

p
t
,
(4.1)
where
K
t

1
=[

(
x
i
;x
j
)]
;x
i
;x
j
2
S
,and
'
=(

(
x
t
;x
1
)
;:::;
(
x
t
;x
s
))
>
.
Thesimplestmethodofdeterminingwhetherornottoaddadat
apoint
x
t
to
S
,istoper-
formindependentBernoullitrials,i.e.
x
t
isstoredin
S
withprobability
p
t
=
1
2
.However,
Bernoullisamplingresultsinalargekernelapproximation
error,andrequiresalargenumberof
pointstobestoredinmemory
1
.Toalleviatethisissue,weperformimportancesamplingin
stead
ofBernoullisampling.Thesamplingprobability
p
t
foreachpoint
x
t
isbasedonitsﬁimpor-
1
WedemonstratethisusingasyntheticdatasetinFigure4.2,
andusingfourlargebenchmarkdatasetsinSec-
tion4.5.
118
(a)
(b)
(c)
(d)
Figure4.2Illustrationofimportancesamplingonatwo-dim
ensionalsyntheticdatasetcontaining
1
;000
pointsalong
10
concentriccircles(
100
pointsineachcluster),representedbyﬁoﬂinFig-
ure(a).Figure(b)shows
50
pointssampledusingimportancesampling,andFigures(c)a
nd(d)
show
50
and
100
pointsselectedusingBernoullisampling,respectively.T
hesampledpointsare
representedusingﬁ*ﬂ.Allthe
10
clustersarewell-representedbyjust
50
pointssampledusing
importancesampling.Ontheotherhand,
50
pointssampledusingBernoullisamplingarenotad-
equatetorepresentthese
10
clusters(Cluster
4
inredhasnorepresentatives).Atleast
100
points
areneededtorepresentalltheclusters.

tanceﬂ,denedintermsofthestatisticalleveragescores[
56].Letthekernelmatrix
K
t
attime
t
bedecomposedas
K
t
'
V
C

C
V
>
C
,where
C
representsthenumberofactiveclusters
2
attime
t
,

C
=
diag
(

1
;:::;
C
)
containsthehighest
C
eigenvaluesof
K
t
,and
V
C
=(
v
1
;:::;
v
C
)
contains
thecorrespondingeigenvectors.Theprobabilityofadding
point
x
t
to
S
isdenedby
p
t
=
1
C


V
(
t
)
C


2

2
;(4.2)
where
V
(
j
)
C
isthe
j
th
rowof
V
C
.Statisticalleveragescoresmeasurethecorrelationbetw
eenthe
eigenvectorsofthematrix
K
t
andthestandardbasis.Ahighscoreindicatesthatthecorre
sponding
2
Werefertothesetofclustersthatthedatapointsinthebuff
er
S
belongtoattime
t
asthesetofactiveclusters.
119
datapointhasalargeinuenceintheapproximationoftheke
rnelmatrix.Thesubsetofdata
correspondingtothelargeststatisticalleveragevaluesa
rethemostinformative,andcanrepresent
thedistributionoftheentiredata.Byperformingimportan
cesamplingonthedatastream,the
samplesthathavenotbeenadequatelyrepresentedbytheexi
stingsamplesareaddedtothebuffer.
Statisticalleveragescoreshavebeenusedsuccessfullyto
obtainlowrankmatrixapproxima-
tionsoflargematrices,performlargescaleregressionand
otherlargescaledataanalysisopera-
tions[28,34].Thefollowinglemmaadaptedfrom[74]showst
hat,attime
t
,theapproximation
errorbetweenthetruekernelmatrixforthe
t
points
f
x
1
;x
2
;:::;
x
t
g
andthelow-rankkernelma-
trixconstructedusingthissamplingschemeisminimized,w
henthenumberofsamplesin
S
at
time
t
is
s
=(
C
ln
C
)
:
Lemma8.
Let
K
bea
t

t
SPSDmatrix,and
V
C
=(
v
1
;:::;
v
C
)
representtheeigenvectors
correspondingtothetop
C
-dimensionaleigenspaceof
K
.Let
K
B
representthe
t

s
matrix
obtainedbysamplingthecolumnsof
K
withprobabilitydenedin
(4.2)
and
b
K
bethe
s

s
submatrixof
K
B
correspondingtothesampledcolumns.Foragivenfailurepr
obability

2
(0
;1]
,
andapproximationfactor

2
(0
;1]
,if
s

3200


2
C
ln(4
C=
)
,wehave


K

K
B
b
K

1
K
>
B


2
k
K

K

k
2
+

2
k
K

K

k

;where
K

isthebest
C
-rankapproximationof
K
,and
kk
2
and
kk

representthespectralnorm
andtracenormrespectively
3
.
Byusingimportancesampling,weobtainagoodapproximatio
nofthetruekernelbysampling
justafractionofthedataset.Figures4.2(a)-(d)illustra
tetheadvantageofimportancesampling
overBernoullisamplingonatwo-dimensionaldatasetconta
ining
1
;000
pointsfrom
10
clusters.
Eachtrueclusterisaconcentriccircleofvaryingradius,w
ith
100
points,asshowninFigure4.2(a).
3
Lemma8boundstheerrorbetweentheapproximatekernelandt
hetruekernelforasetof
t
datapoints.We
demonstrateempiricallyinSection4.5thattheaccumulate
derrorastime
t
increasesiswell-bounded.
120
Figure4.2(b)alsoshows
50
pointssampledusingimportancesampling.Weobservethata
llthe
10
clustersareadequatelyrepresentedbythe
50
sampledpoints.Figure4.2(c)showsthat
50
points
sampledfromthedatausingBernoullisamplingdonotrepres
entalltheclusters,astheprobability
ofsamplingdatapointsfromalltheclustersislow.Allthec
lustersarerepresentedonlywhen
100
pointsaresampled,asshowninFigure4.2(d).

4.3.2Clustering

Let
s
bethenumberofpointsinthebuffer
S
and
C
bethenumberofactiveclusters
2
attime
t
.
Afterthekernelmatrix
K
t
isconstructedinaccordancewith(4.1),thedatapointsin
S
canbe
partitionedinto
C
clustersbysolvingthekernel
k
-meansproblem
max
U
2P
tr
(
e
UK
t
e
U
>
)
;(4.3)
where
U
=(
u
1
;:::;
u
C
)
>
istheclustermembershipmatrix,
e
U
=[
diag
(
U
1
)]

1
=
2
U
,domain
P
=
f
U
2f
0
;1
g
C

s
:U
>
1
=
1
g
,and
1
isavectorofallones.Therunningtimecomplexity
ofthisstepwouldbe
O
(
s
2
)
.Wefurtherreducethiscomplexitybyconstrainingtheclus
tercenters
toasmallersubspace,spanningthetop
C
eigenvectorsofthekernelmatrix
K
t
,alongthelinesof
thespectralclusteringalgorithm.Weposetheclusteringp
roblemasthefollowingoptimization
problem:
min
U
2P
max
f
c
k
(

)
2H
a
g
C

k
=1
C
X
k
=1
s
X
i
=1
U
k;i
s
jj
c
k
(

)


(
x
i
;
)
jj
2

H

;(4.4)
where
H
a
=
span
(
v
1
;:::;
v
C
)
.Theclustercenterscanbeexpressedaslinearcombination
softhe
eigenvectorsofthekernelmatrix:
c
k
(

)=
s
X
i
=1
C
X
j
=1
U
k;i
n
k
p

j
v
ij
=
u
k
n
k
V
C

1
=
2
C
;k
2
[C
];(4.5)
121
where
n
k
isthenumberofpointsinthe
k
th
cluster,and
u
k
=(
U
k;
1
;U
k;
2
;:::;U
k;s
)
>
.Bysubstitut-
ing(4.5)in(4.4),weobtainthefollowingtracemaximizati
onproblem:
max
U
2P
tr
(
e
UV
C

C
V
>
C
e
U
>
)
:(4.6)
Theaboveproblemcanbesolvedefcientlybyexecuting
k
-meansonthematrix
V
C

1
=
2
C
.Inthe
followinglemma,weshowthattheerrorincurredduetotheap
proximation(4.4)isbounded,when
thelowesteigenvaluesofthekernelmatrixhavesmallmagni
tudes,whichistrueformostrealdata
sets[45]:

Lemma9.
Let
E
and
E
a
representtheoptimalclusteringerrorsin
(4.3)
and
(4.6)
,respectively.
Wehave
j
E

E
a
j
s
X
i
=
C
+1

i
:Proof.
Let
f
c


k
(

)
g
C

k
=1
and
U

betheoptimalsolutionto(4.3).Let
c
a

k
(

)
representtheprojectionof
c


k
intothesubspace
H
a
.Forany

(
x
i
;
)
,let
g
i
(

)
and
h
i
(

)
betheprojectionsof

(
x
i
;
)
intothe
subspace
H
a
andspan
(
v
C
+1
;:::;
v
s
)
,respectively.Wehave
E
a
=min
U
2P
max
c
k
(

)
2H
a
C
X
k
=1
s
X
i
=1
U
k;i
s
jj
c
k
(

)


(
x
i
;
)
jj
2

H


C
X
k
=1
s
X
i
=1
U

k;i
s
jj
c
a

k
(

)


(
x
i
;
)
jj
2

H


C
X
k
=1
s
X
i
=1
U

k;i
s

jj
c
a

k
(

)

g
i
(

)
jj
2

H

+
jj
h
i
(

)
jj
2

H


E
+
1
s
C
X
k
=1
s
X
i
=1
jj
h
i
(

)
jj
2

H


E
+
s
X
i
=
C
+1

i
:Wenotethattheeigenvaluesandeigenvectorsdonotneedtob
ere-computedforclustering,
122
astheywerealreadycomputedwhilecalculatingtheleverag
escores.Thiseliminatestheneed
forcomputingandstoringthekernelmatrix
K
t
,asonlyitstopeigenvaluesandthecorresponding
eigenvectorsarerequiredforbothsamplingandclustering
.Startingwith
V
C
=
1
and

C
=

(
x
1
;x
1
)
,wecanupdatetheeigensystemincrementallyasthedatapoi
ntsarrive.Efcientmethods
toupdatetheeigenvectorsandeigenvaluesincrementallya
rediscussedinSection4.4.
4.3.3LabelAssignment

Datapointsareassignedclusterlabelsusingtheclusterce
ntersobtainedfromthesampleddata
pointsinamannersimilartotheSVclusteringalgorithminC
hapter3,andtheactiveclustersare
updatedusingafadingclustermechanism,similartothatus
edbytheadaptivenon-linearclustering
algorithm[86].Eachcluster
k
isassociatedwithatimestamp
t
k
representingthelasttimeadata
pointwasassignedthe
k
th
clusterlabel,andarecencyvaluedenedbyamonotonicfunc
tion
f
k
(
t
)=exp(


(
t

t
k
))
;(4.7)
where

isauser-denedparameter,representingthedecayrateofa
cluster[9].Adatapoint
x
t
is
addedtocluster
k

if
k

=argmin
k
2
[
C
]
jj
c
k
(

)

g
t
(

)
jj
2

H

;and
f
k

(
t
)
>;
(4.8)
where
c
k
(

)
istheclustercentergivenby(4.5),
g
t
(

)
istheprojectionof

(
x
t
;
)
intothesubspace
spannedbytheeigenvectors
V
C
,and

isauser-denedlifetimethresholdwhichdetermineshow
longaclusterremainsactive.Iftherecency
f
k

(
t
)
oftheclosestcluster
k

islessthan

,thena
newclusteriscreatedwiththedatapoint
x
t
.Aftertheclusterassignmentismade,thetimestamp
andtherecencyvalueoftheassignedclusterareupdated.Cl
usterswhoserecencyislessthan

(calledstaleclusters)aredeleted,andthedatapointsint
hebufferthatbelongtothesestaleclusters
123
areremovedfromthebuffer.
Algorithm8describestheproposedstreamclusteringmetho
d.Theinputtothealgorithmis
thedatastream
D
,kernelfunction

(

;
)
,initialnumberofclusters
C
,buffersizeparameters(
m
and
M
),andclusteringfadingmechanismparameters(

and

).Selectionofkernelfunction
andinitialnumberofclusters
C
isbasedondomainknowledge.Severalarticlesinthelitera
ture
describetechniquestolearnthekernelfunctionfromtheda
ta[112,177,200].Theparameters
m
and
M
shouldbesetsuchthattheinitialandnalsamplesetsconta
insufcientrepresentatives
fromalltheclusters.Theparameters

and

shouldbeselectedbasedonhowfastthecategories
areexpectedtochangeinthestream.Heuristicstosetthese
parametersarediscussedfurtherin
Section4.5.

4.4ImplementationandComplexity

Thetwomajoroperationsintheproposedalgorithmare:comp
utationofleveragescores,and
clusteringofthetop
C
eigenvectorsoftheapproximatekernelmatrixusing
k
-means.Boththe
operationsrequiretheeigenvaluesandeigenvectorsofthe
kernelmatrix.Let
s
bethenumberof
pointsinthesampleset
S
attime
t
.Eigendecompositionofan
s

s
kernelmatrix
K
t
takes
O
(
s
3
)
time,ifperformednaively.However,wecanupdatetheeigen
systemincrementallyusingthefast
rank-oneupdatemechanismproposedin[31].Giventheeigen
decomposition,
K
t
=
V

V
>
,and
vector
'
2<
s
,thismethodndstheeigendecompositionof

K
t
+
''
>

as
K
t
+
''
>
=

V
w
jj
w
jj


0

V
w
jj
w
jj

>
(4.9)
124
Algorithm8
ApproximateStreamKernel
k
-means
1:
Input
:
D
=
f
x
1
;x
2
;:::
g
;x
i
2<
d
:thedatastreamtobeclustered


(

;
):
<
d
<
d
7!<
:thekernelfunction

C
:theinitialnumberofclusters

m
:theinitialnumberofpointstobesampled(
m>C
)

M
:maximumnumberofpointsallowedinthesampleset(
m<M
)


:clusterdecayrate


:clusterlifetimethreshold
2:
Output
:Clusterlabelsforthedatapointsinthestream
3:
Initialize
S
=
f
x
1
g
,
V
C
=
1
and

C
=

(
x
1
;x
1
)
.
4:
for
t
=1
;2
;:::;m
do
5:
Set
S
=
S
[f
x
t
g
.
6:
Updatetheeigenvalues

C
andeigenvectors
V
C
using(4.9).
7:
endfor
8:
Clusterthedatapointsin
S
into
C
clustersbyexecuting
k
-meanson
V
C

C
1
=
2
.
9:
Setthelastupdatetime
t
k
=
t;k
2
[C
].
10:
Evaluatetherecencyfunction
f
k
(
t
)
;k
2
[C
]accordingto(4.7).
11:
for
t
=
m
+1
;m
+2
;:::
do
12:
Calculatetheprobability
p
t
using(4.2)andset
S
=
S
[f
x
t
g
withprobability
p
t
.
13:
If
x
t
wasaddedto
S
inStep12,updatetheeigenvalues

C
andeigenvectors
V
C
using(4.9),
andreclusterthepointsin
S
byexecuting
k
-meanson
V
C

C
1
=
2
,otherwisendthecluster
k

whosecenterisclosestto
x
t
.
14:
If
f
k

(
t
)
>
,assign
x
t
to
k

,otherwisecreateanewclusterwith
x
t
andset
C
=
C
+1
.
15:
Findtheclusterswhoserecency
f
k
(
t
)

;k
2
[C
],andremovethesestaleclusters.Set
C
=
C

c
,where
c
isthenumberofstaleclusters.
16:
If
card
(
S
)
>
=
M
,ndindex
q
=argmin
l


V
(
l
)
C


2

2
andremovedatapoint
x
q
from
S
.
17:
endfor
where
w
=

I

VV
>

'
isthecomponentof
K
t
thatisorthogonalto
V
,and

0
containsthe
dominanteigenvaluesofthesparsematrix
2

6

4

V
>
'
'
>
V
jj
w
jj
3

7

5
:Thisoperation,repeatedeverytimeanewdatapointisinput
tothesystem,canbeperformedin
O
(
sC
+
C
3
)
time.
125
Clusteringisperformedeverytimeapointisaddedtothesam
pleset
S
,whichtakes
O
(
sC
2
l
)
time,where
l
isthenumberofiterationsrequiredtoreachconvergence.I
nordertoreducethe
runningtime,wecanemploya
lazyreclustering
approach,bywhichweperformtheclustering
afterevery
T
datapointadditions.Tofurtherenhancetheefciencyofth
ealgorithm,thedata
pointscanalsobeprocessedinbatchesofsize
B
.
Insummary,thetimetakenbytheproposedapproximatestrea
mkernel
k
-meansalgorithm
toclusteradatasetofsize
n
is
O
(
ndM
+
nCM
+
nC
3
+
M
2
C
2
l
)
˘
O
(
nd
+
nC
)
,when
max(
C;d;M;l
)
˝
n
.Thiscontrastswiththe
O
(
n
2
)
runningtimecomplexityoftypicalkernel-
basedclustering.

4.5ExperimentalResults

4.5.1Datasets

Theproposedstreamclusteringalgorithminputsthedatase
tinbatches,andcanhandlepotentially
unboundeddatasets,hencethesizeofthedatasetisnotsign
icant.Thedimensionalityofthedata
setplaysanimportantroleinthekernelsimilaritycomputa
tionandtheeigensystemupdate.We
demonstratetheeffectivenessoftheproposedalgorithmon
theCIFAR-10,MNIST,ForestCover
Type,Imagenet-34,Poker,andNetworkIntrusiondatasets.

4.5.2Baselines

Wecomparedtheperformanceoftheproposedalgorithmwitht
worecentstreamclusteringal-
gorithms(StreamKM++andsKKM),whichhavebeenshowntoper
formbetterthantheother
streamclusteringalgorithms.TheStreamKM++algorithm[6
]isalinearstreamclusteringalgo-
rithm,whichinthesamespiritastheproposedalgorithm,ex
tractsthecorepointsinthestreaming
data,andusesthesecorepointstodeterminetheclustercen
ters.Thealgorithmmaintainsaset
126
ofbuckets,eachofsize
m
.Datapointsareaddedtotherstbucketuntil
m
pointsarereceived.
Theyarethenrecursivelymergedwiththepointsinthesubse
quentbucketstoformacoresetof
m
points,usingacoresettree.Thecoresetsarenallycluste
redusingthe
k
-means++algorithm[12]
toobtaintheclustercenters.Theperformanceofthisalgor
ithmdependsonthecoresetsize
m
.
Thestreamingkernel
k
-means(sKKM)algorithmproposedin[84]processesthedata
inchunks
ofsize
m
.Theinitialdatachunkisclusteredusingkernel
k
-means.Weightedkernel
k
-meansis
usedtoclusterthesubsequentdatachunks.Theclustercent
ersfromtheprecedingdatachunk
areusedtoobtaintheweights.Weshowthattheproposedappr
oximatestreamkernel
k
-meansis
moreeffectivethanthesealgorithms.Wealsocomparethepe
rformanceoftheproposedalgorithm
with(i)thebatch
k
-meansalgorithmtoshowthatouralgorithmachieveshigher
accuracy,and(ii)
thebatchkernel
k
-meansalgorithmtoevaluatethelossintheclusterquality
.Wecouldexecute
thekernel
k
-meansalgorithmonlyonthemedium-sizedCIFAR-10andMNIS
Tdatasetsdue
toitsquadratictimecomplexity.Fortheremainingdataset
s,weexecutedkernel
k
-meanson
a
50
;000
-sizedrandomlyselectedsubsetofthedata,andassignedth
eremainingpointstothe
closestclustercenters.Thisgivesusanapproximationoft
hetimetakentoexecutekernel
k
-means
onthefulldataset.Wenallyevaluatetheperformanceofth
eproposedapproximatestream
kernel
k
-meansalgorithmwheneachdatapointissampledwithprobab
ility
1
=
2
,andshowthat
importancesamplingplaysasignicantroleinreducingthe
memoryrequirementsandenhancing
theclusteringaccuracy.

4.5.3Parameters

WeusedtheuniversalRBFkernelfortheproposedalgorithma
ndthekernel-basedbaselinealgo-
rithmsonallthedatasets.Wetunedthekernelwidthusinggr
idsearchintherange
[0
;1]
toobtain
bestperformance.Fortheproposedapproximatestreamkern
el
k
-meansalgorithm,wevariedthe
initialsamplesizefrom
m
=1
;000
to
m
=5
;000
inmultiplesof
1
;000
,andthemaximumbuffer
sizefrom
M
=5
;000
to
M
=20
;000
inmultiplesof
5
;000
,toconstrainthememoryusedto
127
4
GB.Weemployedthelazyreclusteringapproachwith
T
setto
50
,andprocessedthedatain
batchesofsize
B
=10
;000
.Wesettheclusterdecayfactor

=0
:5
assuggestedin[86],and
variedthelifetimethreshold

as

=exp(

˝
)
,where
˝
=
f
1
;2
;:::;
5
g
.Thecoresetsizeand
chunksizeparametersfortheStreamKM++andsKKMalgorithm
swerevariedfrom
1
;000
to
5
;000
.Theinitialnumberofclusters
C
wassetequaltothetruenumberofclassesinthedataset,
forallthealgorithms.
WeobtainedthecodefortheStreamKM++algorithmfromtheau
thors
4
,andimplemented
theotheralgorithmsinMATLAB.Weexecutedeachalgorithm
10
timesona
2
:8
GHzprocessor
withthememoryconstrainedto
4
GBforthestreamclusteringalgorithms,andto
40
GBforthe
batchclusteringalgorithms.Wepresentthemeanandvarian
ceofthetimetakenforclustering
(inmilliseconds)andtheclusteringquality,measuredint
ermsoftheSilhouettecoefcientand
NMI[104],overthese
10
runs.Differentpermutationsofthedatasetwereinputtoth
eclustering
algorithmsineachrun.

4.5.4Results

4.5.4.1Clusteringefciencyandquality

Clusteringtimeforouralgorithmiscomputedastheaverage
timetakentoassignalabeltoeach
datapoint.Forthebaselinealgorithms,wecomputedthisti
mebydividingthetotaltimetakento
clusterthedatasetbythenumberofpointsinthedataset.Fi
gures4.3,4.4and4.5comparethe
runningtime,silhouettecoefcientandNMIvalues,respec
tively,oftheproposedalgorithmwith
thebaselinealgorithms,whentheparameters
m
=5
;000
,
M
=20
;000
and
˝
=1
.Asexpected,
theproposedalgorithmwasfasterthanthebatchkernel
k
-meansalgorithmsanditsapproximation
(describedinSection4.5.2)onmostofthedatasets,buttoo
klongerthanthe
k
-meansalgorithm,
becauseouralgorithmhastocomputethekernelsimilaritya
nditstopeigenvectorsunlikethe
k
-
4
ThecodeforStreamKM++isavailableathttp://www.algorit
hm-engineering.de/software-projects?
view=project&task=show&id=17
128
051015Running timein milliseconds(a)CIFAR-10
050100Running timein milliseconds(b)MNIST
01020304050Running timein milliseconds(c)ForestCoverType
050100150200Running timein milliseconds(d)Imagenet-34
020406080Running timein milliseconds(e)Poker
0200400600Running timein milliseconds(f)NetworkIntrusion
Figure4.3Runningtime(inmilliseconds)ofthestreamclus
teringalgorithms.Theparametersfor
theproposedapproximatestreamkernel
k
-meansalgorithmaresetto
m
=5
;000
,
M
=20
;000
,
and
˝
=1
.ThecoresetsizefortheStreamKM++algorithm,andthechun
ksizeofthesKKM
algorithmaresetto
5
;000
.Itisnotfeasibletoexecutekernel
k
-meansontheForestCoverType,
Imagenet-34,Poker,andNetworkIntrusiondatasetsduetot
heirlargesize.Theapproximate
runningtimeofkernel
k
-meansonthesedatasetsisobtainedbyrstexecutingkerne
l
k
-meanson
arandomlychosensubsetof
50
;000
datapointstondtheclustercenters,andthenassigningth
e
remainingpointstotheclosestclustercenter.
129
00.020.040.060.08Silhouette(a)CIFAR-10
00.20.40.60.8Silhouette(b)MNIST
Figure4.4Silhouettecoefcientvaluesofthepartitionso
btainedusingtheproposedapproximate
streamkernel
k
-meansalgorithm.Theparametersfortheproposedalgorith
mweresetto
m
=
5
;000
,
M
=20
;000
,and
˝
=1
.ThecoresetsizefortheStreamKM++algorithm,andthechun
k
sizeofthesKKMalgorithmweresetto
5
;000
.
meansalgorithm.Thesilhouettecoefcientvaluesofthepr
oposedalgorithmarecomparableto
thoseofthekernel
k
-means,showingthattheyyieldedsimilarpartitions.TheN
MIachievedby
ouralgorithmishigherthanthatof
k
-meansbecauseoftheuseofnon-linearsimilaritymeasures
.
Theproposedalgorithmalsooutperformstheapproximateva
riantofthekernel
k
-meansalgorithm,
describedinSection4.5.2.OntheCIFAR-10dataset,thebat
chkernel
k
-meansachievedanNMI
valueof
16
:9%
.TheproposedalgorithmachievescomparableNMIvalues(
15
:5%
).
ComparedtotheStreamKM++algorithm,theproposedalgorit
hmachieveshigherclustering
quality,bothintermsofsilhouettecoefcientandNMI,alt
houghittakesslightlylongertoas-
signclusterlabelstothepoints.Thisisduetothefactthat
ouralgorithmneedstoupdateand
clustertheeigenvectorsoftheapproximatekernelmatrixf
oreachbatchofdatapoints.Thepro-
posedalgorithmofferstheadvantagethattheclusterlabel
scanbeobtainedinreal-time,unlike
theStreamKM++algorithmwhichneedstoprocessallthedata
pointsbeforeassigningtheclus-
terlabels.Forinstance,theproposedalgorithmwasableto
clusterabout
2
;700
imagesfromthe
CIFAR-10datasetpersecond,whichisequivalenttoaspeedo
fabout
8
MBps.Ontheremaining
130
051015NMI(a)CIFAR-10
01020304050NMI(b)MNIST
051015NMI(c)ForestCoverType
0246810NMI(d)Imagenet-34
010203040NMI(e)Poker
0510NMI(f)NetworkIntrusion
Figure4.5NMI(in%)oftheclusteringalgorithmswithrespe
cttothetrueclasslabels.The
parametersfortheproposedapproximatestreamkernel
k
-meansalgorithmaresetto
m
=5
;000
,
M
=20
;000
,and
˝
=1
.ThecoresetsizefortheStreamKM++algorithm,andthechun
ksizeof
thesKKMalgorithmaresetto
5
;000
.Itisnotfeasibletoexecutekernel
k
-meansontheForest
CoverType,Imagenet-34,Poker,andNetworkIntrusiondata
setsduetotheirlargesize.The
approximateNMIvaluesofkernel
k
-meansonthesedatasetsisobtainedbyrstexecutingkerne
l
k
-meansonarandomlychosensubsetof
50
;000
datapointstondtheclustercenters,andthen
assigningtheremainingpointstotheclosestclustercente
r.
131
01020304050NMIData streamFigure4.6ChangeintheNMI(in%)oftheproposedapproximat
estreamkernel
k
-meansalgorithm
overtime.Theparameters
m
,
M
and
˝
weresetto
m
=5
;000
,
M
=20
;000
and
˝
=1
,
respectively.

threedatasets,theclusteringspeedrangesfrom
30
KBpsto
700
KBps.Ouralgorithmalsooutper-
formsthesKKMclusteringalgorithmintermsofclusteringq
uality.WhilethesKKMalgorithmis
slowerthantheproposedalgorithmontheCIFAR-10dataset,
it'sspeedisatparwiththeproposed
algorithmontheremainingdatasets.TheStreamKM++algori
thmobtainsclustersfromcoresets
whichsummarize
all
thepointsinthedataset.ThesKKMalgorithmreliesonthein
formation
fromonlytwotimestepsanddiscardsmostofthehistoricali
nformation.Theproposedapproxi-
matestreamkernel
k
-meansalgorithmndsthemiddlegroundbyretainingpotent
iallyusefuldata
pointsusingimportancesampling,anddiscardingtheresto
fthedatapoints.Thisisreectedin
thesilhouetteandNMIvaluesachievedbythealgorithms.
Figure4.6showshowtheNMIvaluesoftheproposedalgorithm
fallduetotheaccumulation
ofthekernelapproximationerrorovertime.Weobservethat
thereductioninNMIisslowand
stabilizesovertimeformostofthedatasets,showingthatt
heapproximationerrorreducesover
time.Theerroraccumulationcanbefurtherminimizedbyclu
steringthepointsinthebuffermore
frequently(asdiscussedinSection4.4),althoughthiswou
ldincreasetherunningtime.Theuser
cantrade-offbetweentheefciencyandaccuracybytuningt
heparametersofthealgorithm.
132
4.5.4.2Parametersensitivity:

Theproposedapproximatestreamkernel
k
-meansalgorithmreliesonveparameters:initialsam-
plesize
m
,maximumbuffersize
M
,initialnumberofclusters
C
,clusterdecayrate

andcluster
lifetimethreshold

.Westudytheinuenceoftheseparametersonthealgorithm'
sperformance
andpresentheuristicstosettheparametervalues:

Initialsamplesize
m
:
Thetimetakenbytheproposedalgorithmtoclustereachdata
point
x
t
isinuencedbythenumberofpointsinthebuffer
S
attime
t
,becausethesizeofthe
eigenvectormatrix
V
C
increasesproportionally.Thebuffersizeattime
t
,inturn,depends
ontherst
m
datapoints
f
x
1
;:::;
x
m
g
inputtothesystem.Moredatapointsaresampled
fromthestreamandaddedto
S
,iftheinitialsampledoesnotcontainasufcientnumberof
representativepoints.OntheCIFAR-10dataset,thenumber
ofadditionalpointssampled
reducedfrom
6
;087
to
4
;434
astheinitialsamplesize
m
wasincreasedfrom
1
;000
to
5
;000
.
Similartrendswereobservedfortheremainingdatasetsasw
ell.Figure4.7comparesthe
runningtimeoftheproposedalgorithmwiththeStreamKM++a
ndsKKMalgorithmsasthe
parameter
m
isvaried.Recallthat
m
representsthecoresetsizeandthechunksizeforthe
StreamKM++andsKKMalgorithms,respectively.As
m
wasincreased,thetimetakenfor
clusteringbythebaselinealgorithmsalsoincreased.Asex
pected,theproposedalgorithm
tookslightlylongerthantheStreamKM++andsKKMalgorithm
sformostdatasets,espe-
ciallywhen
m
waslarge.However,theNMIvaluesachievedbytheproposeda
lgorithmare
muchhigherthanthoseachievedbythebaselinealgorithms,
asshowninFigure4.9.Our
algorithm'saccuracyimprovessignicantlyas
m
increases,whilethereisminimalimprove-
mentintheclusterqualityoftheStreamKM++algorithm.Thi
simprovementinaccuracy
compensatesforthehigherrunningtimeoftheproposedalgo
rithm.Theseresultsindicate
thattheinitialsample,determinedbytheorderofthedata,
playsacrucialroleintheperfor-
manceoftheproposedalgorithm.ThevarianceintheNMItend
storeduceas
m
increases,
againindicatingthattheorderofthedataisimportant.The
silhouettecoefcientvalues
133
10002000300040005000051015mRunning timein milliseconds(a)CIFAR-10
10002000300040005000020406080100mRunning timein milliseconds(b)MNIST
1000200030004000500001020304050mRunning timein milliseconds(c)ForestCoverType
10002000300040005000020406080mRunning timein milliseconds(d)Imagenet-34
10002000300040005000050100mRunning timein milliseconds(e)Poker
10002000300040005000050100150200mRunning timein milliseconds(f)NetworkIntrusion
Figure4.7Effectoftheinitialsamplesize
m
ontherunningtime(inmilliseconds)oftheproposed
approximatestreamkernel
k
-meansalgorithm.Parameter
m
representstheinitialsamplesetsize,
thecoresetsizeandthechunksizefortheapproximatestrea
mkernel
k
-means,StreamKM++and
sKKMalgorithms,respectively.Theparameters
M
and
˝
aresetto
M
=20
;000
and
˝
=1
,
respectively.
134
1000200030004000500000.020.040.060.08mSilhouette(a)CIFAR-10
1000200030004000500000.20.40.60.8mSilhouette(b)MNIST
Figure4.8Effectoftheinitialsamplesize
m
onthesilhouettecoefcientvaluesoftheproposed
approximatestreamkernel
k
-meansalgorithm.Parameter
m
representstheinitialsamplesetsize,
thecoresetsizeandthechunksizefortheapproximatestrea
mkernel
k
-means,StreamKM++and
sKKMalgorithms,respectively.Theparameters
M
and
˝
aresetto
M
=20
;000
and
˝
=1
,
respectively.
achievedbytheproposedalgorithmvaryminimallywithincr
easeintheinitialsamplesize,
asshowninFigure4.8.

Maximumbuffersize
M
:
Themaximumbuffersize
M
doesnotaffectthealgorithmic
efciencyoftheproposedalgorithm,providedthat
M
˘
2
m
,andtheinitialsampleis
representativeofthestream.If
M
issmall,datapointsneedtoberemovedmoreoften
fromthebuffertoaccommodateforthenewlysampleddatapoi
nts,whichresultsinan
increasedrunningtimeasshowninTable4.2.Forinstance,w
hen
M
wassetto
5
;000
,about
2
;500
pointswereremovedfromthebuffer,whereasnopointsneede
dtoberemovedwhen
M
=20
;000
,resultingina
2
millisecondreductionoftheclusteringtimeperdatapoint
.
Thesilhouettecoefcientvaluesvaryminimallywith
M
,asrecordedinTable4.3.TheNMI
valueincreasesas
M
increasesbecausealargernumberofrepresentativedatapo
intscanbe
storedinthebuffer,asshowninTable4.4.

Clusterdecayrate

,lifetimethreshold

andnumberofclusters
C
:
Thenalnumber
ofclustersattheendofclusteringdependsontheorderingo
fthedataset,andthecluster
135
10002000300040005000051015mNMI(a)CIFAR-10
1000200030004000500001020304050mNMI(b)MNIST
10002000300040005000051015mNMI(c)ForestCoverType
1000200030004000500002468mNMI(d)Imagenet-34
10002000300040005000010203040mNMI(e)Poker
100020003000400050000510mNMI(f)NetworkIntrusion
Figure4.9Effectoftheinitialsamplesize
m
ontheNMI(in%)oftheproposedapproximate
streamkernel
k
-meansalgorithm.Parameter
m
representstheinitialsamplesetsize,thecoreset
sizeandthechunksizefortheapproximatestreamkernel
k
-means,StreamKM++andsKKM
algorithms,respectively.Theparameters
M
and
˝
aresetto
M
=20
;000
and
˝
=1
,respectively.
136
Table4.2Effectofthemaximumbuffersize
M
ontherunningtime(inmilliseconds)ofthepro-
posedapproximatestreamkernel
k
-meansalgorithm.Parametersettings:
m
=5
;000
,
˝
=1
.
M
5,000
10,000
15,000
20,000
CIFAR-10
9.34
8.50
9.57
7.48
(

0
:76)
(

3
:33)
(

2
:79)
(

1
:24)
MNIST
11.05
10.35
8.99
9.94
(

2
:22)
(

4
:04)
(

0
:41)
(

1
:75)
Forest
7.07
24.17
40.65
58.55
CoverType
(

0
:27)
(

6
:69)
(

12
:81)
(

21
:57)
Imagenet-34
10.57
18.77
48.15
57.91
(

2
:62)
(

4
:85)
(

18
:18)
(

22
:20)
Poker
7.38
21.06
44.04
62.38
(

3
:56)
(

9
:57)
(

17
:76)
(

32
:31)
Network
12.09
27.15
43.05
161.89
Intrusion
(

2
:57)
(

7
:07)
(

15
:31)
(

69
:43)
Table4.3Effectofthemaximumbuffersize
M
ontheSilhouettecoefcientoftheproposed
approximatestreamkernel
k
-meansalgorithm.Parametersettings:
m
=5
;000
,
˝
=1
.
M
5,000
10,000
15,000
20,000
CIFAR-10
5.53
5.63
6.92
7.75
(

e

02
)
(

0
:12)
(

0
:04)
(

0
:29)
(

0
:26)
MNIST
80.50
77.72
82.19
82.51
(

e

02
)
(

0
:84)
(

0
:66)
(

1
:29)
(

1
:75)
Table4.4Effectofthemaximumbuffersize
M
ontheNMI(in%)oftheproposedapproximate
streamkernel
k
-meansalgorithm.Parametersettings:
m
=5
;000
,
˝
=1
.
M
5,000
10,000
15,000
20,000
CIFAR-10
6.22
8.07
15.49
15.40
(

0
:27)
(

2
:73)
(

0
:18)
(

0
:39)
MNIST
20.15
29.97
48.31
48.31
(

0
:26)
(

0
:87)
(

1
:50)
(

1
:50)
Forest
0.56
0.72
12.19
14.27
CoverType
(

0
:07)
(

0
:05)
(

0
:02)
(

2
:13)
Imagenet-34
1.58
1.73
6.55
7.04
(

1
:27)
(

1
:62)
(

1
:19)
(

1
:24)
Poker
0.64
22.54
39.11
36.09
(

3
:45)
(

2
:92)
(

4
:19)
(

4
:94)
Network
13.71
13.86
13.75
14.32
Intrusion
(

0
:01)
(

0
:40)
(

0
:30)
(

0
:10)
137
Table4.5Effectoftheclusterlifetimethreshold

=exp(

˝
)
ontherunningtime(inmil-
liseconds)oftheproposedapproximatestreamkernel
k
-meansalgorithm.Parametersettings:
m
=5
;000
,
M
=20
;000
.
˝
1
2
3
4
5
CIFAR-10
7.48
9.28
8.33
8.54
9.08
(

1
:24)
(

1
:03)
(

1
:53)
(

1
:66)
(

1
:12)
MNIST
9.94
9.25
9.31
9.42
10.31
(

1
:75)
(

0
:46)
(

0
:59)
(

0
:61)
(

1
:25)
Forest
58.55
42.80
48.78
40.09
41.88
CoverType
(

21
:57)
(

17
:26)
(

20
:72)
(

13
:81)
(

15
:90)
Imagenet-34
57.91
60.25
55.77
57.24
54.98
(

22
:20)
(

24
:43)
(

26
:20)
(

24
:57)
(

31
:10)
Poker
62.38
44.39
44.11
42.65
43.66
(

32
:31)
(

16
:04)
(

15
:62)
(

17
:48)
(

16
:27)
Network
161.89
164.61
165.18
162.36
163.05
Intrusion
(

0
:69)
(

0
:70)
(

0
:71)
(

0
:68)
(

0
:64)
Table4.6Effectoftheclusterlifetimethreshold

=exp(

˝
)
ontheSilhouettecoefcientofthe
proposedapproximatestreamkernel
k
-meansalgorithm.Parameters:
m
=5
;000
,
M
=20
;000
.
˝
1
2
3
4
5
CIFAR-10
7.75
7.66
6.40
6.35
6.07
(

e

02
)
(

0
:26)
(

0
:24)
(

0
:19)
(

0
:20)
(

0
:22)
MNIST
82.51
82.51
82.51
82.51
82.51
(

e

02
)
(

1
:75)
(

0
:01
:75)
(

1
:75)
(

1
:75)
(

1
:75)
Table4.7Effectoftheclusterlifetimethreshold

=exp(

˝
)
ontheNMI(in%)oftheproposed
approximatestreamkernel
k
-meansalgorithm.Parameters:
m
=5
;000
,
M
=20
;000
.
˝
1
2
3
4
5
CIFAR-10
15.49
15.55
15.41
15.45
15.50
(

0
:39)
(

0
:23)
(

0
:33)
(

0
:23)
(

0
:25)
MNIST
48.31
47.77
49.45
45.98
47.74
(

1
:40)
(

1
:49)
(

1
:48)
(

1
:40)
(

1
:49)
Forest
14.27
12.10
12.11
12.10
12.10
CoverType
(

2
:13)
(

0
:03)
(

0
:03)
(

0
:03)
(

0
:03)
Imagenet-34
7.04
7.04
6.95
6.95
7.76
(

1
:24)
(

1
:24)
(

1
:14)
(

1
:14)
(

1
:54)
Poker
36.09
32.07
32.07
36.09
32.07
(

4
:94)
(

4
:41)
(

4
:41)
(

4
:94)
(

4
:41)
Network
14.32
13.65
13.65
13.65
13.66
Intrusion
(

0
:10)
(

0
:06)
(

0
:06)
(

0
:06)
(

0
:06)
138
Table4.8Comparisonoftheperformanceoftheapproximates
treamkernel
k
-meansalgorithm
withimportancesamplingandBernoullisampling.
Dataset
CIFAR
MNIST
Forest
Imagenet
Poker
Network
-10
Cover
-34
Intrusion
Type
Importance
Running
7.48
9.94
58.55
57.91
62.38
161.89
sampling
time(ms)
(

1
:
24)
(

1
:
75)
(

21
:
57)
(

22
:
20)
(

32
:
31)
(

0
:
69)
Silhouette
7.75
82.51
-
-
-
-
coefcient
(

0
:
26)
(

1
:
75)
(

e

02
)
NMI
15.49
48.31
14.27
7.04
36.09
14.32
(%)
(

0
:
39)
(

1
:
40)
(

2
:
13)
(

1
:
24)
(

4
:
94)
(

0
:
10)
Number
5,434
6,136
16,561
14,735
6,265
14,886
ofpoints
(

2
;
093)
(

34)
(

3
;
710)
(

1
;
790)
(

132)
(

2
;
627)
sampled
Bernoulli
Running
2091.50
2210.77
1257.03
3002.45
86.43
923.16
sampling
time(ms)
(

47
:
34)
(

58
:
05)
(

39
:
33)
(

77
:
97)
(

1
:
86)
(

40
:
41)
Silhouette
0.72
8.11
-
-
-
-
coefcient
(

0
:
01)
(

0
:
13)
(

e

02
)
NMI
11.33
14.35
3.93
4.97
2.90
6.50
(%)
(

4
:
9)
(

0
:
05)
(

0
:
7)
(

0
:
19)
(

0
:
02)
(

0
:
15)
Number
31,483
23,000
407,220
389,177
50,000
1,711,101
ofpoints
(

717)
(

203)
(

5
;
807)
(

11
;
325)
(

100)
(

44
;
866)
sampled
decayandlifetimeparameters

and

.Forinstance,whenthepointsintheCIFAR-10data
setwereinputintheirtrueorder(i.e.allimagesfromclass
i
areinputbeforeallimagesfrom
class
j
(
i<j
))for
C
=5
,

=0
:5
and

=exp(


)=0
:61
,
10
clusterswerefound.On
theotherhand,whenthedatawaspermutedrandomlyandclust
ered,therewasnoincrease
inthenumberofclustersbecausenoclustersbecamestale.T
henumberofclustersincreased
morerapidlywhen

and

weresettolowervaluesbecausetheclustersbecamestalefa
ster.
Thisalsoinuencestheclusteringtimeminimally.Theeffe
ctoftheparameter

onthe
runningtimeisrecordedinTable4.5.Whilethesilhouettec
oefcientvaluesremainalmost
constantwhen

changes,theNMIvaluesarebetterforlowervaluesof

and

asshownin
Tables4.6and4.7.
139
Samplingtechniques:
Table4.8showshowtheperformanceoftheproposedalgorith
monthe
sixdatasetschanges,whenimportancesamplingisreplaced
byasamplingprocedurewhereeach
datapointissampledwithprobability
1
=
2
,andnolimitisplacedonthesizeofthesampleset.
Werecord,foreachsamplingprocedure,therunningtimeinm
illiseconds,theNMIvaluesandthe
averagenumberofpointsstoredinmemoryafterallthedatap
ointshavebeenclustered.Wealso
recordthesilhouettecoefcientvaluesfortheCIFAR-10an
dMNISTdatasets.Forimportance
sampling,wesettheinitialsamplesetsize
m
=5
;000
andtherecencythresholdparameter
˝
=1.WeobservethatthenumberofpointssampledwithBernoul
lisamplingismuchhigherthan
thatwithimportancesampling.Forinstance,about
31
;483
pointsaresampledfromtheCIFAR-
10datasetwhenBernoullisamplingisemployed,whereasonl
yabout
5
;434
pointsaresampled
usingimportancesampling.Inaddition,theclusterqualit
yofBernoullisamplingismuchlower
thanthatofimportancesampling.Thisisbecausethekernel
approximationerrorismuchhigher
whenthedataissampledwithequalprobability.Therunning
timeisalsohigherwhencompared
totheproposedalgorithmwithimportancesamplingduetoth
elargenumberofsampledpoints.
4.6Applications:TwitterStreamClustering

Twitter
5
isapopularmicrobloggingsocialnetworkforsharinginfor
mationovertheweb.Ithas
over100millionactiveuserspostingover
100
;000
shortmessages(called
tweets
)perminute,
whichincludepersonalupdates,real-timeinformationabo
utevents,newsetc.Eachtweetcontains
atextmessagelimitedto
140
characters,andcanincludeuser-mentions,links,andemot
iconsin
additiontoplaintext.Tweetsarealsooftenannotatedwith
hashtags
thatdenotekeywordsrelated
tothetweets.Alargebodyofworkontopicdetection,eventd
etection,hashtagrecommendation,
andsentimentanalysishasbeenperformedontheTwitterdat
a[16,144].Clusteringhasbeenused
tondtrendingtopicsinTwitterposts,ndusercommunitie
sbasedonthesimilarityoftheirposts,
5
www.twitter.com
140
andautomaticallyannotatetweetswithhashtags[16,144].
Inordertodemonstratethepractical
Figure4.10Sampletweetsfromthe
ASP.NET
cluster.
applicabilityoftheproposedapproximatestreamkernel
k
-meansalgorithm,weusedittocluster
theTwitterdata,andndthemost-tweeted-abouttechnolog
iesoveraperiodoftime.Weusedthe
TwitterstreamingsearchAPItoobtainoverabilliontweets
generatedduringthemonthofJanuary
2015,usingthefollowing
20
popularkeywordsashashtagsearchqueries:Python,Perl,C
#,Java,
Ruby,C++,JavaScript,VBScript,Scala,ObjectiveC,PHP,S
QL,Postgresql,GO,Julia,Erlang,
HTML,XML,Swift,andASP.NET.Welteredoutthenon-Englis
htweets,removedthehashtags
andeliminatedstopwordstoobtainavocabularycontaining
8
;042
terms.Weusedthecorrespond-
ingtf-idf(termfrequency-inversedocumentfrequency)fe
atures[125],andthetimestampofthe
tweetsasfeaturesforcalculatingthekernel,denedby

(
x
a
;x
b
)=

exp

k
ts
a

ts
b
k
2

+(1


)
f
>
a
f
b
k
f
a
kk
f
b
k
;where
ts
a
and
f
a
denotethetimestampandthetf-idffeaturesofatweet,repr
esentedbydatapoint
x
a
,respectively.Therstterminthekernelfunctionensures
thattwotweetswhichweregenerated
inthesametimeperiodarelikelytobeassignedtothesamecl
uster,andthesecondtermensures
thattwotweetswithsimilarvocabularyaregroupedtogethe
r.Wegaveequalimportancetoboth
141
Figure4.11Sampletweetsfromthe
HTML
cluster.
thetimestampandthetf-idffeaturesbysetting

=0
:5
.Wesettheparameters
m
=5
;000
,
M
=10
;000
,
C
=20
,

=0
:5
,

=exp(


)=0
:6
and
B
=10
;000
.Ouralgorithmassigned
aclusterlabeltoeachtweetinabout
200
milliseconds.Treatingthehashtagsasthegroundtruth
labels
6
,weobtainedanaverageclusteringaccuracyof
61%
intermsofNMI.Ontheotherhand,
theStreamKM++algorithmtookabout
83
millisecondspertweet,andachievedanNMIvalueof
40%
,andthesKKMalgorithmtookabout
2
secondspertweet,andachievedanNMIvalueof
53%
.Figures4.10and4.11showsomesampletweetsfromtheASP.N
ETandHTMLclusters,
respectively.Weobservedthat,bygivingequalimportance
tothetimestampofthetweet,andthe
wordsinthetweet,weobtainclusterscontainingtweetstha
thavebothtemporalproximityand
vocabularysimilarity.Retweetsarealwaysassignedtothe
sameclusterastheoriginaltweet.For
example,thetweetsabout
stickyheaders
areassignedtotheHTMLcluster,asseeninFigure4.11.
6
Althoughhashtagsarepronetoerror,theyarethebestindic
atorsofthetopicofatweet.Theyhavebeenusedas
topiclabelsinmanyotherstudiesincluding[75,150].
142
Morerecenttweetsratherthanoldtweetsarestoredintheme
mory.Figure4.12(a)showsthe
trendsofthetop-veclustersoverthemonth.Thiscoincide
swellwiththetruetrendofthetop
topicsshowninFigure4.12(b).Wefoundthattheorderofpop
ularityofthetopicclustersis
ASP.NET,HTML,SQL,JavaScript,Perl,C++,Postgresql,Pyt
hon,GO,PHP,Swift,Scala,Java,
Ruby,C#,XML,Erlang,Julia,ObjectiveCandVBScript;whil
ethetrueorderoftopicpopularity
isASP.NET,HTML,Python,JavaScript,Perl,Java,PHP,Ruby
,SQL,C++,Swift,C#,Scala,
Postgresql,XML,Erlang,Julia,GO,ObjectiveC,andVBScri
pt.
5101520253000.20.40.6Percentage of tweetsfrom each clusterTimeline  ASP.NET         HTMLSQLJavaScript   Perl(a)Clustertrends
5101520253000.20.40.6TimelinePrecentage of tweetsfrom each topic  ASP.NETHTML             PythonJavascriptPerl(b)Truetrendsofthetopics
Figure4.12TrendingclustersinTwitter.Thehorizontalax
isrepresentsthetimelineindaysand
theverticalaxisrepresentsthepercentageratioofthenum
beroftweetsintheclustertothetotal
numberoftweetsobtainedontheday.Figure(a)showsthetre
ndsobtainedbytheproposed
approximatestreamkernel
k
-meansalgorithm,andFigure(b)showsthetruetrends.
143
4.7Summary

Inthischapter,wehaveproposedanefcientandeffectiver
eal-timekernel-basedstreamclustering
algorithm,calledapproximatestreamkernel
k
-means.Experimentalresultsshowthattheproposed
algorithmoffersagoodtrade-offbetweenclusteringefci
encyandclusteringquality.Further,
unlikesomestate-of-the-artkernel-basedstreamcluster
ingalgorithms,theproposedalgorithm
cancontrolthedecayandbirthofclusters,therebydynamic
allycontrollingthenalnumberof
clusters.Thekeytotheefciencyoftheproposedalgorithm
isthesamplingofthestreaming
databasedontheirimportance,denedintermsofthestatis
ticalleveragescores.Thisallows
ustomaintainthelong-termhistoryofthestreamingdataan
dalsolimitthememoryrequiredto
storethedata.Wecatertothedriftinthedatadistribution
byplacingthresholdsonthelifeofa
cluster.Wedemonstratedempiricallythattheproposedalg
orithmcanclusterfaststreamssuchas
theTwitterstreamwithlimitedmemory,andachievehigherc
lusteringaccuracythanthecurrent
streamclusteringalgorithms.
144
Chapter5

Kernel-BasedClusteringforLargeNumber

ofClusters

5.1Introduction

Documentandimagedatasets,containingmillionsofhigh-d
imensionalpoints,usuallybelongto
alargenumberofclusters.Findingclustersinsuchdataset
siscomputationallyexpensiveusing
kernel-basedclusteringtechniques.Ouraimistospeedupk
ernel-basedclusteringfordatasets
withlargenumberofclusters.Inthischapter,wepresentav
ariantoftheonlinekernelclustering
algorithmdiscussedinChapter4,calledthe
sparsekernelk-meansalgorithm
whichcanefciently
clusterlargedatasetsintothousandsofclusters,withsig
nicantlylowerprocessingandmemory
requirements,andhighclusteringaccuracy[38,39].
Approximatekernelclusteringalgorithmssuchasapproxim
atespectralclustering[67,157]
andapproximatekernel
k
-means(fromChapter2)reducetherunningtimeofkernelclu
stering
byuniformlysamplingan
m
-sizedsubsetofthedata,andconstructingalow-rankappro
ximate
kernelmatrixusingthesampleddata.Theseapproachesredu
cetherunningtimecomplexityof
kernelclusteringto
O
(
nmd
+
nmC
)
.Notethattherunningtimeincreasesproportionatelywith
145
Table5.1Complexityofpopularpartitionalclusteringalg
orithms:
n
and
d
representthesizeand
dimensionalityofthedatarespectively,and
C
representsthenumberofclusters.Parameter
m>C
representsthesizeofthesampledsubsetforthesampling-b
asedapproximateclusteringalgorithms.
n
sv

C
representsthenumberofsupportvectors.DBSCANandCanopy
algorithmsaredepen-
dentonuser-denedintra-clusterandinter-clusterdista
ncethresholds,sotheircomplexityisnot
directlydependenton
C
.
Clusteringalgorithms
Complexity
k
-means[87]
O
(
nCd
)
DBSCAN[61]
O
(
n
log(
n
)
d
)
Canopy[126]
O
(
nCd
)
Kernel
k
-means[72]
O
(
n
2
d
+
n
2
C
)
Spectralclustering[118]
O
(
n
2
d
+
n
3
+
nC
2
)
Supportvectorclustering[19]
O
(
n
2
dn
sv
)
Approximatespectralclustering[67]
O
(
nmd
+
nmC
)
Approximatekernel
k
-means[40]
O
(
nmd
+
nmC
)
thenumberofclusters(SeeTable5.1).AsdemonstratedinCh
apters2and3,thesealgorithms
takeverylongtoclusterthedatasetwhenthenumberofclust
ersisintheorderofthousands.In
addition,thenumberofsamples
m
requiredtoobtainagoodapproximationisdependentonthe
rankofthekernelmatrix,whichisinturndependentonthenu
mberofclustersinthedata[74].
Clusteringdatasetswithlargenumberofclustersusingthe
sealgorithmsrequiressampling
O
(
n
)
numberofdatapoints,tosufcientlyrepresentalltheclus
ters.Thisrenderstheapproximatekernel
clusteringalgorithmsalsonon-scalable.
Theproposedsparsekernel
k
-meansalgorithmreducestherunningtimeandmemorycom-
plexityofkernelclusteringusingtwokeyideas:(i)kernel
approximationusingincrementalim-
portancesampling,and(ii)kernelsparsity.Importancesa
mplinginvolvesselectingdatapoints
basedontheirnovelty,measuredintermsofstatisticallev
eragescores[34].Fewersamples
(
m
=(
C
log
C
)
)arerequiredtoconstructagoodkernelapproximationusin
gimportancesam-
plingthanuniformsampling.However,ndingthestatistic
alleveragescoresfortheentiredata
involvescomputingtheeigenvectorsofthefull
n

n
kernelmatrix,whichiscomputationallyex-
pensive[56].Wedesignanefcientonlinemethodtosamplet
hedatabasedontheirimportance,
146
therebyreducingthetimerequiredforsampling.
Wealsoreducethecomplexityofkernelcomputationandclus
teringbyusingsparsication.
Wecomputethe
p
-nearestneighborgraph(where
p
isauser-denedparameter)forthesampled
pointsandusethissparsekernelmatrixtoobtainthecluste
rcenters.Clusteringisperformed
efcientlybyrstprojectingthedataintoasubspacespann
edbythetopeigenvectorsofthesparse
kernelmatrix,andthenclusteringtheprojectedpointsusi
ngamodied
k
-meansalgorithm,which
usesrandomized
kd
-trees[132]tondthenearestclustercenterforeachdatap
oint.
Theruntimecomplexityoftheproposedalgorithmislineari
n
n
and
d
,andlogarithmicin
C
.
Weshowthatonlyasmallsubsetofthedataneedstobesampled
,therebyreducingthememory
requirements.Wedemonstrateempiricallyusingseveralbe
nchmarkdatasetsthattheproposed
clusteringalgorithmisscalabletodatasetscontainingmi
llionsofhigh-dimensionaldatapoints,
andthousandsofclusters.

5.2Background

Importancesampling
AsdiscussedinChapter4,theprinciplebehindimportances
amplingis
toselectasubsetofthedatathatismostinformative.Letth
ekernelmatrix
K
bedecomposed
as
K
'
V
C

C
V
>
C
,where

C
=
diag
(

1
;:::;
C
)
containsthehighest
C
eigenvaluesof
K
and
V
C
=(
v
1
;:::;
v
C
)
containsthecorrespondingeigenvectors.Adatapoint
x
i
issampledwith
probability
p
i
,denedas
p
i
=
1
C


V
(
i
)
C


2

2
;(5.1)
where
V
(
i
)
C
isthe
i
th
rowof
V
C
.Theterm


V
(
i
)
C


2

2
,calledthestatisticalleveragescorefordatapoint
x
i
,isanindicatoroftheimportanceofthepoint.Ahighscorei
ndicatesthatthecorresponding
datapointhasahighinuenceintheapproximationoftheker
nelmatrix.
WeshowedinLemma8thatimportancesamplingreducesthedep
endencyofthenumberof
samplesrequiredonthenumberofclusterssignicantly,wh
encomparedtouniformsampling.
147
(a)
(b)
Figure5.1Illustrationofkernelsparsityonatwo-dimensi
onalsyntheticdatasetcontaining
1
;000
pointsalong
10
concentriccircles.Figure(a)showsallthedatapoints(re
presentedbyﬁoﬂ)and
Figure(b)showstheRBFkernelmatrixcorrespondingtothis
data.Neighboringpointshavethe
sameclusterlabelwhenthekernelisdenedcorrectlyforth
edataset.
Kernelsparsity
Anotherkeycomponentoftheproposedalgorithmiskernelsp
arsity.Thepro-
posedalgorithmusesthe
p
-nearestneighbors(
p>C
)ofeachpointtoconstructasparsekernel
matrix.Theintuitionbehindthisisthefactthat,eachdata
pointissurroundedbypointsbelonging
tothesameclusterinthehighdimensionalfeaturespace,pr
ovidedthekernelfunctionisappropri-
atelyselected.Figure5.1illustratesthisconceptonthet
wo-dimensionalconcentriccirclesdataset.
TheRBFkernelmatrixcorrespondingtothisdataisshowninF
igure5.1(b).Nearbydatapoints
intermsofthekernelsimilaritytendtohavethesamecluste
rlabel.Thisideahasbeenpreviously
appliedinseveralsupervisedlocallearningapproaches[2
7].Thelocallearning-basedclustering
algorithm[188]andthelocalspectralclusteringalgorith
m[118]alsousethenearestneighbor
graphstoobtaintheclusterlabelsforthedata.However,th
esemethodsrequirethecomputationof
thefull
n

n
similaritymatrices,renderingthemnon-scalable.
Findingthenearestneighborsofadatapointfromamongst
s
pointswouldrequirethecom-
putationof
O
(
s
)
similarities.Popularapproximatenearestneighboralgor
ithmsadoptoneofthe
followingtwoapproachestondthenearestneighborsefci
ently[3]:

Usehashingtechniquessuchaslocalitysensitivehashing,
whichusehashfunctionstoplace
similarobjectsinthesamebin[100,198].
148

Usedatastructureslike
kd
-trees(alsodenotedas
k-d
trees)[131]anditsvariantslikeR-trees,
R
*
-treesandmetrictrees[154]toorganizethedataaccording
totheirsimilarityandenable
efcientquerying.
Therandomized
kd
-trees[132]techniqueforapproximatenearestneighborco
mputationin-
volvesconstructingmultiple
kd
-treesandsearchingtheminparallel.Whileaclassical
kd
-treeis
builtbysplittingthedataalongthedimensionswiththehig
hestvariance[131],eachrandomized
kd
-treesplitsthedataalongadimensionchosenrandomlyfrom
thetop
n
d
dimensionswiththe
highestvariance.Apriorityqueuewithinformationaboutt
hedistanceofeachbranchtothedeci-
sionboundaryisusedtoindexintothemultipletrees.Ittak
es
O
(
s
log
s
)
timetobuildthetrees,
and
O
(log
s
)
timeforeachquery.Therefore,thetimetakenfornearestne
ighborcomputation
issignicantlyreduced,whenalargenumberofqueriesneed
tobeperformedonthesamedata
set.Weemployrandomized
kd
-treesintheproposedalgorithmtorstndthenearestneig
hbors
andbuildthesparsekernelmatrix,andthentondthecloses
tcenterforeachdatapointduring
clustering.
Theproposedalgorithmoffersthefollowingadvantagesove
rtheexistingtechniquestoreduce
therunningtimeofkernel-basedclustering[40,42,67,157
,188]:
(i)Itemploysimportancesampling,sofewernumberofsampl
esarerequiredtoapproximatethe
kernelmatrix,whencomparedtotheapproximationmethodsi
n[40,67,157],whichemploy
uniformrandomsampling.
(ii)Existingapproximatekernelclusteringalgorithms[4
0,67,157]needtoperform
O
(
nm
)
ker-
nelsimilaritycomputations,where
m
isthenumberofsamples.Thenumberofkernelsim-
ilaritycomputationsperformedbytheproposedalgorithmi
s
O
(
np
)
,wherethenumberof
neighbors
p
˝
m
.Thisalsoreducesthetimeandmemoryrequiredforclusteri
ng,compared
totheotherapproximateclusteringalgorithms.
149
(iii)Theclusteringqualityisbetterwhencomparedtothee
xistingapproximatekernelcluster-
ingmethods,evenwitharelativelysmallnumberofsamples,
becausethemostinformative
samplesareusedtoperformclustering.
(iv)Itdoesnotrequirethecomputationofthefullkernelma
trix,unlikethelocalclusteringmeth-
odsin[118]and[188].
(v)Itisonlineinnature,i.e.thedataisclusteredinbatch
esofauser-denedsize
B
,soitcan
clusterverylargedatasets(includingdatastreams).
5.3SparseKernelk-means
Theproposedsparsekernel
k
-meansclusteringalgorithmisdescribedinAlgorithm9.Th
e
algorithmstartswiththerst
m
datapointsstoredinabuffer
S
ofaxedmaximumsize
M
(
C<m<M
).Let
N
(
x
i
)
representthe
p
-nearestneighborsofdatapoint
x
i
intheRKHS
1
.We
constructthe
p
-neighborgraph
K
0
forthe
m
datapointsin
S
,denedby
K
0
=[
K
ij
]m

m
,where
(5.2)
K
i;j
=
8

>

<

>

:

(
x
i
;x
j
)
if
x
i
2N
(
x
j
)
and
x
j
2N
(
x
i
)
;0
otherwise.
WeassumethatnearbypointsintheHilbertspacebelongtoth
esamecluster.Thekernelfunction
shouldbeappropriatelydenedforthisassumptiontobeval
id.Severalarticlesintheliterature
describetechniquestolearnthekernelfunctionfromtheda
ta[112,177,200].
Theremainingdataisclusteredinbatches
fD
1
;D
2
;:::
g
ofsize
B
,where
D
t
=
f
x
t

1
;:::;
x
t

B
g
.
Let
K
0
=
V
C

C
V
C
>
,where

C
=
diag
(

1
;:::;
C
)
containsthetop
C
eigenvaluesof
K
0
and
1
Thenearestneighborsarefoundefcientlyusingrandomize
d
kd
-trees.Weusethekernelfunction

(
;)
todene
theinter-pointdistancefunction.
150
Algorithm9
SparseKernel
k
-means
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered


(

;
):
<
d
<
d
7!<
:thekernelfunction

C
:thenumberofclusters

m
:minimumnumberofpointstobesampled(
m>C
)

p
:numberofneighborsforcalculatingthesparsekernelmatr
ix(
p<m
)

M
:maximumnumberofpointsallowedinthesampleset(
m<M
)

B
:sizeofeachinputdatabatch
2:
Output
:Clusterlabelsforthedatapoints
3:
Initialize
S
=
f
x
1
:::
x
m
g
.
4:
Setthenumberofbatches
n
B
=(
n

m
)
=B
anddividetheremainingpointsinthedataset
(
D
S
)
intobatches
fD
1
;:::;
D
n
B
g
,where
D
t
=
f
x
t

1
;:::;
x
t

B
g
.
5:
Computethesparsekernelmatrix
K
0
accordingto(5.2).
6:
Decompose
K
0
as
K
0
=
V
C

C
V
>
C
.
7:
Clusterthedatapointsin
S
byexecutingapproximate
k
-means(Algorithm10)on
V
C

C
1
=
2
to
obtaintheirclusterlabels.
8:
for
t
=1
;2
;:::;n
B
do
9:
for
i
=1
;2
;:::;B
do
10:
Calculatetheprobability
p
t

i
using(5.1).
11:
Set
S
=
S
[f
x
t

i
g
withprobability
p
t

i
.
12:
If
x
t

i
wasaddedto
S
inStep11,updatetheeigenvalues

C
andeigenvectors
V
C
us-
ing(5.8),andreclusterthepointsin
S
byexecutingtheapproximate
k
-meansalgo-
rithm(Algorithm10)on
V
C

C
1
=
2
,otherwiseassign
x
t

i
tocluster
k

,where
k

=
argmin
k
2
[
C
]
jj
c
k
(

)

g
t
(

)
jj
2

H

,
c
k
(

)
isgivenby(5.6),and
g
t
(

)
istheprojectionof

(
x
i

t
;
)
intothesubspacespannedbytheeigenvectors
V
C
.
13:
If
card
(
S
)
>M
,ndindex
q
=argmin
l


V
(
l
)
C


2

2
andremovedatapoint
x
q
from
S
.
14:
endfor
15:
endfor
151
V
C
=(
v
1
;:::;
v
C
)
containsthecorrespondingeigenvectors.Thematrices
V
C
and

C
areupdated
usingeachpoint
x
t

i
from
D
t
,andthekernelmatrixisupdatedas
K
t
=
8

>

>

>

>

<

>

>

>

>

:
2

6

4
K
t

1
'
>
'
(
x
t

i
;x
t

i
)
3

7

5
withprobability
p
t

i
;K
t

1
withprobability
1

p
t

i
,
(5.3)
where
'
isasparsevectordenedby
'
=[

(
x
t

i
;x
s
)]
>
,
x
s
2N
(
x
t

i
)
\
S
,and
p
t

i
istheimportance
samplingprobabilitydenedin(5.1).Datapoint
x
t

i
isaddedto
S
withprobability
p
t

i
.Thecluster
labelsforthepointsin
S
canbeobtainedbysolvingthekernel
k
-meansproblem
max
U
2P
tr
(
e
UK
t
e
U
>
)
;(5.4)
where
U
=(
u
1
;:::;
u
C
)
>
istheclustermembershipmatrix,
e
U
=[
diag
(
U
1
)]

1
=
2
U
,domain
P
=
f
U
2f
0
;1
g
C

s
:U
>
1
=
1
g
,
s
=
card
(
S
)
,and
1
isavectorofallones.Theclusterlabels
fortheunsampledpointscanbeobtainedbyassigningthemto
theclosestcenter.Therunning
timecomplexityofthisstepis
O
(
s
2
)
.Wefurtherreducethiscomplexitybyconstrainingthe
clustercenterstoasmallersubspace,spanningthetopeige
nvectorsofthekernelmatrix
K
t
,along
thelinesofspectralclustering
2
.Weposetheclusteringproblemasthefollowingoptimizati
on
problem:
min
U
2P
max
f
c
k
(

)
2H
a
g
C

k
=1
C
X
k
=1
s
X
i
=1
U
k;i
s
jj
c
k
(

)


(
x
i
;
)
jj
2

H

;(5.5)
where
H
a
=
span
(
v
1
;:::;
v
C
)
.Theclustercenterscanbeexpressedaslinearcombination
softhe
2
Notethattheeigenvaluesandeigenvectorswerecomputedwh
ilendingthesamplingprobabilities(5.1),hence
theeigenvectorsdonotneedtobere-computedforclusterin
g.
152
eigenvectorsofthekernelmatrix:
c
k
(

)=
s
X
i
=1
C
X
j
=1
U
k;i
n
k
p

j
v
i;j
=
u
k
n
k
V
C

1
=
2
C
;k
2
[C
];(5.6)
where
n
k
isthenumberofpointsinthe
k
th
cluster,and
u
k
=(
U
k;
1
;U
k;
2
;:::;U
k;s
)
>
.Bysubstitut-
ing(5.6)in(5.5),weobtainthefollowingtracemaximizati
onproblem:
max
U
2P
tr
(
e
UV
C

C
V
>
C
e
U
>
)
:(5.7)
Theaboveproblemcanbesolvedbyexecuting
k
-meansonthematrix
V
C

1
=
2
C
.Thecomplexityof
running
k
-meanson
V
C

1
=
2
C
wouldbe
O
(
sC
2
)
,whichcanagainbecomputationallyexpensivefor
large
C
.
Wealleviatethisissuebyemployinganapproximatevariant
ofthe
k
-meansalgorithm(Algo-
rithm10),similartothelteringalgorithmin[91].Themos
tcomputationallyexpensivestepinthe
k
-meansalgorithmiscomputingtheclosestcenterforeachda
tapoint,whichrequires
O
(
sC
)
dis-
tancecomputations.Wereducethenumberofdistancecomput
ationsbyusingrandomized
kd
-trees
tondtheclosestclustercenters.
Theproposedsparsekernel
k
-meansalgorithmisdependentonthreeparameters:initial
sample
size
m
,maximumbuffersize
M
,andthenumberofneighbors
p
usedtobuildthesparsekernel
matrix.Theparameters
m
and
M
shouldbesetsuchthattheinitialandnalsamplesetsconta
in
representativesfromalltheclusters.Theparameter
p
shouldbesetlargeenoughtoensurethatthe
kernelmatrixremainspositivesemi-deniteanditsrankis
greaterthanthenumberofclusters
C
.
Heuristicstosettheseparametersarediscussedfurtherin
Section5.5.
153
Algorithm10
Approximate
k
-means
1:
Input
:
D
=
f
x
1
;:::;
x
n
g
;x
i
2<
d
:thesetof
nd
-dimensionaldatapointstobeclustered

C
:thenumberofclusters

MAXITER
:Maximumnumberofiterations
2:
Output
:Clusterlabelsforthedatapoints
3:
Randomlyinitializetheclusterlabels
f
l
1
;l
2
;:::l
n
g
;l
i
2
[C
].
4:
Computetheclustercenters
c
k
=
P
l
i
=
k
x
i
;k
2
[C
].
5:
Set
t
=0
;
6:
repeat
7:
Set
t
=
t
+1
.
8:
Buildrandomized
kd
-treeindex
I
forthe
C
centers[132].
9:
for
i
=1
;2
;:::;n
do
10:
Findtheapproximatenearestcenter
c
k

ofdatapoint
x
i
usingtheindex
I
.
11:
Set
l
i
=
k

.
12:
endfor
13:
Recomputethecenters
f
c
1
;c
2
;:::;
c
C
g
.
14:
until
thelabelsdonotchangeor
t>MAXITER
5.4Analysis

5.4.1ComputationalComplexity

Themostcomputationallyintensiveoperationsinthepropo
sedalgorithmare:(i)computingthe
m

m
kernelmatrix
K
0
(Step5),andndingitseigenvectorstoobtaintheleverage
scores(Step
6),and(ii)updatingtheeigenvectorsineachiteration,an
dclusteringthemusingtheapproximate
k
-meansalgorithm(Step12).Inordertoobtaintheeigenvalu
esandeigenvectorsofan
s

s
kernelmatrix
K
t
(where
s
isthenumberofdatapointsinthebuffer
S
),weneedtoperform
eigendecompositionof
K
t
.Naiveimplementationsofeigendecompositiontake
O
(
s
3
)
time.We
canreducethetimeforcomputingtheeigenvectorsbymaking
twomodicationstothealgorithm:
(i)UseefcientalgorithmssuchasLanczos,subspaceitera
tion,andtraceminimizationmethods
todecomposethe
m

m
kernelmatrix
K
0
obtainedfromtherst
m
points[21].Thisreduces
therunningtimecomplexityofthisstepto
O
(
mp
+
m
)
.Inourimplementation,weusedthe
154
svds
functioninMATLABtoobtainthetop
C
eigenvaluesandeigenvectorsforthekernel
matrixcorrespondingtotherst
m
datapoints.
(ii)Updatetheeigenvectors
V
C
incrementallyineachiterationofthealgorithmusingfast
update
mechanisms[31,175],toreducethetimetakentoprocessthe
pointsineachbatch.Using
therank-1updatemechanismproposedin[31],weupdatethee
igenvectorsin
O
(
sp
+
p
3
)
time,where
s
isthenumberofpointsinthebuffer
S
.Giventheeigendecomposition,
K
t
=
V
C

C
V
>
C
,andvector
'
2<
m
,thismethodndstheeigendecompositionof

K
t
+
''
>

as
K
t
+
''
>
=

V
w
jj
w
jj


0

V
w
jj
w
jj

>
(5.8)
where
w
=

I

VV
>

'
isthecomponentof
K
t
thatisorthogonalto
V
,and

0
contains
thedominanteigenvaluesofthesparsematrix
2

6

4

V
>
'
'
>
V
jj
w
jj
3

7

5
:Thismethodalsoeliminatestheneedtostorethekernelmatr
ix
K
t
inmemory.Afterthematrix
K
0
anditseigenvectorsareobtained,onlythevector
'
in(5.3)isrequiredtoupdate
V
C
and

C
.
Theapproximate
k
-meansalgorithmrstbuildsmultiplerandomized
kd
-treescontainingthe
C
clustercenters,andanindexintothesetrees,whichtakes
O
(
C
log
C
)
time.Itthenndsthe
approximatenearestneighborsforeachdatapointin
S
in
O
(
s
log
C
)
time,withan

approximation
error.Therefore,thetotaltimeforclustering
s
pointsusingtheapproximate
k
-meansalgorithmis
O
(
C
log
Cl
+
s
log
Cl
)
,where
l
isthenumberofiterationsrequiredforconvergence.Clust
ering
isperformedeverytimeapointisaddedtothesampleset
S
fromtheinputbatchofdatapoints.
Inordertofurtherreducetherunningtime,wecanemploya
lazyreclustering
approach,bywhich
weperformtheclusteringafterevery
T
datapointadditions.Eachunsampleddatapointcanbe
155
assignedaclusterlabelbyndingtheclosestcenterin
O
(log
C
)
time.
Insummary,theoverallrunningtimecomplexityofthepropo
sedsparsekernel
k
-meansalgo-
rithmis
O
(
npd
+
mp
+
m
+
QC
log
Cl
+
QM
log
Cl
+
n
log
C
)
,where
Q
isthetotalnumber
ofpointssampledfromthestream.WedemonstrateinSection
5.5thatthenumberofpoints
Q
isclosetotheinitialsamplesize
m
.Therefore,therunningtimecomplexitycanbesimplied
as
O
(
npd
+
mp
+
m
+
mC
log
Cl
+
mM
log
Cl
+
n
log
C
)
˘
O
(
npd
+
n
log
C
)
,assuming
max(
mp;mCl;mMl
)
˝
n
.Therefore,theproposedalgorithmhasrunningtimecomple
xitylin-
earin
n
,linearin
d
,andlogarithmicin
C
.Itissignicantlyfasterthanthekernel
k
-meansalgorithm
andtheapproximatekernelclusteringalgorithms,whichha
ve
O
(
n
2
d
+
n
2
C
)
and
O
(
nmd
+
nmC
)
runningtimecomplexities,respectively.Theamountofmem
oryrequiredis
O
(
mp
+
Md
+
MC
)
,
forstoringtheinitialkernelmatrix
K
0
,thedatapointsinthebufferandtheeigenvectorsofthe
kernelmatrix.

5.4.2ApproximationError

Theproposedsparsekernel
k
-meansalgorithmessentiallyapproximatestheeigenvecto
rsofthe
true
n

n
kernelmatrixwiththesingularvectorsofasparse
n

Q
matrix,where
Q
isthetotal
numberofpointssampledfromthedatasetusingimportances
ampling.Inthissection,werst
boundthisapproximationerror(duetoimportancesampling
andsparsication),andthenbound
theerrorincurredduetotheapproximation(5.5)forcluste
ring.
Theorem7.
Let
K
bethe
n

n
kernelmatrixandlet

K
bethe
n

Q
kernelmatrixbetween
the
n
pointsinthedatasetandthe
Q
sampledpoints.Let
Z
C
=(
z
1
;:::;z
C
)
representthetop
C
eigenvectorsof
K
,and

2
(0
;1)
bethesmallestprobabilitysuchthat
(

C


C
+1
)
>
3
,where

<
2

1
Q
ln
2

+

j
K
j
F
s
2ln(2
=
)
Qn
and

2
=max
1

i

Q
n
X
j
=1

2
(
x
i
;x
j
)
:156
Assuming

=
O
(
p
Q
)
and

(

;
)

1
,
max
1

i

C
j
v
i

z
i
j
2

9
2(

C


C
+1
)
;(5.9)
withprobability
1


.
Proof.
Wewillrstestablisharelationshipbetweenthesingularv
ectorsofthesparsekernelmatrix
thatisconstructedbytheproposedalgorithmandthe
n

n
kernelmatrix
K
sp
=

K
sp
i;j

n

n
dened
asfollows:
K
sp
i;j
=
8

>

<

>

:

(
x
i
;x
j
)
if
x
i
2N
(
x
j
)
and
x
j
2N
(
x
i
)
;0
otherwise,
andthenusethefactthat
j
K
sp
j
F
j
K
j
F
toobtaintherequiredresult.Let
b
Z
=(
b
z
1
;:::;
b
z
n
)
representtheeigenvectorsof
K
sp
,
X
=(
b
z
1
;:::;
b
z
C
)
,and
Y
=(
b
z
C
+1
;:::;
b
z
n
)
.Let
L
n
bealinear
operatorthatmapsanyfunction
f
(

)
toafunction
L
n
[f
](

)
2H

denedby
L
n
[f
](

)=
1
n
n
X
i
=1

(
x
i
;
)
f
(
x
i
)
:(5.10)
Theeigenfunctions[187]of
L
n
,whichformthebasisofthespace
H

aregivenby
b
'
i
(

)=
1
p

i
n
n
X
j
=1
b
z
i;j

(
x
j
;
)
:(5.11)
Similarto
L
n
,let
L
Q
representthelinearoperatorbasedonthesampledexamples
,denedby
L
Q
[f
](

)=
1
Q
Q
X
i
=1

(
x
i
;
)
f
(
x
i
)
:(5.12)
Werstproveasimplerresultthatestablishesarelationsh
ipbetweenthesubspaces
X
and
V
C
,
inthefollowinglemma:
157
Lemma10.
Let

2
(0
;1)
bethesmallestprobabilitysuchthat
(

C


C
+1
)
>
3
,where

is
denedas
=
2

1
Q
ln
2

+

j
K
sp
j
F
s
2ln(2
=
)
Qn

2

1
Q
ln
2

+

j
K
j
F
s
2ln(2
=
)
Qn
Thereexistsamatrix
P
2
R
(
n

C
)

C
satisfying
k
P
k
F

2

C


C
+1


;suchthat
V
C
=(
X
+
YP
)(
I
+
P
>
P
)

1
=
2
.
Proof.
Theproofisbasedonthefollowingresults(Lemmas11and12)
from[166]and[165],
respectively:

Lemma11.
Let
(

i
;v
i
)
;i
2
[n
]betheeigenvaluesandeigenvectorsofasymmetricma-
trix
A
2
R
n

n
rankedinthedescendingorderofeigenvalues.Set
X
=(
v
1
;:::;
v
C
)
and
Y
=(
v
C
+1
;:::;
v
n
)
.Givenasymmetricperturbationmatrix
E
,let
(
X;Y
)
>
E
(
X;Y
)=
0

B

@
E
11
E
12
E
21
E
22
1

C

A
:Let
kk
representaconsistentfamilyofnormsandset

=
k
E
21
k
;
=

C


C
+1
k
E
11
kk
E
22
k
:If
>
0
and


<
1
2
,thenthereexistsauniquematrix
P
2
R
(
n

C
)

C
satisfying
k
P
k
<
2


;such
158
that
X
0
=(
X
+
YP
)(
I
+
P
>
P
)

1
=
2
;Y
0
=(
Y

XP
>
)(
I
+
PP
>
)

1
=
2
aretheeigenvectorsof
A
+
E
.
Lemma12.
Let
H

beaHilbertspaceand
˘
bearandomvariableon
(
Z;ˆ
)
withvaluesin
H

.
Assume
k
˘
k
M<
1
almostsurely.Denote
˙
2
(
˘
)=
E
(
k
˘
k
2
)
.Let
f
z
i
g
Q

i
=1
beindependent
randomdrawersof
ˆ
.Forany
0
<<
1
,withcondence
1


,


1
Q
Q
X
i
=1
(
˘
i

E
[˘
i
])


2
M
ln(2
=
)
Q
+
s
2
˙
2
(
˘
)ln(2
=
)
Q
:(5.13)
Let

C
=

C


C
+1
.Dene
A
=[
h

(
x
i
;
)
;L
n

(
x
j
;
)
i
H

]n

n
;B
=[
h

(
x
i
;
)
;L
Q

(
x
j
;
)
i
H

]n

n
;and
E
=
B

A
.Wehave

=
k
X
>
EY
k
F
;
=
C
k
X
>
EX
k
F
k
YEY
k
F
:Usingtherelationship
b
'
i
=
q
1

i
n
P
n

k
=1
b
z
i;k

(
x
k
;
)
,
i
=1
;:::;n
,wehave
[X
>
EY
]i;j
=
b
z
>
i
E
b
z
j
=
n
X
a;b
=1
b
z
a;i
b
z
b;j
h

(
x
i
;
)
;(
L
n

L
Q
)

(
x
j
;
)
i
H

=
p

i

j
h
b
'
i
;(
L
n

L
Q
)
b
'
j
i
H

=
h
b
'
i
;L
1
=
2
n
(
L
n

L
Q
)
L
1
=
2
n
b
'
j
i
H

:159
Wehavesimilarresultsfor
X
>
EX
and
Y
>
EY
.Thus,weobtain

and

as

=
v

u

u

t
C
X
i
=1
n
X
j
=
C
+1
h
b
'
i
;(
L
2

n

L
1
=
2
n
L
Q
L
1
=
2
n
)
b
'
j
i
2

H

k
L
1
=
2
n
(
L
n

L
Q
)
L
1
=
2
N
k
F

=
C

v

u

u

t
C
X
i;j
=1
h
b
'
i
;(
L
2

n

L
1
=
2
n
L
Q
L
1
=
2
n
)
b
'
i
i
2

H


v

u

u

t
n
X
i;j
=
C
+1
h
b
'
i
;(
L
2

n

L
1
=
2
n
L
Q
L
1
=
2
n
)
b
'
i
i
2

H


C
k
L
1
=
2
n
(
L
n

L
Q
)
L
1
=
2
n
k
F
:Wesubstitutetheseboundsfor

and

intoLemma11toobtain
k
P
k
F

2
k
L
2

n

L
1
=
2
n
L
Q
L
1
=
2
n
k
F

C


C
+1
k
L
2

n

L
1
=
2
n
L
Q
L
1
=
2
n
k
F
:Wenowbound
k
L
2

n

L
1
=
2
n
L
Q
L
1
=
2
n
k
F
usingLemma12.Let

i
[f
](

)=

(
x
i
;
)
f
(
x
i
)
and
˘
i
=
L
1
=
2
N

i
L
1
=
2
N
.Wedene
M
and
˙
2
as
M
=max
1

i

n
k
˘
i
k
F
,and
˙
2
=
E
i
[k
˘
i
k
2

F
].
Wehave
M
k
L
n
k
2
k

i
k
F
=

1
and
˙
2
=
E
"
n
X
k
=1
h
b
'
k
;˘
2
i
b
'
k
i
H

#
=
E
"
n
X
k
=1
h
b
'
k
;L
1
=
2
n

i
L
n

i
L
1
=
2
n
b
'
k
i
H

#

E
"
h

(
x
i
;
)
;L
n

(
x
i
;
)
i
H

n
X
k
=1
h
b
'
k
;L
1
=
2
n

i
L
1
=
2
n
b
'
k
i
H

#


2
n
E
"
n
X
k
=1
h
b
'
k
;L
1
=
2
n

i
L
1
=
2
n
b
'
k
i
H

#


2
n
k
L
n
k
2

F


2
j
K
sp
j
2

F
n


2
j
K
j
2

F
n
:160
Wecompletetheproofbysubstitutingtheboundsfor
M
and
˙
2
intoLemma12.
NowweproveTheorem7usingtheresultofLemma10.Wehave
max
1

i

C
j
v
i

z
i
j
2
=
k
V
C

X
k
2
k
YP
(
I
+
P
>
P
)

1
=
2
k
2
+
k
(
I

(
I
+
P
>
P
)

1
=
2
)
X
k
2
k
Y
k
2
k
P
k
2
+
k
I

(
I
+
P
>
P
)

1
=
2
k
2
j
X
j
2
k
P
k
F
+1

1
p
1+
k
P
k
2

F
k
P
k
F
+1

q
1
k
P
k
2

F
k
P
k
F
+
k
P
k
2

F
2

3
2
k
P
k
F
:Weobtaintherequiredresultusingthefactthat
k
P
k
F

2

C


C
+1


3

C


C
+1
:Wecompletetheproofbyusingthefact
j
K
j
F

1
toobtaintherelation
=
O

1
Q
+
1
p
n

,when

=
O
(
p
Q
)
.
Inthefollowinglemma,weshowthattheerrorincurreddueto
theapproximation(5.5)iswell-
bounded,providedthatthetailoftheeigenspectrumisfast
decaying,whichistrueformostreal
datasets[45]:

Lemma13.
Let
E
and
E
a
representtheoptimalclusteringerrorsin
(5.4)
and
(5.7)
,respectively.
Wehave
j
E

E
a
j
s
X
i
=
C
+1

i
:Proof.
Let
f
c


k
(

)
g
C

k
=1
and
U

betheoptimalsolutionto(5.4).Let
c
a

k
(

)
representtheprojectionof
c


k
intothesubspace
H
a
.Forany

(
x
i
;
)
,let
g
i
(

)
and
h
i
(

)
betheprojectionsof

(
x
i
;
)
intothe
161
subspace
H
a
andspan
(
v
C
+1
;:::;
v
s
)
,respectively.Wehave
E
a
=min
U
2P
max
c
k
(

)
2H
a
C
X
k
=1
s
X
i
=1
U
k;i
s
jj
c
k
(

)


(
x
i
;
)
jj
2

H


C
X
k
=1
s
X
i
=1
U

k;i
s
jj
c
a

k
(

)


(
x
i
;
)
jj
2

H


C
X
k
=1
s
X
i
=1
U

k;i
s

jj
c
a

k
(

)

g
i
(

)
jj
2

H

+
jj
h
i
(

)
jj
2

H


E
+
1
s
C
X
k
=1
s
X
i
=1
jj
h
i
(

)
jj
2

H


E
+
s
X
i
=
C
+1

i
:5.5ExperimentalResults

5.5.1Datasets

Wedemonstratetheeffectivenessoftheproposedsparseker
nel
k
-meansalgorithmusingthe
CIFAR-100,Imagenet-164,YoutubeandTinydatasets.

5.5.2BaselinesandParameters

Wecomparedtheperformanceoftheproposedalgorithmwitht
hekernel
k
-means[72]algorithm
ontheCIFAR-100dataset.Itisinfeasibletoexecutetheker
nel
k
-meansalgorithmontheother
threedatasets.Wealsoevaluateditsperformanceagainstt
he
k
-meansalgorithm.Weshowthat
althoughtheproposedalgorithmhasahigherrunningtimeth
an
k
-means,ityieldsbetterclustering
accuracy.Finally,wecomparedouralgorithmwiththeappro
ximatekernel
k
-meansalgorithmfrom
Chapter2,wherethedataissampledwithuniformprobabilit
y,andalowrankapproximatekernel
162
isconstructedusingthesampleddata.Weshowthatimportan
cesamplingandkernelsparsityplay
asignicantroleinreducingthetimeandmemoryrequiremen
ts.
WeusedtheuniversalRBFkernelfortheproposedalgorithma
ndthekernel-basedbaselineal-
gorithms(kernel
k
-meansandapproximatekernel
k
-means)ontheCIFAR-100,TinyandImagenet-
164datasets.FortheYoutubedataset,whichcontainsbotht
extandimagefeatures,weuseda
combinationofthecosinesimilarityandtheRBFkernel,de
nedas

(
x
a
;x
b
)=
1
2

exp


k
g
a

g
b
k
2

+
f
>
a
f
b
k
f
a
kk
f
b
k

;where
f
a
and
g
a
denotethetf-idfandGISTfeaturesfordatapoint
x
a
,respectively.Wetunedthe
kernelwidthfortheRBFkernelusinggridsearchintherange
[0
;1]
toobtainbestperformance.
Wevariedtheinitialsamplesetsizefrom
m
=5
;000
to
m
=20
;000
,andthenumberof
neighborsfrom
p
=1
;000
to
m
inmultiplesof
5
;000
.Themaximumsamplesetsizewassetto
M
=50
;000
.Thenumberofclusters
C
wassetequaltothetruenumberofclassesinthedata
setfortheCIFAR-100andImagenet-164datasets.Thetruenu
mberofclassesisunknownforthe
YoutubeandTinydatasets,sowesetthenumberofclusterseq
ualto
10
;000
.Thebatchsize
B
wassetequaltotheinitialsamplesize
m
.
WeimplementedallthealgorithmsinMATLAB,andexecutedth
em
10
timeseachona
2
:8
GHzprocessor.Thememoryusedwasconstrainedto
60
GB.Wepresenttheresults(meanand
variance)overthe
10
runs.Differentpermutationsofthedatasetwereinputtoth
eclusteringalgo-
rithmsineachrun.Weusedtherandomized
kd
-treesimplementationintheFLANNlibrary[132]
tondtheapproximatenearestneighborsintheproposedalg
orithm.Thedistancefunctionused
bythelibrarywasdenedastheinverseofthekernelsimilar
ityfunction.Therandomized
kd
-tree
parametersweresetasfollows:thenumberofdimensions
n
d
to
5
,thenumberoftreesto
8
,and
theapproximationerrorto

=1
e

16
.
163
Table5.2Runningtime(inseconds)oftheproposedsparseke
rnel
k
-meansandthethreebaseline
algorithmsonthefourdatasets.Theparametersofthepropo
sedalgorithmweresetto
m
=
20
;000
,
M
=50
;000
,and
p
=1
;000
.Thesamplesize
m
fortheapproximatekernel
k
-means
algorithmwassetequalto
20
;000
fortheCIFAR-100datasetand
10
;000
fortheremainingdata
sets.Itisnotfeasibletoexecutekernel
k
-meansontheImagenet-164,YoutubeandTinydata
setsduetotheirlargesize.Theapproximaterunningtimeof
kernel
k
-meansonthesedatasets
isobtainedbyrstexecutingthealgorithmonarandomlycho
sensubsetof
50
;000
datapointsto
ndtheclustercenters,andthenassigningtheremainingpo
intstotheclosestclustercenter.
Dataset
SparseKernel
Approx.
k
-means
Kernel
k
-means
kernel
k
-means
(proposed)
k
-means
CIFAR-100
49,887
11,394
1,507
117,513
(

93)
(

600)
(

332)
(

211)
Imagenet-164
74,794
16,023
240,722
182,311
(

870)
(

3
;577)
(

5
;351)
(

14
;916)
Youtube
217,533
57,096
145,039
679,061
(

1
;264)
(

2
;196)
(

1
;436)
(

2
;284)
Tiny
343,560
371,004
359,291
704,656
(

2
;528)
(

1
;588)
(

7
;045)
(

8
;482)
5.5.3Results

5.5.3.1Runningtime

Table5.2comparestherunningtimeofouralgorithmwiththe
approximatekernel
k
-means,kernel
k
-meansand
k
-meansalgorithms,whentheparameters
m
and
p
areequalto
20
;000
and
1
;000
,re-
spectively.OntheCIFAR-100dataset,theproposedalgorit
hmtakeslongerthanthe
k
-meansalgo-
rithm,asexpected,becauseoftheadditionaltimerequired
forkernelcomputationandeigensystem
calculation.Italsotakeslongerthantheapproximatekern
el
k
-meansalgorithm,asitperformsim-
portancesamplingbycalculatingandupdatingtheeigenvec
torsofthesparsekernelmatrix.Onthe
otherhand,theapproximatekernel
k
-meansalgorithmselectsthesubsetofthedatausingunifor
m
randomsampling,andcomputestheclustercentersusingthe
low-rankmatrixconstructedfrom
thissubset.Theproposedalgorithm,theapproximatekerne
l
k
-means,andthe
k
-meansalgorithms
aresignicantlyfasterthanthekernel
k
-meansalgorithm.Theproposedalgorithmspendsmore
164
Figure5.2Sampleimagesfromthreeofthe
100
clustersintheCIFAR-100datasetobtainedusing
theproposedalgorithm.

timeinupdatingtheeigenvectorsandndingtheleveragesc
oresthanclusteringtheeigenvectors
toobtaintheclusterlabels.Similarperformanceisobserv
edontheImagenet-164,Youtubeand
Tinydatasets.Theproposedalgorithmisalsofasterthan
k
-meansontheImagenet-164dataset,
because
k
-meanstakeslongertoconverge.Itisinfeasibletocompute
thefullkernelmatrixforthe
Imagenet-164,YoutubeandTinydatasets,sowewereunablet
oexecutekernel
k
-meansonthem.
Forthesedatasets,weexecutedkernel
k
-meansona
50
;000
-sizedrandomlyselectedsubsetof
thedata,andassignedtheremainingpointstotheclosestcl
ustercenters.Theproposedalgorithm
isalsofasterthanthisimplementationofkernel
k
-means,becauseittakesalongtimetondthe
distancebetweenthedatapointsandtheclustercenters,an
dassignlabels.Theproposedalgorithm
isalsomoreaccuratethanthiskernel
k
-meansimplementationontheImagenet-164dataset.
5.5.3.2Clusterquality

Figure5.2showexamplesofclustersobtained,usingthespa
rsekernel
k
-meansalgorithm,fromthe
CIFAR-100dataset.Weassignedaclasslabeltoeachcluster
,basedonthetrueclassofmajority
oftheobjectsinthecluster.Table5.3recordsthesilhouet
tecoefcientvaluesofthepartitionsof
theCIFAR-100dataset.Thesparsekernel
k
-meansalgorithmachievesvaluesclosertothatofthe
kernel
k
-meansalgorithm.Theapproximatekernel
k
-meansand
k
-meansalgorithmsareunableto
165
Table5.3Silhouettecoefcient(

e

02
)oftheproposedsparsekernel
k
-meansandthethree
baselinealgorithmsontheCIFAR-100dataset.Theparamete
rsoftheproposedalgorithmwere
setto
m
=20
;000
,
M
=50
;000
,and
p
=1
;000
.Thesamplesize
m
fortheapproximatekernel
k
-meansalgorithmwassetequalto
m
=20
;000
.
Sparsekernel
Approx.kernel
k
-means
Kernel
k
-means(proposed)
k
-means
k
-means
11.36
2.33
3.02
30.18
(

0
:07)
(

0
:02)
(

0
:01)
(

0
:13)
051015NMI(a)CIFAR-100
051015NMI(b)Imagenet-164
Figure5.3NMI(in%)oftheproposedsparsekernel
k
-meansandthethreebaselinealgorithms
ontheCIFAR-100andImagenet-164datasets.Theparameters
oftheproposedalgorithmwere
setto
m
=20
;000
,
M
=50
;000
,and
p
=1
;000
.Thesamplesize
m
fortheapproximate
kernel
k
-meansalgorithmwassetequalto
20
;000
fortheCIFAR-100datasetand
10
;000
forthe
Imagenet-164dataset.Itisnotfeasibletoexecutekernel
k
-meansontheImagenet-164dataset,
duetoitslargesize.TheapproximateNMIvalueachievedbyk
ernel
k
-meansontheImagenet-164
datasetisobtainedbyrstexecutingthealgorithmonarand
omlychosensubsetof
50
;000
data
pointstondtheclustercenters,andthenassigningtherem
ainingpointstotheclosestcluster
center.
166
achievesimilarsilhouettevalues.
Weanalyzethepredictionaccuracy,intermsofNMI,ofthepr
oposedsparsekernel
k
-means
usingtheCIFAR-100andImagenet-164datasets.Asthetruec
lasslabelsfortheYoutubeandTiny
datasetsarenotavailable,wewereunabletondtheNMIfort
hesedatasets.Figure5.3showsthe
NMIvalueswithrespecttothetrueclasslabels,foreachoft
healgorithmsontheCIFAR-100and
Imagenet-164datasets.InFigure5.3(a),itisobservedtha
ttheNMIachievedbyouralgorithmis
closetothatofthekernel
k
-meansalgorithm.Theproposedalgorithmoutperformsboth
k
-means
andapproximatekernel
k
-meansonboththeCIFAR-100andImagenet-164datasets,due
tothe
factthatitsamplesthemostinformativepointsfromthedat
aset.
5.5.3.3Parametersensitivity

Oursparsekernel
k
-meansalgorithmreliesonthreeparameters:theinitialsa
mpleset
m
,the
maximumsizeofthesampleset
M
,andthesizeoftheneighborhood
p
.Weevaluatedtheeffectof
eachoftheseparametersontheperformanceoftheproposeda
lgorithm,usingtheCIFAR-100and
Imagenet-164datasets.

Initialsample:
Theinitialsampleusedtoconstructthekernel
K
0
,andobtaintheinitial
clusterlabelsplaysacrucialroleintheperformanceofour
algorithmasshowninTable5.4,
Table5.5andFigure5.4.Theycomparetheperformanceofthe
proposedalgorithmandthe
approximatekernel
k
-meansalgorithmwithincreasing
m
value.Asexpected,therunning
timeofboththealgorithmsincreasesastheinitialsamples
izeincreasesfrom
m
=5
;000
to
m
=20
;000
.As
m
increases,thesizeoftheinitialkernel
K
0
,andthetimetocompute
anddecomposeitintoitseigenvaluesandeigenvectorsincr
easeproportionately.Theinitial
samplealsodeterminesthenumberofpointssampledfromthe
dataset,aseachinputbatch
isprocessed.Moredatapointsweresampledandaddedtotheb
uffer
S
,iftheinitialsample
didnotcontainsufcientnumberofrepresentativepoints.
Thetimetoclusterincreasesas
morepointsareaddedtothebuffer.Thesilhouettecoefcie
ntvaluesontheCIFAR-100
167
Table5.4Comparisonoftherunningtime(inseconds)ofthep
roposedsparsekernel
k
-means
algorithmandtheapproximatekernel
k
-meansalgorithmontheCIFAR-100andtheImagenet-164
datasets.Parameter
m
representstheinitialsamplesetsizefortheproposedalgo
rithm,andthesize
ofthesampledsubsetfortheapproximatekernel
k
-meansalgorithm.Theremainingparameters
oftheproposedalgorithmaresetto
M
=50
;000
,and
p
=1
;000
.Approximatekernel
k
-meansis
infeasiblefortheImagenet-164datasetwhen
m>
10
;000
duetoitslargesize.
m
CIFAR-100
Imagenet-164
Sparsekernel
Approx.kernel
Sparsekernel
Approx.kernel
k
-means
k
-means
k
-means
k
-means
5,000
6,192
1,693
24,029
15,691
(

424)
(

339)
(

4
;469)
(

3
;786)
10,000
18,256
4,134
36,669
16,023
(

21)
(

549)
(

603)
(

3
;577)
15,000
34,192
7,856
53,142
-
(

2
;652)
(

929)
(

3
;058)
20,000
49,887
11,394
74,794
-
(

93)
(

600)
(

870)
Table5.5Comparisonofthesilhouettecoefcient(

e

02
)oftheproposedsparsekernel
k
-means
algorithmandtheapproximatekernel
k
-meansalgorithmontheCIFAR-100dataset.Parameter
m
representstheinitialsamplesetsizefortheproposedalgo
rithm,andthesizeofthesampled
subsetfortheapproximatekernel
k
-meansalgorithm.Theremainingparametersoftheproposed
algorithmweresetto
M
=50
;000
,and
p
=1
;000
.
m
5,000
10,000
15,000
20,000
Sparsekernel
19.42
11.77
11.67
11.36
k
-means(proposed)
(

0
:12)
(

0
:04)
(

0
:06)
(

0
:07)
Approx.kernel
2.45
2.37
2.45
2.33
k
-means
(

0
:03)
(

0
:02)
(

0
:02)
(

0
:02)
168
50001000015000200000510mNMI(a)CIFAR-100
5000100001500020000051015mNMI(b)Imagenet-164
Figure5.4ComparisonoftheNMI(in%)oftheproposedsparse
kernel
k
-meansalgorithmand
theapproximatekernel
k
-meansalgorithmontheCIFAR-100andtheImagenet-164data
sets.
Parameter
m
representstheinitialsamplesetsizefortheproposedalgo
rithm,andthesizeofthe
sampledsubsetfortheapproximatekernel
k
-meansalgorithm.Theremainingparametersofthe
proposedalgorithmweresetto
M
=50
;000
,and
p
=1
;000
.Approximatekernel
k
-meansis
infeasiblefortheImagenet-164datasetwhen
m>
10
;000
duetoitslargesize.
datasetdecreaseminimallywhen
m
increasesfrom
5
;000
to
10
;000
,butremainconstant
for
m

10
;000
.Ontheotherhand,thereisminimalchangeinthesilhouette
valuesofthe
approximatekernel
k
-meansalgorithmforincreasing
m
.TheNMIvaluesachievedbyour
algorithmincreaseconsiderablyasthesamplesize
m
increases,indicatingthattheinitial
sampleisimportanttotheclusteringaccuracyofthepropos
edalgorithm.Evenwithjust
5
;000
datapointsintheinitialsample,ouralgorithmisabletoac
hieve
13%
NMI.Onthe
otherhand,theapproximatekernel
k
-meansalgorithmisunabletoachievethesamewith
even
20
;000
samples.Theperformanceofthesparsekernel
k
-meansalgorithmisbestwhen
thesamplesizeissetgreaterthan
C
log
C
,inaccordancewithLemma8.

Maximumsamplesize:
Inourexperiments,wesetthemaximumsamplesizeto
50
;000
.
Wefoundthatthisparameterisnotascriticalastheinitial
sample,providedthatitissetlarge
enoughtoaccommodateforasufcientlyrepresentativesam
ple.OnboththeCIFAR-100
andImagenet-164datasets,thenumberofpointsaddedtothe
bufferrangefrom
100
to
500
,
onanaverage.Thenumberofpointsaddeddecreasesastheini
tialsamplesize
m
increases
169
Table5.6Effectofthesizeoftheneighborhood
p
ontherunningtime(inseconds),thesilhouette
coefcientandNMI(in%)oftheproposedsparsekernel
k
-meansalgorithmontheCIFAR-100
andImagenet-164datasets.Theremainingparametersofthe
proposedalgorithmweresetto
m
=20
;000
,and
M
=50
;000
.
p
CIFAR-100
Imagenet-164
Runningtime
Silhouettecoefcient
NMI
Runningtime
NMI
(

e

02
)
1,000
49,887
11.36
12.23
74,794
16.15
(

93)
(

0
:07)
(

2
:3)
(

870)
(

0
:004)
5,000
52,073
11.25
12.09
82,880
17.58
(

483)
(

0
:06)
(

0
:02)
(

21
;360)
(

0
:10)
10,000
54,205
12.27
13.86
192,725
18.01
(

874)
(

0
:12)
(

0
:07)
(

3
;874)
(

0
:07)
15,000
55,062
11.32
14.00
247,911
18.23
(

837)
(

0
:09)
(

0
:01)
(

7
;789)
(

0
:004)
from
5
;000
to
20
;000
.Forinstance,ontheCIFAR-100dataset,when
m
=5
;000
,
453
additionalpointswereaddedtothebuffer.When
m
=20
;000
,only
69
pointswereadded.

Sizeoftheneighborhood:
Thenumberofneighbors
p
usedtoconstructthesparsekernel
similarityisalsoimportanttotheperformanceofthepropo
sedsparsekernel
k
-meansalgo-
rithm.Table5.6showshowtherunningtime,thesilhouettec
oefcientvalues,andtheNMI
valuesontheCIFAR-100andImagenet-164datasetsareaffec
tedasthevalueof
p
increased
from
1
;000
to
15
;000
,andtheinitialsamplesize
m
wasxedat
20
;000
.Therunningtime
doubleswhen
p
increasedfrom
p
=1
;000
to
p
=15
;000
,onboththedatasets.Thisis
duetothefactthatalargernumberofsimilaritycomputatio
nsneedtobeperformedasthe
valueof
p
increases.However,althoughthereisasmallincreaseinth
esilhouettecoefcient
andNMIvalues,theincreaseisnotsignicantenoughtojust
ifytheincreaseintherunning
time.Weconcludethattheneighborhoodsizeisanimportant
parameterindeterminingthe
efciencyofthealgorithm.

Numberofclusters:
Weshowtheeffectofvaryingthenumberofclusters
C
ontheperfor-
manceoftheproposedalgorithm,inFigures5.5and5.6.Ther
unningtimeofthealgorithm
170
2040608010020002200240026002800Running timein secondsNumber of clusters (C)(a)CIFAR-100
5010015011.21.4x 105Running timein secondsNumber of clusters (C)(b)Imagenet-164
0.511.52x 104246810x 105Running timein secondsNumber ofclusters (C)(c)Youtube
0.511.5x 104246810x 105Running timein secondsNumber ofclusters (C)(d)Tiny
Figure5.5Effectofthenumberofclusters
C
ontherunningtime(inseconds)oftheproposed
sparsekernel
k
-meansalgorithm.
20406080100789101112NMINumber of clusters (C)(a)CIFAR-100
501001501213141516NMINumber of clusters (C)(b)Imagenet-164
Figure5.6Effectofthenumberofclusters
C
ontheNMI(in%)oftheproposedsparsekernel
k
-meansalgorithm.
171
1050.511.52x 104Running timein secondsSize of the data set n(log scale)(a)
1011021031000200030004000Running timein secondsDimensionality of the data set d(log scale)(b)
1011021031234x 104Running timein secondsNumber of clusters C(log scale)(c)
Figure5.7Runningtimeofthesparsekernel
k
-meansclusteringalgorithmfordifferentvaluesof
(a)
n
,(b)
d
and(c)
C
.
increaseswiththenumberofclusters.However,unlikemany
otherclusteringalgorithms
includingtheapproximatekernel
k
-means,RFF,SVandapproximatestreamkernel
k
-means
clusteringalgorithmspresentedintheearlierchapters,t
herunningtimeofthesparsekernel
k
-meansalgorithmincreasesalmostlogarithmicallywithth
enumberofclusters,onmost
datasets.TheNMIvaluesachievedbyouralgorithmalsoincr
easeasthenumberofclus-
tersincrease.WenotethattheNMIvaluesoftheproposedalg
orithmarebetterthanthose
achievedbythebaselinealgorithms,onboththeCIFAR-100a
ndImagenet-164datasets,for
allvaluesof
C
.
5.5.3.4Scalability

Wevariedthenumberofpoints,thedimensionalityandthenu
mberofclustersintheconcentric
circlesdataset,andexecutedouralgorithmonthesedatase
tstoexamineitsscalability.Weused
theRBFkerneltocomputetheinter-pointsimilarity.Theal
gorithmparameters
m
,
p
and
M
were
setto
m
=1
;000
,
p
=100
and
M
=20
;000
respectively,andthedatawasinputinbatchesof
10
.
Figures5.7(a),5.7(b)and5.7(c)showthattheproposedalg
orithmislinearlyscalablewithrespect
tothesizeanddimensionalityofthedataset,andalmostlog
arithmicallyscalablewithrespecttothe
numberofclusters,inaccordancewiththecomplexityanaly
sisinSection5.4.1.InFigure5.7(a),
thesizeofthedatasetisvariedfrom
n
=100
to
n
=10
7
,whilethedimensionalityandthe
172
numberofclustersarexedat
d
=100
and
C
=10
.Therunningtimeoftheproposedalgorithm
increaseslinearlywith
n
.Figure5.7(b)showstherunningtimeoftheproposedalgori
thmasthe
dimensionalityofthedatavariesbetween
d
=10
and
d
=1
;000
,while
n
=10
6
and
C
=10
.
Finally,therunningtimeofouralgorithmincreaseslogari
thmicallyasthenumberofclusters
increasesfrom
C
=10
to
C
=1
;000
,with
n
=10
6
and
d
=100
,asshowninFigure5.7(c).
5.6Summary

Inthischapter,wehaveproposedthesparsekernel
k
-meansclusteringalgorithm,whichcanef-
cientlyclusterlargehigh-dimensionaldatasetsintoalar
genumberofclusters.Bysamplingthe
datapointsbasedontheirnovelty,denedintermsofthesta
tisticalleveragescores,weonlystore
themostinformativepointsinthedata,therebylimitingth
ememoryrequirements.Weneedto
computethekernelsimilarityofthedatapointsonlywithre
specttothesesampledpoints,thusre-
ducingtherunningtimecomplexity.Wefurtherreducetheru
nningtimecomplexitybyintroducing
sparsityintothekernel,basedontheassumptionthattheke
rnelfunctionisappropriatelydened,
andnearbypointsinthekernelspacehavesimilarlabels.We
demonstratedthattheproposed
algorithmisscalableandaccurateusingseverallargebenc
hmarkdatasets.
173
Chapter6

SummaryandFutureWork

Astheamountofdigitaldatacontinuestogrowatarapidrate
,continuedeffortstodesignand
developscalableandefcientalgorithmstoorganizethisd
ataandextractusefulinformationfrom
itareessential.Wehavefocusedontheunsupervisedlearni
ngtaskofclusteringinthisthesis.
Whilelinearclusteringalgorithms(e.g.
k
-means)arefastandscalable,theyareincapableof
ndingtheunderlyingclustersinreal-worlddatasetswith
highaccuracy.Ontheotherhand,
kernel-basedclusteringalgorithmsareaccurate,butaren
otscalabletobigdatasets.Wehave
proposedanumberofkernel-basedclusteringalgorithmswh
icharenotonlyscalabletodatasets
containingbillionsofdatapoints,butalsoachievecluste
rqualitycomparabletothatoftheexisting
kernel-basedclusteringalgorithms.Theproposedalgorit
hmsareprimarilybasedon
randomly
sampling
thedatasetsandndingtheclustersusingfast
iterativeoptimization
techniques.The
maincontributionofthisthesisisthedesignofapproximat
ealgorithmsfortheadvancementofthe
scalabilityofkernel-basedclustering,whilemaintainin
gtheclusterquality,anddemonstratingthe
performanceoftheproposedalgorithmsondiversedatasets
.
174
6.1Contributions

Theapproximatebatchkernelclusteringalgorithmspropos
edinChapters2and3makethefol-
lowingcontributions:

Theapproximatekernel
k
-meansalgorithminChapter2demonstratesthat,byusingun
iform
randomsampling,kernel-basedclusteringcanbeperformed
in
O
(
nmC
+
nmd
)
time,where
n
isthesizeofthegivendataset,
d
isitsdimensionality,
C
isthenumberofclusters,and
m
isthenumberofsamplesfromthedataset(
m
˝
n
).Thisrunningtimecomplexityis
signicantlysmallerthanthe
O
(
n
2
C
+
n
2
d
)
complexityofclassicalkernel-basedclustering
algorithms,giventhat
m
˝
n
.

Incontrasttotheapproximatekernel
k
-meansalgorithm,whichdecomposesthekernelma-
trixintoitslow-rankcomponents,theRFFandSVkernelclus
teringalgorithms,introduced
inChapter3,factorthekernelfunctionusingtheFouriertr
ansform,andprojectthedatainto
alow-dimensionalspacespannedbytheFouriercomponents.
TheRFFandSVclustering
algorithmshave
O
(
nm
log(
d
)+
nmC
)
and
O
(
nm
log(
d
)+
nC
2
)
runningtimecomplexities,
respectively,where
m
˝
n
isthenumberofFouriercomponents.Bothalgorithmsperfor
m
wellonlargehigh-dimensionaldatasets.TheSVclustering
algorithmisfasterthantheRFF
andapproximatekernel
k
-meansalgorithms,whenthenumberofclusters
C
issmall(less
than
100
),withaminimallossinclusterquality.

Theerrorincurredbytheapproximatekernel
k
-meansalgorithmduetosamplingis
O
(1
=m
)
,
whichimpliesthattheerrorreduceslinearly,asthenumber
ofdatapointssampledfromthe
datasetincreases.Similarly,theerrorincurredbytheRFF
andSVclusteringalgorithms
reducesattherateof
O
(1
=
p
m
)
and
O
(1
=m
)
,respectively,where
m
representsthenumber
ofFouriercomponentsusedforprojection.

Thebestclusteringqualityisachievedbytheseapproximat
ealgorithms,whenthenumber
175
ofsamples(orthenumberofFouriercomponents)
m
issignicantlygreaterthan
C
,andthe
eigenvaluesofthekernelmatrixhavealong-taileddistrib
ution.

Theproposedalgorithmsachieveclusteringqualitysimila
rtothekernel
k
-meansandspec-
tralclusteringalgorithmsonlargebenchmarktextandimag
edatasets,containingupto
10
milliondatapoints,withsignicantlylowerrunningtime.
TheonlinekernelclusteringalgorithmsproposedinChapte
rs4and5makethefollowingcontri-
butions:

Datastreamsoftencontainunboundednumberofdatapoints,
soitisimpossibletostorethe
entiredatasetinmemory.Itisalsodifculttouniformlysa
mplestreamingdatasetsbecause
oftheirarbitrarysize.Theapproximatestreamkernel
k
-meansalgorithm,introducedin
Chapter4,relieson
importance
sampling,andtherebyusesonlythemostinformativedata
pointsinthestreamtoperformclustering.Importancesamp
lingisinherentlyacomplex
procedurebecauseitrequirestheeigendecompositionofth
ekernelmatrix.Bydevisingan
efcientonlinemethodtoperformimportancesampling,weh
avereduceditsrunningtime
complexity.Theapproximatestreamkernel
k
-meansalgorithmcanclusterlargestreamdata
setsin
O
(
nd
+
nC
)
time.

Wehavedemonstratedtheperformanceoftheapproximatestr
eamkernel
k
-meansonthe
Twitterstream.Itcanalsobeappliedtondclustersinnan
cialdata,climatedata,click-
streamsetc.

Whenthenumberofclusters
C
inthedatasetislarge(intheorderoftensofthousands),
theexistingkernelclusteringalgorithmshavelongrunnin
gtimesasaresultoftheirlinear
runningtimecomplexitywithrespectto
C
.Byusingimportancesamplingtosamplethedata
set,andinducingsparsityintothekernelmatrixconstruct
edfromthesampleddatapoints,
thesparsekernel
k
-meansalgorithm,introducedinChapter5,reducesthistim
ecomplexity
176
to
O
(
nd
+
n
log
C
)
,with
O
(1
=
p
m
)
approximationerror,where
m
isthenumberofpoints
sampledfromthedataset.

Wehavedemonstratedthescalabilityofthesparsekernel
k
-meansalgorithmonlargehetero-
geneousdatasetssuchastheTinyimagedatasetandtheYoutu
bedataset(textandimage),
containingmillionsofdatapointswithupto
10
;000
clusters.

Thelossintheclusteringqualitybytheapproximatestream
kernel
k
-meansandthesparse
kernel
k
-meansalgorithmsisminimalwhencomparedtothebatchkern
el
k
-meansclustering
algorithm.
Thecruxoftheproposedalgorithmsistorandomlysamplethe
largedatasetsandthereby,reduce
thenumberofsimilaritycomputationsrequiredtoconstruc
tthekernelmatrixandclusterthedata.
Thesamplesizeandthesamplingstrategyplayacrucialrole
intheperformanceofthealgorithms.
Whiletheproposedbatchclusteringalgorithmsselectthes
amplesuniformlyfromthegivendata
set,theonlinealgorithmsemploythemoresophisticatedim
portancesamplingstrategy.Theim-
portancesamplingtechniquereducesthetotalnumberofsam
plesrequiredbecauseitchoosesthe
datapointsintelligently,basedonthedatadistribution.

6.2FutureWork

Kernel-basedclusteringresearchpresentedinthisdisser
tationcanbefurtheradvancedasfollows:

Parallelization.
Incontrasttolinearclusteringalgorithms,kernel-based
clusteringalgo-
rithmsneedthecomputationofthekernelmatrix,duetowhic
htheyaremoredifcultto
parallelizethanlinearclusteringalgorithms.Theapprox
imatekernel-basedclusteringalgo-
rithmspresentedinthisthesisareeasiertoparallelizeth
anclassicalkernel-basedcluster-
ingalgorithms.Unlikeparallelversionsoftheclassicalk
ernel-basedclusteringalgorithms,
whichrequireallthedatatobereplicatedinallthenodes,t
heapproximatekernel-based
177
clusteringalgorithmsrequireonlythesampleddatapoints
tobereplicated.Thisreduces
theamountofmemoryrequiredandthecommunicationcost.We
havedemonstratedhow
theapproximatekernel
k
-meansalgorithmcanbeexecutedonadistributedframework
in
Chapter2.TheRFFandSVclusteringalgorithmscanbesimila
rlyparallelized.However,
theremainingapproximatekernel-basedclusteringalgori
thmsproposedinthisthesisrely
ontheeigenvectorsoftheapproximatekernelmatrices,and
needeffectiveonlineparallel
techniquesforeigenvectorupdates.Parallelizationofth
esealgorithmscanaidintheirde-
ploymenttolargescalecomputingframeworks.

Kernelselection.
AsdemonstratedinChapter1,thekernelfunctionusedtode
netheinter-
pointsimilarityplaysacrucialroleintheefciencyandac
curacyofkernelclustering.Em-
ployingthewrongkernelforclusteringcanadverselyaffec
ttheclusterquality,andcanresult
inclusteringqualityworsethanthatoflinearclusteringa
lgorithms.However,choosingthe
correctkernel,andselectingthekernelparametersisacha
llengingtask.Althoughafew
algorithmshavebeenproposedtolearnthekernelfromtheda
tainanunsupervisedman-
ner,thesealgorithmshavehighrunningtimecomplexity,re
sultingintheirnon-scalability.
Morescalabletechniqueshavebeendevelopedtolearntheke
rnelinthesupervisedand
semi-supervisedsettings,butobtainingthelabelsforlar
gedatasetsisexpensiveandoften
impossible.Developmentofscalableunsupervisedkernell
earningalgorithmsisapotential
directionforfuturework.

Overlappingclusters.
Inapplicationssuchasusercommunitydetectioninsocialn
etworks,
usersoftenbelongtomorethanonecommunity,causingthecl
usterstooverlapwitheach
other.Veryfeweffortshavebeenmadetondsuchoverlappin
gclustersusingkernel-based
clusteringtechniques.Fuzzykernelclusteringtechnique
sonlycomputetheprobabilitythat
adatapointbelongstoacluster,anddonotdeterministical
lyndtheclustermemberships.
Moreconcretescalabletechniquesneedtobedevelopedton
doverlappingclustersindata.
178
BIBLIOGRAPHY
179
BIBLIOGRAPHY
[1]DataAnalytics.
http://searchdatamanagement.techtarget.com/
definition/data-analytics
,Jan2008.
[2]BigDatain2020.
http://www.emc.com/collateral/analyst-reports/
idc-digital-universe-2014.pdf
,Dec2012.IDCandEMCCorpReport.
[3]M.R.Abbasifard,B.Ghahremani,andH.Naderi.Asurveyo
nnearestneighborsearch
methods.
InternationalJournalofComputerApplications
,95(25):39Œ52,2014.
[4]M.E.Abbasnejad,D.Ramachandram,andR.Mandava.Asurv
eyofthestateoftheartin
learningthekernels.
KnowledgeandInformationSystems
,31(2):193Œ221,2012.
[5]D.AchlioptasandF.McSherry.Fastcomputationoflow-r
ankmatrixapproximations.
Jour-
naloftheACM
,54(2),2007.
[6]M.R.Ackermann,M.Märtens,C.Raupach,K.Swierkot,C.L
ammersen,andC.Sohler.
StreamKM++:Aclusteringalgorithmfordatastreams.
JournalofExperimentalAlgorith-
mics
,17:1Œ30,2012.
[7]R.H.Affandi,A.Kulesza,E.Fox,andB.Taskar.Nystroma
pproximationforlarge-scale
determinantalprocesses.In
ProceedingsoftheInternationalConferenceonArticialI
ntel-
ligenceandStatistics
,pages85Œ98,2013.
[8]C.C.Aggarwal.Asurveyofstreamclusteringalgorithms
.In
DataClustering:Algorithms
andApplications
,pages231Œ258.2013.
[9]C.C.Aggarwal,J.Han,J.Wang,andP.S.Yu.Aframeworkfo
rprojectedclusteringofhigh
dimensionaldatastreams.In
ProceedingsoftheInternationalConferenceonVeryLarge
DataBases
,pages852Œ863,2004.
[10]N.Ailon,R.Jaiswal,andC.Monteleoni.Streamingk-me
ansapproximation.In
Proceedings
oftheConferenceonNeuralInformationProcessingSystems
,pages10Œ18,2009.
[11]C.AlzateandJ.A.K.Suykens.Multiwayspectralcluste
ringwithout-of-sampleexten-
sionsthroughweightedkernelPCA.
IEEETransactionsonPatternAnalysisandMachine
Intelligence
,32(2):335Œ347,2010.
[12]D.ArthurandS.Vassilvitskii.k-means++:Theadvanta
gesofcarefulseeding.In
Proceed-
ingsoftheACM-SIAMSymposiumonDiscreteAlgorithms
,pages1027Œ1035,2007.
[13]K.BacheandM.Lichman.UCImachinelearningrepositor
y.
http://archive.ics.
uci.edu/ml
,2013.
180
[14]A.Barla,F.Odone,andA.Verri.Histogramintersectio
nkernelforimageclassication.In
ProceedingsoftheInternationalConferenceonImageProce
ssing
,volume3,pages513Œ
516,2003.
[15]O.Beaumont,H.Larchevêque,andL.Marchal.Non-linea
rdivisibleloads:Thereisnofree
lunch.In
ProceedingsoftheInternationalParallelandDistributed
ProcessingSymposium
,
pages1Œ10,2012.
[16]H.Becker,M.Naaman,andL.Gravano.Beyondtrendingto
pics:Real-worldeventiden-
ticationontwitter.In
ProceedingsoftheInternationalAAAIConferenceonWeblog
sand
SocialMedia
,pages438Œ441,2011.
[17]M.A.BelabbasandP.J.Wolfe.Spectralmethodsinmachi
nelearningandnewstrategies
forverylargedatasets.
ProceedingsoftheNationalAcademyofSciences
,106(2):369Œ374,
2009.
[18]S.Belongie,C.Fowlkes,F.Chung,andJ.Malik.Spectra
lpartitioningwithindenitekernels
usingtheNystromextension.In
ProceedingsoftheEuropeanConferenceonComputer
Vision
,pages531Œ542.2002.
[19]A.Ben-Hur,D.Horn,H.T.Siegelmann,andV.Vapnik.Sup
portvectorclustering.
The
JournalofMachineLearningResearch
,2:125Œ137,2002.
[20]Y.Bengio,A.Courville,andP.Vincent.Representatio
nlearning:Areviewandnewper-
spectives.
IEEETransactionsonPatternAnalysisandMachineIntellig
ence
,35(8):1798Œ
1828,2013.
[21]M.W.Berry.Large-scalesparsesingularvaluecomputa
tions.
InternationalJournalof
SupercomputerApplications
,6(1):13Œ49,1992.
[22]M.W.Berry,S.A.Pulatova,andG.W.Stewart.Computing
sparsereduced-rankapproxi-
mationstosparsematrices.
ACMTransactionsonMathematicalSoftware
,31(2):252Œ269,
2005.
[23]J.A.BlackardandD.J.Dean.Comparativeaccuraciesof
articialneuralnetworksanddis-
criminantanalysisinpredictingforestcovertypesfromca
rtographicvariables.
Computers
andElectronicsinAgriculture
,24(3):131Œ152,1999.
[24]D.M.Blei,A.Y.Ng,andM.I.Jordan.LatentDirichletAl
location.
JournalofMachine
LearningResearch
,3:993Œ1022,2003.
[25]L.BoandC.Sminchisescu.Efcientmatchkernelbetwee
nsetsoffeaturesforvisual
recognition.In
ProceedingsoftheConferenceonNeuralInformationProces
singSystems
,
pages135Œ143,2009.
[26]S.BochnerandK.Chandrasekharan.
FourierTransforms
.PrincetonUniversityPress,1949.
181
[27]L.BottouandV.Vapnik.Locallearningalgorithms.
NeuralComputation
,4(6):888Œ900,
1992.
[28]C.Boutsidis,M.W.Mahoney,andP.Drineas.Animproved
approximationalgorithmforthe
columnsubsetselectionproblem.In
ProceedingsoftheACM-SIAMSymposiumonDiscrete
Algorithms
,pages968Œ977,2009.
[29]D.C.Brabham.
Crowdsourcing
.MITPress,2013.
[30]P.S.Bradley,U.M.Fayyad,andC.Reina.Scalingcluste
ringalgorithmstolargedatabases.
In
ProceedingsoftheInternationalConferenceonKnowledgeD
iscoveryandDataMining
,
pages9Œ15,1998.
[31]M.Brand.Fastlow-rankmodicationsofthethinsingul
arvaluedecomposition.
Linear
AlgebraanditsApplications
,415(1):20Œ30,2006.
[32]F.Cao,M.Ester,W.Qian,andA.Zhou.Density-basedclu
steringoveranevolvingdata
streamwithnoise.In
ProceedingsoftheSIAMInternationalConferenceonDataMi
ning
,
pages328Œ339,2006.
[33]R.Cattral,F.Oppacher,andD.Deugo.Evolutionarydat
aminingwithautomaticrulegener-
alization.
RecentAdvancesinComputers,ComputingandCommunication
s
,pages296Œ300,
2002.
[34]S.ChatterjeeandA.S.Hadi.Inuentialobservations,
highleveragepoints,andoutliersin
linearregression.
StatisticalScience
,1(3):379Œ393,1986.
[35]W.Chen,Y.Song,H.Bai,C.Lin,andE.Y.Chang.Parallel
spectralclusteringindistributed
systems.
IEEETransactionsonPatternAnalysisandMachineIntellig
ence
,33(3):568Œ586,
2011.
[36]Y.ChenandL.Tu.Density-basedclusteringforreal-ti
mestreamdata.In
Proceedingsof
theInternationalConferenceonKnowledgeDiscoveryandDa
taMining
,pages133Œ142,
2007.
[37]Y.Cheng.Meanshift,modeseeking,andclustering.
IEEETransactionsonPatternAnalysis
andMachineIntelligence
,17(8):790Œ799,1995.
[38]R.Chitta,A.K.Jain,andR.Jin.Sparsekernelclusteri
ngofmassivehigh-dimensionaldata
setswithlargenumberofclusters.In
ProceedingsofthePhDWorkshopattheInternational
ConferenceonInformationandKnowledgeManagement
,2015.
[39]R.Chitta,A.K.Jain,andR.Jin.Sparsekernelclusteri
ngofmassivehigh-dimensional
datasetswithlargenumberofclusters.TechnicalReportMS
U-CSE-15-10,Departmentof
ComputerScience,MichiganStateUniversity,2015.
182
[40]R.Chitta,R.Jin,T.C.Havens,andA.K.Jain.Approxima
tekernelk-means:Solutionto
largescalekernelclustering.In
ProceedingsoftheInternationalConferenceonKnowledge
DiscoveryandDatamining
,pages895Œ903,2011.
[41]R.Chitta,R.Jin,T.C.Havens,andA.K.Jain.Scalablek
ernelclustering:Approximate
kernelk-means.
arxivpreprintarXiv:1402.3849
,2014.
[42]R.Chitta,R.Jin,andA.K.Jain.Efcientkernelcluste
ringusingrandomfourierfeatures.
In
ProceedingsoftheInternationalConferenceonDataMining
,pages161Œ170,2012.
[43]R.Chitta,R.Jin,andA.K.Jain.Streamclustering:Ef
cientkernel-basedapproximation
usingimportancesampling.In
ProceedingsoftheICDMWorkshoponDataScienceand
BigDataAnalytics
,2015.
[44]J.ChiuandL.Demanet.Sublinearrandomizedalgorithm
sforskeletondecompositions.
SIAMJournalonMatrixAnalysisandApplications
,34(3):1361Œ1383,2013.
[45]A.Clauset,C.R.Shalizi,andMarkE.J.Newman.Power-l
awdistributionsinempirical
data.
SIAMReview
,51(4):661Œ703,2009.
[46]C.Cortes,M.Mohri,andA.Talwalkar.Ontheimpactofke
rnelapproximationonlearning
accuracy.
JournalofMachineLearningResearch
,9:113Œ120,2010.
[47]T.F.CoxandM.A.A.Cox.
MultidimensionalScaling
.CRCPress,2000.
[48]A.S.Das,M.Datar,A.Garg,andS.Rajaram.Googlenewsp
ersonalization:Scalableonline
collaborativeltering.In
ProceedingsoftheInternationalConferenceonWorldWideW
eb
,
pages271Œ280,2007.
[49]J.Deng,W.Dong,R.Socher,L.J.Li,K.Li,andL.Fei-Fei
.Imagenet:Alarge-scale
hierarchicalimagedatabase.In
ProceedingsoftheIEEEConferenceonComputerVision
andPatternRecognition
,pages248Œ255,2009.
[50]A.DeshpandeandS.Vempala.Adaptivesamplingandfast
low-rankmatrixapproxima-
tion.In
Approximation,Randomization,andCombinatorialOptimiz
ation:Algorithmsand
Techniques
,pages292Œ303.2006.
[51]I.S.Dhillon,Y.Guan,andB.Kulis.Auniedviewofkern
elk-means,spectralclustering
andgraphcuts.TechnicalReportTR-04-25,DepartmentofCo
mputerScience,University
ofTexasatAustin,2004.
[52]C.Ding,X.He,andH.D.Simon.Ontheequivalenceofnonn
egativematrixfactorization
andspectralclustering.In
ProceedingsoftheSIAMDataMiningConference
,pages606Œ
610,2005.
[53]J.A.Doornik.AnimprovedZigguratmethodtogeneraten
ormalrandomsamples.
Univer-
sityofOxford
,2005.
183
[54]P.Drineas,R.Kannan,andM.W.Mahoney.FastMonte-Car
loalgorithmsformatricesII:
Computingalow-rankapproximationtoamatrix.
SIAMJournalonComputing
,36(1):158Œ
183,2006.
[55]P.Drineas,R.Kannan,andM.W.Mahoney.FastMonte-Car
loalgorithmsformatricesIII:
Computingacompressedapproximatematrixdecomposition.
SIAMJournalonComputing
,
36(1):184Œ206,2006.
[56]P.Drineas,M.Magdon-Ismail,M.W.Mahoney,andD.P.Wo
odruff.Fastapproximation
ofmatrixcoherenceandstatisticalleverage.
TheJournalofMachineLearningResearch
,
13(1):3475Œ3506,2012.
[57]P.DrineasandM.W.Mahoney.OntheNystrommethodforap
proximatingaGrammatrix
forimprovedkernel-basedlearning.
TheJournalofMachineLearningResearch
,6:2153Œ
2175,2005.
[58]C.EckartandG.Young.Theapproximationofonematrixb
yanotheroflowerrank.
Psy-
chometrika
,1(3):211Œ218,1936.
[59]R.Edmonds,E.Guskin,A.Mitchell,andM.Jurkowitz.Th
eState
oftheNewsMedia2013.
http://stateofthemedia.org/
2013/newspapers-stabilizing-but-still-threatened/

newspapers-by-the-numbers
,May2013.PoynterInstituteandPewResearch
CenterReport.
[60]A.Ene,S.Im,andB.Moseley.FastclusteringusingMapR
educe.In
Proceedingsofthe
InternationalConferenceonKnowledgeDiscoveryandDatam
ining
,pages681Œ689,2011.
[61]M.Ester,H.P.Kriegel,J.Sander,andX.Xu.Adensity-b
asedalgorithmfordiscovering
clustersinlargespatialdatabaseswithnoise.In
ProceedingsoftheInternationalConference
onKnowledgeDiscoveryandDatamining
,pages226Œ231,1996.
[62]F.Farnstrom,J.Lewis,andC.Elkan.Scalabilityforcl
usteringalgorithmsrevisited.
ACM
SIGKDDExplorationsNewsletter
,2(1):51Œ57,2000.
[63]D.Feldman,M.Schmidt,andC.Sohler.TurningBigdatai
ntotinydata:Constant-size
coresetsfork-means,PCAandprojectiveclustering.In
ProceedingsoftheACM-SIAM
SymposiumonDiscreteAlgorithms
,pages1434Œ1453,2013.
[64]C.Fellbaum.
WordNet:AnElectronicLexicalDatabase
.BradfordBooks,1998.
[65]M.Filippone,F.Camastra,F.Masulli,andS.Rovetta.A
surveyofkernelandspectral
methodsforclustering.
PatternRecognition
,41(1):176Œ190,2008.
[66]G.D.Forney.Generalizedminimumdistancedecoding.
IEEETransactionsonInformation
Theory
,12(2):125Œ131,1966.
184
[67]C.Fowlkes,S.Belongie,F.Chung,andJ.Malik.Spectra
lgroupingusingtheNystrom
method.
IEEETransactionsonPatternAnalysisandMachineIntellig
ence
,pages214Œ225,
2004.
[68]C.FraleyandA.E.Raftery.Howmanyclusters?Whichclu
steringmethod?Answersvia
model-basedclusteranalysis.
TheComputerJournal
,41(8):578Œ588,1998.
[69]A.Frieze,R.Kannan,andS.Vempala.FastMonte-Carloa
lgorithmsforndinglow-rank
approximations.In
ProceedingsoftheFoundationsofComputerScience
,pages370Œ378,
1998.
[70]A.Frieze,R.Kannan,andS.Vempala.FastMonte-Carloa
lgorithmsforndinglow-rank
approximations.
JournaloftheACM
,51(6):1025Œ1041,2004.
[71]J.Ginsberg,M.H.Mohebbi,R.S.Patel,L.Brammer,M.S.
Smolinski,andL.Brilliant.
Detectinginuenzaepidemicsusingsearchenginequerydat
a.
Nature
,457(7232):1012Œ
1014,2008.
[72]M.Girolami.Mercerkernel-basedclusteringinfeatur
espace.
IEEETransactionsonNeural
Networks
,13(3):780Œ784,2002.
[73]A.Gittens,P.Kambadur,andC.Boutsidis.Approximate
spectralclusteringviarandomized
sketching.
arXivpreprintarXiv:1311.2854
,2013.
[74]A.GittensandM.W.Mahoney.RevisitingtheNystrommet
hodforimprovedlarge-scale
machinelearning.
arXivpreprintarXiv:1303.1849
,2013.
[75]F.Godin,V.Slavkovikj,W.DeNeve,B.Schrauwen,andR.
VandeWalle.Usingtopic
modelsfortwitterhashtagrecommendation.In
ProceedingsoftheInternationalConference
onWorldWideWebCompanion
,pages593Œ596,2013.
[76]J.C.Gower.Addingapointtovectordiagramsinmultiva
riateanalysis.
Biometrika
,
55(3):582Œ585,1968.
[77]J.C.GowerandG.J.S.Ross.Minimumspanningtreesands
inglelinkageclusteranalysis.
AppliedStatistics
,pages54Œ64,1969.
[78]H.P.Graf,E.Cosatto,L.Bottou,I.Dourdanovic,andV.
Vapnik.Parallelsupportvector
machines:ThecascadeSVM.In
ProceedingsoftheConferenceonNeuralInformation
ProcessingSystems
,pages521Œ528,2004.
[79]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O'Cal
laghan.Clusteringdata
streams:Theoryandpractice.
IEEETransactionsonKnowledgeandDataEngineering
,
pages515Œ528,2003.
[80]S.Guha,R.Rastogi,andK.Shim.CURE:Anefcientclust
eringalgorithmforlarge
databases.
InformationSystems
,26(1):35Œ58,2001.
185
[81]R.Hamid,Y.Xiao,A.Gittens,andD.DeCoste.Compactra
ndomfeaturemaps.
arXiv
preprintarXiv:1312.4626
,2013.
[82]S.Har-PeledandS.Mazumdar.Oncoresetsfork-meansan
dk-medianclustering.In
ProceedingsoftheACMSymposiumonTheoryofComputing
,pages291Œ300,2004.
[83]T.Hastie,R.Tibshirani,andJ.Friedman.
TheElementsofStatisticalLearning
,volume2.
Springer,2009.
[84]T.C.Havens.Approximationofkernelk-meansforstrea
mingdata.In
Proceedingsofthe
InternationalConferenceonPatternRecognition
,pages509Œ512,2012.
[85]T.C.HavensandJ.C.Bezdek.Anefcientformulationof
theimprovedvisualassess-
mentofclustertendency(iVAT)algorithm.
IEEETransactionsonKnowledgeandData
Engineering
,24(5):813Œ822,2012.
[86]A.Jain,Z.Zhang,andE.Y.Chang.Adaptivenon-linearc
lusteringindatastreams.In
ProceedingsoftheInternationalConferenceonInformatio
nandKnowledgeManagement
,
pages122Œ131,2006.
[87]A.K.Jain.Dataclustering:50yearsbeyondk-means.
PatternRecognitionLetters
,
31(8):651Œ666,2010.
[88]A.K.JainandR.C.Dubes.
AlgorithmsforClusteringData
.Prentice-Hall,Inc.,1988.
[89]A.K.Jain,R.P.W.Duin,andJ.Mao.Statisticalpattern
recognition:Areview.
IEEE
TransactionsonPatternAnalysisandMachineIntelligence
,22(1):4Œ37,2000.
[90]A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering:
Areview.
ACMComputing
Surveys
,31(3):264Œ323,1999.
[91]T.Kanungo,D.M.Mount,N.S.Netanyahu,C.D.Piatko,R.
Silverman,andA.Y.Wu.An
efcientk-meansclusteringalgorithm:Analysisandimple
mentation.
IEEETransactions
onPatternAnalysisandMachineIntelligence
,24(7):881Œ892,2002.
[92]P.KarandH.Karnick.Randomfeaturemapsfordotproduc
tkernels.In
Proceedingsofthe
InternationalConferenceonArticialIntelligenceandSt
atistics
,pages583Œ591,2012.
[93]G.KarypisandV.Kumar.Asoftwarepackageforpartitio
ningunstructuredgraphs,parti-
tioningmeshes,andcomputingll-reducingorderingsofsp
arsematrices.Technicalreport,
DepartmentofComputerScience,UniversityofMinnesota,1
998.
[94]L.KaufmanandP.J.Rousseeuw.
FindingGroupsinData:AnIntroductiontoCluster
Analysis
.WileyBlackwell,2005.
[95]D.W.Kim,K.Y.Lee,D.Lee,andK.H.Lee.Evaluationofth
eperformanceofclustering
algorithmsinkernel-inducedfeaturespace.
PatternRecognition
,38(4):607Œ611,2005.
186
[96]T.Kohonen.
Self-organizingMaps
.Springer,2001.
[97]S.B.Kotsiantis.Supervisedmachinelearning:Arevie
wofclassicationtechniques.
Infor-
matica
,31(3),2007.
[98]P.Kranen,I.Assent,C.Baldauf,andT.Seidl.TheClusT
ree:Indexingmicro-clustersfor
anytimestreammining.
KnowledgeandInformationSystems
,29(2):249Œ272,2011.
[99]A.KrizhevskyandG.Hinton.Learningmultiplelayerso
ffeaturesfromtinyimages.Tech-
nicalreport,DepartmentofComputerScience,Universityo
fToronto,2009.
[100]B.KulisandK.Grauman.Kernelizedlocality-sensiti
vehashingforscalableimagesearch.
In
ProceedingsoftheInternationalConferenceonComputerVi
sion
,pages2130Œ2137,
2009.
[101]A.Kumar,Y.Sabharwal,andS.Sen.Asimplelineartime
(1+

)-approximationalgorithm
fork-meansclusteringinanydimensions.In
ProceedingsoftheIEEESymposiumonFoun-
dationsofComputerScience
,pages454Œ462,2004.
[102]S.Kumar,M.Mohri,andA.Talwalkar.Onsampling-base
dapproximatespectraldecom-
position.In
ProceedingsoftheInternationalConferenceonMachineLea
rning
,pages553Œ
560,2009.
[103]S.Kumar,M.Mohri,andA.Talwalkar.Samplingtechniq
uesfortheNystrommethod.In
ProceedingsofConferenceonArticialIntelligenceandSt
atistics
,pages304Œ311,2009.
[104]T.O.Kvalseth.Entropyandcorrelation:Somecomment
s.
IEEETransactionsonSystems,
ManandCybernetics
,17(3):517Œ519,1987.
[105]D.Laney.3Ddatamanagement:Controllingdatavolume
,velocity,andvariety.Technical
report,METAGroup,2001.
[106]S.Lazebnik,C.Schmid,andJ.Ponce.Beyondbagsoffea
tures:Spatialpyramidmatching
forrecognizingnaturalscenecategories.In
ProceedingsoftheIEEEComputerSocietyCon-
ferenceonComputerVisionandPatternRecognition
,volume2,pages2169Œ2178,2006.
[107]Q.Le,T.Sarlos,andA.Smola.Fastfood-Approximatin
gkernelexpansionsinloglinear
time.In
ProceedingsoftheInternationalConferenceonMachineLea
rning
,pages16Œ21,
2013.
[108]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradien
t-basedlearningappliedtodocu-
mentrecognition.
ProceedingsoftheIEEE
,86(11):2278Œ2324,1998.
[109]J.Lee,S.Kim,G.Lebanon,andY.Singer.Locallow-ran
kmatrixapproximation.In
ProceedingsofInternationalConferenceonMachineLearni
ng
,pages82Œ90,2013.
[110]F.Li,C.Ionescu,andC.Sminchisescu.RandomFourier
approximationsforskewedmulti-
plicativehistogramkernels.
PatternRecognition
,pages262Œ271,2010.
187
[111]M.Li,J.T.Kwok,andB.Lu.Makinglarge-scaleNystrom
approximationpossible.In
ProceedingsoftheInternationalConferenceonMachineLea
rning
,pages631Œ638,2010.
[112]B.Liu,S.X.Xia,andY.Zhou.Unsupervisednon-parame
trickernellearningalgorithm.
Knowledge-BasedSystems
,44:1Œ9,2013.
[113]L.L.Liu,X.B.Wen,andX.X.Gao.SegmentationforSARi
magebasedonanewspectral
clusteringalgorithm.
LifeSystemModelingandIntelligentComputing
,pages635Œ643,
2010.
[114]R.LiuandH.Zhang.SamplingcriteriafortheNystromm
ethod.
http://citeseerx.
ist.psu.edu/viewdoc/summary?doi=10.1.1.112.6368
.
[115]T.Liu,C.Rosenberg,andH.A.Rowley.Clusteringbill
ionsofimageswithlargescale
nearestneighborsearch.In
ProceedingsoftheIEEEWorkshoponApplicationsofCompute
r
Vision
,pages28Œ33,2007.
[116]S.Lloyd.LeastsquaresquantizationinPCM.
IEEETransactionsonInformationTheory
,
28(2):129Œ137,1982.
[117]D.G.Lowe.Distinctiveimagefeaturesfromscale-inv
ariantkeypoints.
InternationalJour-
nalofComputerVision
,60(2):91Œ110,2004.
[118]U.Luxburg.Atutorialonspectralclustering.
StatisticsandComputing
,17(4):395Œ416,
2007.
[119]U.Luxburg.
ClusteringStability
.NowPublishersInc.,2010.
[120]D.MacDonaldandC.Fyfe.Thekernelself-organisingm
ap.In
ProceedingsoftheIn-
ternationalConferenceonKnowledge-BasedIntelligentEn
gineeringSystemsandAllied
Technologies
,volume1,pages317Œ320,2002.
[121]M.Mahajan,P.Nimbhorkar,andK.Varadarajan.Thepla
nark-meansproblemisNP-Hard.
In
ProceedingsoftheInternationalWorkshoponAlgorithmsan
dComputation
,pages274Œ
285,2009.
[122]M.W.MahoneyandP.Drineas.CURmatrixdecomposition
sforimproveddataanalysis.
ProceedingsoftheNationalAcademyofSciences
,106(3):697Œ702,2009.
[123]O.A.MaillardandR.Munos.Compressedleast-squares
regression.In
Proceedingsofthe
ConferenceonNeuralInformationProcessingSystems
,pages1213Œ1221,2009.
[124]S.MalinowskiandR.Morla.AsinglepassTrellis-base
dalgorithmforclusteringevolving
datastreams.
DataWarehousingandKnowledgeDiscovery
,pages315Œ326,2012.
[125]C.D.Manning,P.Raghavan,andH.Schütze.
IntroductiontoInformationRetrieval
.Cam-
bridgeUniversityPress,2008.
188
[126]A.McCallum,K.Nigam,andL.H.Ungar.Efcientcluste
ringofhigh-dimensionaldata
setswithapplicationtoreferencematching.In
ProceedingsoftheInternationalConference
onKnowledgeDiscoveryandDataMining
,pages169Œ178,2000.
[127]G.McLachlanandD.Peel.
FiniteMixtureModels
.JohnWiley&Sons,2004.
[128]M.McPherson,L.Smith-Lovin,andJ.M.Cook.Birdsofa
feather:Homophilyinsocial
networks.
AnnualReviewofSociology
,pages415Œ444,2001.
[129]A.K.MenonandC.Elkan.Fastalgorithmsforapproxima
tingthesingularvaluedecompo-
sition.
ACMTransactionsonKnowledgeDiscoveryfromData
,5(2):1Œ36,2011.
[130]K.Mizumoto,H.Yanagimoto,andM.Yoshioka.Sentimen
tanalysisofstockmarketnews
withsemi-supervisedlearning.In
ProceedingsoftheIEEE/ACISInternationalConference
onComputerandInformationScience
,pages325Œ328,2012.
[131]A.W.Moore.Anintroductorytutorialonkd-trees.Tec
hnicalreport,DepartmentofCom-
puterScience,CarnegieMellonUniversity,1991.
[132]M.MujaandD.G.Lowe.Scalablenearestneighboralgor
ithmsforhighdimensionaldata.
IEEETransactionsonPatternAnalysisandMachineIntellig
ence
,36(11):2227Œ2240,2014.
[133]R.Nallapati,W.Cohen,andJ.Lafferty.Parallelized
variationalEMforLatentDirichlet
Allocation:Anexperimentalevaluationofspeedandscalab
ility.
ICDMWorkshoponHigh
PerformanceDataMining
,pages349Œ354,2007.
[134]O.Nasraoui,C.Cardona,andC.Rojas.Usingretrieval
measurestoassesssimilarityinmin-
ingdynamicwebclickstreams.In
ProceedingsoftheInternationalConferenceonKnowl-
edgeDiscoveryinDataMining
,pages439Œ448,2005.
[135]D.Newman,A.Asuncion,P.Smyth,andM.Welling.Distr
ibutedinferenceforLatent
DirichletAllocation.In
ProceedingsoftheConferenceonNeuralInformationProces
sing
Systems
,pages17Œ24,2007.
[136]R.T.NgandJ.Han.CLARANS:Amethodforclusteringobj
ectsforspatialdatamining.
IEEETransactionsonKnowledgeandDataEngineering
,pages1003Œ1016,2002.
[137]N.H.Nguyen,P.Drineas,andT.D.Tran.Matrixsparsi
cationviatheKhintchineinequal-
ity.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=1
0.1.1.
164.4755&rep=rep1&type=pdf
,2009.
[138]L.Nguyen-Dinh,C.Waldburger,D.Roggen,andG.Tröst
er.Tagginghumanactivitiesin
videobycrowdsourcing.In
ProceedingsoftheConferenceonInternationalConference
on
MultimediaRetrieval
,pages263Œ270,2013.
[139]H.Ning,W.Xu,Y.Chi,Y.Gong,andT.S.Huang.Incremen
talspectralclusteringby
efcientlyupdatingtheeigen-system.
PatternRecognition
,43(1):113Œ127,2010.
189
[140]L.O'Callaghan,N.Mishra,S.Guha,A.M.Meyerson,and
R.Motwani.Streaming-data
algorithmsforhigh-qualityclustering.In
ProceedingsoftheInternationalConferenceon
DataEngineering
,pages685Œ695,2002.
[141]A.OlivaandA.Torralba.Modelingtheshapeofthescen
e:Aholisticrepresentationofthe
spatialenvelope.
InternationalJournalofComputerVision
,42(3):145Œ175,2001.
[142]M.OuimetandY.Bengio.Greedyspectralembedding.In
ProceedingsoftheInternational
WorkshoponArticialIntelligenceandStatistics
,pages253Œ260,2005.
[143]S.Owen,R.Anil,T.Dunning,andE.Friedman.
MahoutinAction
.ManningPublications
Co.,2011.
[144]G.Petkos,S.Papadopoulos,andY.Kompatsiaris.Two-
levelmessageclusteringfortopic
detectioninTwitter.In
ProceedingsoftheSNOWDataChallenge
,pages49Œ56,2014.
[145]A.K.QinandandP.N.Suganthan.Kernelneuralgasalgo
rithmswithapplicationtocluster
analysis.
PatternRecognition
,4:617Œ620,2004.
[146]M.RaginskyandS.Lazebnik.Locality-sensitivebina
rycodesfromshift-invariantkernels.
In
ProceedingsoftheConferenceonNeuralInformationProces
singSystems
,pages1509Œ
1517,2009.
[147]A.RahimiandB.Recht.Randomfeaturesforlarge-scal
ekernelmachines.In
Proceedings
oftheConferenceonNeuralInformationProcessingSystems
,pages1177Œ1184,2007.
[148]C.Ranger,R.Raghuraman,A.Penmetsa,G.Bradski,and
C.Kozyrakis.Evaluatingmapre-
duceformulti-coreandmultiprocessorsystems.In
ProceedingsoftheIEEESymposiumon
HighPerformanceComputerArchitecture
,pages13Œ24,2007.
[149]P.P.Rodrigues.Learningfromubiquitousdatastream
s:Clusteringdataanddatasources.
AICommunications
,25(1):69Œ71,2012.
[150]K.D.Rosa,R.Shah,B.Lin,A.Gershman,andR.Frederki
ng.Topicalclusteringoftweets.
In
ProceedingsoftheACMSIGIRWorkshoponSocialWebSearchan
dMining
,2011.
[151]P.J.Rousseeuw.Silhouettes:Agraphicalaidtothein
terpretationandvalidationofcluster
analysis.
JournalofComputationalandAppliedMathematics
,20:53Œ65,1987.
[152]W.Rudin.
FourierAnalysisonGroups
.Wiley-Interscience,1990.
[153]T.SakaiandA.Imiya.Fastspectralclusteringwithra
ndomprojectionandsampling.
Ma-
chineLearningandDataMininginPatternRecognition
,pages372Œ384,2009.
[154]H.Samet.
FoundationsofMultidimensionalandMetricDataStructure
s
.MorganKauf-
mann,2006.
190
[155]T.Sarlos.Improvedapproximationalgorithmsforlar
gematricesviarandomprojections.In
ProceedingsoftheIEEESymposiumonFoundationsofCompute
rScience
,pages143Œ152,
2006.
[156]F.Schleif,A.Gisbrecht,andB.Hammer.Accelerating
kernelneuralgas.In
Proceedingsof
theInternationalConferenceonArticialNeuralNetworks
andMachineLearning
,pages
150Œ158.2011.
[157]F.Schleif,X.Zhu,A.Gisbrecht,andB.Hammer.Fastap
proximatedrelationalandkernel
clustering.In
ProceedingsoftheInternationalConferenceonPatternRec
ognition
,pages
1229Œ1232,2012.
[158]B.Schölkopf,R.Herbrich,andA.Smola.Ageneralized
representertheorem.In
Proceed-
ingsofComputationalLearningTheory
,pages416Œ426,2001.
[159]B.SchölkopfandA.Smola.
Learningwithkernels:Supportvectormachines,regulariz
a-
tion,optimization,andbeyond(Adaptivecomputationandm
achinelearning)
.TheMIT
Press,2001.
[160]B.Schölkopf,A.Smola,andK.R.Muller.Nonlinearcom
ponentanalysisasakerneleigen-
valueproblem.
NeuralComputation
,10(5):1299Œ1314,1996.
[161]J.ShiandJ.Malik.Normalizedcutsandimagesegmenta
tion.
IEEETransactionsonPattern
AnalysisandMachineIntelligence
,22(8):888Œ905,2002.
[162]M.Shindler,A.Wong,andA.W.Meyerson.Fastandaccur
atek-meansforlargedatasets.
In
ProceedingsoftheConferenceonNeuralInformationProces
singSystems
,pages2375Œ
2383,2011.
[163]H.D.SimonandH.Zha.Low-rankmatrixapproximationu
singtheLanczosbidiagonaliza-
tionprocesswithapplications.
SIAMJournalonScienticComputing
,21(6):2257Œ2274,
2000.
[164]A.Smola,L.Song,andC.H.Teo.Relativenoveltydetec
tion.In
Proceedingsofthe
InternationalConferenceonArticialIntelligenceandSt
atistics
,volume5,pages536Œ543,
2009.
[165]S.SteveandD.X.Zhou.Geometryonprobabilityspaces
.
ConstructiveApproximation
,
30:311Œ323,2009.
[166]G.W.Stewart.
MatrixPerturbationTheory
.AcademicPress,1990.
[167]S.J.Stolfo,W.Fan,W.Lee,A.Prodromidis,andP.K.Ch
an.Cost-basedmodelingfor
fraudandintrusiondetection:ResultsfromtheJAMproject
.In
ProceedingsoftheDARPA
InformationSurvivabilityConferenceandExposition
,volume2,pages130Œ144,2000.
191
[168]A.StrehlandJ.Ghosh.Clusterensembles-Aknowledge
reuseframeworkforcombining
multiplepartitions.
JournalofMachineLearningResearch
,3:583Œ617,2003.
[169]Z.SunandG.Fox.StudyonparallelSVMbasedonMapRedu
ce.In
Proceedingsofthe
InternationalConferenceonParallelandDistributedProc
essingTechniquesandApplica-
tions
,pages495Œ561,2012.
[170]A.TalwalkarandA.Rostamizadeh.Matrixcoherencean
dtheNystrommethod.In
Pro-
ceedingsofConferenceonUncertaintyinArticialIntelli
gence
,2010.
[171]P.Tan,M.Steinbach,andV.Kumar.
IntroductiontoDataMining
.PearsonEducation,2007.
[172]R.Tibshirani,G.Walther,andT.Hastie.Estimatingt
henumberofclustersinadatasetvia
thegapstatistic.
JournaloftheRoyalStatisticalSociety:SeriesB(Statist
icalMethodology)
,
63(2):411Œ423,2001.
[173]A.Torralba,R.Fergus,andW.T.Freeman.80millionti
nyimages:Alargedatasetfor
nonparametricobjectandscenerecognition.
IEEETransactionsonPatternAnalysisand
MachineIntelligence
,30(11):1958Œ1970,2008.
[174]J.W.Tukey.
ExploratoryDataAnalysis
.Reading,MA,1977.
[175]J.Tzeng.Split-and-combinesingularvaluedecompos
itionforlarge-scalematrix.
Journal
ofAppliedMathematics
,2013.
[176]J.K.Uhlmann.Satisfyinggeneralproximity/similar
ityquerieswithmetrictrees.
Informa-
tionProcessingLetters
,40(4):175Œ179,1991.
[177]H.ValizadeganandR.Jin.Generalizedmaximummargin
clusteringandunsupervisedker-
nellearning.In
ProceedingsoftheConferenceonNeuralInformationProces
singSystems
,
pages1417Œ1424,2006.
[178]A.VedaldiandB.Fulkerson.VLFeat:Anopenandportab
lelibraryofcomputervision
algorithms.
http://www.vlfeat.org
,2008.
[179]A.VedaldiandA.Zisserman.Efcientadditivekernel
sviaexplicitfeaturemaps.
IEEE
TransactionsonPatternAnalysisandMachineIntelligence
,34(3):480Œ492,2012.
[180]S.Vega-PonsandJ.Ruiz-Schulcloper.Asurveyofclus
teringensemblealgorithms.
Inter-
nationalJournalofPatternRecognitionandArticialInte
lligence
,25(3):337Œ372,2011.
[181]R.Vidal.Subspaceclustering.
IEEESignalProcessingMagazine
,28(2):52Œ68,2011.
[182]J.Wang,S.C.H.Hoi,P.Zhao,J.Zhuang,andZ.Liu.Larg
escaleonlinekernelclassica-
tion.In
ProceedingsoftheInternationalJointConferenceonArti
cialIntelligence
,pages
1750Œ1756,2013.
192
[183]L.Wang,C.Leckie,R.Kotagiri,andJ.Bezdek.Approxi
matepairwiseclusteringforlarge
datasetsviasamplingplusextension.
PatternRecognition
,44(2):222Œ235,2011.
[184]S.WangandZ.Zhang.AscalableCURmatrixdecompositi
onalgorithm:Lowertime
complexityandtighterbound.In
ProceedingsoftheConferenceonNeuralInformation
ProcessingSystems
,pages656Œ664,2012.
[185]K.Q.Weinberger,M.Slaney,andR.Zwol.Resolvingtag
ambiguity.In
Proceedingsof
ConferenceonMultimedia
,pages111Œ120,2008.
[186]J.J.Whang,X.Sui,andI.S.Dhillon.Scalableandmemo
ry-efcientclusteringoflarge-
scalesocialnetworks.In
ProceedingsoftheInternationalConferenceonDataMining
,
pages705Œ714,2012.
[187]C.WilliamsandM.Seeger.UsingtheNystrommethodtos
peedupkernelmachines.In
ProceedingsoftheConferenceonNeuralInformationProces
singSystems
,pages682Œ688,
2001.
[188]M.WuandB.Schölkopf.Alocallearningapproachforcl
ustering.In
Proceedingsofthe
ConferenceonNeuralInformationProcessingSystems
,pages1529Œ1536,2006.
[189]Z.Xiaojin.Semi-supervisedlearningliteraturesur
vey.TechnicalReport1530,Department
ofComputerScience,UniversityofWisconsin-Madison,200
5.
[190]L.Xu,J.Neufeld,B.Larson,andD.Schuurmans.Maximu
mmarginclustering.In
Advances
inNeuralInformationProcessingsystems
,pages1537Œ1544,2004.
[191]D.Yan,L.Huang,andM.I.Jordan.Fastapproximatespe
ctralclustering.In
Proceedings
oftheInternationalConferenceonKnowledgeDiscoveryand
Datamining
,pages907Œ916,
2009.
[192]H.Zha,X.He,C.Ding,M.Gu,andH.D.Simon.Spectralre
laxationfork-meansclustering.
In
ProceedingsoftheConferenceonNeuralInformationProces
singSystems
,pages1057Œ
1064,2001.
[193]D.Zhang,S.Chen,andK.Tan.Improvingtherobustness
ofonlineagglomerativeclustering
methodbasedonkernel-inducedistancemeasures.
NeuralProcessingLetters
,21(1):45Œ51,
2005.
[194]K.ZhangandJ.T.Kwok.ClusteredNystrommethodforla
rgescalemanifoldlearningand
dimensionreduction.
IEEETransactionsonNeuralNetworks
,21(10):1576Œ1587,2010.
[195]K.Zhang,I.W.Tsang,andJ.T.Kwok.ImprovedNystroml
ow-rankapproximationand
erroranalysis.In
ProceedingsoftheInternationalConferenceonMachineLea
rning
,pages
1232Œ1239,2008.
193
[196]R.ZhangandA.I.Rudnicky.Alargescaleclusteringsc
hemeforkernelk-means.
Pattern
Recognition
,4:289Œ292,2002.
[197]T.Zhang,R.Ramakrishnan,andM.Livny.BIRCH:Anefc
ientdataclusteringmethodfor
verylargedatabases.
ACMSIGMODRecord
,25(2):103Œ114,1996.
[198]Y.M.Zhang,K.Huang,G.Geng,andC.Liu.Fast
k
-nngraphconstructionwithlocality
sensitivehashing.
MachineLearningandKnowledgeDiscoveryinDatabases
,pages660Œ
674,2013.
[199]W.Zhao,H.Ma,andQ.He.Parallelk-meansclusteringb
asedonMapReduce.
Cloud
Computing
,pages674Œ679,2009.
[200]J.Zhuang,J.Wang,S.C.H.Hoi,andX.Lan.Unsupervise
dmultiplekernellearning.
JournalofMachineLearningResearch
,20:129Œ144,2011.
194