KERNEL-BASEDCLUSTERINGOFBIGDATA By RadhaChitta ADISSERTATION Submittedto MichiganStateUniversity inpartialfulllmentoftherequirements forthedegreeof ComputerScienceŒDoctorofPhilosophy 2015 ABSTRACT KERNEL-BASEDCLUSTERINGOFBIGDATA By RadhaChitta Therehasbeenarapidincreaseinthevolumeofdigitaldatao vertherecentyears.Astudyby IDCandEMCCorporationpredictedthecreationof44zettaby tes( 10 21 bytes)ofdigitaldataby theyear2020.Analysisofthismassiveamountsofdata,popu larlyknownas bigdata ,necessi- tateshighlyscalabledataanalysistechniques.Clusterin gisanexploratorydataanalysistoolused todiscovertheunderlyinggroupsinthedata.Thestate-of- the-artalgorithmsforclusteringbig datasetsare linear clusteringalgorithms,whichassumethatthedataislinear lyseparableinthe inputspace,andusemeasuressuchastheEuclideandistance todenetheinter-pointsimilarities. Thoughefcient,linearclusteringalgorithmsdonotachie vehighclusterqualityonreal-worlddata sets,whicharenotlinearlyseparable.Kernel-basedclust eringalgorithmsemploynon-linearsimi- laritymeasurestodenetheinter-pointsimilarities.Asa result,theyareabletoidentifyclustersof arbitraryshapesanddensities.However,kernel-basedclu steringtechniquessufferfromtwomajor limitations: (i)Theirrunningtimeandmemorycomplexityincreasequadr aticallywiththeincreaseinthe sizeofthedataset.Theycannotscaleuptodatasetscontain ingbillionsofdatapoints. (ii)Theperformanceofthekernel-basedclusteringalgori thmsishighlysensitivetothechoice ofthekernelsimilarityfunction.Adhocapproaches,relyi ngonpriordomainknowledge, arecurrentlyemployedtochoosethekernelfunction,andit isdifculttodeterminethe appropriatekernelsimilarityfunctionforthegivendatas et. Inthisthesis,wedevelopscalableapproximatekernel-bas edclusteringalgorithmsusingrandom samplingandmatrixapproximationtechniques.Theycanclu sterbigdatasetscontainingbillions ofhigh-dimensionalpointsnotonlyasefcientlyaslinear clusteringalgorithmsbutalsoasaccu- ratelyasclassicalkernel-basedclusteringalgorithms. Ourrstcontributionisbasedonthepremisethatthesimila ritymatricescorrespondingtobig datasetscanusuallybewell-approximatedbylow-rankmatr icesbuiltfromasubsetofthedata. Wedevelopanapproximatekernel-basedclusteringalgorit hm,whichusesalow-rankapproximate kernelmatrix,constructedfromauniformlysampledsmalls ubsetofthedata,toperformcluster- ing.Weshowthattheproposedalgorithmhaslinearrunningt imecomplexityandlowmemory requirements,andalsoachieveshighclusterquality,when providedwithsufcientnumberofdata samples.Wealsodemonstratethattheproposedalgorithmca nbeeasilyparallelizedtohandle distributeddatasets.Wethenemploynon-linearrandomfea turemapstoapproximatethekernel similarityfunction,anddesignclusteringalgorithmswhi chenhancetheefciencyofkernel-based clustering,aswellaslabelassignmentforpreviouslyunse endatapoints. Ournextcontributionisanonlinekernel-basedclustering algorithmthatcanclusterpotentially unboundedstreamdatainreal-time.Itintelligentlysampl esthedatastreamandndsthecluster labelsusingthesesampledpoints.Theproposedschemeismo reeffectivethanthecurrentkernel- basedandlinearstreamclusteringtechniques,bothinterm sofefciencyandclusterquality. Wenallyaddresstheissuesofhighdimensionalityandscal abilitytodatasetscontaininga largenumberofclusters.Undertheassumptionthatthekern elmatrixissparsewhenthenumberof clustersislarge,wemodifytheaboveonlinekernel-basedc lusteringschemetoperformclustering inalow-dimensionalspacespannedbythetopeigenvectorso fthesparsekernelmatrix.The combinationofsamplingandsparsityfurtherreducestheru nningtimeandmemorycomplexity. Theproposedclusteringalgorithmscanbeappliedinanumbe rofreal-worldapplications.We demonstratetheefcacyofouralgorithmsusingseverallar gebenchmarktextandimagedatasets. Forinstance,theproposedbatchkernelclusteringalgorit hmswereusedtoclusterlargeimage datasets(e.g.Tiny)containingupto80millionimages.The proposedstreamkernelclustering algorithmwasusedtoclusteroverabilliontweetsfromTwit ter,forhashtagrecommendation. ToMyFamily iv ACKNOWLEDGMENTS fiLifeisacontinuouslearningprocess. Eachdaypresentsanopportunityforlearning.fl-LailahGif tyAkita,ThinkGreat:BeGreat EverydayduringmyPhDstudieshasbeenagreatopportunityf orlearning,thankstomy advisors,colleagues,friends,andfamily.Iamverygratef ultomythesisadvisorProf.AnilK. Jain,whohasbeenawonderfulmentor.Hisabilitytoidentif ygoodresearchproblemshasalways beenmyinspiration.Iammotivatedbyhisenergy,disciplin e,meticulousnessandpassionfor research.Hehastaughtmetoplanandprioritizemywork,and presentitinaconvincingmanner. IamalsoverythankfultoProf.RongJin,withwhomIhadthepr ivilegeofworkingclosely. Underhisguidance,Ihavelearnthowtoformalizeaproblem, anddevelopcoherentsolutionsto theproblem,usingdifferentmachinelearningtools.Iamin spiredbyhisextensiveknowledgeand hard-workingnature. IwouldliketothankmyPhDcommitteemembers,Prof.Pang-Ni ngTan,Prof.Shantanu Chakrabartty,andProf.SelinAviyentefortheirvaluablec ommentsandsuggestions.Prof.Pang- NingTanwasalwaysavailablewhenIneededhelp,andprovide dveryusefulsuggestions. Iamgratefultoseveralotherresearcherswhohavementored meatvariousstagesofmyre- search.IhavehadtheprivilegeofworkingwithDr.SuvritSr aandDr.FrancescoDinuzzo,atthe MaxPlanckInstituteforIntelligentSystems,Germany.Iwo uldliketothankthemforgivingme aninsightintoseveralemergingproblemsinmachinelearni ng.IthankDr.GaneshRameshfrom Edmodoforprovidingmetheopportunitytolearnmoreaboutn aturallanguageprocessing,and buildingscalablesolutions.Dr.TimothyHavenswasveryhe lpfulwhenwewereworkingtogether duringtherstyearofmyPhD. Iwouldliketothankmylabmatesandfriends:Shalini,Soweo n,Serhat,Zheyun,Jinfeng, v Mehrdad,Kien,Alessandra,Abhishek,Brendan,Jung-Eun,S unpreet,Inci,Scott,Lacey,Charles, andKeyur.TheymademylifeatMSUverymemorable.Iwouldlik etospeciallythankSerhatfor allthehelpfuldiscussions,andSoweonforhersupportande ncouragement.IamthankfultoLinda Moore,CathyDavison,NormaTeague,KatieTrinklein,Court neyKosloskiandDebbieKruchfor theiradministrativesupport.ManythankstotheCSEandHPC Cadministrators,speciallyKelly Climer,AdamPitcher,Dr.DirkColbry,andDr.BenjaminOng. Lastbutnottheleast,Iwouldliketothankmyfamily.Iamdee plyindebtedtomyhusband Praveen,withoutwhosesupportandmotivation,Iwouldnoth avebeenabletopursueandcomplete myPhD.Myparents,mysisterandmyparents-in-lawhavebeen verysupportivethroughoutthe pastveyears.IwasinspiredbymyfatherRamamurthytopurs uehigherstudies,andstriveto makehimproud.IwouldliketospeciallymentionmymotherSu dhaLakshmi,whohasbeenmy rolemodelandinspiration.Icanalwayscountonhertoencou ragemeandupliftmyspirits. vi TABLEOFCONTENTS LISTOFTABLES ....................................... x LISTOFFIGURES ...................................... xiv LISTOFALGORITHMS ................................... xxi Chapter1Introduction .................................. 1 1.1DataAnalysis.................................... ...4 1.1.1DataRepresentation............................ ......4 1.1.2Learning...................................... ...5 1.1.3Inference..................................... ...6 1.2Clustering...................................... ...7 1.2.1ClusteringAlgorithms.......................... .......8 1.2.2ChallengesinDataClustering.................... .........10 1.3ClusteringBigData............................... .....13 1.3.1Clusteringwith k -means................................17 1.4KernelBasedClustering........................... ......19 1.4.1Kernel k -means.....................................25 1.4.2Challenges.................................... ...27 1.4.2.1Scalability................................. .....28 1.4.2.2Choiceofkernel.............................. .....29 1.5ThesisContributions............................. ......31 1.6DatasetsandEvaluationMetrics.................... .........35 1.6.1Datasets...................................... ..35 1.6.2EvaluationMetrics............................. ......39 1.7ThesisOverview.................................. ...41 Chapter2ApproximateKernel-basedClustering .................... 42 2.1Introduction.................................... ....42 2.2RelatedWork..................................... ..43 2.2.1Low-rankMatrixApproximation................... .........44 2.2.1.1CURmatrixapproximation...................... .......45 2.2.1.2Nystrommatrixapproximation.................. ..........46 2.2.2Kernel-basedClusteringforLargeDatasets........ ..............47 2.3ApproximateKernelk-means........................ .......49 2.3.1Parameters.................................... ...52 2.3.1.1Samplesize.................................. ....54 2.3.1.2Samplingstrategies.......................... ........55 2.3.2Analysis...................................... ...56 vii 2.3.2.1Computationalcomplexity..................... .........56 2.3.2.2Approximationerror.......................... .......57 2.3.3DistributedClustering......................... ........60 2.4ExperimentalResults............................. ......64 2.4.1Datasets...................................... ..65 2.4.2Baselines..................................... ...65 2.4.3Parameters.................................... ...65 2.4.4Results....................................... ..66 2.4.4.1Runningtime................................. ....66 2.4.4.2Clusterquality.............................. ......67 2.4.4.3Parametersensitivity........................ .........71 2.4.4.4Samplingstrategies.......................... ........73 2.4.4.5Scalabilityanalysis......................... .........75 2.4.5DistributedApproximateKernel k -means.......................78 2.5Summary......................................... 79 Chapter3Kernel-basedClusteringUsingRandomFeatureMap s ........... 80 3.1Introduction.................................... ....80 3.2Background...................................... ..81 3.3KernelClusteringusingRandomFourierFeatures...... ..............83 3.3.1Analysis...................................... ...86 3.3.1.1Computationalcomplexity..................... .........86 3.3.1.2Approximateerror............................ ......86 3.4KernelClusteringusingRandomFourierFeaturesinCons trainedEigenspace.....88 3.4.1Analysis...................................... ...90 3.4.1.1Computationalcomplexity..................... .........90 3.4.1.2Approximationerror.......................... .......91 3.4.2Out-of-sampleClustering....................... .........95 3.5ExperimentalResults............................. ......96 3.5.1Datasets...................................... ..96 3.5.2Baselines..................................... ...96 3.5.3Parameters.................................... ...97 3.5.4Results....................................... ..97 3.5.4.1Runningtime................................. ....97 3.5.4.2Clusterquality.............................. ......99 3.5.4.3Parametersensitivity........................ .........101 3.5.4.4Scalability................................. .....103 3.5.4.5Out-of-sampleclustering..................... ..........108 3.6Summary......................................... 112 Chapter4StreamClustering ............................... 113 4.1Introduction.................................... ....113 4.2Background...................................... ..114 viii 4.3ApproximateKernel k -meansforStreams........................117 4.3.1Sampling...................................... ..118 4.3.2Clustering.................................... ....121 4.3.3LabelAssignment............................... .....123 4.4ImplementationandComplexity..................... ........124 4.5ExperimentalResults............................. ......126 4.5.1Datasets...................................... ..126 4.5.2Baselines..................................... ...126 4.5.3Parameters.................................... ...127 4.5.4Results....................................... ..128 4.5.4.1Clusteringefciencyandquality............... ............128 4.5.4.2Parametersensitivity:....................... ..........133 4.6Applications:TwitterStreamClustering............ .............140 4.7Summary......................................... 144 Chapter5Kernel-BasedClusteringforLargeNumberofClust ers .......... 145 5.1Introduction.................................... ....145 5.2Background...................................... ..147 5.3SparseKernelk-means............................. .....150 5.4Analysis........................................ ..154 5.4.1ComputationalComplexity....................... ........154 5.4.2ApproximationError............................ ......156 5.5ExperimentalResults............................. ......162 5.5.1Datasets...................................... ..162 5.5.2BaselinesandParameters........................ ........162 5.5.3Results....................................... ..164 5.5.3.1Runningtime................................. ....164 5.5.3.2Clusterquality.............................. ......165 5.5.3.3Parametersensitivity........................ .........167 5.5.3.4Scalability................................. .....172 5.6Summary......................................... 173 Chapter6SummaryandFutureWork .......................... 174 6.1Contributions................................... ....175 6.2FutureWork...................................... ..177 BIBLIOGRAPHY ....................................... 179 ix LISTOFTABLES Table1.1Notation................................... ....7 Table1.2ClusteringtechniquesforBigData............. ............14 Table1.3Popularkernelfunctions..................... ..........23 Table1.4Comparisonoftherunningtimesof k -meansandkernel k -meansona 100 - dimensionalsyntheticdatasetcontaining 10 clustersandexponentiallyincreasing numberofdatapoints,ona2.8GHzprocessorwith40GBmemory .........28 Table1.5Descriptionofdatasetsusedforevaluationofthe proposedalgorithms......35 Table2.1Comparisonoftheconfusionmatricesoftheapprox imatekernel k -means,kernel k -meansand k -meansalgorithmsforthetwo-dimensionalsemi-circlesda taset, containing 500 points( 250 pointsineachofthetwoclusters).Theapproximate kernel k -meansalgorithmachievesclusterqualitycomparabletoth atofthekernel k -meansalgorithm...................................5 3 Table2.2Runningtime(inseconds)oftheproposedapproxim atekernel k -meansandthe baselinealgorithms.Thesamplesize m issetto 2 ;000 ,forboththeproposedalgo- rithmandtheNystromapproximationbasedspectralcluster ingalgorithm.Itisnot feasibletoexecutekernel k -meansonthelargeForestCoverType,Imagenet-34, Poker,andNetworkIntrusiondatasetsduetotheirlargesiz e.Anapproximate valueoftherunningtimeofkernel k -meansonthesedatasetsisobtainedbyrst executingkernel k -meansonarandomlychosensubsetof 50 ;000 datapointsto ndtheclustercenters,andthenassigningtheremainingpo intstotheclosestclus- tercenter........................................66 Table2.3Effectofthesamplesize m ontherunningtime(inseconds)oftheproposed approximatekernel k -meansclusteringalgorithm...................74 Table2.4Comparisonofsamplingtimes(inmilliseconds)of theuniform,column-norm and k -meanssamplingstrategiesontheCIFAR-10andMNISTdatase ts.Parameter m representsthesamplesize............................ ..76 Table2.5Performanceofthedistributedapproximatekerne l k -meansalgorithmonthe Tinyimagedatasetandtheconcentriccirclesdataset,with parameters m =1 ;000 and P =1024 ......................................78 x Table3.1ComparisonoftheconfusionmatricesoftheRFF,ke rnel k -means,and k -means algorithmsforthetwo-dimensionalsemi-circlesdataset, containing 500 points ( 250 pointsineachofthetwoclusters)...................... ..84 Table3.2Runningtime(inseconds)oftheRFFandSVclusteri ngalgorithmsonthesix benchmarkdatasets.Theparameter m ,whichrepresentsthenumberofFourier componentsfortheRFFandSVclusteringalgorithms,andthe samplesizeforthe approximatekernel k -meansandNystromapproximationbasedspectralclusterin g algorithms,issetto m =2 ;000 .Itisnotfeasibletoexecutekernel k -meansonthe largeForestCoverType,Imagenet-34,Poker,andNetworkIn trusiondatasetsdue totheirlargesize.Anapproximateoftherunningtimeofker nel k -meansonthese datasetsisobtainedbyrstexecutingkernel k -meansonarandomlychosensubset of 50 ;000 datapointstondtheclustercenters,andthenassigningth eremaining pointstotheclosestclustercenter.................... .......98 Table3.3EffectofthenumberofFouriercomponents m ontherunningtime(inseconds) oftheRFFandSVclusteringalgorithmsonthesixbenchmarkd atasets.Parameter m representsthenumberofFouriercomponentsfortheRFFandS Vclustering algorithms,andthesamplesizefortheapproximatekernel k -meansandNystrom approximationbasedspectralclusteringalgorithms..... .............104 Table3.4Runningtime(inseconds)andpredictionaccuracy (in%)forout-of-sampledata points.Parameter m representsthesamplesizefortheapproximatekernel k -means algorithmandthenumberofFouriercomponentsfortheSVclu steringalgorithm. Thevalueof m issetto 1 ;000 forboththealgorithms.Itisnotfeasibletoexecute theWKPCAalgorithmonthelargeForestCoverType,Imagenet -34,Poker,and NetworkIntrusiondatasetsduetotheirlargesize........ ...........111 Table4.1Majorpublishedapproachestostreamclustering. .................115 Table4.2Effectofthemaximumbuffersize M ontherunningtime(inmilliseconds)of theproposedapproximatestreamkernel k -meansalgorithm.Parametersettings: m =5 ;000 , ˝ =1 ...................................137 Table4.3Effectofthemaximumbuffersize M ontheSilhouettecoefcientofthepro- posedapproximatestreamkernel k -meansalgorithm.Parametersettings: m = 5 ;000 , ˝ =1 ......................................137 Table4.4Effectofthemaximumbuffersize M ontheNMI(in%)oftheproposedap- proximatestreamkernel k -meansalgorithm.Parametersettings: m =5 ;000 , ˝ =1 .137 Table4.5Effectoftheclusterlifetimethreshold =exp( ˝ ) ontherunningtime(in milliseconds)oftheproposedapproximatestreamkernel k -meansalgorithm.Pa- rametersettings: m =5 ;000 , M =20 ;000 ......................138 xi Table4.6Effectoftheclusterlifetimethreshold =exp( ˝ ) ontheSilhouettecoef- cientoftheproposedapproximatestreamkernel k -meansalgorithm.Parameters: m =5 ;000 , M =20 ;000 ...............................138 Table4.7Effectoftheclusterlifetimethreshold =exp( ˝ ) ontheNMI(in%)ofthe proposedapproximatestreamkernel k -meansalgorithm.Parameters: m =5 ;000 , M =20 ;000 ......................................138 Table4.8Comparisonoftheperformanceoftheapproximates treamkernel k -meansalgo- rithmwithimportancesamplingandBernoullisampling.... ...........139 Table5.1Complexityofpopularpartitionalclusteringalg orithms: n and d representthe sizeanddimensionalityofthedatarespectively,and C representsthenumber ofclusters.Parameter m>C representsthesizeofthesampledsubsetforthe sampling-basedapproximateclusteringalgorithms. n sv C representsthenum- berofsupportvectors.DBSCANandCanopyalgorithmsaredep endentonuser- denedintra-clusterandinter-clusterdistancethreshol ds,sotheircomplexityisnot directlydependenton C ................................146 Table5.2Runningtime(inseconds)oftheproposedsparseke rnel k -meansandthethree baselinealgorithmsonthefourdatasets.Theparametersof theproposedalgorithm weresetto m =20 ;000 , M =50 ;000 ,and p =1 ;000 .Thesamplesize m forthe approximatekernel k -meansalgorithmwassetequalto 20 ;000 fortheCIFAR-100 datasetand 10 ;000 fortheremainingdatasets.Itisnotfeasibletoexecuteker nel k -meansontheImagenet-164,YoutubeandTinydatasetsdueto theirlargesize. Theapproximaterunningtimeofkernel k -meansonthesedatasetsisobtainedby rstexecutingthealgorithmonarandomlychosensubsetof 50 ;000 datapoints tondtheclustercenters,andthenassigningtheremaining pointstotheclosest clustercenter...................................... 164 Table5.3Silhouettecoefcient( e 02 )oftheproposedsparsekernel k -meansandthe threebaselinealgorithmsontheCIFAR-100dataset.Thepar ametersofthepro- posedalgorithmweresetto m =20 ;000 , M =50 ;000 ,and p =1 ;000 .The samplesize m fortheapproximatekernel k -meansalgorithmwassetequalto m =20 ;000 ......................................166 Table5.4Comparisonoftherunningtime(inseconds)ofthep roposedsparsekernel k - meansalgorithmandtheapproximatekernel k -meansalgorithmontheCIFAR-100 andtheImagenet-164datasets.Parameter m representstheinitialsamplesetsize fortheproposedalgorithm,andthesizeofthesampledsubse tfortheapproximate kernel k -meansalgorithm.Theremainingparametersoftheproposed algorithm aresetto M =50 ;000 ,and p =1 ;000 .Approximatekernel k -meansisinfeasible fortheImagenet-164datasetwhen m> 10 ;000 duetoitslargesize.........168 xii Table5.5Comparisonofthesilhouettecoefcient( e 02 )oftheproposedsparsekernel k -meansalgorithmandtheapproximatekernel k -meansalgorithmontheCIFAR- 100dataset.Parameter m representstheinitialsamplesetsizefortheproposed algorithm,andthesizeofthesampledsubsetfortheapproxi matekernel k -means algorithm.Theremainingparametersoftheproposedalgori thmweresetto M = 50 ;000 ,and p =1 ;000 .................................168 Table5.6Effectofthesizeoftheneighborhood p ontherunningtime(inseconds),the silhouettecoefcientandNMI(in%)oftheproposedsparsek ernel k -meansalgo- rithmontheCIFAR-100andImagenet-164datasets.Theremai ningparametersof theproposedalgorithmweresetto m =20 ;000 ,and M =50 ;000 ..........170 xiii LISTOFFIGURES Figure1.1Emergingsizeofthedigitalworld.Imagefrom[2] ................2 Figure1.2GrowthofTargetedDisplayAdvertising.Imagefr om[59]............3 Figure1.3Atwo-dimensionalexampletodemonstratehierar chicalandpartitionalclus- teringtechniques.Figure(a)showsasetofpointsintwo-di mensionalspace,con- tainingthreeclusters.Hierarchicalclusteringgenerate sadendrogramforthedata. Figure(b)showsadendrogramgeneratedusingthecomplete- linkagglomerative hierarchicalclusteringalgorithm.Thehorizontalaxisre presentsthedatapoints andtheverticalaxisrepresentsthedistancebetweenthecl usterswhentheyrst merge.Byapplyingathresholdonthedistanceat 4 units(shownbytheblack dottedline),wecanobtainthethreeclusters.Partitional clusteringdirectlynds the C clustersinthedataset.Figure(c)showsthethreeclusters ,representedby theblue,greenandredpoints,obtainedusingthe k -meansalgorithm.Thestarred pointsinblackrepresenttheclustercenters............ ..........8 Figure1.4Atwo-dimensionalexamplethatdemonstratesthe limitationsof k -meansclus- tering. 500 two-dimensionalpointscontainingtwosemi-circularclus tersareshown inFigure(a).Pointsnumbered 1 250 belongtotherstclusterandpoints numbered 251 500 belongtothesecondcluster.Theclustersobtainedusing k -means(usingEuclideandistancemeasure)donotreectthe trueunderlying clusters(showninFigure(b)),becausetheclustersarenot linearlyseparableas expectedbythe k -meansalgorithm.Ontheotherhand,thekernel k -meansalgo- rithmusingtheRBFkernel(withkernelwidth ˙ 2 =0 :4 )revealsthetrueclusters (showninFigure(c)).Figures(d)and(e)showthe 500 500 similaritymatrices correspondingtotheEuclideandistanceandtheRBFkernels imilarity,respec- tively.TheRBFkernelsimilaritymatrixcontainsdistinct blockswhichdistinguish betweenthepointsfromdifferentclusters.Thesimilarity betweenthepointsinthe sametrueclusterishigherthanthesimilaritybetweenpoin tsindifferentclusters. TheEuclideandistancematrix,ontheotherhand,doesnotco ntainsuchdistinct blocks,whichexplainsthefailureofthe k -meansalgorithmonthisdata.......20 Figure1.5Similarityofimagesexpressedthroughgrayleve lhistograms.Thehistogram oftheintensityvaluesoftheimageofawebsite(Figure(b)) isverydifferentfrom thehistogramsoftheimagesofbutteries(Figures(d)and( f)).Thehistogramsof thetwobutteryimagesaresimilartoeachother.......... .........21 xiv Figure1.6Sensitivityofthekernel k -meansalgorithmtothechoiceofkernelfunction. Thesemi-circlesdataset(showninFigure(a))isclustered usingkernel k -means withtheRBFkernel.Whenthekernelwidthissetto 0 :4 ,thetwoclustersare correctlydetected(showninFigure(b)),whereaswhenthek ernelwidthissetto 0 :1 ,thepointsareclusteredincorrectly(showninFigure(c)) .Figure(d)showsthe variationintheclusteringerrorofkernel k -means,denedin(1.10),withrespect tothekernelwidth................................... 30 Figure1.7Scalabilityofclusteringalgorithmsintermsof n , d and C ,andthecontribution oftheproposedalgorithmsinimprovingthescalabilityofk ernel-basedclustering. Theplotshowsthemaximumsizeofthedatasetthatcanbeclus teredwithless than 100 GBmemoryona 2 :8 GHzprocessorwithareasonableamountofcluster- ingtime(lessthan 10 hours).Thelinearclusteringalgorithmsarerepresentedi n blue,currentkernel-basedclusteringalgorithmsareshow ningreen,parallelclus- teringalgorithmsareshowninmagenta,andtheproposedclu steringalgorithmsare representedinred.Existingkernel-basedclusteringalgo rithmscanclusteronlyup totheorderof 10 ;000 pointswith 100 featuresinto 100 clusters.Theproposed batchclusteringalgorithms(approximatekernel k -means,RFFclustering,andSV clusteringalgorithms)arecapableofperformingkernel-b asedclusteringondata setsaslargeas 10 million,withthesameresourceconstraints.Theproposedo n- lineclusteringalgorithms(approximatestreamkernel k -meansandsparsekernel k -meansalgorithms)canclusterarbitrarily-sizeddataset swithdimensionalityin theorderof 1 ;000 andthenumberofclustersintheorderof 10 ;000 .........32 Figure2.1Illustrationoftheapproximatekernel k -meansalgorithmonthetwo- dimensionalsemi-circlesdatasetcontaining 500 points( 250 pointsineachofthe twoclusters).Figure(a)showsallthedatapoints(inred)a ndtheuniformlysam- pledpoints(inblue).Figures(b)-(e)showtheprocessofdi scoveryofthetwo clustersinthedatasetandtheircentersintheinputspace( representedbyx)bythe approximatekernel k -meansalgorithm.........................53 Figure2.2ExampleimagesfromthreeclustersintheImagene t-34dataset.Theclusters represent(a)buttery,(b)odometer,and(c)websiteimage s.............67 Figure2.3Silhouettecoefcientvaluesofthepartitionso btainedusingapproximateker- nel k -means,comparedtothoseofthepartitionsobtainedusingt hebaselinealgo- rithms.Thesamplesize m issetto 2 ;000 ,forboththeproposedalgorithmandthe Nystromapproximationbasedspectralclusteringalgorith m.............68 xv Figure2.4NMIvalues(in%)ofthepartitionsobtainedusing approximatekernel k -means, withrespecttothetrueclasslabels.Thesamplesize m issetto 2 ;000 ,forboth theproposedalgorithmandtheNystromapproximationbased spectralclustering algorithm.Itisnotfeasibletoexecutekernel k -meansonthelargeForestCover Type,Imagenet-34,Poker,andNetworkIntrusiondatasetsd uetotheirlargesize. TheapproximateNMIvaluesofkernel k -meansonthesedatasetsareobtainedby rstexecutingkernel k -meansonarandomlychosensubsetof 50 ;000 datapoints tondtheclustercenters,andthenassigningtheremaining pointstotheclosest clustercenter...................................... 69 Figure2.5ExampleimagesfromtheclustersfoundintheCIFA R-10datasetusingapprox- imatekernel k -means.Theclustersrepresentthefollowingobjects:(a)a irplane, (b)automobile,(c)bird,(d)cat,(e)deer,(f)dog,(g)frog ,(h)horse,(i)ship,and (j)truck.........................................70 Figure2.6Effectofthesamplesize m ontheNMIvalues(in%)ofthepartitionsobtained usingapproximatekernel k -means,withrespecttothetrueclasslabels........72 Figure2.7Effectofthesamplesize m ontheSilhouettecoefcientvaluesofthepartitions obtainedusingapproximatekernel k -means......................73 Figure2.8ComparisonofSilhouettecoefcientvaluesofth epartitionsobtainedfromap- proximatekernel k -meansusingtheuniform,column-normand k -meanssampling strategies,ontheCIFAR-10andMNISTdatasets.Parameter m representsthe samplesize.......................................76 Figure2.9ComparisonofNMIvalues(in%)ofthepartitionso btainedfromapproximate kernel k -meansusingtheuniform,column-normand k -meanssamplingstrategies, ontheCIFAR-10andMNISTdatasets.Parameter m representsthesamplesize...77 Figure2.10Runningtimeoftheapproximatekernel k -meansalgorithmfordifferentvalues of(a) n ,(b) d and(c) C .................................77 Figure3.1AsimpleexampletoillustratetheRFFclustering algorithm.(a)Two- dimensionaldatasetwith 500 pointsfromtwoclusters( 250 pointsineachcluster), (b)Plotofthematrix H obtainedbysampling m =1 Fouriercomponent.(c) Clustersobtainedbyexecuting k -meanson H .....................84 Figure3.2Silhouettecoefcientvaluesofthepartitionso btainedusingtheRFFandSV clusteringalgorithms.Theparameter m ,whichrepresentsthenumberofFourier componentsfortheRFFandSVclusteringalgorithms,andthe samplesizeforthe approximatekernel k -meansandNystromapproximationbasedspectralclusterin g algorithms,issetto m =2 ;000 ............................100 xvi Figure3.3NMIvalues(in%)ofthepartitionsobtainedusing theRFFandSVclustering algorithms,withrespecttothetrueclasslabels.Theparam eter m ,whichrepresents thenumberofFouriercomponentsfortheRFFandSVclusterin galgorithms,and thesamplesizefortheapproximatekernel k -meansandNystromapproximation basedspectralclusteringalgorithms,issetto m =2 ;000 .Itisnotfeasibleto executekernel k -meansonthelargeForestCoverType,Imagenet-34,Poker,a nd NetworkIntrusiondatasetsduetotheirlargesize.Theappr oximateNMIvalues ofkernel k -meansonthesedatasetsareobtainedbyrstexecutingkern el k -means onarandomlychosensubsetof 50 ;000 datapointstondtheclustercenters,and thenassigningtheremainingpointstotheclosestclusterc enter............102 Figure3.4EffectofthenumberofFouriercomponents m onthesilhouettecoefcientval- uesofthepartitionsobtainedusingtheRFFandSVclusterin galgorithms.Param- eter m representsthenumberofFouriercomponentsfortheRFFandS Vclustering algorithms,andthesamplesizefortheapproximatekernel k -meansandNystrom approximationbasedspectralclusteringalgorithms..... .............103 Figure3.5EffectofthenumberofFouriercomponents m ontheNMIvalues(in%)of thepartitionsobtainedusingtheRFFandSVclusteringalgo rithms,onthesix benchmarkdatasets.Parameter m representsthenumberofFouriercomponents fortheRFFandSVclusteringalgorithms,andthesamplesize fortheapproximate kernel k -meansandNystromapproximationbasedspectralclusterin galgorithms..107 Figure3.6RunningtimeoftheRFFclusteringalgorithmford ifferentvaluesof(a) n ,(b) d and(c) C .......................................108 Figure3.7RunningtimeoftheSVclusteringalgorithmfordi fferentvaluesof(a) n ,(b) d and(c) C ........................................109 Figure4.1Schemaoftheproposedapproximatestreamkernel k -meansalgorithm......117 Figure4.2Illustrationofimportancesamplingonatwo-dim ensionalsyntheticdataset containing 1 ;000 pointsalong 10 concentriccircles( 100 pointsineachcluster), representedbyfioflinFigure(a).Figure(b)shows 50 pointssampledusingim- portancesampling,andFigures(c)and(d)show 50 and 100 pointsselectedusing Bernoullisampling,respectively.Thesampledpointsarer epresentedusingfi*fl. Allthe 10 clustersarewell-representedbyjust 50 pointssampledusingimportance sampling.Ontheotherhand, 50 pointssampledusingBernoullisamplingarenot adequatetorepresentthese 10 clusters(Cluster 4 inredhasnorepresentatives).At least 100 pointsareneededtorepresentalltheclusters........... .....119 xvii Figure4.3Runningtime(inmilliseconds)ofthestreamclus teringalgorithms.Thepa- rametersfortheproposedapproximatestreamkernel k -meansalgorithmaresetto m =5 ;000 , M =20 ;000 ,and ˝ =1 .ThecoresetsizefortheStreamKM++algo- rithm,andthechunksizeofthesKKMalgorithmaresetto 5 ;000 .Itisnotfeasible toexecutekernel k -meansontheForestCoverType,Imagenet-34,Poker,andNet - workIntrusiondatasetsduetotheirlargesize.Theapproxi materunningtimeof kernel k -meansonthesedatasetsisobtainedbyrstexecutingkerne l k -meanson arandomlychosensubsetof 50 ;000 datapointstondtheclustercenters,andthen assigningtheremainingpointstotheclosestclustercente r..............129 Figure4.4Silhouettecoefcientvaluesofthepartitionso btainedusingtheproposedap- proximatestreamkernel k -meansalgorithm.Theparametersfortheproposedal- gorithmweresetto m =5 ;000 , M =20 ;000 ,and ˝ =1 .Thecoresetsizefor theStreamKM++algorithm,andthechunksizeofthesKKMalgo rithmwereset to 5 ;000 .........................................130 Figure4.5NMI(in%)oftheclusteringalgorithmswithrespe cttothetrueclasslabels. Theparametersfortheproposedapproximatestreamkernel k -meansalgorithmare setto m =5 ;000 , M =20 ;000 ,and ˝ =1 .ThecoresetsizefortheStreamKM++ algorithm,andthechunksizeofthesKKMalgorithmaresetto 5 ;000 .Itisnot feasibletoexecutekernel k -meansontheForestCoverType,Imagenet-34,Poker, andNetworkIntrusiondatasetsduetotheirlargesize.Thea pproximateNMI valuesofkernel k -meansonthesedatasetsisobtainedbyrstexecutingkerne l k -meansonarandomlychosensubsetof 50 ;000 datapointstondthecluster centers,andthenassigningtheremainingpointstotheclos estclustercenter.....131 Figure4.6ChangeintheNMI(in%)oftheproposedapproximat estreamkernel k -means algorithmovertime.Theparameters m , M and ˝ weresetto m =5 ;000 , M = 20 ;000 and ˝ =1 ,respectively.............................132 Figure4.7Effectoftheinitialsamplesize m ontherunningtime(inmilliseconds)ofthe proposedapproximatestreamkernel k -meansalgorithm.Parameter m represents theinitialsamplesetsize,thecoresetsizeandthechunksi zefortheapproximate streamkernel k -means,StreamKM++andsKKMalgorithms,respectively.The parameters M and ˝ aresetto M =20 ;000 and ˝ =1 ,respectively.........134 Figure4.8Effectoftheinitialsamplesize m onthesilhouettecoefcientvaluesofthe proposedapproximatestreamkernel k -meansalgorithm.Parameter m represents theinitialsamplesetsize,thecoresetsizeandthechunksi zefortheapproximate streamkernel k -means,StreamKM++andsKKMalgorithms,respectively.The parameters M and ˝ aresetto M =20 ;000 and ˝ =1 ,respectively.........135 xviii Figure4.9Effectoftheinitialsamplesize m ontheNMI(in%)oftheproposedapprox- imatestreamkernel k -meansalgorithm.Parameter m representstheinitialsample setsize,thecoresetsizeandthechunksizefortheapproxim atestreamkernel k - means,StreamKM++andsKKMalgorithms,respectively.Thep arameters M and ˝ aresetto M =20 ;000 and ˝ =1 ,respectively...................136 Figure4.10Sampletweetsfromthe ASP.NET cluster......................141 Figure4.11Sampletweetsfromthe HTML cluster.......................142 Figure4.12TrendingclustersinTwitter.Thehorizontalax isrepresentsthetimelineindays andtheverticalaxisrepresentsthepercentageratioofthe numberoftweetsin theclustertothetotalnumberoftweetsobtainedontheday. Figure(a)shows thetrendsobtainedbytheproposedapproximatestreamkern el k -meansalgorithm, andFigure(b)showsthetruetrends..................... .....143 Figure5.1Illustrationofkernelsparsityonatwo-dimensi onalsyntheticdatasetcontain- ing 1 ;000 pointsalong 10 concentriccircles.Figure(a)showsallthedatapoints (representedbyfiofl)andFigure(b)showstheRBFkernelmatr ixcorrespondingto thisdata.Neighboringpointshavethesameclusterlabelwh enthekernelisdened correctlyforthedataset............................. ...148 Figure5.2Sampleimagesfromthreeofthe 100 clustersintheCIFAR-100datasetob- tainedusingtheproposedalgorithm.................... ......165 Figure5.3NMI(in%)oftheproposedsparsekernel k -meansandthethreebaselineal- gorithmsontheCIFAR-100andImagenet-164datasets.Thepa rametersofthe proposedalgorithmweresetto m =20 ;000 , M =50 ;000 ,and p =1 ;000 . Thesamplesize m fortheapproximatekernel k -meansalgorithmwassetequalto 20 ;000 fortheCIFAR-100datasetand 10 ;000 fortheImagenet-164dataset.Itis notfeasibletoexecutekernel k -meansontheImagenet-164dataset,duetoitslarge size.TheapproximateNMIvalueachievedbykernel k -meansontheImagenet-164 datasetisobtainedbyrstexecutingthealgorithmonarand omlychosensubset of 50 ;000 datapointstondtheclustercenters,andthenassigningth eremaining pointstotheclosestclustercenter.................... .......166 Figure5.4ComparisonoftheNMI(in%)oftheproposedsparse kernel k -meansalgo- rithmandtheapproximatekernel k -meansalgorithmontheCIFAR-100andthe Imagenet-164datasets.Parameter m representstheinitialsamplesetsizeforthe proposedalgorithm,andthesizeofthesampledsubsetforth eapproximatekernel k -meansalgorithm.Theremainingparametersoftheproposed algorithmwereset to M =50 ;000 ,and p =1 ;000 .Approximatekernel k -meansisinfeasibleforthe Imagenet-164datasetwhen m> 10 ;000 duetoitslargesize.............169 xix Figure5.5Effectofthenumberofclusters C ontherunningtime(inseconds)ofthe proposedsparsekernel k -meansalgorithm.......................171 Figure5.6Effectofthenumberofclusters C ontheNMI(in%)oftheproposedsparse kernel k -meansalgorithm................................171 Figure5.7Runningtimeofthesparsekernel k -meansclusteringalgorithmfordifferent valuesof(a) n ,(b) d and(c) C .............................172 xx LISTOFALGORITHMS Algorithm1 k -means............................................. .................18 Algorithm2Kernel k -means............................................. ..........26 Algorithm3ApproximateKernel k -means...........................................52 Algorithm4DistributedApproximateKernel k -means................................61 Algorithm5Meta-ClusteringAlgorithm................. ............................62 Algorithm6RFFClustering............................ ............................83 Algorithm7SVClustering............................. ............................89 Algorithm8ApproximateStreamKernel k -means..................................125 Algorithm9SparseKernel k -means............................................. ..151 Algorithm10Approximate k -means............................................. ...154 xxi Chapter1 Introduction Overthepastcoupleofdecades,greatadvancementshavebee nmadeindatageneration,collection andstoragetechnologies.Thishasresultedina digitaldataexplosion .Dataisuploadedeveryday bybillionsofuserstothewebintheformoftext,image,audi oandvideo,throughvariousmedia suchasblogs,e-mails,socialnetworks,photoandvideohos tingservices.Itisestimatedthat 204 millione-mailmessagesareexchangedeveryminute 1 ;overabillionusersonFacebookshare 4 :75 billionpiecesofcontenteveryhalfhour,including 350 millionphotosand 4 millionvideos 2 ; and 300 hoursofvideosareuploadedtoYouTubeeveryminute 3 .Inaddition,alargeamountof dataaboutthewebusersandtheirwebactivityiscollectedb yahostofcompanieslikeGoogle, Microsoft,FacebookandTwitter.Thisdataisnowpopularly termedas BigData [105]. Bigdataisformallydenedasfihighvolume,highvelocity,a nd/orhighvarietyinformation assetsthatrequirenewformsofprocessingtoenableenhanc eddecisionmaking,insightdiscovery andprocessoptimizationfl.Itischaracterizedbythe 3 V's-Volume,Velocity,andVariety.Volume indicatesthescaleofthedata.AstudybyIDCandEMCCorpora tionpredictedthecreation of44zettabytes( 10 21 bytes)ofdigitaldatabytheyear2020(SeeFigure1.1)[2].T hisboils 1 http://mashable.com/2014/04/23/data-online-every-mi nute 2 http://www.digitaltrends.com/social-media/according -to-facebook-there-are-350-million-photos-uploaded -on-the-social-network-daily-and-thats-just-crazy 3 https://www.youtube.com/yt/press/statistics.html 1 Figure1.1Emergingsizeofthedigitalworld.Imagefrom[2] . downtoabout2.3zettabytesofdatageneratedeveryday.Vel ocityrelatestoreal-timeprocessing ofstreamingdatainapplicationslikecomputernetworksan dstockexchanges.TheNewYork StockExchangecapturesabout 1 TBoftradeinformationduringeachtradingsession.Real-t ime processingofthisdatacanaidatraderinmakingimportantt radedecisions.Varietypertains totheheterogeneityofthedigitaldata.Bothstructuredda tasuchascensusrecordsandlegal records,andunstructureddataliketext,imagesandvideos fromthewebformpartofbigdata. Specializedtechniquesmaybeneededtohandledifferentfo rmatsofthedata.Otherattributes suchasreliability,volatilityandusefulnessofthedatah avebeenaddedtothedenitionofbig dataovertheyears.Virtuallyeverylargebusinessisinter estedingatheringlargeamountsofdata fromitscustomersandminingittoextractusefulinformati oninatimelymanner.Thisinformation helpsthebusinessprovidebetterservicetoitscustomersa ndincreaseitsprotability. About 23% ofthishumongousamountofdigitaldataisbelievedtoconta inusefulinformation thatcanbeleveragedbycompanies,governmentagenciesand individualusers 4 .Forinstance,a partialfiblueprintflofeveryuseronthewebcanbecreatedby combiningtheinformationfrom theirFacebook/Googleproles,statusupdates,Twittertw eets,metadataoftheirphotoandvideo uploads,webpagevisits,andallsortsofotherminutedata. Thisgivesaninsightintotheinterests 4 http://www.mckinsey.com/insights/business_technolog y/big_data_the_next_frontier_for_innovation 2 Figure1.2GrowthofTargetedDisplayAdvertising.Imagefr om[59]. andneedsoftheusers,therebyallowingcompaniestotarget aselectgroupofusersfortheirprod- ucts.Userspreferonlineadvertisementsthatmatchtheiri nterestsoverrandomadvertisements. Figure1.2showsthetremendousgrowththathasbeenachieve dintargetedadvertisingoverthe years,asaconsequenceofusingdataanalytics 5 tounderstandthebehaviorofwebusers[59]. BigDataanalyticshasalsoleadtothedevelopmentofnewapp licationsandserviceslike Microsoft'sHealthVault 6 ,aplatformthatenablespatientstocompilepersonalhealt hinformation frommultiplesourcesintoasingleonlinerepository,andc oordinatetheirhealthmanagement withotherusers.ApplicationssuchasGoogleFluTrends 7 andDengueTrends 8 predictedthe diseaseoutbreakwellbeforetheofcialCDC(USCentersfor DiseaseControlandPrevention) 5 Dataanalyticsisthescienceofexaminingdatawiththepurp oseofinferringusefulinformation,andmaking decisionsandpredictionsbasedontheinferences.Itencom passesamyriadofmethodologiesandtoolstoperform automatedanalysisofdata[1]. 6 https://www.healthvault.com/us/en/overview 7 http://www.google.org/utrends 8 http://www.google.org/denguetrends 3 andEISS(EuropeanInuenzaSurveillanceScheme)reportsa republished,basedonaggregated searchactivity,reducingthenumberofpeopleaffectedbyt hedisease[71]. 1.1DataAnalysis Dataanalysisisgenerallydividedintoexploratoryandcon rmatorydataanalysis[174].Thepur- poseofexploratoryanalysisistodiscoverpatternsandmod elthedata.Exploratorydataanalysisis usuallyfollowedbyaphaseofconrmatorydataanalysiswhi chaimsatmodelvalidation.Several statisticalmethodshavebeenproposedtoperformdataanal ysis.Statisticalpatternrecognitionand machinelearningisconcernedwithpredictiveanalysis,wh ichinvolvesdiscoveringrelationships betweenobjectsandpredictingfutureevents,basedonthek nowledgeobtained.Patternrecogni- tioncomprisesofthreephases:datarepresentation,learn ingandinference. 1.1.1DataRepresentation Datarepresentationinvolvesselectingasetoffeaturesto denotetheobjectsinthedataset.A d -dimensionalvector x =( x 1 ;:::;x d ) > denoteseachobject,where x p ;p 2 [d ]representsa feature.Thefeaturesmaybenumerical,categoricalorordi nal.Forinstance,adocumentmay berepresentedusingthewordsinthedocument;inwhichcase each x p denotesawordinthe document.Animagemayberepresentedusingthepixelintens ityvalues.Inthiscase, x p is thenumericalintensityvalueatthe p th pixel.Therepresentationemployeddictatesthekindof analysisthecanbeperformedonthedataset,andtheinterpr etationoftheresultsofanalysis. Therefore,itisimportanttoselectthecorrectrepresenta tion.Inmostapplications,priordomain knowledgeisusefulinselectingtheobjectrepresentation .Recently,deeplearningtechniqueshave beenemployedtoautomaticallylearntherepresentationfo robjects[20]. 4 1.1.2Learning Afterasuitablerepresentationischosen,thedataisinput toalearningalgorithmwhichtsamodel tothedata. Thesimplestlearningtaskisthatof supervisedlearning ,alsotermedasclassication[97]. Thegoalofsupervisedlearningistoderiveafunctionthatm apsthesetofinputobjectstoasetof targets(classes),using labeled trainingdata.Forinstance,givenasetoftaggedimages,th elearner analyzestheimagesandlearnsafunctionmappingtheimages totheirtags.Supervisedlearning ndsuseinmanyapplicationssuchasobjectrecognition,sp amdetection,intrusiondetection,and machinetranslation. Unfortunately,onlyabout 3% ofthepotentiallyusefuldataonthewebislabeled(e.g.tag sfor objectsinimages),anditisextremelyexpensivetoobtaint helabelsforthemassiveamountofdata, makingsupervisedlearningdifcultinmostbigdataapplic ations[2].Oflate,crowdsourcingtools suchasAmazonMechanicalTurk 9 havebeenusedtoobtainlabelsforthedataitems,frommulti ple usersovertheweb[29].However,labelsobtainedthroughsu chapproachescanbeunreliableand ambiguous.Forexample,inthetaskofimagetaggingthrough crowdsourcing,oneusermaytag theimageofapoodlewiththelabelfidogfl,whereasanotherus ermaylabelitasfianimalfl(i.e. usageofhypernymsversushyponyms).Thesametagfijaguarflc ouldapplytoboththecaraswell astheanimal(polysemy).Spammerscanintentionallygener atewronglabelsleadingtonoisein thedata.Additionaleffortsareneededtohandletheseissu es[138,185]. Semi-supervisedlearning techniquesalleviatetheneedforlabelinglargedatasetsb yutiliz- ingalargepoolofunlabeledobjectsinconjunctionwithare lativelysmallsetoflabeledobjectsto learnaclassier[189].Ithasbeenfoundthattheclassier slearntthroughsemi-supervisedlearn- ingmethodscanbemoreaccuratethanthoselearntusinglabe leddataalone,becausetheunlabeled dataallowsthelearnertoexploretheunderlyingstructure ofthedata.Thoughsemi-supervised learningmethodsmitigatethelabelingproblemassociated withsupervisedlearningmethodsto 9 https://www.mturk.com/mturk 5 someextent,theyarestillsusceptibletosameissuesasthe supervisedlearningtechniques.More- over,itisexpensivetoobtainsupervisioninapplications suchasstockmarketanalysis,wherehigh levelofexpertiseisrequiredtoidentifythestocktrends[ 130]. Unsupervisedlearning tasksinvolvendingthehiddenstructureindata.Unlikesu pervised andsemi-supervisedlearning,thesetasksdonotrequireth edatatobelabeled,therebyavoiding thecostoftaggingthedataandallowingonetoleveragethea bundantdatacorpus.Examplesof unsupervisedlearningtasksincludedensityestimation,d imensionalityreduction,featureselection andextraction,andclustering[83]. Clustering,alsoknownasunsupervisedclassication,iso neoftheprimaryapproachesto unsupervisedlearning.Thepurposeofclusteringistodisc overthenaturalgroupingoftheinput objects.Oneofthegoalsofclusteringistosummarizeandco mpressthedata,leadingtoefcient organizationandconvenientaccessofthedata.Itisoftene mployedasaprecursortoclassication. Thedataisrstcompressedusingclustering,andasupervis edlearningmodelisbuiltusingonly thecompresseddata.Forinstance,intheimagetaggingprob lem,ifthelearnerwasonlyprovided withalargenumberofuntaggedimages,theimagescanbegrou pedintoclustersbasedonapre- denedsimilarity.Eachclustercanberepresentedbyasmal lsetofprototypeimages,andthe labelsfortheserepresentativeimagesobtainedthroughcr owdsourcing,whichcanthenbeusedto learnataggingfunctioninasupervisedmanner.Thisproces sischeaperandmorereliablethan obtainingthelabelsforalltheimages.Clusteringndsuse inamultitudeofapplicationssuch aswebsearch,socialnetworkanalysis,imageretrieval,ge neexpressionanalysis,marketanalysis andrecommendationsystems[90]. 1.1.3Inference Inthisphase,thelearntmodelisusedfordecisionmakingan dprediction,asrequiredbytheap- plication.Forexample,intheimagetaggingproblem,themo delcomprisingthemappingfunction canbeusedtopredictthetagscorrespondingtoanimagethel earnerhasnotseenpreviously.In 6 Table1.1Notation. Symbol Description D = f x 1 ;:::; x n g Inputdatasettobeclustered x i i th datapoint ˜ Inputspace H Featurespace/ReproducingKernelHilbertSpace(RKHS) kk H FunctionalnorminRKHS d Dimensionalityoftheinputspace n Numberofpointsinthedataset C Numberofclusters U =( u 1 ;:::; u C ) > Clustermembershipmatrix ( C n ) P = f U 2f 0 ;1 g C n :U > 1 = 1 g Setofvalidclustermembershipmatrices C k k th cluster c k k th clustercenter n k Numberofpointsinthe k th cluster ' Mappingfunctionfrom ˜ to H ( ; ) Kernelfunction K Kernelmatrix ( n n ) socialnetworks,clusteringisemployedtogroupusersbase dontheirgender,occupation,web activity,andotherattributes,toautomaticallynduserc ommunities[128].Basedonthecommu- nitiesidentied,recommendationsfornewconnectionsand contentcanbemadetotheusers. Inthisthesis,wefocusontheclusteringproblem.Notation susedthroughoutthisthesisare summarizedinTable1.1. 1.2Clustering Clustering,oneoftheprimaryapproachestounsupervisedl earning,isthetaskofgroupingaset ofobjectsintoclustersbasedonsomeuser-denedsimilari ty.Givenasetof n objectsrepresented by D = f x 1 ;:::; x n g ,whereeachpoint x i 2 ˜ and ˜ < d ,theobjectiveofclustering,inmost applications,istogroupthepointsinto C clusters,representedby fC 1 ;:::; C C g ,suchthatthe clustersreectthenaturalgroupingoftheobjects.Thede nitionofnaturalgroupingissubjective, 7 (a) 1 5231826 71322 412 215 616 31110 9 81417192520212430272928123456(b) (c) Figure1.3Atwo-dimensionalexampletodemonstratehierar chicalandpartitionalclusteringtech- niques.Figure(a)showsasetofpointsintwo-dimensionals pace,containingthreeclusters.Hier- archicalclusteringgeneratesadendrogramforthedata.Fi gure(b)showsadendrogramgenerated usingthecomplete-linkagglomerativehierarchicalclust eringalgorithm.Thehorizontalaxisrep- resentsthedatapointsandtheverticalaxisrepresentsthe distancebetweentheclusterswhenthey rstmerge.Byapplyingathresholdonthedistanceat 4 units(shownbytheblackdottedline), wecanobtainthethreeclusters.Partitionalclusteringdi rectlyndsthe C clustersinthedataset. Figure(c)showsthethreeclusters,representedbytheblue ,greenandredpoints,obtainedusing the k -meansalgorithm.Thestarredpointsinblackrepresentthe clustercenters. anddependentonanumberoffactorsincludingtheobjectsin thedataset,theirrepresentation, andthegoalofclusteranalysis.Themostcommonobjectivei stogroupthepointssuchthatthe similaritybetweenthepointswithinthesameclusterisgre aterthanthesimilaritybetweenthe pointsindifferentclusters.Thestructureoftheclusters obtainedisdeterminedbythedenitionof thesimilarity.Itisusuallydenedintermsofadistancefu nction d :˜ ˜ !< . 1.2.1ClusteringAlgorithms Historically,twotypeofclusteringalgorithmshavebeend eveloped:hierarchicalandparti- tional[88]. Hierarchicalclusteringalgorithms,asthenamesuggests, buildahierarchyofclusters;the rootofthetreecontainsallthe n pointsinthedataset,andtheleavescontaintheindividual points.Agglomerativehierarchicalclusteringalgorithm sstartwith n clusters,eachwithone 8 point,andrecursivelymergetheclusterswhicharemostsim ilartoeachother.Divisive hierarchicalclusteringalgorithms,ontheotherhand,sta rtwiththerootcontainingallthe datapoints,andrecursivelysplitthedataintoclustersin atop-downmanner.Themost well-knownhierarchicalclusteringalgorithmsarethesin gle-link,complete-linkandWard's algorithms[88].Thesingle-linkalgorithmdenesthesimi laritybetweentwoclustersasthe similaritybetweentheirmostsimilarmembers,whereasthe complete-linkalgorithmdenes thesimilaritybetweentwoclustersasthesimilarityofthe irmostdissimilarmembers.The Ward'sclusteringalgorithmrecursivelymergesthecluste rsthatleadstotheleastpossible increaseintheintra-clustervarianceaftermerging.Figu re1.3(b)showsthecomplete-link dendrogramcorrespondingtotheclustersinthetwo-dimens ionaldatasetinFigure1.3(a). Partitionalclusteringalgorithms,directlypartitionth edatainto C clusters,asshowninFig- ure1.3(c).Popularpartitionalclusteringalgorithmsinc ludecentroid-based( k -means, k - medoids)[87,94],model-based(Mixturemodels,LatentDir ichletAllocation)[24],graph- theoretic(MinimumSpanningTrees,Normalized-cut,Spect ralclustering)[77,161],and densityandgrid-based(DBSCAN,OPTICS,CLIQUE)algorithm s[61]. Fromastatisticalviewpoint,clusteringtechniquescanal sobecategorizedasparametricand non-parametric[127].Parametricapproachestoclusterin gassumethatthedataisdrawnfrom adensity p ( x ) whichisamixtureofparametricdensities,andthegoalofcl usteringistoiden- tifythecomponentdensities.Thecentroid-basedandmodel -basedclusteringalgorithmsfallin thiscategory.Non-parametricapproachesarebasedonthep remisethattheclustersrepresentthe modesofthedensity p ( x ) ,andtheaimofclusteringistodetectthehigh-densityregi onsinthe data.Themodalstructureof p ( x ) canbesummarizedina clustertree .Eachlevelinthecluster treerepresentsthefeaturespace L ( ;p )= f x j p ( x ) > g .Clustertreescanbeconstructedusing thesingle-linkclusteringalgorithmtobuildneighborhoo dgraphs,andndingtheconnectedcom- ponentsintheneighborhoodgraphs.Density-basedpartiti onalclusteringalgorithmssuchasDB- 9 SCANandOPTICS,arespecializednon-parametricclusterin gtechniques,whichndthemodes ataxeduser-deneddensitythreshold.Mean-shiftcluste ringalgorithmsestimatethedensity locallyateach x ,andndthemodesusingagradientascentprocedureonthelo caldensity. 1.2.2ChallengesinDataClustering Dataclusteringisadifcultproblem,asreectedbythehun dredsofclusteringalgorithmsthat havebeenpublished,andthenewonesthatcontinuetoappear .Duetotheinherentunsupervised natureofclustering,thereareseveralfactorsthataffect theclusteringprocess. Datarepresentation. Thedatacanbeinputtoclusteringalgorithmsintwoforms:( i)the n d patternmatrix containingthe d featurevaluesforeachofthe n objects,and(ii) the n n proximitymatrix ,whoseentriesrepresentthesimilarity/dissimilaritybe tweenthe correspondingobjects.Givenasuitablesimilaritymeasur e,itiseasytoconvertapattern matrixtotheproximitymatrix.Similarly,methodslikesin gularvaluedecompositionand multi-dimensionalscalingcanbeenusedtoapproximatethe patternmatrixcorrespondingto thegivenproximitymatrix[47].Conventionally,hierarch icalclusteringalgorithmsassume inputintheformoftheproximitymatrix,whereaspartition alclusteringalgorithmsaccept thepatternmatrixasinput. Thefeaturesusedtorepresentthedatainthepatternmatrix playanimportantroleinclus- tering.Iftherepresentationisgood,theclusteringalgor ithmwillbeabletondcompact clustersinthedata.Dimensionalityofthedatasetisalsoc rucialtothequalityofclustersob- tained.High-dimensionalrepresentationswithredundant andnoisyfeaturesnotonlyleadto longclusteringtimes,butmayalsodeterioratethecluster structureinthedata.Featureselec- tionandextractiontechniquessuchasforward/backwardse lectionandprincipalcomponent analysisareusedtodeterminethemostdiscriminativefeat ures,andreducethedimensional- ityofthedataset[89].Deeplearningtechniques[20]andke rnellearningtechniques[112] 10 canbeemployedtolearnthedatarepresentationfromthegiv endataset. Numberofclusters. Mostclusteringalgorithmsrequirethespecicationofthe numberof clusters C .Whilecentroid-based,model-basedandgraph-theoretica lgorithmsdirectlyac- ceptthenumberofclustersasinput,densityandgrid-based algorithmsacceptotherparam- eterssuchasthemaximuminter-clusterdistance,whichare indirectlyrelatedtothenumber ofclusters.Automaticallydeterminingthenumberofclust ersisadifcultproblemand, inpractice,domainknowledgeisusedtodeterminethispara meter.Severalheuristicshave beenproposedtoestimatethenumberofclusters.In[172],t henumberofclustersisdeter- minedbyminimizingthefigapflbetweentheclusteringerror 10 foreachvalueof C ,andthe expectedclusteringerrorofareferencedistribution.Cro ss-validationtechniquescanbeused tondthevalueof C atwhichtheerrorcurvecorrespondingtothevalidationdat aexhibits asharpchange[68]. ClusteringAlgorithm. Theobjectiveofclusteringdictatesthealgorithmchosenf orclus- tering,andinturn,thequalityandthestructureoftheclus tersobtained.Centroid-based clusteringalgorithmssuchas k -meansaimatminimizingthesumofthedistancesbetween thepointsandtheirrepresentativecentroids.Thisobject iveissuitableforapplicationswhere theclustersarecompactandhyper-sphericalorhyper-elli psoidal.Densitybasedalgorithms aimatndingthedenseregionsinthedata.Thesingle-linkh ierarchicalclusteringalgorithm ndslongelongatedclusterscalledfichainsfl,asthecriter ionformergingclustersislocal, whereasthecomplete-linkhierarchicalclusteringalgori thmndslargecompactclusters. Eachclusteringalgorithmisassociatedwithadifferentsi milaritymeasure. Similaritymeasures. Thesimilaritymeasureemployedbytheclusteringalgorith miscrucial tothestructureoftheclustersobtained.Thechoiceofthes imilarityfunctiondependsonthe datarepresentationscheme,andtheobjectiveofclusterin g.Apopulardistancefunctionis 10 RefertoSection1.3.1forthedenitionofclusteringerror . 11 thesquaredEuclideandistancedenedby d 2 ( x a ;x b )= jj x a x b jj 2 2 ;(1.1) where x a ;x b 2D .However,theEuclideandistanceisnotsuitableforallapp lications. OtherdistancemeasuressuchasMahalanobis,Minkowskiand non-lineardistancemeasures havebeenappliedintheliteraturetoimprovetheclusterin gperformanceinmanyapplica- tions[171](SeeSection1.4). ClusteringTendency,QualityandStability. Mostclusteringalgorithmswillndclustersin thegivendataset,evenifthedatadoesnotcontainanynatur alclusters.Thestudyofclus- teringtendencydealswithexaminingthedatabeforeexecut ingtheclusteringalgorithm,to determineifthedatacontainsanyclusters.Clusteringten dencyisusuallyassessedthrough visualassessmenttechniqueswhichreorderthesimilarity matrixtoexaminewhetherornot thedatacontainsclusters[85].Thesetechniquescanalsob eusedtodeterminethenumber ofclustersinthedataset. Afterobtainingtheclusters,weneedtoevaluatethevalidi tyandqualityoftheclusters. Severalmeasureshavebeenidentiedtoevaluatethecluste rsobtained,andthechoiceofthe qualitycriteriondependsontheapplication.Clustervali ditymeasuresarebroadlyclassied aseitherinternalorexternalmeasures[88].Internalmeas uressuchasthevalueofthe clusteringalgorithm'sobjectivefunctionandtheinter-c lusterdistancesassessthesimilarity betweentheclusterstructureandthedata.Asclusteringis anunsupervisedtask,itislogical toemployinternalmeasurestoevaluatethepartitions.How ever,thesemeasuresaredifcult tointerpretandoftenvaryfromoneclusteringalgorithmto another.Ontheotherhand, externalmeasuressuchaspredictionaccuracyandclusterp urityusepriorinformationlike thetrueclasslabelstoassesstheclusterquality.Externa lmeasuresaremorepopularlyused toevaluateandcomparetheclusteringresultsofdifferent clusteringalgorithms,astheyare 12 easiertointerpretthaninternalvaliditymeasures. Clusterstabilitymeasuresthesensitivityoftheclusters tosmallperturbationsinthedata set[119].Itisdependentonboththedatasetandthealgorit hmusedtoperformclustering. Clusteringalgorithmswhichgeneratestableclustersarep referredastheywillberobustto noiseandoutliersinthedata.Stabilityistypicallymeasu redusingdataresamplingtech- niquessuchasbootstrapping.Multipledatasetsofthesame size,generatedfromthesame probabilitydistribution,areclusteredusingthesamealg orithmandthesimilaritybetween thepartitionsofthesedatasetsisusedasameasureoftheal gorithm'sstability. Scalability. Inadditiontotheclusterquality,thechoiceofthecluster ingalgorithmisalso determinedbythescalabilityofthealgorithm.Thisfactor becomesallthemorecrucial whendesigningsystemsforbigdataanalysis.Twoimportant factorsthatdeterminethescal- abilityofaclusteringalgorithmareitsrunningtimecompl exityanditsmemoryfootprint. Clusteringalgorithmswhichhavelinearorsub-linearrunn ingtimecomplexity,andrequire minimumamountofmemoryaredesirable. 1.3ClusteringBigData Whenthesizeofthedataset n isintheorderofbillionsandthedimensionalityofthedata d is intheorderofthousands,asisthecaseinmanybigdataanaly ticsproblems,thescalabilityof thealgorithmbecomesanimportantfactorwhilechoosingac lusteringalgorithm.Hierarchical clusteringalgorithmsareassociatedwithatleast O ( n 2 d + n 2 log( n )) runningtimeand O ( n 2 ) memorycomplexity,whichrenderstheminfeasibleforlarge datasets.Thesameholdsformany ofthepartitionalclusteringalgorithmssuchasthemodelb asedalgorithmslikeLatentDirichlet Allocation,graph-basedalgorithmssuchasspectralclust eringanddensity-basedalgorithmslike DBSCAN.Theyhaverunningtimecomplexitiesrangingfrom O ( n log( n )) to O ( n 3 ) intermsofthe numberofpointsinthedata,andatleastlineartimecomplex itywithrespecttothedimensionality 13 Table1.2ClusteringtechniquesforBigData. Clusteringapproaches Runningtimecomplexity Memorycomplexity Linearclustering k -means O ( nCd ) O ( nd ) Sampling-based CLARA[94] O ( Cm 2 + C ( n C )) O ( n 2 ) clusteringwith CURE[80] O ( m 2 log( m )) O ( md ) samplesize m ˝ n Coreset[82] O ( n + C polylog ( n )) O ( nd ) Compression BIRCH[197] O ( nd ) M y CLARANS[136] O ( n 2 ) O ( n 2 ) Streamcluster- ing Stream[79], ClusTree[98] O ( nCd ) M y Scalable k - means[30], Single-pass k -means[62] O ( nd ) StreamKM++[6] O ( dns ) * O ( ds log( n=s )) * Distributedclus- tering Parallel k - means[60,199] O ( nCd ) O ( PC 2 n ) ;> 0 with P tasks MapReduce basedspectral clustering[35] O ( n 2 d=P + r 3 + nr + nC 2 ) ** O ( n 2 =P ) Nearest- neighborcluster- ing[115] O ( n log( n ) =P ) O ( n=P ) * s = O ( dC log( n )log d= 2 ( C log( n ))) ** r representsthetherankoftheafnitymatrix y M isauser-denedparameterrepresentingtheamountofmemor yavailable d andthenumberofclusters C .Severalclusteringalgorithmshavebeenmodiedandspeci al algorithmshavebeendevelopedintheliterature,toscaleu ptolargedatasets.Mostofthese algorithmsinvolveapreprocessingphasetocompressordis tributethedata,beforeclusteringis performed.Someofthepopularmethodstoefcientlycluste rlargedatasets(listedinTable1.2) canbeclassiedbasedontheirpreprocessingapproach,asf ollows: Sampling-basedmethodsreducethecomputationtimebyrst choosingasubsetofthegiven datasetandthenusingthissubsettondtheclusters.Theke yideabehindallsampling-based 14 clusteringtechniquesistoobtaintheclusterrepresentat ives,usingonlythesampledsubset, andthenassigntheremainingdatapointstotheclosestrepr esentative.Thesuccessofthese techniquesdependsonthepremisethattheselectedsubseti sanunbiasedsampleandis representativeoftheentiredataset.Thissubsetischosen eitherrandomly(CLARA[94], CURE[80])orthroughanintelligentsamplingschemesuchas coresetsampling[82,183]. Coreset-basedclusteringrstndsasmallsetofweightedd atapointscalledthecoreset, whichapproximatesthegivendataset,withinauser-dened errormargin,andthenobtains theclustercentersusingthiscoreset.In[63],itisproved thatacoresetofsize O ( C 2 = 4 ) is sufcienttoobtainan O (1+ ) approximation,where istheerrorparameter. ClusteringalgorithmssuchasBIRCH[197]andCLARANS[136] improvetheclustering efciencybyencapsulatingthedatasetintospecialdatast ructuresliketreesandgraphsfor efcientdataaccess.Forinstance,BIRCHdenesadatastru cture,calledtheClustering- FeatureTree(CF-Tree).Eachleafnodeinthistreesummariz esasetofpointswhoseinter- pointdistancesarelessthanauser-denedthreshold,byth esumofthepoints,sumofthe squaresofthedatapoints,andthenumberofpoints.Eachnon -leafnodesummarizesthe samestatisticsforallitschildnodes.Thepointsinthedat asetareaddedincrementallytothe CF-Tree.Theleafentriesofthetreearethenclusteredusin ganagglomerativehierarchical clusteringalgorithmtoobtainthenaldatapartition.Oth erapproachessummarizethedata into kd -treesandR-treesforfast k -nearestneighborsearch[115]. Streamclustering[8]algorithmsaredesignedtooperatein asinglepassoveranarbitrary- sizeddataset.Onlythesufcientstatistics(suchastheme anandvarianceoftheclusters, whentheclustersareassumedtobedrawnfromaGaussianmixt ure)ofthedataseensofar areretained,therebyreducingthememoryrequirements.On eoftherststreamclustering algorithmswasproposedbyGuha etal. [79].Theyrstsummarizethedatastreamintoa largernumberofclustersthandesired,andthenclusterthe centroidsobtainedintherststep. 15 StreamclusteringalgorithmssuchasCluStream[8],ClusTr ee[98],scalable k -means[30], andsingle-pass k -means[62]werebuiltusingasimilaridea,containinganon linephase tosummarizetheincomingdata,andanofinephasetocluste rthesummarizeddata.The summarizationisusuallyintheformoftrees[8,30],grids[ 32,36]andcoresets[6,63].For instance,theCluStreamalgorithmsummarizesthedataseti ntoaCF-Tree,inwhicheach nodestoresthelinearsumandthesquaredsumofasetofpoint swhicharewithinauser- deneddistancefromeachother.Eachnoderepresentsamicr o-clusterwhosecenterand radiuscanbefoundusingthelinearandsquaredsumvalues.T he k -meansalgorithmisthe algorithmofchoicefortheofinephasetoobtainthenalcl usters. Withtheevolutionofcloudcomputing,parallelprocessing techniquesforclusteringhave gainedpopularity[48,60].Thesetechniquesspeedupthecl usteringprocessbyrstdivid- ingthetaskintoanumberofindependentsub-tasksthatcanb eperformedsimultaneously, andthenefcientlymergingthesesolutionsintothenalso lution.Forinstance,in[60], theMapReduceframework[148]isemployedtospeedupthe k -meansandthe k -medians clusteringalgorithms.Thedatasetissplitamongmanyproc essorsandasmallrepresenta- tivedatasampleisobtainedfromeachoftheprocessors.The serepresentativedatapoints arethenclusteredtoobtaintheclustercentersormedians. InparallellatentDirichletallo- cation,eachtaskndsthelatentvariablescorrespondingt oadifferentcomponentofthe mixture[133].TheMahoutplatform[143]implementsanumbe rofparallelclustering algorithms,includingparallel k -means,latentDirichletallocation,andmean-shiftclust er- ing[37,133,135,199].Billionsofimageswereclusteredus inganefcientparallelnearest- neighborclusteringin[115]. Datasetsofsizesclosetoabillionhavebeenclusteredusin gtheparallelizedversionsofthe k - means,nearestneighborandspectralclusteringalgorithm s.Tothebestofourknowledge,basedon thepublishedarticles,thelargestdatasetthathasbeencl usteredconsistedofa 1 :5 billionimages, 16 eachrepresentedbya 100 -dimensionalvectorcontainingtheHaarwaveletcoefcien ts[115].They wereclusteredinto 50 millionclustersusingthedistributednearestneighboral gorithmin 10 hours using 2 ;000 CPUs.Datasetsthatarebiginbothsize( n )anddimensionality( d ),likesocial- networkgraphsandwebgraphs,wereclusteredusingsubspac eclusteringalgorithmsandparallel spectralclusteringalgorithms[35,181]. 1.3.1Clusteringwith k -means Amongthevarious O ( n ) runningtimeclusteringalgorithmsinTable1.2,themostpo pularalgo- rithmforclusteringlargescaledatasetsisthe k-means algorithm[87].Itiseasytoimplement, simpleandefcient.Itiseasytoparallelize,hasrelative lyfewparameterswhencomparedtothe otheralgorithms,andyieldsclusteringresultssimilarto manyotherclusteringalgorithms[192]. Millionsofpointscanbeclusteredusing k -meanswithinminutes.Extensiveresearchhasbeen performedtosolvethe k -meansproblemandobtainstrongtheoreticalguaranteeswi threspectto itsconvergenceandaccuracy.Forthesereasons,wefocuson the k -meansalgorithminthisthesis. Thekeyideabehind k -meansistominimizethe clusteringerror ,denedasthesumofthe squareddistancesbetweenthedatapointsandthecenteroft heclustertowhicheachpointis assigned.Thiscanbeposedasthefollowingmin-maxoptimiz ationproblem: min U 2P max c k 2 ˜ C X k =1 n X i =1 U k;i d 2 ( c k ;x i ) ;(1.2) where U =( u 1 ;:::; u C ) > istheclustermembershipmatrix, c k 2 ˜;k 2 [C ]arethecluster centers,anddomain P = f U 2f 0 ;1 g C n :U > 1 = 1 g ,where 1 isavectorofallones.The mostcommonlyuseddistancemeasure d ( ; ) isthesquaredEuclideandistancemeasure,dened in(1.1).The k -meansproblemwiththesquaredEuclideandistancemeasure isdenedas min U 2P max c k 2 ˜ C X k =1 n X i =1 U k;i jj c k x i jj 2 2 :(1.3) 17 Algorithm1 k -means 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered C :thenumberofclusters 2: Output :Clustermembershipmatrix U 2f 0 ;1 g C n 3: Randomlyinitializethemembershipmatrix U withzerosandones,ensuringthat U > 1 = 1 . 4: repeat 5: Computetheclustercenters c k = 1 u > k 1 n P i =1 U k;i x i ;k 2 [C ]. 6: for i =1 ;:::;n do 7: Findtheclosestclustercenter k for x i ,bysolving k =argmin k 2 [ C ] jj c k x i jj 2 2 :8: Updatethe i th columnof U by U k;i =1 for k = k andzero,otherwise. 9: endfor 10: until convergenceisreached Theaboveproblem(1.3)isanNP-completeintegerprogrammi ngproblem,duetowhichitis difculttosolve[121].Agreedyapproximatealgorithm,pr oposedbyLloydsolves(1.3)itera- tively[116].Thecentersareinitializedrandomly.Ineach iteration,everydatapointisassignedto theclusterwhosecenterisclosesttoit,andthenthecluste rcentersarerecalculatedasthemeans ofthepointsassignedtothecluster,i.e.the k th center c k isobtainedas c k = 1 n k n X i =1 U k;i x i ;k 2 [C ];(1.4) where n k = u > k 1 isthenumberofpointsassignedtothe k th cluster.Thesetwostepsarerepeated untiltheclusterlabelsofthedatapointsdonotchangeinco nsecutiveiterations.Thisproce- dureisdescribedinAlgorithm1.Ithas O ( ndCl ) runningtimecomplexityand O ( nd ) memory complexity,where l isthenumberofiterationsrequiredforconvergence.Sever almethodshave beendevelopedintheliteraturetoinitializethealgorith mintelligentlyandensurethatthesolution obtainedisa (1+ ) -approximationoftheoptimalsolutionof(1.3)[12,101]. 18 1.4KernelBasedClustering Theissueofscalabilitycanbeaddressedbyusingthelarges caleclusteringalgorithmsdescribed inSection1.3.However,mostofthesealgorithms,includin g k -means,arelinearclusteringalgo- rithms,i.e.theyassumethattheclustersarelinearlysepa rableintheinputspace(e.g.thedataset showninFigure1.3(a))anddenetheinter-pointsimilarit iesusingmeasuressuchastheEuclidean distance.Theysufferfromthefollowingtwomaindrawbacks : (i)Datasetsthatcontainclustersthatcannotbeseparated byahyperplaneintheinputspace cannotbeclusteredbylinearclusteringalgorithms.Forth isreason,alltheclusteringalgo- rithmsinTable1.2,withtheexceptionofspectralclusteri ng,areonlyabletondcompact well-separatedclustersinthedata.Theyarealsonotrobus ttonoiseandoutliersinthedata. ConsidertheexampleshowninFigure1.4.ThedatasetinFigu re1.4(a)contains 500 points intheformoftwosemi-circles.Weexpectaclusteringalgor ithmtogroupthepointsineach semi-circle,anddetectthetwosemi-circularclusters.Th eclustersresultingfrom k -means withEuclideandistanceareshowninFigure1.4(b).Duetoth euseofEuclideandistance, thetwo-dimensionalspaceisdividedintotwohalf-spacesa ndtheresultingclustersaresep- aratedbytheblackdottedline.OtherEuclidean-distanceb asedpartitionalalgorithmsalso ndsimilarincorrectpartitions. (ii)Non-linearsimilaritymeasurescanbeusedtondarbit rarilyshapedclusters,andaremore suitableforreal-worldapplications.Forexample,suppos etwoimagesarerepresentedby theirpixelintensityvalues.Theimagesmaybeconsideredm oresimilartoeachotherif theycompriseofsimilarpixelvalues,asshowninFigure1.5 .Thusthedifferencebetween theimagesisreectedbetterbythedissimilarityofimageh istogramsthanbytheEuclidean distancebetweenthepixelvalues[14,106]. 19 (a) (b) (c) 1002003004005001002003004005002468(d) 1002003004005001002003004005000.20.40.60.8(e) Figure1.4Atwo-dimensionalexamplethatdemonstratesthe limitationsof k -meansclustering. 500 two-dimensionalpointscontainingtwosemi-circularclus tersareshowninFigure(a).Points numbered 1 250 belongtotherstclusterandpointsnumbered 251 500 belongtothesecond cluster.Theclustersobtainedusing k -means(usingEuclideandistancemeasure)donotreectthe trueunderlyingclusters(showninFigure(b)),becausethe clustersarenotlinearlyseparableas expectedbythe k -meansalgorithm.Ontheotherhand,thekernel k -meansalgorithmusingtheRBF kernel(withkernelwidth ˙ 2 =0 :4 )revealsthetrueclusters(showninFigure(c)).Figures(d )and (e)showthe 500 500 similaritymatricescorrespondingtotheEuclideandistan ceandtheRBF kernelsimilarity,respectively.TheRBFkernelsimilarit ymatrixcontainsdistinctblockswhich distinguishbetweenthepointsfromdifferentclusters.Th esimilaritybetweenthepointsinthe sametrueclusterishigherthanthesimilaritybetweenpoin tsindifferentclusters.TheEuclidean distancematrix,ontheotherhand,doesnotcontainsuchdis tinctblocks,whichexplainsthefailure ofthe k -meansalgorithmonthisdata. 20 (a) 01000200030000100200(b) (c) 01002003000100200(d) (e) 01002003000100200(f) Figure1.5Similarityofimagesexpressedthroughgrayleve lhistograms.Thehistogramofthe intensityvaluesoftheimageofawebsite(Figure(b))isver ydifferentfromthehistogramsofthe imagesofbutteries(Figures(d)and(f)).Thehistogramso fthetwobutteryimagesaresimilar toeachother. 21 Theissueofnon-linearseparabilityistackledusing kernelfunctions 11 .Thekeybehindthe successofkernel-basedlearningalgorithmsisthefacttha tanydatasetbecomeslinearlyseparable whenprojectedtoanappropriatehighdimensionalspace.Co nsideranon-linearfunction ' :˜ ! H ,whichmapsthepointsinthe inputspace ˜ toahighdimensional featurespace H .The distancebetweenthedatapointsinthisfeaturespacecanbe denedintermsofthedotproducts oftheprojectedpoints.Forinstance,theEuclideandistan cebetweentwopoints x a and x b in H isdenedas jj ' ( x a ) ' ( x b ) jj 2 2 = h ' ( x a ) ;' ( x a ) i + h ' ( x b ) ;' ( x b ) i 2 h ' ( x a ) ;' ( x b ) i :Inpracticalapplications,thedimensionalityof H isextremelyhigh,possiblyinnite.Hence, theexplicitcomputationofthemapping ' ishighlycomputationallyintensiveand,inmostcases, infeasible.Thiscomputationisavoidedbyreplacingthedo tproductwithanon-linearkernel distancefunction ( ; ): ˜ ˜ !< .Thedistancebetweenanytwopointsisnowdenedin termsofthekernelfunction as d 2 ( x a ;x b )= ( x a ;x a )+ ( x b ;x b ) 2 ( x a ;x b ) :(1.5) Akernelfunction isadmissibleifandonlyifitsatisestheMercer'sconditi on[159,Theorem 2.10].Informallystated,Mercer'stheoremassertsthatth ereexistsamapping ' andanexpansion ( x a ;x b )= ' ( x a ) > ' ( x b ) ifandonlyif,foranyfunction g ( x ) suchthat R g ( x ) 2 d x isnite,we have Z ( x a ;x b ) g ( x a ) g ( x b ) d x a d x b 0 :SuchakernelisknownastheMercerkernelorReproducingKer nel,andthefeaturespace H is calledthe ReproducingKernelHilbertSpace(RKHS) .Thematrix K =[ ( x i ;x j )] ;x i ;x j 2D is 11 http://crsouza.blogspot.com/2010/03/kernel-function s-for-machine-learning.html 22 Table1.3Popularkernelfunctions. Linear ( x a ;x b )= x > a x b + c forconstant c Polynomial ( x a ;x b )= x > a x b + c d , d isthedegreeofthepolynomialkernel RBF ( x a ;x b )=exp jj x a x b jj 2 2 2 ˙ 2 , ˙> 0 isthekernelwidthparameter Laplacian ( x a ;x b )=exp jj x a x b jj ˙ Chi-square ( x a ;x b )=1 d P i =1 ( x i a x i b ) 2 0 : 5 ( x i a + x i b ) HistogramIntersection ( x a ;x b )= P min( hist ( x a ) ;hist ( x b )) Stringkernel Numberofcommonsubsequencesbetweenstringsequences x a and x b knownasthekernelmatrixorGrammatrix.Thesimplestkerne lfunctionsarepositivedenite kernelswhosecorrespondingkernelmatrixisHermitianand positive-denite.TheRadialBasis Function(RBF)kerneldenedby ( x a ;x b )=exp jj x a x b jj 2 2 2 ˙ 2 ! ;˙> 0 (1.6) isapopularpositive-denitekernelfunction.Itperforms wellonalargenumberofbenchmark datasets.Theparameter ˙ 2 ,knownasthekernelwidth,scalesthedistancebetweenthep oints.Ta- ble1.3listssomeofthepopularkernelfunctions.Chi-squa rekernel,histogramintersectionkernel andtheirvariantsarecommonlyusedinimageandvideo-rela tedapplications.Stringkernelsare popularintext-miningapplications.Theremainingkernel sinTable1.3aregenerickernels.Using thelinearkernelisthesameasusingtheEuclideandistance measure. Kernelbasedclusteringtechniquesuse(1.5)todenethesi milaritybetweenobjects.Conse- quently,whenprovidedwiththeappropriatekernelfunctio n,theyhavetheabilitytocapturethe non-linearstructureinrealworlddatasetsand,thus,usua llyperformbetterthanthelinearcluster- ingalgorithms,intermsofclusterquality[95].Variouske rnel-basedclusteringalgorithmshave beendeveloped,includingkernel k -means,spectralclustering,supportvectorclustering,m aximum marginclustering,kernelself-organizingmapsandkernel neuralgas[65]. 23 Spectralclustering[118]isbasedontheideaofspectralgr aphpartitioning.Thedatapoints arerepresentedasnodesinagraphandtheafnitybetweenth enodesisdenedbythekernel similaritybetweenthepoints.Thegraphispartitionedint o C componentsbyrstcomputing thegraphLaplacianmatrixandtheeigenvectorscorrespond ingtoitssmallest C eigenvalues,and thenclusteringtheeigenvectorsinto C clustersusing k -means.Thedatapartitionisobtainedvia thegraphpartition.Spectralclusteringiswidelyemploye dforimagesegmentationandgraph partitioningproblems. Supportvectorclustering[19]involvesprojectingthedat atoahighdimensionalfeaturespace andsearchingforaminimumenclosingsphereinthisspace.T hisenclosingsphereisprojected backintotheinputspaceandthesupportvectorsareusedtod enetheclusterboundaries.All pointsthatliewithinaclusterboundaryareassignedtothe samecluster.Themaximummargin clusteringtechnique[190]ndstheclusterlabelingwhich whenusedtondamaximummargin classier(e.g.SupportVectorMachines)forthegivendata ,resultsinamarginthatismaximal overallpossibleclusterlabelings.Aconvexoptimization problemwiththeclusterlabelsandthe marginoftheSupportVectorMachineasvariables,andconst raintsonthenumberofpointsper clusterandthedifferenceintheclustersizes,isformulat ed.Thelabelsandtheclassiermargin areoptimizedsimultaneouslytondtheoptimalclusterlab els. Thekernelself-organizingmap[120]algorithmextendsthe self-organizingmap[96]algo- rithmtousekerneldistancemeasures.Thekeyideabehindth isalgorithmistoconstructalow- dimensional(typicallytwo-dimensional)topology-prese rvingmapoftheinputdatasetthrough competitivelearning.Aself-organizingmap,alsoknownas theKohenenmap,consistsofatwo- layernetwork,theinputlayercontaining d nodesandanoutputlayercontainingatleast C nodes. Eachoutputnodeisrandomlyinitializedwithaweight.When anewdatapointisinputtothe network,thenodewhoseweightisclosesttotheinputdatapo intintermsofthekerneldistanceis determined.ThisnodeiscalledtheBestMatchingUnit.Thew eightsofthebestmatchingunitand itsneighboringnodesareupdated,basedonapre-denednei ghborhoodfunction.Afteranumber 24 ofpassesoverthedataset,theweightsofthenodesconverge toformdistinctiveregionsinthe outputlayerfromwhichtheclustersinthedatacanbereadof f. Thekernelneuralgasalgorithm[145],inspiredbytheself- organizingmapalgorithm,also createsamapoftheinputdata.Thedifferencebetweenthetw omethodsisthatwhile,onlythe weightsofafewneighboringnodesofthebestmatchingunita reupdatedinaself-organizingmap, theweightsofallthenodesareupdatedintheneuralgasalgo rithm.Thenodesarerankedbasedon theirproximitytothebestmatchingunit,andtheirweights updatedonthebasisoftheirrank.The nearestnodeisupdatedbyahigherfactorthanthefarthestn ode.Thisupdatemechanismleadsto neuralgasconvergingfasterthanself-organizingmaps. Similartothe k -meansalgorithm,thekernel k -meansalgorithm[160]isthemostpopular kernel-basedclusteringalgorithmduetoitsimplicity.Se veralstudieshavealsoestablishedthe theoreticalequivalenceofkernel k -meansandotherkernel-basedclusteringmethods,suggest ing thattheyyieldsimilarresults[51,52]. 1.4.1Kernel k -means Thekernel k -meansalgorithmcanbeviewedasanon-linearextensionoft he k -meansalgo- rithm.ItreplacestheEuclideandistancefunction(1.1)em ployedinthe k -meansalgorithmwitha non-linearkerneldistancefunctiondenedin(1.5). Let K 2< n n bethekernelmatrixwith K i;j = ( x i ;x j ) ,where ( ; ) isthekernelfunction. Let H betheReproducingKernelHilbertSpace(RKHS)endowedbyth ekernelfunction ( ; ) , and jjjj H bethefunctionalnormfor H .Similartothe k -meansproblem,theobjectiveofkernel k -meansistominimizetheclusteringerror.Hence,thekerne l k -meansproblemcanbecastasthe followingoptimizationproblem: min U 2P max f c k ( ) 2H g C k =1 C X k =1 n X i =1 U k;i k c k ( ) ( x i ; ) k 2 H ;(1.7) 25 Algorithm2 Kernel k -means 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered ( ; ): ˜ ˜ 7!< :thekernelfunction C :thenumberofclusters 2: Output :Clustermembershipmatrix U 2f 0 ;1 g C n 3: Computethekernelmatrix K =[ ( x i ;x j )] n n . 4: Randomlyinitializethemembershipmatrix U withzerosandones,ensuringthat U > 1 = 1 . 5: repeat 6: for i =1 ;:::;n do 7: Findtheclosestclustercenter k for x i ,bysolving k =argmin k 2 [ C ] jj c k ( ) ( x i ; ) jj 2 H =argmin k 2 [ C ] u > k K u k u > k 1 2 2 u > k K i u > k 1 ;where u k isthe k th columnof U > ,and K i isthe i th columnof K . 8: Updatethe i th columnof U by U k;i =1 for k = k andzerootherwise. 9: endfor 10: until convergenceisreached where U =( u 1 ;:::; u C ) > istheclustermembershipmatrix, c k ( ) 2H ;k 2 [C ]arethecluster centers,anddomain P = f U 2f 0 ;1 g C n :U > 1 = 1 g ,where 1 isavectorofallones.Theabove problemisalsoNP-complete.Asimpliedversionoftheprob lem,whichrelaxestheconstraints on U ,issolvedtoobtainthesolution[72,192].Let n k = u > k 1 bethenumberofdatapoints assignedtothe k th cluster,and b U =( b u 1 ;:::; b u C ) > =[ diag ( n 1 ;:::;n C )] 1 U; e U =( e u 1 ;:::; e u C ) > =[ diag ( p n 1 ;:::; p n C )] 1 U; (1.8) denotethe ` 1 and ` 2 normalizedmembershipmatrices,respectively. Itiseasytoverifythat,giventhe C n clustermembershipmatrix U ,theoptimalsolutionfor 26 theclustercentersis c k ( )= n X i =1 b U k;i ( x i ; ) ;k 2 [C ]:(1.9) Asaresult,wecanformulate(1.7)asthefollowingoptimiza tionproblemover U : min U 2P tr ( K ) tr ( e UK e U > ) ;(1.10) whichcanbefurtherreformulatedasthefollowingtracemax imizationproblem: max U 2P tr ( e UK e U > ) :(1.11) Notethatthe k -meansoptimizationproblemin(1.3)canalsobewrittenast hefollowingtrace maximizationproblem: max U 2P tr ( e UXX > e U > ) ;(1.12) where X =( x 1 ;:::; x n ) > isthe n d patternmatrixcorrespondingtothedataset D .Therefore,a greedyiterativealgorithmsimilartothe k -meansalgorithmcanbeemployedtosolve(1.11),with theEuclideandistancefunctionreplacedbythekernelsimi larityfunction. Thekernel k -meansalgorithmisdescribedinAlgorithm2.Figure1.4(c) showstheresultof applyingthekernel k -meansalgorithmtothesyntheticsemi-circlesdatasetinF igure1.4(a)using theRBFkernelfunctionin(1.6),withthekernelwidth ˙ 2 setto 0 :4 .Itcanbeobservedthatkernel k -meansisabletodetectthetwosemi-circlescorrectly,unl ikethe k -meansalgorithm. 1.4.2Challenges Thoughkernelbasedclusteringalgorithmsachievebetterc lusterquality,theysufferfromtwo majorlimitations. 27 Table1.4Comparisonoftherunningtimesof k -meansandkernel k -meansona 100 -dimensional syntheticdatasetcontaining 10 clustersandexponentiallyincreasingnumberofdatapoint s,ona 2.8GHzprocessorwith40GBmemory. Datasetsize 10 4 10 5 10 6 10 7 10 8 Runningtime k -means 0.03 0.17 2.30 34.90 5508.50 inseconds Kernel k -means 3.09 320.10 > 48 hours 1.4.2.1Scalability Anaiveimplementationofkernel k -meansrequiresthecomputationofthe n n kernelmatrix K (Step3inAlgorithm2)whichtakes O ( n 2 ) timeandmemory.Clusteringmillionsofobjects usingkernel k -meansrequiresmorethan 8 ;000 GBofmemoryandlargeamountofcomputing resources.Table1.4comparestherunningtimesofthe k -meansandthekernel k -meansalgorithms ona 100 -dimensionalsyntheticdatasetcontaining 10 clustersandexponentiallyincreasingnumber ofpoints.Thealgorithmswereexecutedona 2 :8 GHzprocessorwith 40 GBmemory.Itcanbe seenthatrunningkernel k -meansisfarmoreexpensivethanrunning k -means,especiallyonlarge datasets. Itisalsoexpensivetoassignpreviouslyunseendatapoints toclustersusingkernel k -means, oftentermedasthe out-of-sample-problem .Tondtheclusterlabelforanewdatapoint x ,we needtocomputethedistancebetween x andalltheclustercentersasfollows: d 2 ( x ;c k )= jj c k ( ) ( x ; ) jj 2 H = u > k K u k u > k 1 2 2 u > k K x u > k 1 ;k 2 [C ];(1.13) where K x =( ( x ;x 1 ) ;:::; ( x ;x n )) > .Itrequiresthecomputationofthe O ( n ) -sizedvector K x inadditiontothekernelmatrix K .Thisisduetothefactthatthereisnoexplicitrepresentat ionfor theclustercenters.Iftherewasa d -dimensionalrepresentationfortheclustercenters c k (asinthe caseof k -means),thedistance d 2 ( x ;c k ) canbecomputedin O ( d ) time. Clearly,scalabilityisamajorchallengefacedbykernel k -means.Otherkernel-basedalgo- 28 rithmsalsohavehighrunningtimecomplexity.Forinstance ,spectralclusteringinvolvesthecom- putationofthetop C eigenvectorsofthekernelmatrix,whichisof O ( n 3 ) complexity. Intheliterature,theissueofscalabilityhaslargelybeen addressedthroughtheuseofcloud computingandparallelalgorithms.TheMahoutplatform[14 3]implementstheparallelspectral clusteringalgorithmwhichusesthedistributedLanczosei gensolvertoobtaintheeigenvectorsof theLaplacianmatrix[35].Distributedimplementationsof SupportVectorMachineshavebeen developedtoperformclustering[78,169].However,parall elizationofkernelbasedalgorithmsis notsimpleduetotheirnon-linearnature[15].Forinstance ,inordertoparallelizekernel k -means, onemustreplicatethedatatoallthetasks,leadingtolarge resourceandcommunicationoverheads. Approximateclusteringtechniquesareusefulinalleviati ngthisissue.Samplingmethods,such astheNystrommethod[187],havebeenemployedtoobtainlow rankapproximationofthekernel matrixtoaddressthischallenge[67,113].Low-dimensiona lprojectioncombinedwithsampling havebeenusedtofurtherimprovetheclusteringefciencya ndtackletheout-of-sampleprob- lem[11,153]. 1.4.2.2Choiceofkernel Theroleofthekernelfunctionistoreectthetruestructur eofthedataset.However,ifthekernel functionischosenwrongly,theperformanceoftheclusteri ngalgorithmdegrades.TheRBFkernel denedin(1.6)performswellonmostbenchmarkdatasets.Ho wever,evenfortheRBFkernel, thekernelwidthparameterhastobechosencarefully.Figur e1.6demonstratesthesensitivityof kernel k -meanstothekernelwidthparameter.Kernel k -meansisexecutedonthesemi-circlesdata setshowninFigure1.4(a),usingtheRBFkernelwithkernelw idthvalues: 0 :4 and 0 :1 .Theclusters obtainedareshowninFigures1.6(b)and1.6(c)respectivel y.When ˙ 2 =0 :4 ,thetrueclustersare revealed.Ontheotherhand,when ˙ 2 =0 :1 ,theclustersaredistorted.Figure1.6(d)plotsthe clusteringerrorofkernel k -means,denedin(1.10),againsttheRBFkernelwidth.Itis clearthat theperformancedependsonthechoiceofthekernelwidth.He nce,anotherchallengeassociated 29 (a) (b) (c) 00.51498.5499499.5500Kernelwidth <2Clusteringerror (d) Figure1.6Sensitivityofthekernel k -meansalgorithmtothechoiceofkernelfunction.Thesemi- circlesdataset(showninFigure(a))isclusteredusingker nel k -meanswiththeRBFkernel.When thekernelwidthissetto 0 :4 ,thetwoclustersarecorrectlydetected(showninFigure(b )),whereas whenthekernelwidthissetto 0 :1 ,thepointsareclusteredincorrectly(showninFigure(c)) .Figure (d)showsthevariationintheclusteringerrorofkernel k -means,denedin(1.10),withrespectto thekernelwidth. 30 withkernelbasedalgorithmsisthechoiceofthekernelfunc tionandthekernelparameters. Kernellearningtechniquesaimatlearningapositivesemi- denitekernelmatrixthatreects thetruesimilaritybetweenthepointsinthedataset[4].In thesupervisedlearningsetting,the kernelisoptimizedtoalignwiththetrueclassstructureof thedata.Thisisachievedbyeither minimizingtheerroroftheclassierforthechosenkernel, ormaximizingthesimilaritybetween thekernelandtheclassmatrix.Astheclasslabelsarenotav ailableinthesettingofunsupervised learning,othercriterionsuchascompactnessofthecluste rsinthefeaturespace,anddegreeof alignmentwiththestructureofthedataareutilized[112,1 77,200]. 1.5ThesisContributions Theobjectiveofthisthesisistodesignclusteringalgorit hmsthatcanaccuratelyidentifytheclus- tersindatasetscontainingbillionsofpoints,thousandso ffeaturesandthousandsofclusters.As kernel-basedclusteringalgorithmsgenerallyachievehig hclusterquality,providedthecorrectker- nelfunctionischosen,weaddressthescalabilitychalleng eassociatedwithkernelbasedclustering algorithms.Ourmaincontributionisthedevelopmentofef cientapproximationsofthekernel k - meansalgorithmtoenablekernelbasedclusteringoflarged atasets.Wedemonstrateanalytically andempiricallythattheproposedapproximatealgorithmsa recomparabletokernel k -meansin termsofaccuracyand,atthesametime,comparableto k -meansintermsofefciency,achieving thedesiredtrade-offbetweenscalabilityandaccuracy.We thenextendtheproposedapproximate algorithmstohandledistributedandstreamingdata,pushi ngthelimitsonthenumberofobjects thatcanbeclusteredaccuratelywithlimitedcomputingand memoryresources.Figure1.7shows thescalabilityofsomeofthepopularlinearandkernel-bas edclusteringalgorithmsintermsof n , d and C ,andthecontributionoftheproposedclusteringalgorithm sinimprovingthescalabilityof kernel-basedclustering. Inthefollowing,wedescribethespeciccontributionsofe achchapter: 31 1001020100102104100102104106CndFigure1.7Scalabilityofclusteringalgorithmsintermsof n , d and C ,andthecontributionof theproposedalgorithmsinimprovingthescalabilityofker nel-basedclustering.Theplotshows themaximumsizeofthedatasetthatcanbeclusteredwithles sthan 100 GBmemoryona 2 :8 GHzprocessorwithareasonableamountofclusteringtime(l essthan 10 hours).Thelinearclus- teringalgorithmsarerepresentedinblue,currentkernel- basedclusteringalgorithmsareshown ingreen,parallelclusteringalgorithmsareshowninmagen ta,andtheproposedclusteringalgo- rithmsarerepresentedinred.Existingkernel-basedclust eringalgorithmscanclusteronlyupto theorderof 10 ;000 pointswith 100 featuresinto 100 clusters.Theproposedbatchclusteringal- gorithms(approximatekernel k -means,RFFclustering,andSVclusteringalgorithms)arec apable ofperformingkernel-basedclusteringondatasetsaslarge as 10 million,withthesameresource constraints.Theproposedonlineclusteringalgorithms(a pproximatestreamkernel k -meansand sparsekernel k -meansalgorithms)canclusterarbitrarily-sizeddataset swithdimensionalityinthe orderof 1 ;000 andthenumberofclustersintheorderof 10 ;000 . 32 Chapters2and3addressthescalabilityofkernel k -meansusingkernelapproximationtech- niques.Thecomputationaldemandofkernel k -meansstemsfromthefactthatitcomputes an n n kernelmatrix K ,leadingto O ( n 2 ) runningtimeandmemorycomplexity.This canbealleviatedbyreplacingthekernelmatrix K withanapproximatematrixwhichcan becomputedmoreefciently.InChapter2,werstpresentar andomizedalgorithm,called approximatekernelk-means whichreplaces K ,withalowrankapproximatekernelmatrix. Itscomplexityislinearintermsof n ,whileitsclusteringperformanceisequivalenttothat ofkernel k -means.Wethenextendtheproposedapproximatealgorithmt ohandlelargedata setsinadistributedenvironment.InChapter3,weproposet woclusteringalgorithms: RFF clustering and SVclustering ,whichemploy randomfeaturemaps [92,147]toobtainlow- dimensionalrepresentationsforthedatapoints,suchthat thedotproductofanytwopoints inthelow-dimensionalspaceapproximatesthekernelsimil aritybetweenthem.Thisallows ustoexecutealinearclusteringalgorithmonthetransform eddatapoints.TheSVclustering algorithmhasalowerrunningtimethantheapproximatekern el k -meansalgorithm.Italso allowstheexplicitcomputationoftheclustercenters,lea dingtoanefcientsolutiontothe out-of-sampleclusteringproblem.Wedemonstratethatiti spossibletoclusterbillionsof datapointsefcientlyandaccuratelyusingthealgorithms proposedinthesetwochapters. Forinstance,wewereabletoclusterasyntheticdatasetcon taining 1 billion 10 -dimensional pointsusingthedistributedapproximatekernel k -meansalgorithmin 15 minutes(onacom- putingclusterwith 1 ;024 , 2 :8 GHzprocessorsandshared 40 GBmemory),withhighcluster quality( 80% accuracyintermsofNMI 12 ).Itwouldtakemanydaystoclusterthisdataset usingkernel k -meansandotherkernel-basedclusteringalgorithms,whil elinearclustering algorithmslike k -meanscannotachievecomparableaccuracy. Batchclusteringalgorithmssuchas k -meansandkernel k -meansareiterativeinnatureand needtoaccesstheinputdatapointsmultipletimes.However ,manydatasetsaretoolarge 12 RefertoSection1.6.2forthedenitionofNMI. 33 toloadintothememory,soitwouldnotonlybeprohibitively expensivetoperformmultiple passesoverthedata,butalsoinfeasibletocomputethekern elmatrix.Someapplications suchassocialnetworkanalysisandintrusiondetectioninn etworks,involvepotentiallyun- boundedsequencesofdatapointscalleddatastreams.Onlya smallsubsetofthedatacanbe stored,dependingonthesizeofthedatabuffer.Duetothis, eachdatapointcanbeaccessed atmostonce.Thisdataalsoevolvesovertime,sothedatapoi ntsthatarrivedrecentlyhave higherrelevancethantheolderdata.Therehavebeenrelati velyfeweffortstoapplykernel basedclusteringtodatastreams,duetothecostofcomputin gthekernel.InChapter4,we presentanefcientalgorithmcalled approximatestreamkernelk-means ,toperformkernel clusteringonstreamdata.Thekeyideaistoconstructtheke rnelmatrixdynamicallyusing importancesampling,andassignlabelstotheincomingdata pointsinreal-time.Weusesev- eralbenchmarkdatasetstosimulatestreamdatasets,andev aluatetheperformanceofthe proposedalgorithmonthesedatasets.Wedemonstratethato uralgorithmisabletocluster streamdatasetsinreal-timewithspeedsupto 8 MBps. Documentandimagedatasets,containmillionsofhigh-dime nsionalpointsandusuallybe- longtoalargenumberofcategories.Findingclustersinsuc hdatasetsiscomputationally expensiveusingkernel-basedclusteringtechniquesbecau setheyhavequadraticrunningtime complexityintermsofthenumberofdatapoints,andlineart imecomplexityintermsofthe numberofdimensionsandthenumberofclusters.Althoughth eapproximatekernelclus- teringalgorithmsdiscussedinChapters2-4reducetherunn ingtimecomplexityintermsof thenumberofdatapoints,theirclusteringtimegrowslinea rlywiththenumberofclusters. InChapter5,wepresentthe sparsekernelk-meansalgorithm whichcanefcientlycluster largedatasetsintothousandsofclusterswithsignicantl ylowerprocessingandmemoryre- quirements,withhighclusteringaccuracy.Itassumesthat thekernelmatrixissparsewhen thenumberofclustersislarge,andconstructsasparsekern elmatrixforasubsetofthedata set,sampledincrementallyusingimportancesampling.Clu sterlabelsareobtainedbyclus- 34 Table1.5Descriptionofdatasetsusedforevaluationofthe proposedalgorithms. Dataset Numberofdatapoints n Dimensionality d Numberofclusters C CIFAR-10[99] 60,000 384 10 CIFAR-100[99] 60,000 384 100 MNIST[108] 70,000 784 10 ForestCoverType[23] 581,012 54 7 Imagenet-34[49] 949,401 900 34 Imagenet-164[49] 1,262,102 900 164 Poker[33] 1,025,010 30 10 NetworkIntrusion[167] 4,897,988 50 10 Youtube 10,143,254 6,647 N/A Tiny[173] 79,302,017 384 N/A Twitter 1,000,000,000 8,042 N/A Concentriccircles 100to1,000,000,000 10to1,000 10to1,000 teringthissparsekernelmatrixinalowdimensionalspaces pannedbyitstopeigenvectors. Thisalgorithmhasrunningtimecomplexitylinearinthesiz eandthedimensionalityofthe dataset,andlogarithmicinthenumberofclusters. 1.6DatasetsandEvaluationMetrics 1.6.1Datasets Todemonstratetheeffectivenessoftheproposedalgorithm s,weuseseveralbenchmarkdatasets ofdifferentsizesanddimensionalities,fromseveraldoma ins.Thedescriptionofthedatasetsis summarizedinTable1.5. MNIST[108]: TheMNISTdatasetisasubsetofthedatabaseofhandwrittend igitsavailable fromNIST.Itcontains 70 ;000 imagesfrom 10 classes,eachclassrepresentingoneofthe digits, 0 9 .Eachimageisrepresentedasa 784 -dimensionalfeaturevectorcontainingthe pixelintensityvalues. ForestCoverType[23]: Thisdatasetiscomposedofcartographicvariablesobtaine dfrom 35 theUSGeologicalSurvey(USGS)andtheUSForestService(US FS)data.Eachofthe 581 ;012 datapointsrepresentstheattributesofa 30 30 squaremetercelloftheforest oor.Thereareatotalof 12 attributes,includingqualitativemeasureslikesoiltype and wildernessarea,andquantitativemeasureslikeslope,ele vation,anddistancetohydrology. These 12 attributesarerepresentedusing 54 features.Thedataaregroupedinto 7 classes, eachrepresentingadifferentforestcovertype.Thetrueco vertypewasdeterminedfromthe USFSRegion2ResourceInformationSystem(RIS)data. Imagenet[49]: TheImagenetdatasetcontainsabout 14 millionimagesorganizedaccord- ingtotheWordnethierarchy[64].Eachnodeinthishierarch yrepresentsaconceptknown asafisynsetfl.Wedownloaded 1 ;262 ;102 imagesfrom 1 ;000 synsets,mergedtheleafnodes inthesynsettreebasedontheirsimilaritytoforma164-cla ssdataset.Wecallthisdata setImagenet-164anduseittodemonstratetheeffectivenes softhesparsekernel k -means algorithminChapter5.Bylteringouttheclasseswithfewe rthan 500 images,weformed abalanceddatasetcontaining 949 ;401 imagesfrom 34 classes,whichwecallImagenet-34 dataset.Thisdatasetisusedtoevaluatetheremainingclus teringalgorithms.Wecom- putedtheScaleInvariantFeatureTransform(SIFT)descrip tors[117]oftheimagesusing theVLFeatlibrary[178],andclusteredarandomlychosensu bsetof10millionSIFTfea- turestoformavisualvocabulary.EachSIFTdescriptorwast henquantizedintoavisual wordusingthenearestclustercenter.Weobtaineda900-dim ensionalvectorrepresentation foreachimage,whichwasthennormalizedtolieintherange [0 ;1] . Poker[33]: Thisdataset,availableintheUCIrepository[13],contain s 1 ;025 ;010 data points.Eachdatapointisanexampleofafihandflconsistingo fveplayingcardsdrawn fromastandarddeckof 52 .Eachcardisdescribedusingtwoattributes:suitandrank. These attributesarerepresentedusinga 30 -dimensionalcategoricalfeaturevector.Thereare 10 classesinthedataset,eachdepictingatypeofpokerhand. NetworkIntrusion[167]: TheNetworkIntrusiondatasetcontains 4 ;898 ;431 50- 36 dimensionaldatapointsrepresentingtheTCPdumpdatafrom sevenweeksofalocal-area networktrafc.Thedataisclassiedinto23classes,onecl assrepresentinglegitimatetrafc andtheremaining22classesrepresentingdifferenttypeso fillegitimatetrafc.Weltered outthedatafromclasseswhichcontainfewerthan 500 datapoints,toformadatasetwith 4 ;897 ;988 datapointsfrom10classes. Youtube 13 : Youtubeisavideohostingwebsitewhichallowsuserstouplo ad,viewandshare videosovertheweb.Ithasoveronebillionusersuploadingo ver 300 hoursofvideosev- eryminute,onawiderangeoftopics.WeusedtheYoutubeSear chAPI 14 todownload themeta-datacorrespondingto 10 ;143 ;254 videosusing 26 ;000 non-abstractnounsfrom Wordnet[64]assearchqueries.Weusedthevideotitle,desc riptionandthevideothumbnail (whichusuallycontainsthekeyframeinthevideo)toextrac tfeaturesforeachrecord.For eachvideo,weeliminatedstopwordsfromthetitleanddescr iptiontoobtainavocabulary containing 6 ;135 terms,andextractedthecorrespondingtf-idf(termfreque ncy-inversedoc- umentfrequency)features[125].Featurevalue x r;t ,representingtheweightassignedtothe term t inrecord r ,measureshowimportantthetermistotherecordinthedatas et.Itis denedas x r;t = tf ( r;t ) idf ( t; D ) (1.14) = 8 > < > : 1+log f ( r;t ) log n log f ( t ) if f ( r;t ) > 0 0 otherwise ;(1.15) where f ( r;t ) representsthenumberoftimestheterm t occursintherecord r and f ( t ) representsthenumberofrecordscontainingtheterm t .Wethendownloadedthethumbnail ofthevideoandextractedtheglobalGISTfeatures[141]oft heimage.Thenal 6 ;647 - dimensionalfeaturevectorwasobtainedbyconcatenatingt hetf-idfandGISTfeatures.We 13 www.youtube.com 14 https://developers.google.com/youtube/v3 37 usethisdatasettoevaluatetheperformanceofthesparseke rnel k -meansalgorithmproposed inChapter5onlargehighdimensionaldatasets. Tiny,CIFAR-10andCIFAR-100[99,173]: TheTinyImagedatasetcontains 79 ;302 ;017 unique 32 32 colorimages,downloadedfromtheInternet.Theywereobtai nedbyextract- ing 75 ;062 non-abstractEnglishnounsfromtheWordnetdatabase[64]a ndusingthemto searchforimagesin 7 independentimagesearchengines.Theseimagesweredownlo aded anddown-sampledto 32 32 .Werepresentedeachimageusinga 384 -dimensionalGIST descriptor[141].Thoughthesearchqueriescanbeusedtolo oselylabeltheimages,these labelsareunreliable.Toevaluatetheaccuracyofthepropo sedalgorithms,weusedthe CIFAR-10andCIFAR-100datasets,manuallylabeledsubsets oftheTinydataset.The CIFAR-10datasetcontains 60 ;000 imagesfrom 10 classes(bird,truck,deer,dog,cat,frog, car,plane,horseandship).TheCIFAR-100alsocontains 60 ;000 imagesfrom 100 classes. Twitter 15 : Twitterisasocialnetworkwithover100millionactiveuser spostingover 100 ;000 shortmessages(called tweets )perminute.Thetweetscontainpersonalupdates, real-timeinformationaboutevents,newsetc.Eachtweetco ntainsatextmessagelimited to 140 charactersandcanincludeuser-mentions,links,emoticon s,andhashtagsinaddi- tiontoplaintext.Wedownloadedoverabilliontweetsusing theTwitterstreamingsearch APIusing 20 programminglanguages(Python,Perl,C#,Java,Ruby,C++,J avaScript,VB- Script,Scala,ObjectiveC,PHP,SQL,Postgresql,GO,Julia ,Erlang,HTML,XML,Swift, andASP.NET)assearchterms.Welteredoutthenon-English tweets,removedthehash- tags,eliminatedthestopwordsandrepresentedeachtweetw iththetf-idffeatures,dened in(1.15),correspondingto 8 ;042 terms.Weusethisdatasettodemonstratetheefciencyof theapproximatestreamkernel k -meansalgorithminChapter4onfaststreamingdatasets. Inadditiontotheabovereal-worlddatasets,weuseasynthe ticdataset,whichwecallthe con- centriccircles dataset,todemonstratethescalabilityoftheproposedalg orithms.Thedataset 15 www.twitter.com 38 containingcircularclustersofvaryingradii,wasgenerat edwithdifferentnumberofpoints,rang- ingfrom 100 to 1 billion.Thedatadimensionalityrangesfrom 10 to 1 ;000 andthenumberof clustersrangesfrom 10 to 1 ;000 .Eachclustercontainsthesamenumberofpoints.Anexample datasetcontaining 1 ;000 two-dimensionalpointsalong 10 concentriccircles( 100 pointsineach cluster)isshowninFigure4.2(a). 1.6.2EvaluationMetrics Thegoalofourresearchistoreducetheresourcesneededfor kernelclustering,withminimal reductionintheclusterquality.Inordertoevaluatethere ductioninrunningtimeandmemory complexity,wemeasuredthetimetakenforclusteringtheda tapoints,andtheamountofmemory used. Theclusterqualityoftheproposedalgorithmswereevaluat edusingtwotypesofmeasures: (a)internalmeasuresevaluatethestructureandcompactne ssoftheclusters,while(b)external measuresevaluatehowwelltheclusterlabelsmatchwiththe trueclasslabels.Weusedtheinternal Silhouettecoefcient [151]andtheexternal NormalizedMutualInformation(NMI) [104]measures toevaluatetheclusterqualityofouralgorithms. TheSilhouettecoefcientmeasuresthecompactnessofthec lusters.Let d k;i representthe averagedissimilaritybetweendatapoint x i andallthepointsassignedtothecluster C k ,i.e. d k;i = 1 n k X x j 2C k x i 6= x j d 2 ( x i ;x j ) ;where n k isthenumberofpoints(exceptfor x i )assignedtocluster C k .Foreachdatapoint x i , denethecoefcients a i and b i asfollows: a i = d k ;i ;and b i =min k 6= k d k;i ;39 where k istheindexoftheclustertowhich x i isassigned.Thecoefcient a i representstheaverage dissimilarityof x i withallotherpointswithinthesamecluster,andthecoefc ient b i represents theaveragedissimilaritybetween x i andallthepointsinneighboringcluster.TheSilhouette coefcientisdenedas Silhouette = 1 n n X i =1 b i a i max( a i ;b i ) :(1.16) ThevalueoftheSilhouettecoefcientliesintherange [ 1 ;1] .Avaluecloseto 1 isdesired.When thecoefcientiscloseto 1 itimpliesthat a i ˝ b i foralargenumberofpoints,i.e.manyofthe pointsarewell-matchedtotheclustertowhichtheywereass igned.Ontheotherhand,whenthe Silhouettecoefcientvalueiscloseto 1 , a i ˛ b i foralargenumberofdatapoints,whichimplies thatmanyofthepointsaremoresimilartotheneighboringcl ustersthantheclustertowhichthey havebeenassigned.Avaluecloseto 0 denotesthatmanydatapointslieontheboundariesoftheir naturalclusters. TheNormalizedMutualInformationwithrespecttothetruec lasslabelsofthedatapoints isdenedasfollows:Let U a and U b betheclustermembershipmatricescorrespondingtotwo partitions a and b ofthesamedataset.Let n a i representthenumberofdatapointsthathavebeen assignedlabel i inpartition a ,and n a;b i;j representsthenumberofdatapointsthathavebeenassigned label i inpartition a andlabel j inpartition b .Wehave NMI ( a;b )= C P i =1 C P j =1 n a;b i;j log n n a;b i;j n a i n b j v u u t C P i =1 n a i log n a i n C P j =1 n b j log n b j n ! ;(1.17) where a representsthepartitionobtainedfromtheclusteringalgo rithm,and b representsthepar- titionbasedonthetrueclasses.AnNMIvalueof1indicatesp erfectmatchingwiththetrueclass distributionwhereas0indicatesperfectmismatch.Thetru eclasslabelsareavailableformostof thedatasets(excepttheTinyimageandYoutubedatasets).W eusedtheCIFAR-10dataset,a 40 labeledsubsetoftheTinyimagedataset,toevaluatetheper formanceontheTinydataset. 1.7ThesisOverview Kernel-basedclusteringalgorithms,whichperformwellon real-worlddatasets,arenotscalable tobigdatasets,containingbillionsofhigh-dimensionalp ointsfromthousandsofclusters.We proposescalableapproximatekernel-basedclusteringalg orithms,anddemonstratetheirefciency andeffectivenessonseveraldiverselarge-scaledatasets .Theremainderofthisthesisisorganized asfollows:Chapters2and3describetheapproximatebatchc lusteringalgorithms(approximate kernel k -means,andkernel-basedclusteringusingrandomFourierf eatures),basedonworkpub- lishedin[40]and[42],respectively.Thesealgorithmscan clusterupto 10 milliondatapointswith thousandsoffeatures,andachievehighclusterquality.Ch apter4,basedonthepublication[43], describestheapproximatestreamkernel k -meansalgorithm,whichcanclusterstreamingdataof arbitrarysizesinreal-time.Thesparsekernel k -meansalgorithm,discussedinChapter5,canclus- terarbitrarily-sizedhigh-dimensionaldatasets,intoth ousandsofclusters.Itisapplicabletolarge documentandimagerepositories.Thisworkwaspublishedin [38].Weconcludeourstudyand presentdirectionsforfutureworkinChapter6. 41 Chapter2 ApproximateKernel-basedClustering 2.1Introduction AsdiscussedinChapter1,kernel k -meansachievesbetterclusteringperformancethan k -means, becauseitexploresthenon-linearstructureinthedatausi ngcomplexnon-linearsimilaritymea- sures.However,ithasrunningtimeandmemorycomplexityqu adraticinthenumberofdatapoints n ,leadingtoitsnon-scalabilitytobigdatasets. Toaddressthisissue,weproposeanapproximatekernelclus teringalgorithmcalled Approxi- mateKernelk-means [40],basedonrandomsampling.Wesample m pointsfromthedatasetof n points,andexpresstheclustercentersaslinearcombinati onsofvectorsinthespacespannedby thissubset.Theweightsofthesampledpointsinthecluster centers,andtheclusterlabelsofthe pointsareobtainedsimultaneouslyusingiterativeoptimi zation.Onlyasmall n m portionof thekernelmatrixneedstobecomputedusingtheproposedalg orithm,therebyreducingtherun- ningtimecomplexityofclusteringfrom O ( n 2 ) to O ( nm ) .When n isintheorderofmillions,the samplesize m ismuchsmallerthan n .Hencetheproposedalgorithmiscomparableto k -meansin termsofefciency.Weshowanalyticallyandempiricallyth attheclusterqualityachievedbythe proposedapproximatekernel k -meansiscomparabletothatofkernel k -means. 42 Thischapterisorganizedasfollows:InSection2.2,webrie yreviewsomeofthepopular approximatekernel-basedclusteringschemesdevelopedin theliterature.Weformallydescribe theproposedapproximatekernel k -meansalgorithminSection2.3.Thekeyparameterswhich determinethesuccessoftheproposedalgorithmarethenumb erofsamples m andthesampling strategy.WediscusstheseissuesinSection2.3.1.InSecti on2.3.2,weanalyzetheproposed algorithm'srunningtimeandmemorycomplexity.Wealsosho wthatthedifferencebetweenthe performanceoftheapproximatekernel k -meansandthekernel k -meansalgorithmsintermsofthe clusteringerror 1 reducesasthenumberofsamples m increases,attherateof O (1 =m ) .In2.3.3,we presentthedistributedapproximatekernel k -meansalgorithm[41],whichparallelizestheproposed approximatekernel k -meansalgorithm,inordertoscaleuptodatasetscontainin gbillionsof datapoints.Finally,inSection2.4,wedemonstrateempiri callythattheproposedapproximate clusteringalgorithmisanefcientandaccuratevariantof thekernel k -meansalgorithm,andcan beusedtoclusterlargedatasets,containingbillionsofpo ints. 2.2RelatedWork Largematriceslikethekernelmatricescorrespondingtola rgedatasetshavefastdecayingeigen- spectrums[187].Therefore,thecomputationalrequiremen tsofoperationsinvolvingsuchmatri- cescanbereducedbyreplacingthemwiththeirlow-rankappr oximations.Mostofthescalable kernel-basedlearningalgorithms,includingtheproposed approximatekernel k -meansalgorithm, takeadvantageofthisfactintheirdesign. Below,werstbrieyreviewthelow-rankmatrixapproximat ionliterature,andthendescribe someofthelarge-scalekernel-basedclusteringalgorithm sdevelopedintheliterature. 1 Clusteringerrorisdenedasthesumofthesquareddistance sbetweenthedatapointsandthecenterofthe clustertowhichthedatapointisassigned.SeeSection1.3. 1fortheformaldenitionofclusteringerror. 43 2.2.1Low-rankMatrixApproximation Givenan n m matrix A ,theobjectiveoflow-rankapproximationistondarank- r matrix A r thatminimizestheerrordenedby jj A A r jj p ;where jjjj p representseitherthespectralnormortheFrobeniusnorm.T heoptimalsolution to(2.2.1)isgivenby A r = r X k =1 k u k v > k ;where f k g r k =1 representthelargest r singularvaluesof A ,and f u k g r k =1 and f v k g r k =1 arethe correspondingleftandrightsingularvectors[58].Thetim erequiredtoestimatethesingular vectorsis O ( mn min f m;n g ) ,whichcanbeprohibitivewhen m and n arelarge. Severalefcientalgorithmshavebeenproposedintheliter aturetoapproximatetheSingular ValueDecomposition[129].Oneoftheearliestalgorithmsb yFrieze etal. involvesindependently sampling s rowsandcolumnsfrom A ,toforman s s matrix S . A isthenprojectedonto thespanofthedominanteigenvectorsof S .Theyshowedthatwhenthecolumnsandrowsare sampledwithprobabilityproportionaltothecolumnandrow normsrespectively,andthesample size s = O (max f r 4 3 ;r 2 4 g ) ,theapproximationerrorcanbebounded,withhighprobabil ity, as jj A A r jj 2 F = jj A A r jj 2 F + jj A jj 2 F ;(2.1) where > 0 isanerrorparameter[70].Achlioptas etal. obtainedasimilarresultbydesigninga randommatrix R withentriesdependentonthevaluesinthematrix A ,suchthatthematrix A + R issparse,andtheexpectation E [R ]=0 .Thetop r singularvectorsofthesparsematrix A + R are usedinplaceofthesingularvectorsof A tondthelowrankapproximationof A [5].Thesingular vectorsofasparsematrixcanbecomputedefcientlyusingt heLanczosbidiagonalizationmethod anditsvariants[163]. 44 Approximationschemesdevelopedthereafterachievedtigh terboundsontheerrorandthesam- plingcomplexity,byusingdifferentsparsication,sampl ingandprojectionschemes[50,54,57, 103,109,137,155,187].Forinstance,in[137],matrix A issparsiedbynullifyingtheentries whichhavesufcientlylowmagnitudes.Elementsareretain edinproportiontotheirmagnitude. Sequentialcolumnandrowsamplingwithprobabilitypropor tionaltothecolumn's(row's)distance fromthespanofthecolumns(rows)alreadyselectedisusedi n[50].In[155], A isrstprojected intoalow-dimensionalspaceusingauniformrandommatrix R 2f +1 ; 1 g m s as B = AR , andthenprojectedontothespanofthebestrank- r approximationof B .Theseworksobtained multiplicativeerrorboundsoftheform jj A A r jj p =(1+ ) jj A A r jj p :(2.2) Themostwidely-studiedsampling-basedapproximationtec hniquesintheliteraturearethe CURandtheNystromapproximations: 2.2.1.1CURmatrixapproximation TheCURmatrixdecompositionmethodfactorizes A as A ' CUR where C contains s columns and R contains t rowsselectedfrom A ,suchthat s ˝ m;t ˝ n .The s t matrix U iscon- structedtoachieveminimalapproximationerror[22,44,55 ,122,184].Berry etal. obtained C are R usingQuasi-Gram-Schmidtorthogonalizationof A and A > ,respectively[22].Drineas et al. proposedefcientlinear-timealgorithmswhichrandomlys amplethecolumnsandrows,and obtain U throughthesingularvaluedecompositionofasmall s s matrix[55].In[122],they improvedtheapproximationbysampling C and R basedontheimportanceofthecolumnsand rows,measuredintermsoftheirstatisticalleveragescore s 2 .Wang etal. augmentedasparsica- 2 Thestatisticalleveragescoreofthe i th columnofarank- rn m matrix A withsingularvaluedecomposition A = U V > isdenedas ˇ i = 1 r V ( i ) 2 2 ,where V ( i ) representsthe i th rowin V .Itisameasureoftheindependence ofthecolumnanditsinuenceonthematrix. 45 tionproceduretotheCURdecompositionalgorithmtofurthe rimproveitsefciency[184].The CURdecompositionispreferredovertheSVDdecompositioni nmanyapplicationsbecauseitcan beinterpretedmoreeasily. 2.2.1.2Nystrommatrixapproximation TheNystromapproximationcanbeviewedasaspecialization oftheCURdecompositionforsym- metricpositivesemi-denite(SPSD)matrices 3 likekernelsimilaritymatrices[17,57,103,109,111, 170,187,195].Itwasrstusedin[187]toperformclassica tionandregressionusingGaussian processes.Itwasthenadoptedinmanykernel-basedlearnin gtaskssuchasclassication[187], regression[46,187],clustering[35,67,113],manifoldle arning[194]anddimensionalityreduc- tion[7]. Let K representan n n SPSDmatrixand K r representitsbestrank- r approximation.The NystromapproximationstudiedbyWilliams etal. in[187]samples m ˝ n columnsuniformly withoutreplacementfrom K ,toformthe n m matrix K B .Let b K bethe m m intersection betweenthesampledcolumnsandthecorrespondingrows. K isapproximatedby e K = K B b K 1 K > B :(2.3) Drineas etal. developedavariantwhichuses b K r ,thebestrank- r approximationof b K ,inplace of b K in(2.3),andsamplesthecolumnsuniformly with replacement,toobtainapproximation errorboundsoftheform(2.1)[57].Severalnon-uniformsam plingtechniqueshavebeenexplored in[17,74,103,170,195]toobtainimprovederrorbounds. 3 An n n matrix K ispositivesemi-deniteif x > K x 0 ,forallnon-zero x 2< n . 46 2.2.2Kernel-basedClusteringforLargeDatasets Samplingandsparsicationtechniqueshavebeenemployedt odevelopefcientkernel-basedclus- teringalgorithms. Thespectralclusteringalgorithmisagraph-basedcluster ingtechnique[118].The n points inthedatasetarerepresentedasnodesofagraph.Eachedgei nthegraphisweightedbythe similaritybetweenthepointsconnectedbytheedge.Let K denotethe n n similaritymatrix. Thespectralclusteringalgorithmusestherst C eigenvectorsoftheLaplacianmatrixdenedby L = I diag ( K > 1 ) 1 = 2 K diag ( K > 1 ) 1 = 2 ;tondtheclusters.Theobviouscomputationalbottlenecks inthisalgorithmarethecalculationof theLaplacianmatrixandthecomputationofitseigenvector s,whichrequire O ( n 2 ) and O ( n 3 ) time respectively. ThecolumnsamplingmethodbyFrieze etal. andNystromapproximationmethodhavebeen usedintheliteraturetospeedupspectralclustering[18,6 7,102].Thekeyideaistoapproximate theLaplacianmatrix,andusetheapproximateeigenvectors tondtheclusters.Therunningtime complexityoftheseapproximatespectralclusteringalgor ithmsis O ( nm + m 3 ) ,where m isthe numberofcolumnssampledfromthekernelmatrix.As m isusuallymuchlowerthan n ,these algorithmsrunmuchfasterthanspectralclustering.Rando mprojectioncanbecombinedwith samplingtofurtherimprovetheclusteringefciency[73,1 53].Alow-dimensionalprojection ofthesimilaritymatrix,obtainedbymultiplyingitwithar andomGaussianmatrix,isusedto constructthegraphLaplacianmatrix.Theseapproximatesp ectralclusteringalgorithmshavebeen appliedsuccessfullyinimagesegmentationproblems[113] . Nystromapproximationwasalsousedtoacceleratethekerne lneuralgasalgorithm[145].The objectiveofkernelneuralgasistond C prototypestorepresentthedata.Eachprototypeis associatedwitharandomlyinitializedweight,whichisupd atedbyafactorproportionaltothe 47 similaritybetweentheprototypeandtheinputdatapoint.S imilartothekernel k -means,the prototypesareexpressedaslinearcombinationsofthedata pointsintheHilbertspace,soeach weightupdateinvolvesthe n n kernelmatrix.TheNystromapproximationofthekernelmatr ix wasusedin[156]toperformtheweightupdates,therebyredu cingitsrunningtimecomplexity. Inadditiontotheaboveapproximatemethods,severalheuri sticandapplication-specical- gorithmshavebeenproposedtoperformefcientkernel-bas edclustering.ZhangandRudnicky reducedthememoryrequirementsofthekernel k -meansalgorithmbychangingtheorderin whichclusteringisperformed.Thekernelmatrixiscompute dblockwise,andtheclusterlabels areobtainedbyexaminingonlyoneblockatatime[196].TheK ASPalgorithmrstclusters theinputdatasetinto m clustersusing k -means,andthenexecutesspectralclusteringonthe m ( C ˝ m ˝ n ) clustercenterstoobtainthe C clusters[191].TheRASPclusteringmethod rstpartitionsthedataspaceusingRandomProjection(RP) trees[191].RPtreesaredatastruc- turesthatpartitionthedataspaceinto m cells,bysplittingrecursivelyalongonerandomlychosen coordinateatatime.Eachcellinthepartitionisrepresent edbyitscenter,andspectralclustering isexecutedonthe m representativecenters.Thesemethodsreducetherunningt imecomplexity ofclusteringto O ( nm + m 3 ) .Chen etal. sparsiedthesimilaritymatrixbyretainingonlythe similarityvaluescorrespondingtothenearest p neighborsforeachnode,andproposedasimple schemetoparallelizethesimilaritycomputationandclust ering[35].Thenearestneighborsare foundusing kd -trees[131]andmetrictrees[176],therebyreducingtheov erallmemoryrequire- mentto O ( np ) ,althoughtherunningtimecomplexityisstill O ( n 2 log( p )) .TheGEM(Graph Extraction+weightedkernel k -Means)algorithmproposedin[186]speedsupkernel k -meansfor socialnetworkgraphsbyeliminatingthenodeswithlowdegr ee.Itmakesuseofthepowerlaw distributionofsocialnetworks,whichindicatesthatasma llsetofhighdegreeverticescovera largeportionofthenetwork. 48 2.3ApproximateKernelk-means Givenadataset D = f x 1 ;:::; x n g ,andakernelfunction ( ; ) ,kernel k -meansnds C clusters, whosecenters c k ( ) arerepresentedaslinearcombinationsofallthepointsint hedataset,in accordancewiththerepresentertheorem[158],i.e. c k ( )= n X i =1 b U k;i ( x i ; ) ;k 2 [C ];(2.4) where b U istheclustermembershipmatrixnormalizedbythenumberof pointsineachcluster,as denedin(2.8).Inotherwords,theclustercenterslieinth esubspacespannedbyallthedata points,i.e. c k ( ) 2H = span ( ( x 1 ; ) ;:::; ( x n ; )) ;k 2 [C ].Asaconsequence,thekernel k -meansalgorithmrequiresthecomputationof O ( n 2 ) kernelsimilarityvalues,leadingtoitsnon- scalability. Wecanavoidcomputingthefullkernelmatrixifwerestrictt hesolutionfortheclustercenters toasmallersubspace H a ˆH . H a shouldbeconstructedsuchthat (i) H a issmallenoughtoallowefcientcomputation,and (ii) H a isrichenoughtoyielddatapartitionssimilartothoseobta inedusing H . Weemployasimplerandomizedapproachforconstructing H a :werandomlysample m datapoints ( m ˝ n ),denotedby b D = f b x 1 ;:::; b x m g ,andconstructthesubspace H a = span ( b x 1 ;:::; b x m ) . Giventhesubspace H a ,wemodifythekernel k -meansoptimizationproblem(1.7)as min U 2P max f c k ( ) 2H a g C k =1 C X k =1 n X i =1 U k;i jj c k ( ) ( x i ; ) jj 2 H ;(2.5) where U =( u 1 ;:::; u C ) > istheclustermembershipmatrix, c k ( ) 2H a ;k 2 [C ]arethecluster centers,anddomain P = f U 2f 0 ;1 g C n :U > 1 = 1 g ,where 1 isavectorofallones.Let K B 2< n m representthekernelsimilaritymatrixbetweendatapoints in D andthesampleddata 49 points b D ,and b K 2< m m representthekernelsimilaritybetweenthesampleddatapo ints.The followinglemmaallowsustoreduce(2.5)toanoptimization probleminvolvingonlythecluster membershipmatrix U . Lemma1. Giventheclustermembershipmatrix U ,theoptimalclustercentersin (2.5) aregiven by c k ( )= m X i =1 k;i ( b x i ; ) ;(2.6) where = b UK B b K 1 .Theoptimizationproblemfor U isgivenby min U tr ( K ) tr ( e UK B b K 1 K > B e U > ) ;(2.7) where b U and e U aredenedby b U =( b u 1 ;:::; b u C ) > =[ diag ( n 1 ;:::;n C )] 1 U; e U =( e u 1 ;:::; e u C ) > =[ diag ( p n 1 ;:::; p n C )] 1 U; and n k = u > k 1 ;k 2 [C ]:(2.8) Proof. Let ' i =( ( x i ;b x 1 ) ;:::; ( x i ;b x m )) and i =( i; 1 ;:::; i;m ) bethe i th rowsofmatrices K B and respectively.As c k ( ) 2H a = span ( b x 1 ;:::; b x m ) ,wecanexpress c k ( ) as c k ( )= m X i =1 k;i ( b x i ; ) ;50 andwritetheobjectivefunctionin(2.5)as C X k =1 n X i =1 U k;i jj c k ( ) ( x i ; ) jj 2 H C X k =1 n X i =1 U k;i m X j =1 k;j ( b x j ; ) ( x i ; ) 2 H = tr ( K )+ C X k =1 n k > k b K k 2 u > k K B k :(2.9) Byminimizingtheaboveexpressionwithrespectto k ,wehave k = b K 1 K > B b u k ;k 2 [C ](2.10) andtherefore, = b UK B b K 1 .Wecompletetheproofbysubstitutingtheexpressionfor into (2.9). AsindicatedbyLemma1,weneedtocomputeonly K B forndingtheclustermemberships. b K ispartof K B andthereforedoesnotneedtobecomputedseparately.When m ˝ n ,this computationalcostwouldbesignicantlysmallerthanthat ofcomputingthefullmatrix. Werefertotheproposedalgorithmas ApproximateKernel k -means ,outlinedinAlgorithm3. Figure2.1illustratesthealgorithmonatwo-dimensionals yntheticdatasetcontainingtwosemi- circles.Exceptforafewpointswhicharemisclustered,the resultissimilartothatofkernel k -means.Table2.1comparestheconfusionmatricesofthepar titionsobtainedusingtheapproxi- matekernel k -meansalgorithm,withthoseofthekernel k -meansandthe k -meansalgorithms.A confusionmatrixshowsthemappingbetweenthetrueclassla belsandtheclusterlabels.Each clusterisassignedaclasslabel,correspondingtothetrue labelofthemajorityofthedatapointsin thecluster.Eachentry ( k;c ) intheconfusionmatrixrepresentthenumberofdatapointsf romclass c assignedtocluster k .Thediagonalentriesrepresentthenumberofpointsthatha vebeenassigned tothecorrectcluster.ItisclearfromTable2.1thatthepro posedalgorithmachievesclusterquality 51 comparabletothatofthekernel k -meansalgorithm,andismuchmoreaccuratethanthe k -means algorithm. Algorithm3 ApproximateKernel k -means 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered ( ; ): < d < d 7!< :thekernelfunction C :thenumberofclusters m :thenumberofrandomlysampleddatapoints( C 1 = 1 . 6: Set t =0 . 7: repeat 8: Set t = t +1 . 9: Computethe ` 1 normalizedmembershipmatrix b U by b U =[ diag ( U 1 )] 1 U . 10: Calculate = b UT . 11: for i =1 ;:::;n do 12: Findtheclosestclustercenter k for x i by k =argmin k 2 [ C ] > k b K k 2 ' > i k ;where k and ' i arethe k th and i th rowsofmatrices and K B ,respectively. 13: Updatethe i th columnof U by U k;i =1 for k = k andzerootherwise. 14: endfor 15: until themembershipmatrix U doesnotchange,or t>MAXITER 2.3.1Parameters Inadditiontothekernelfunctionandthenumberofclusters ,theapproximatekernel k -meansis parameterizedbythesamplesize m andtherandomsamplingtechniqueemployedtoobtainthe subset b D .Theseparametersplayacrucialroleindeterminingtheclu steringperformanceofthe proposedalgorithm. 52 (a) (b) (c) (d) (e) Figure2.1Illustrationoftheapproximatekernel k -meansalgorithmonthetwo-dimensionalsemi- circlesdatasetcontaining 500 points( 250 pointsineachofthetwoclusters).Figure(a)showsall thedatapoints(inred)andtheuniformlysampledpoints(in blue).Figures(b)-(e)showtheprocess ofdiscoveryofthetwoclustersinthedatasetandtheircent ersintheinputspace(representedby x)bytheapproximatekernel k -meansalgorithm. Table2.1Comparisonoftheconfusionmatricesoftheapprox imatekernel k -means,kernel k - meansand k -meansalgorithmsforthetwo-dimensionalsemi-circlesda taset,containing 500 points ( 250 pointsineachofthetwoclusters).Theapproximatekernel k -meansalgorithmachieves clusterqualitycomparabletothatofthekernel k -meansalgorithm. Class1 Class2 Cluster1 245 4 Cluster2 5 246 (a)Approximatekernel k -means Class1 Class2 Cluster1 250 0 Cluster2 0 250 (b)Kernel k -means Class1 Class2 Cluster1 132 129 Cluster2 118 121 (c) k -means 53 2.3.1.1Samplesize Bycomparingtheoptimizationproblemofapproximatekerne l k -meansin(2.7)withthekernel k -meansproblemin(1.10),wecanobservethattheapproximat ekernel k -meansproblemcanbe viewedasthekernel k -meansprobleminwhichthekernelmatrix K isreplacedbyitsNystrom approximation K B b K 1 K > B .Therefore,theclusteringperformanceoftheapproximate k -means problemwillbeclosetotheclusteringperformanceofkerne l k -meansiftheapproximationerror K K B b K 1 K > B issmall.Thefollowinglemmaadaptedfrom[74]characteriz esthenumber ofsamplesrequiredtoobtainagoodapproximation. Lemma2. Let f k ;v k g n k =1 denotetheeigenvaluesandeigenvectorsofthekernelmatri x K .Let V C =( v 1 ;:::; v C ) denotetheeigenvectorscorrespondingtothedominant C eigenvaluesof K . Denetheficoherenceflofthedominant C -dimensionalinvariantsubspaceof K as ˝ = n C max 1 i n V ( i ) C 2 2 ;(2.11) where V ( i ) C isthe i th rowin V C .Assumethattheeigengap C C +1 issufcientlylarge.Forany 2 (0 ;1) ,wehave K K B b K 1 K > B 2 C +1 1+ 2 n m ;withprobability 1 ,provided m 8 ˝C log( C= ) . Thecoherenceofamatrix ˝ isameasureofthenumberofinformativecolumnsinthema- trix.Whenthecoherenceislow,fewcolumnsaresufcientto obtainanaccurateapproximation. Lemma2indicatesthattheapproximationerrorreducesatar ateof O (1 =m ) ,withincreasing m . Inourexperiments,weexaminedtheperformanceofouralgor ithmfordifferentsamplesizes m ,rangingfrom 0 :001% to 15% ofthedatasetsize n ,andobservedthatsetting m equalto 0 :01% to 0 :05% of n leadstoasatisfactoryperformance. 54 2.3.1.2Samplingstrategies Anotherimportantfactorthatinuencestheproposedappro ximatekernel k -meansalgorithmis thesamplingdistributionemployedtoconstructthekernel approximation.Thesimplestsampling techniqueisuniformrandomsampling,i.e.eachpointissel ectedwithaprobability 1 =n .Several non-uniformsamplingandgreedyapproachestoperformlow- rankmatrixapproximation,have beenstudiedintheliterature. (i)Diagonalsamplinginvolveschoosingadatapoint x i withaprobabilityproportionaltothe diagonalelement K ( x i ;x i ) [17,57].Thisdistributionisthesameastheuniformdistri bution forexponentialkernelsoftheform ( x a ;x b )=exp jj x a x b jj q p ;p;q> 0 ;suchastheRBFkernelandtheLaplaciankernel,becauseallt hediagonalentriesareequal tooneanother. (ii)Column-normsamplinginvolveschoosing x i withaprobabilityproportionaltothe ` 2 norm ofthecolumnvector K ( ;x i ) [69]. (iii)In[195], k -meansisappliedtothedatasetandtheclustercentersobta inedareusedinplace ofthesampleddataset b D . (iv)Adaptivesamplingtechniquesinvolveselectingdatap ointssequentially,toensuremaximum coverageofthedata[50,102,114,142].Forexample,agreed yselectionprocedurewhich selectsapointwhichisfarthestfromthecurrentlyselecte dsetofpointsisemployedin[142]. Liu etal. proposeselectingadatapointwhichwouldformasubspacewi ththepreviously chosenpoints,sothatthetotaldistanceofunsampleddatap ointstothissubspaceismini- mized[114]. 55 (v)Samplingbasedontheimportanceofthedatapointinterm softhestatisticalleveragescores andthecoherenceofthedataisemployedin[170]. Thenon-uniformsamplingtechniqueslikecolumn-normsamp ling,adaptivesamplingandimpor- tancesamplinghave O ( n 2 ) runningtimecomplexity.Hence,theyareinfeasibleforlar gedatasets. Samplingusing k -meanscanbeperformedin O ( nm ) time.Uniformanddiagonalsamplinghave lineartimecomplexity.Kumar etal. comparedthediagonalandcolumnsamplingtechniqueswith uniformsamplingandshowedthatuniformsamplingwithoutr eplacementismoreeffectivethan thenon-uniformsamplingtechniques[103].Weexploresome ofthesetechniquesempiricallyin Section2.4. 2.3.2Analysis Inthissection,werstanalyzethecomputationalcomplexi tyoftheproposedapproximatekernel k -meansalgorithm,andthenexaminethequalityofthedatapa rtitionsgeneratedbytheproposed algorithm. 2.3.2.1Computationalcomplexity Assuminguniformsamplingstrategy,samplingcanbeperfor medin O ( n ) time.Themostex- pensiveoperationsintheproposedalgorithmarethematrix inversion b K 1 andcalculationofthe matrix T = K B b K 1 ,whichhaveatotalcomputationalcostof O ( m 3 + m 2 n ) .Thecostofcom- puting andupdatingthemembershipmatrix U is O ( mnCl ) ,where l isthenumberofiterations neededforconvergence.Hence,theoverallrunningtimecom plexityoftheapproximatekernel k -meansalgorithmis O ( m 3 + m 2 n + mnCl ) .Wecanfurtherreducethecomputationalcomplexity byavoidingthematrixinversion b K 1 andformulatingthecalculationof = b UT = b UK B b K 1 as 56 thefollowingoptimizationproblem: min 2< C m 1 2 tr ( b K ) tr ( b UK B > ) (2.12) If b K iswellconditioned(i.e.theminimumeigenvalueof b K issignicantlylargerthanzero), wecansolvetheoptimizationproblemin(2.12)byasimplegr adientdescentmethodwitha convergencerateof O (log(1 =" )) ,where " isthedesiredaccuracy.Asthecomputationalcost ofeachstepinthegradientdescentmethodis O ( m 2 C ) ,theoverallcomputationalcostisonly O ( m 2 Cl log(1 =" )) ˝ O ( m 3 ) when Cl ˝ m .Thisreducestheoverallcomputationalcostto O ( m 2 Cl + mnCl + m 2 n ) .Asthelargestmatrixthatneedstobestoredinmemoryis K B ,the memoryrequirementisonly O ( mn ) .Thisisadramaticdecreaseintherunningtimeandmemory requirementsforlargedatasetswhencomparedtothe O ( n 2 ) complexityofkernel k -means.The runningtimecomplexityofapproximatekernel k -meansisalsolowerthanthatoftheNystrom approximationbasedspectralclusteringalgorithm,which needstocomputetheeigenvectorsof the m m matrix b K in O ( m 3 ) time. 2.3.2.2Approximationerror Inthissection,wecomparetheclusteringerrorofapproxim atekernel k -meanswiththatofkernel k -means.Theonlydifferencebetweenthetwoalgorithmsisth efactthatapproximatekernel k -meansrestrictstheclustercenterstoasmallsubspace H a ,constructedusingthesampleddata points.Ouranalysiswillthereforebefocusedonboundingt heexpectederrorduetothisconstraint. Letbinaryrandomvariables ˘ =( ˘ 1 ;˘ 2 ;:::;˘ n ) > 2f 0 ;1 g n representthesamplingvector,i.e. ˘ i =1 if x i 2 b D andzerootherwise.Thefollowingpropositionallowsustow ritetheclustering errorintermsofrandomvariable ˘ : Proposition1. Giventheclustermembershipmatrix U =( u 1 ;:::; u C ) > ,theclusteringerrorcan 57 beexpressedin ˘ as L ( U;˘ )= tr ( K )+ C X k =1 L k ( U;˘ ) ;(2.13) where L k ( U;˘ ) is L k ( U;˘ )=min k 2< n 2 u > k K ( k ˘ )+ n k ( k ˘ ) > K ( k ˘ ) :(2.14) Notethatapproximatekernel k -meansbecomesequivalenttokernel k -meanswhen ˘ = 1 , where 1 isavectorofallones,implyingthatallthedatapointsares electedforconstructingthe subspace H a .Asaresult, L ( U; 1 ) istheclusteringerrorofthekernel k -meansalgorithm. Thefollowinglemmarelatestheexpectedclusteringerroro fapproximatekernel k -meanswith thatofkernel k -means. Lemma3. Giventhemembershipmatrix U ,wehavetheexpectationof L ( U;˘ ) boundedasfollows E ˘ [L ( U;˘ )] L ( U; 1 )+ tr e U h K 1 + m n [diag ( K )] 1 i 1 e U > ;(2.15) where L ( U; 1 )= tr ( K ) tr ( e UK e U > ) . Proof. Werstbound E ˘ [L k ( U;˘ )] as 1 n k E ˘ [L k ( U;˘ )] = E ˘ h min 2 b u > k K ( ˘ )+( ˘ ) > K ( ˘ ) i min E ˘ 2 b u > k K ( ˘ )+( ˘ ) > K ( ˘ ) =min 2 m n b u > k K + m 2 n 2 > K + m n 1 m n > diag ( K ) min 2 m n b u > k K + m n > m n K + diag ( K ) : 58 Byminimizingtheaboveexpressionwithrespectto ,weobtain = m n K + diag ( K ) 1 K b u k :Therefore, 1 n k E ˘ [L k ( U;˘ )] m n b u k K m n K + diag ( K ) 1 K b u k :E ˘ [L k ( U;˘ )] canbeboundedas E ˘ [L k ( U;˘ )]+ n k b u > k K b u k n k b u > k K K h K + n m diag ( K ) i 1 K b u k = e u > k K 1 + m n [diag ( K )] 1 1 e u k :Wecompletetheproofbyaddingup E ˘ [L k ( U;˘ )] andusingthefactthat L k ( U; 1 )=min 2 u > k K + n k > K = e u > k K e u k :Theaboveresultcanbeinterpretedintermsoftheeigenvalu esofthekernelmatrix. Corollary1. Assume ( x ;x ) 1 forany x .Let 1 2 ::: n 0 betheeigenvaluesof matrix K .Giventhemembershipmatrix U ,wehave E ˘ [L ( U;˘ )] L ( U; 1 ) 1+ P C i =1 i = [1+ i m=n ]tr ( K ) P C i =1 i 1+ C=m P n i = C +1 i =n :Proof. As ( x ;x ) 1 forany x ,wehavediag ( K ) I ,where I isanidentitymatrix.As e U isan 59 ` 2 normalizedmatrix,wehave tr e U h K 1 + m n [diag ( K )] 1 i 1 e U > tr e U h K 1 + m n I i 1 e U > C X i =1 i 1+ m i =n Cn m and L ( U; 1 )= tr ( K UKU > ) tr ( K ) C X i =1 i :Wecompletetheproofbycombiningtheaboveinequalities. ToillustratetheresultofCorollary1,consideraspecialk ernelmatrix K thathasitsrst a eigenvaluesequal n=a andtheremainingeigenvaluesequalzero;i.e. 1 = ::: = a = n=a and a +1 = ::: = n =0 .Wefurtherassume a> 2 C ;i.e.thenumberofnon-zeroeigenvaluesof K islargerthantwicethenumberofclusters.Then,according toCorollary1,wehave E ˘ [L ( U;˘ )] L ( U; 1 ) L ( U; 1 ) 1+ Ca m ( a C ) 1+ 2 C m ;indicatingthatwhenthenumberofnon-zeroeigenvaluesof K issignicantlylargerthanthenum- beroftheclusters,thedifferenceintheclusteringerrors ofkernel k -meansandourapproximation schemewilldecreaseattherateof O (1 =m ) .ThisresultconcurswiththeresultofLemma1. 2.3.3DistributedClustering Astheproposedapproximatekernel k -meansalgorithmhas O ( nm ) runningtimecomplexity, itiseasiertoparallelizethanthekernel k -meansalgorithm.InAlgorithm4,weproposeascheme toparallelizeapproximatekernel k -means.Thekeyideaistodistributethekernelcomputation and performapproximateclusteringusingarelativelysmaller matrix. 60 Algorithm4 DistributedApproximateKernel k -means 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered ( ; ): ˜ ˜ 7!< :thekernelfunction C :thenumberofclusters m :thenumberofrandomlysampleddatapoints( C 1 = 1 . 8: Set t =0 . 9: repeat 10: Set t = t +1 . 11: Calculate l =[ diag ( U l 1 )] 1 U l T l . 12: for i =1 ;:::;s do 13: Findtheclosestclustercenter k for x i 2D l by k =argmin k 2 [ C ] l k > b K l k 2 ' l i > l k ;where l k and ' l i arethe k th and i th rowsofmatrices l and K l B ,respectively. 14: Updatethe i th columnof U l by U l k;i =1 for k = k andzerootherwise. 15: endfor 16: until themembershipmatrix U doesnotchangeor t>MAXITER 17: for eachpoint x i = 2D l do 18: Findtheclosestclustercenter k k =argmin k 2 [ C ] l k > b K l k 2 ' l i > l k ;where l k isthe k th rowin l and ' l i =( ( x i ;b x 1 ) ;:::; ( x i ;b x m )) . 19: Updatethe i th columnof U l by U l k;i =1 for k = k andzerootherwise. 20: endfor 21: endparallelexecution //Mastertask 22: Randomlyselectanindex l andset U = U l ,orcombinethematrices U l P l =1 usinganensem- bleclusteringalgorithm(e.g.theMeta-clusteringalgori thmdescribedinAlgorithm5). 61 Algorithm5 Meta-ClusteringAlgorithm 1: Input :Clustermembershipmatrices U l P l =1 ;U l 2f 0 ;1 g C n 2: Output :Consensusclustermembershipmatrix U 3: Concatenatethemembershipmatrices f U l g P l =1 toobtainan PC n matrix U = ( u 1 ;u 2 ;:::; u PC ) > . 4: ComputetheJaccardsimilarity s i;j betweenthevectors u i and u j , i;j 2 [PC ]using s i;j = u > i u j k u i k 2 + k u j k 2 u > i u j :5: Constructacompleteweightedmeta-graph G =( V;E ) ,wherevertexset V = f u 1 ;u 2 ;:::; u PC g andeachedge ( u i ;u j ) isweightedby s i;j . 6: Partition G into C meta-clusters f ˇ k g C k =1 where ˇ k = n u (1) k ;u (2) k ;::: u ( s k ) k o . 7: Computethemeanvectorsforeachmeta-cluster: f k g C k =1 using k = 1 s k s k X i =1 u ( i ) k :8: for i =1 ;:::;n do 9: Updatethe i th columnof U as U k ;i = ( 1 if k =argmax k 2 [ C ] k;i 0 otherwise 10: endfor Werstsample m points b D = f b x 1 ;:::; b x m g fromthedatasetandrandomlysplittheremaining n m datapointsinto P parts D 1 ;:::; D P .Letthematrix b K = ( b x i ;b x j ) where b x i ;b x j 2 b D . Wethenmapeachpartitiontoaprocessingnode.Eachnodecom putesthekernelmatrix K l B = ( x i ;b x j ) ,where x i 2D l ,thesetofpointsassignedtothenode,andndstheclusterl abelsfor the s pointsin D l andthecorrespondingclustercenters,usingthematrices K l B and b K .Each point x i = 2D l isassignedtotheclusterwhosecenterisclosest.Thisproc essgenerates P cluster membershipmatrices U l P l =1 ;U l 2f 0 ;1 g C n .Toobtainthenalclustermembershipmatrix U , wecaneitherrandomlychooseoneindex l andset U = U l ,orcombinethemusinganensemble clusteringalgorithm. 62 Theobjectiveofensembleclustering[180]istocombinemul tiplepartitionsofthegivendata set.ApopularensembleclusteringalgorithmistheMeta-Cl usteringalgorithm(MCLA)[168], describedinAlgorithm5.Itmaximizestheaveragenormaliz edmutualinformationbetweenthe partitionsusinghypergraphpartitioning.Given P clustermembershipmatrices, f U 1 ;:::;U P g , where U l =( u l 1 ;:::; u l C ) > ,theobjectiveofthisalgorithmistondaconsensusmember ship matrix U thatmaximizestheAverageNormalizedMutualInformation, denedas ANMI = 1 P P X k =1 NMI ( U;U l ) ;(2.16) where NMI ( U a ;U b ) ,theNormalizedMutualInformation(NMI)[104]betweentwo partitions a and b ,representedbythemembershipmatrices U a and U b respectively,isdenedby NMI ( U a ;U b )= C P i =1 C P j =1 n a;b i;j log n n a;b i;j n a i n b j v u u t C P i =1 n a i log n a i n C P j =1 n b j log n b j n ! :(2.17) Inequation(2.17), n a i representsthenumberofdatapointsthathavebeenassigned label i inpar- tition a ,and n a;b i;j representsthenumberofdatapointsthathavebeenassigned label i inpartition a andlabel j inpartition b .NMIvalueslieintherange [0 ;1] .AnNMIvalueof1indicatesperfect matchingbetweenthetwopartitionswhereas0indicatesper fectmismatch.Maximizing(2.16) isacombinatorialoptimizationproblemandsolvingitexha ustivelyiscomputationallyinfeasible. MCLAobtainsanapproximateconsensussolutionbyrepresen tingthesetofpartitionsasahyper- graph.Eachvector u l k ;k 2 [C ];l 2 [P ]representsavertexinaregularundirectedgraph,called the meta-graph .Vertex u i isconnectedtovertex u j byanedgewhoseweightisproportionalto theJaccardsimilaritybetweenthetwovectors u i and u j : s i;j = u > i u j k u i k 2 + k u j k 2 u > i u j :(2.18) 63 Thismeta-graphispartitionedusingagraphpartitioninga lgorithmsuchasMETIS[93]toobtain C balancedmeta-clusters f ˇ 1 ;ˇ 2 ;:::ˇ C g .Eachmeta-cluster ˇ k = n u (1) k ;u (2) k ;::: u ( s k ) k o ,containing s k vertices,isrepresentedbythemeanvector k = 1 s k s k X i =1 u ( i ) k :(2.19) Thevalue k;i representstheassociationbetweendatapoint x i andthe k th cluster.Eachdatapoint x i isassignedtothemeta-clusterwithwhichitisassociatedt hemost,breakingtiesrandomly,i.e U k ;i = 8 > < > : 1 if k =argmax k 2 [ C ] k;i 0 otherwise (2.20) Byparallelizingtheapproximatekernel k -meansalgorithm,therunningtimecomplexityfor kernelcalculationandclusteringreducesto O ( nm=P ) and O ( m 2 C + mnC=P + m 2 n=P ) ,respec- tively.Iftheensembleclusteringalgorithmisemployedto combinethepartitionsinthelaststep, anadditionalcostof O ( nC 2 P 2 ) isincurred.Thecommunicationoverheadisminimal.Onlyth e m sampleddatapointsneedtobereplicated,incontrasttothe n numberofdatapointsthatneed tobereplicatedacrossallthenodesinparallelkernel k -means. 2.4ExperimentalResults Inthissection,weshowthattheapproximatekernel k -meansalgorithmisanefcientandscalable variantofthekernel k -meansalgorithm.Ithaslowerrunningtimeandmemoryrequi rementsbut isonparwithkernel k -meansintermsoftheclusteringquality. 64 2.4.1Datasets Weusethemedium-sizedCIFAR-10andMNISTdatasets,forwhi chitisfeasiblebutexpensive tocomputethe n n kernelmatrix,todemonstratethattheproposedalgorithm' sclusteringper- formanceissimilartothatofthekernel k -meansalgorithm,intermsoftheclusterquality.Wethen demonstratetheefciencyoftheproposedalgorithmonlarg edatasets,onasingleprocessor,us- ingthelargeForestCoverType,Imagenet-34,PokerandNetw orkIntrusiondatasets.Weanalyze thescalabilityofouralgorithmusingthesyntheticconcen triccirclesdataset.Wenallyexecute thedistributedapproximatekernel k -meansontheTinydatasetandtheconcentriccirclesdatase t containingabillionpoints. 2.4.2Baselines Werstcomparedtheproposedtechniquewiththekernel k -meansalgorithmtoshowthatsimilar performanceisachievedbyouralgorithm.Wealsogaugedour algorithm'sperformanceagainst thatoftheNystromspectralclusteringalgorithm[67],whi chclustersthetop C eigenvectorsofa lowrankapproximatekernelmatrix,obtainedthroughtheNy stromapproximationtechnique,and the k -meansalgorithmtoshowthatouralgorithmachievesbetter clusterquality. 2.4.3Parameters Todenetheinter-pointsimilarity,weusedtheuniversalR BFkernelwiththekernelwidthparam- etersetequalto ˆ d ,where d istheaveragepairwiseEuclideandistancebetweenthedata points, andparameter ˆ isavalueintherange [0 ;1] 4 .ThevaluewhichachievedthebestNMIwasem- ployed.Weevaluatedtheefciencyoftheproposedalgorith mfordifferentsamplesizesranging from m =100 to m =2 ;000 .Weselectedthesesamplesizestoensurethatthetrueclust ersin eachdatasetaresufcientlyrepresentedinthesample,wit hhighprobability.Forthepurposeof 4 Theaveragepairwisesimilaritywasusedonlyasaheuristic tosettheRBFkernelwidth,andnotrequiredbythe proposedalgorithm.Othertechniquesmaybeemployedtocho osethekernelandthekernelparameters. 65 evaluation,thenumberofclusters C wassetequaltothenumberoftrueclassesinthedataset. AllalgorithmswereimplementedinMATLAB 5 andrunona2.8GHzprocessor.Thememory usedwasexplicitlylimitedto 40 GB.Weexecutedeachalgorithm 10 timesandpresenttheresults averagedovertheseruns.Differentpermutationsofthedat asetwereinputtothealgorithmineach run. 2.4.4Results 2.4.4.1Runningtime Table2.2Runningtime(inseconds)oftheproposedapproxim atekernel k -meansandthebaseline algorithms.Thesamplesize m issetto 2 ;000 ,forboththeproposedalgorithmandtheNystrom approximationbasedspectralclusteringalgorithm.Itisn otfeasibletoexecutekernel k -meanson thelargeForestCoverType,Imagenet-34,Poker,andNetwor kIntrusiondatasetsduetotheirlarge size.Anapproximatevalueoftherunningtimeofkernel k -meansonthesedatasetsisobtained byrstexecutingkernel k -meansonarandomlychosensubsetof 50 ;000 datapointstondthe clustercenters,andthenassigningtheremainingpointsto theclosestclustercenter. Dataset Approximate Nystrom Kernel k -means kernel k -means approximation k -means (proposed) basedspectral clustering CIFAR-10 37.01 116.13 725.32 159.22 ( 6 :52 ) ( 1 :97 ) ( 7 :39 ) ( 75 :81 ) MNIST 57.73 4,186.02 914.59 448.69 ( 12 :94 ) ( 386 :17 ) ( 235 :14 ) ( 177 :24 ) Forest 157.48 573.55 4,721.03 40.88 CoverType ( 27 :37 ) ( 327 :49 ) ( 504 :21 ) ( 6 :4 ) Imagenet-34 1,261.02 1,841.47 154,416.48 31,076.41 ( 37 :39 ) ( 123 :82 ) ( 32 ;302 :44 ) ( 9 ;355 :41 ) Poker 256.26 520.48 9,942.40 40.88 ( 44 :84 ) ( 51 :29 ) ( 1 ;476 :00 ) ( 6 :40 ) Network 891.08 1,682.46 34,784.56 953.41 Intrusion ( 237 :17 ) ( 235 :70 ) ( 1 ;493 :59 ) ( 169 :38 ) 5 Weusedthe k -meansimplementationintheMATLABStatisticsToolboxand theNystromapproximationbased spectralclusteringimplementation[35]availableathttp ://alumni.cs.ucsb.edu/wychen/sc.html.Theremainingal go- rithmswereimplementedin-house. 66 Figure2.2ExampleimagesfromthreeclustersintheImagene t-34dataset.Theclustersrepresent (a)buttery,(b)odometer,and(c)websiteimages. Therunningtimesoftheproposedalgorithmforsamplesize m =2 ;000 andthebaseline algorithmsarerecordedinTable2.2.Weobservedthataspee dupofover 90% wasachievedbyour algorithmwhencomparedtokernel k -meansontheCIFAR-10andMNISTdatasets.Itisinfeasible tocalculatethe n n kernelforthelargeForestCoverType,Imagenet-34,Pokera ndNetwork Intrusiondatasets.Togaugetheefciencyofouralgorithm againstkernel k -meansonthesedata sets,werandomlyselectedasetof 50 ;000 pointsfromthesedatasets,executedkernel k -meanson thissubset,andassignedclusterlabelstotheremainingpo intsbyndingtheclusterwhosecenter isclosest.Ouralgorithmwasfasterthanthisversionofthe kernel k -meansalgorithmaswell. Eventhe k -meansalgorithmwasslowerthantheproposedapproximatek ernel k -meansalgorithm onmostofthedatasets,duetotheirhighdimensionality.Ou ralgorithmwasalsofasterthan thespectralclusteringalgorithmbasedontheNystromappr oximation,becausespectralclustering requirestheeigendecompositionofthesimilaritymatrix. Themosttime-consumingoperationin ouralgorithm,computationoftheinversematrix b K 1 ,heavilyinuencedtheclusteringtime. 2.4.4.2Clusterquality Figures2.2and2.5showexamplesofclustersobtained,usin gtheapproximatekernel k -means algorithm,fromtheImagenet-34andtheCIFAR-10datasets, respectively.Weassignedaclass labeltoeachcluster,basedonthetrueclassofmajorityoft heobjectsinthecluster. Thesilhouettecoefcientsoftheproposedalgorithmareco mparedwiththoseofthebaseline 67 algorithms,ontheCIFAR-10andMNISTdatasets,inFigure2. 3.Computingthesilhouettecoef- cientvaluesforthepartitionsoftheremainingdatasetsi scomputationallyprohibitive.Onboth theCIFAR-10andMNISTdatasets,thesilhouettecoefcient valuesachievedbytheproposed algorithmareclosetothoseofthekernel k -meansalgorithm,provingthatthetwoalgorithmsyield similarpartitions.TheNystromapproximationbasedspect ralclusteringalgorithmachieveslower silhouettevalues,whilethe k -meansalgorithmachievesvaluescloseto 0 ,showingthattheclusters obtainedarenotcompact.TheNMIvaluesachievedbytheprop osedalgorithmagainstthebase- 00.010.020.03Silhouette(a)CIFAR-10 00.10.20.30.40.5Silhouette(b)MNIST Figure2.3Silhouettecoefcientvaluesofthepartitionso btainedusingapproximatekernel k - means,comparedtothoseofthepartitionsobtainedusingth ebaselinealgorithms.Thesample size m issetto 2 ;000 ,forboththeproposedalgorithmandtheNystromapproximat ionbased spectralclusteringalgorithm. linealgorithmsareshowninFigure2.4.Duetothesmallsize oftheimagesintheCIFAR-10data set,itisdifculttoobtainahighclusteringaccuracyonth isdataset.Despitethisdifculty,our algorithmpartitionedtheimagesintoclusterssimilartot hoseobtainedbyusingkernel k -means. TheMNISTdatasetwasalsoclusteredintopartitionssimila rtothepartitionsobtainedfromkernel k -means.Ouralgorithm'spredictionaccuracyintermsofNMI withrespecttothetrueclasslabels iscomparabletothatofkernel k -means.Theproposedalgorithm'sNMIvaluesaremarginally betterthanthoseoftheapproximatespectralclusteringal gorithm,becausethespectralclustering algorithmusesonlythetop C eigenvectorsofthekernelmatrixtodeterminetheclusters ,which 68 051015NMI(a)CIFAR-10 01020304050NMI(b)MNIST 051015NMI(c)ForestCoverType 0246810NMI(d)Imagenet-34 0102030NMI(e)Poker 0510NMI(f)NetworkIntrusion Figure2.4NMIvalues(in%)ofthepartitionsobtainedusing approximatekernel k -means,with respecttothetrueclasslabels.Thesamplesize m issetto 2 ;000 ,forboththeproposedalgorithm andtheNystromapproximationbasedspectralclusteringal gorithm.Itisnotfeasibletoexecute kernel k -meansonthelargeForestCoverType,Imagenet-34,Poker,a ndNetworkIntrusiondata setsduetotheirlargesize.TheapproximateNMIvaluesofke rnel k -meansonthesedatasetsare obtainedbyrstexecutingkernel k -meansonarandomlychosensubsetof 50 ;000 datapointsto ndtheclustercenters,andthenassigningtheremainingpo intstotheclosestclustercenter. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Figure2.5ExampleimagesfromtheclustersfoundintheCIFA R-10datasetusingapproximate kernel k -means.Theclustersrepresentthefollowingobjects:(a)a irplane,(b)automobile,(c)bird, (d)cat,(e)deer,(f)dog,(g)frog,(h)horse,(i)ship,and( j)truck. maybetoorestrictiveforthesedatasets.Asexpected,allt hekernel-basedalgorithmsperformed betterthan k -means. 2.4.4.3Parametersensitivity Theproposedapproximatekernel k -meansalgorithmisdependentononecrucialparameter:the samplesize m .Westudytheeffectofvaryingthisparameterontherunning timeofthealgorithm inTable2.3,andtheclusterqualityinFigure2.6(NMIvalue s)andFigure2.7(Silhouettecoef- cientvalues).Wecomparetheperformanceofouralgorithma gainsttheNystromapproximation basedspectralclusteringalgorithm,whichalsodependson thesameparameter.InTable2.3,the executiontimeissplitintothetimetakenforcomputingthe kernelmatrixandclusteringthedata points.Thekernelcomputationtimeiscommontothepropose dalgorithmandtheNystromap- proximationbasedspectralclusteringalgorithm.Moretim ewasspentinclusteringthaninkernel calculation,duetothesimplicityoftheRBFkernel.Though ouralgorithmtooklongerthanthe approximatespectralclusteringalgorithmforsmallsampl esizes( m 1 ;000 ),therunningtimeof thespectralclusteringalgorithmincreasedcubicallywit hthenumberofsamples.Ouralgorithm wasfasterforlargesamplesizes,whenhighclusterquality wasachieved.Therunningtimeof ouralgorithmalsoincreasedasthesamplesize m increased,butatalowerrate.Thesilhouette coefcientvaluesoftheproposedalgorithmincreasedmarg inallyasthesamplesizeincreased, andwerehigherthanthoseachievedbytheNystromapproxima tionbasedspectralclusteringal- gorithm.TheNMIvaluesachievedbyouralgorithmwerealsoh igherthanthoseachievedbythe Nystromapproximationbasedspectralclusteringalgorith m,especiallywhenthesamplesizeis large,andspectralclusteringiscomputationallyexpensi ve.OnlyontheImagenet-34dataset,our algorithmperformsmarginallyworsethanthespectralclus teringalgorithm.Thereisamarginal improvementintheNMIofouralgorithmasthesamplesizeinc reases. 71 1002005001,0002,000051015mNMI(a)CIFAR-10 1002005001,0002,00001020304050mNMI(b)MNIST 1002005001,0002,000051015mNMI(c)ForestCoverType 1002005001,0002,0000246810mNMI(d)Imagenet-34 1002005001,0002,0000102030mNMI(e)Poker 1002005001,0002,000051015mNMI(f)NetworkIntrusion Figure2.6Effectofthesamplesize m ontheNMIvalues(in%)ofthepartitionsobtainedusing approximatekernel k -means,withrespecttothetrueclasslabels. 72 1002005001,0002,00000.010.020.03mSilhouette(a)CIFAR-10 1002005001,0002,00000.10.20.30.40.5mSilhouette(b)MNIST Figure2.7Effectofthesamplesize m ontheSilhouettecoefcientvaluesofthepartitionsobtai ned usingapproximatekernel k -means. 2.4.4.4Samplingstrategies Inourimplementationoftheproposedalgorithm,weemploye duniformrandomsamplingtoselect thesubsetofdatausingwhichthekernelmatrixisconstruct ed.Othersamplingstrategiessuchas column-normsampling,diagonalsamplingand k -meansbasedsamplingmaybeusedtoselectthe samples.Table2.4,Figure2.8,andFigure2.9comparetheru nningtime,silhouettecoefcient andNMIvalues,respectively,forthecolumnnormsamplinga ndthe k -meanssamplingstrategies withuniformrandomsampling.Forcolumnnorm-sampling,we assumethatthe n n kernel matrixispre-computedandonlyrecordthetimetakenforcom putingthecolumnnorms,andthe timetakenforchoosingtherst m indices,asthesamplingtime.For k -meanssampling,we recordthetimetakentoexecute k -meansandndtherepresentativesamples.Asexpected,the samplingtimeforboththenon-uniformsamplingtechniques wasgreaterthanthetimerequired foruniformrandomsampling.Columnnormsamplingismoreex pensivethan k -meanssampling, afterthekernelcomputationtimeistakenintoaccount.Bot hthenon-uniformsamplingtechniques areasaccurateasuniformrandomsamplingforsubstantiall ylargesamplesizes,bothintermsof silhouettecoefcientvaluesaswellastheNMIvalues.This showsthattheadditionaltimespent fornon-uniformsamplingdoesnotleadtosignicantimprov ementintheperformance,aligning 73 Table2.3Effectofthesamplesize m ontherunningtime(inseconds)oftheproposedapproximate kernel k -meansclusteringalgorithm. m Approx. Approx. Nystrom kernel kernel approx. calculation k -means based (proposed) spectral clustering 100 0.34 11.95 0.57 ( 0 : 04 ) ( 4 : 62 ) ( 0 : 12 ) 200 0.87 39.04 0.99 ( 0 : 07 ) ( 15 : 04 ) ( 0 : 13 ) 500 1.36 11.84 4.25 ( 0 : 03 ) ( 2 : 11 ) ( 1 : 86 ) 1,000 3.63 45.87 22.61 ( 0 : 23 ) ( 21 : 94 ) ( 5 : 03 ) 2,000 4.60 32.41 111.53 ( 0 : 20 ) ( 6 : 32 ) ( 1 : 77 ) (a)CIFAR-10 m Approx. Approx. Nystrom kernel kernel approx. calculation k -means based (proposed) spectral clustering 100 0.65 25.91 7.20 ( 0 : 06 ) ( 3 : 05 ) ( 1 : 00 ) 200 1.06 14.54 49.56 ( 0 : 18 ) ( 7 : 85 ) ( 9 : 19 ) 500 1.99 21.36 348.86 ( 0 : 34 ) ( 8 : 35 ) ( 107 : 43 ) 1,000 3.32 25.78 920.34 ( 0 : 44 ) ( 6 : 78 ) ( 219 : 62 ) 2,000 5.81 51.92 4,180.21 ( 0 : 35 ) ( 12 : 59 ) ( 385 : 82 ) (b)MNIST m Approx. Approx. Nystrom kernel kernel approx. calculation k -means based (proposed) spectral clustering 100 1.40 17.70 10.35 ( 0 : 29 ) ( 6 : 06 ) ( 1 : 44 ) 200 1.64 22.57 16.83 ( 0 : 09 ) ( 12 : 39 ) ( 2 : 38 ) 500 3.82 28.56 50.11 ( 0 : 03 ) ( 11 : 61 ) ( 10 : 83 ) 1,000 11.14 55.01 137.26 ( 0 : 68 ) ( 18 : 57 ) ( 40 : 88 ) 2,000 22.80 134.68 550.75 ( 1 : 27 ) ( 26 : 10 ) ( 326 : 22 ) (c)ForestCoverType m Approx. Approx. Nystrom kernel kernel approx. calculation k -means based (proposed) spectral clustering 100 47.29 504.41 78.53 ( 1 : 12 ) ( 119 : 41 ) ( 7 : 14 ) 200 68.15 608.24 115.16 ( 0 : 16 ) ( 10 : 78 ) ( 4 : 47 ) 500 168.83 737.24 292.69 ( 0 : 27 ) ( 209 : 26 ) ( 7 : 21 ) 1,000 181.93 847.06 404.73 ( 11 : 95 ) ( 22 : 88 ) ( 79 : 77 ) 2,000 344.39 916.63 1497.08 ( 3 : 77 ) ( 33 : 62 ) ( 120 : 05 ) (d)Imagenet-34 74 Table2.3(cont'd) m Approx. Approx. Nystrom kernel kernel approx. calculation k -means based (proposed) spectral clustering 100 2.85 53.02 10.88 ( 0 : 36 ) ( 10 : 86 ) ( 1 : 65 ) 200 7.31 81.83 46.78 ( 1 : 40 ) ( 30 : 72 ) ( 4 : 21 ) 500 12.74 104.83 90.57 ( 2 : 41 ) ( 17 : 76 ) ( 18 : 57 ) 1,000 31.29 171.55 261.14 ( 2 : 64 ) ( 41 : 61 ) ( 20 : 51 ) 2,000 40.75 215.51 479.73 ( 3 : 83 ) ( 41 : 01 ) ( 47 : 46 ) (e)Poker m Approx. Approx. Nystrom kernel kernel approx. calculation k -means based (proposed) spectral clustering 100 7.52 729.07 241.84 ( 0 : 64 ) ( 237 : 67 ) ( 65 : 00 ) 200 13.82 683.22 200.48 ( 4 : 15 ) ( 438 : 10 ) ( 45 : 24 ) 500 41.36 339.77 436.79 ( 10 : 75 ) ( 119 : 48 ) ( 206 : 47 ) 1,000 87.24 551.39 668.91 ( 10 : 54 ) ( 78 : 01 ) ( 49 : 37 ) 2,000 115.14 775.94 1567.32 ( 7 : 06 ) ( 230 : 11 ) ( 228 : 64 ) (f)NetworkIntrusion withtheresultsofearlierworkssuchas[103]. 2.4.4.5Scalabilityanalysis Weanalyzethescalabilityoftheproposedapproximatekern el k -meansfordifferentvaluesof n , d , C usingthesyntheticconcentriccirclesdataset.Weemploye dtheRBFkernelfunctiontocompute theapproximatekernelmatrices,andsetthenumberofsampl edpoints m =1 ;000 when C< 100 and m =10 C when C 100 .Thiswasdoneinordertoensurethattheconditionimposedb y Lemma2issatised. Figure2.10(a)showsthattherunningtimeofthealgorithmv ariesnearlylinearlyasthenumber ofpointsinthedataset n variesfrom 100 to 10 million,thedimensionality d =100 ,andthe numberofclusters C =10 .ThisconcurswithourcomplexityanalysisinSection2.3.2 . Wesetthenumberofdatapoints n =10 6 andthenumberofclusters C =10 ,andstudiedthe effectofthedatadimensionalityontheperformanceofthep roposedalgorithminFigure2.10(b). Thedimensionalityofthedatasetplaysanimportantroleon lyinthecalculationofthekernel.The RBFkernelissimpleandtakesonlyafew 100 secondstocalculate,evenfor n =10 6 .Therunning 75 Table2.4Comparisonofsamplingtimes(inmilliseconds)of theuniform,column-normand k - meanssamplingstrategiesontheCIFAR-10andMNISTdataset s.Parameter m representsthe samplesize. m CIFAR-10 MNIST Uniform Columnnorm k -means Uniform Columnnorm k -means Random ( e03 ) ( e06 ) Random ( e03 ) ( e06 ) 100 9.62 67.62 1.68 9.41 94.22 3 :83 ( 1 :62 ) ( 2 :31 ) ( 0 :43 ) ( 1 :74 ) ( 3 :97 ) ( 0 :542 ) 200 4.24 68.21 1.90 9.34 88.92 2 :62 ( 1 :12 ) ( 3 :49 ) ( 0 :20 ) ( 1 :16 ) ( 4 :44 ) ( 0 :254 ) 500 3.99 64.54 2.14 11.10 86.27 7 :82 ( 0 :65 ) ( 4 :26 ) ( 0 :14 ) ( 3 :81 ) ( 0 :94 ) ( 3 :42 ) 1,000 5.43 67.42 2.44 8.41 86.15 5 :88 ( 0 :87 ) ( 5 :59 ) ( 0 :16 ) ( 1 :38 ) ( 0 :70 ) ( 1 :78 ) 2,000 4.62 70.43 2.66 9.53 86.66 4 :91 ( 2 :20 ) ( 7 :20 ) ( 0 :03 ) ( 1 :94 ) ( 0 :85 ) ( 0 :207 ) 1002005001,0002,00000.010.020.030.04mSilhouette(a)CIFAR-10 1002005001,0002,00000.20.40.6mSilhouette(b)MNIST Figure2.8ComparisonofSilhouettecoefcientvaluesofth epartitionsobtainedfromapproximate kernel k -meansusingtheuniform,column-normand k -meanssamplingstrategies,ontheCIFAR- 10andMNISTdatasets.Parameter m representsthesamplesize. 76 1002005001,0002,000051015mNMI(a)CIFAR-10 1002005001,0002,00001020304050mNMI(b)MNIST Figure2.9ComparisonofNMIvalues(in%)ofthepartitionso btainedfromapproximatekernel k -meansusingtheuniform,column-normand k -meanssamplingstrategies,ontheCIFAR-10and MNISTdatasets.Parameter m representsthesamplesize. 10550001000015000Running timein secondsSize of the data set n(log scale)(a) 1011021031100120013001400Running timein secondsDimensionality of the data set d(log scale)(b) 101102103246810x 104Running timein secondsNumber of clusters C(log scale)(c) Figure2.10Runningtimeoftheapproximatekernel k -meansalgorithmfordifferentvaluesof(a) n ,(b) d and(c) C . timeisdominatedbythetimetakenforclustering.Asaresul t,therunningtimevariesminimally whenthedimensionalityofthedatasetvariesfrom d =10 to d =1 ;000 . Wexed n =10 6 and d =100 ,increasedthenumberofclustersinthedatasetfrom C =10 to C =1 ;000 ,andrecordedtherunningtimeofouralgorithminFigure2.1 0(c).Asexpected,the runningtimealmostincreaseslinearlywith C .When C< 100 ,thenumberofsamples m isxed to 1 ;000 .Therefore,thenumberofclustershasasignicanteffecto nlyontheclusteringtime. When C 100 ,thenumberofsamples m alsoneedtobeincreased,therebyaffectingboththe kernelcalculationtimeandtheclusteringtime. 77 2.4.5DistributedApproximateKernel k -means Ondatasetsofsizesgreaterthan 10 million,executionofapproximatekernel k -meansonasingle processorishighlytime-consuming.Weemployedthedistri butedapproximatekernel k -meansto clustertheTinyimagesdatasetandthesyntheticconcentri ccirclesdataset. Wesetthesamplesize m =1 ;000 andthenumberoftasks P =1 ;024 .Eachtaskwasrunona 2.8GHzprocessor,withatotalof 100 GBsharedmemory.TheRBFkernelwasusedforbothdata sets.Thenumberofclusterswassetto C =100 and C =10 fortheTinyimagesandconcentric circlesdatasets,respectively. Theclusteringperformanceofthedistributedalgorithmon thetwodatasetsispresentedin Table2.5.Whenapproximatekernel k -meanswasexecutedontheTinyimagesdatasetonasingle processor,ittookabout 8 :5 hours.Thedistributedalgorithmisabletoclusterthisdat asetinunder 2 minutes.Theconcentriccirclesdatasetcontaining 1 billionpointswasalsoclusteredinless than 15 minutes.ThetrueclasslabelsarenotavailablefortheTiny imagedataset,soitwasnot possibletoevaluatetheclusterquality.Ontheconcentric circlesdataset,anNMIofabout 78% wasachieved. Table2.5Performanceofthedistributedapproximatekerne l k -meansalgorithmontheTinyimage datasetandtheconcentriccirclesdataset,withparameter s m =1 ;000 and P =1024 . Dataset Tiny Concentriccircles n 79,302,017 1,000,000,000 d 384 10 C 100 10 Running Kernelcalculation 0.21 1.17 ( 0 :07 ) ( 0 :09 ) time Clustering 94.03 876.75 ( 6 :58 ) ( 163 :06 ) NMI N/A 77.80 ( 0 :10 ) 78 2.5Summary Inthischapter,wepresentedtheapproximatekernel k -meansalgorithm,anefcientapproximation forthekernel k -meansclusteringalgorithm,suitableforbigdatasets.Th ekeytotheefciencyof approximatekernel k -meansisthefactthatitdoesnotrequirethecalculationof thepairwisesim- ilaritiesbetweenallthedatapoints.Byrestrictingthecl ustercenterstolieinasubspacespanned byasmallsetofrandomlysampleddatapoints,itisabletoco mputetheclustersusingonlyasmall portionofthekernelmatrix.Consequently,ithaslowerrun ningtimeandmemorycomplexitythan kernel k -meansandotherkernel-basedclusteringalgorithms.Weha veshowntheoreticallythat, thedifferenceintheclusteringerroroftheapproximateke rnel k -meansandthekernel k -means algorithms,reduceslinearlyasthenumberofsampledpoint sincreases.Experimentalresultsalso showthattheperformanceofapproximatekernel k -meansiscomparabletothatofkernel k -means andotherstate-of-the-artapproximatekernelclustering algorithms,intermsoftheclusterquality, whileitsrunningtimeisclosetothatoflinearclusteringa lgorithmssuchas k -means.Thoughnot aseasilyparallelizableas k -means,itrequireslesserdatareplicationandcommunicat ionthanker- nel k -means.Hence,itcanhandledistributeddatasetsmoreefc ientlythankernel k -means.The proposedapproximatekernel k -meansachievesourobjectiveofclusteringbigdatasetsef ciently andaccurately. 79 Chapter3 Kernel-basedClusteringUsingRandom FeatureMaps 3.1Introduction Althoughtheapproximatekernel k -meansalgorithmisaccurateandscalable,ithasthefollow ing limitations: Theapproximatekernel k -meansalgorithmsamplesasubsetof m pointsfromthedataset, andconstructsa n m kernelmatrix K B ,betweenthe n pointsinthedatasetandthesampled points.When n isintheorderofbillions,andthenumberofclustersisalso comparably large,calculatingthe O ( nm ) matrix K B maybeinfeasible.Forinstance,ifwewereto clustertheTinyimagedatasetcontaining 80 millionimagesinto 75 ;062 clusters(thetrue numberofclassesinthedataset),approximatekernel k -meanswouldrequireabout m =10 5 samples.Thiswouldboildowntocalculatingabout 8 trillionsimilarityvalues,whichis computationallyexpensive. Approximatekernel k -meanscannotefcientlyhandleout-of-sampleclustering ,i.e.the problemofassigningnewdatapointstoclustersafterthecl usteringiscomplete.Inorderto 80 ndtheclusterlabelforanewpoint x ,weneedtocompute jj c k ( ) ( x ; ) jj 2 H = > k b K k 2 ' > k ;k 2 [C ];where ' =[ ( x ;b x 1 ) ;:::; ( x ;b x m )] and k isthe k th rowofthe C m matrix ,con- tainingtheweightsofthesampledpointsineachofthe C clusters.Thisoperationhas O ( m 2 C + mC 2 + md ) runningtimecomplexityandcanbeinefcientforlarge m . Toaddresstheabovelimitations,weproposetwoalgorithms whichuserandomfeaturemapsto obtainan O ( m ) -dimensionalembeddingoftheHilbertspaceassociatedwit hthekernel ( ; ) , where m ˝ n [42].Ourrstalgorithmcalledthe RFFclustering algorithmobtainsvectorrepre- sentationsofthedatapointstoforman n m patternmatrix.Thispatternmatrixisclusteredusing alinearclusteringalgorithmlike k -meanstoobtainthedatapartitions.Thisalgorithm,liket he approximatekernelk-means,has O ( nm ) runningtimecomplexityandmemoryrequirements.The secondalgorithm,whichwecallthe SVclustering algorithm,isdesignedalongthelinesofspectral clustering.Itapproximatestheeigenvectorsofthe n n kernelmatrixbythedominant C singular vectorsofthepatternmatrix,andobtainsthedatapartitio nbyclusteringthesesingularvectorsin O ( nC 2 ) time.TheSVclusteringalgorithmprovidesa C -dimensionalrepresentationofthecluster centers,usingwhichpreviouslyunseendatapointscanbeas signedtoclustersefciently. 3.2Background ThematrixapproximationmethodsdiscussedinSection2.2e ssentiallyfactorizethekernelmatrix toobtainalow-dimensionalrepresentationofthedata.Ano therformofkernelapproximation, initiallyproposedforsupervisedkernel-basedlearningb yRahimiandRechtin[147],involves factorizingthekernelfunctioninsteadofthekernelmatri x,bymappingthedataexplicitlyintoa low-dimensionalrandomizedfeaturespace. 81 Akernelfunction ( ; ) is shift-invariant if ( x ;y )= ( x y ) forall x ;y 2< d .Popular examplesofshift-invariantkernelsaretheRBFandLaplaci ankernels.Let p ( w ) denotetheFourier transformofsuchakernelfunction ( x y ) ,i.e. ( x y )= Z < d p ( w )exp( j w > ( x y )) d w :Accordingtothefollowingtheoremfromharmonicanalysis, p ( w ) isavalidprobabilitydensity function,providedthekernelfunctioniscontinuous,posi tive-deniteandscaledappropriately. Theorem2. (Bochner'stheorem[152])Acontinuouskernel ( x ;y )= ( x y ) on < d ispositive deniteifandonlyif ( ) istheFouriertransformofanon-negativemeasure. Forinstance,theFouriertransform[26]oftheRBFkernelfu nctionistheGaussianprobability distributionfunction.Let w bea d -dimensionalvectorsampledfrom p ( w ) .Thekernelfunction canbeapproximatedas ( x ;y )= E w f ( w ;x ) > f ( w ;y ) ;(3.1) where f ( w ;x )=(cos( w > x ) ;sin( w > x )) > :Wecanapproximatetheexpectationin(3.1)withtheempiric almeanover m Fouriercomponents f w 1 ;:::; w m g ,sampledfromthedistribution p ( w ) ,andobtainthefollowingrepresentationfor thepoint x : z ( x )= 1 p m (cos( w > 1 x ) ;:::; cos( w > m x ) ;sin( w > 1 x ) ;:::; sin( w > m x )) :(3.2) Thefeatures z ( x ) arecalledthe RandomFourierFeatures .Thekernelsimilaritybetweenanytwo points x and y canbeapproximatedbytheinnerproductbetweentherandomF ourierfeatures 82 Algorithm6 RFFClustering 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered clustered :theRBFkernelwidthparameter C :thenumberofclusters m :thenumberofFouriercomponents( C istheinput patternmatrix. 5: Runthe k -meansalgorithm(Algorithm1)on H withthenumberofclusterssetto C ,and obtainthemembershipmatrix U . correspondingtothedatapoints,i.e. ( x ;y ) ' z ( x ) > z ( y ) :(3.3) Givenadataset D = f x 1 ;:::; x n g ,wecanobtainitslow-dimensionalrepresentation b D = f z ( x 1 ) ;:::; z ( x n ) g ,andapplyafastlinearlearningalgorithmto b D ,insteadofexecutingakernel- basedlearningalgorithmon D .Thisallowsustolearnthenon-linearrelationsinthedata efciently usinglinearmachines. Thiskernelapproximationhasbeenemployedinseverallarg e-scalelearningtaskssuchas classication[25,147,182],regression[123],datacompr ession[146]andnoveltydetection[164]. Randomfeaturemapshavebeenextendedtoshift-variantker nelssuchasintersectionker- nels[110,179]andotherpositivedenitekernelsusingMac laurinandTaylorexpansionsofthe kernelfunction[81,92]. 3.3KernelClusteringusingRandomFourierFeatures Randomfeaturemapscanbeusedforclusteringbigdatasetse fciently.Weproposeanalgo- 83 (a) (b) (c) Figure3.1AsimpleexampletoillustratetheRFFclustering algorithm.(a)Two-dimensionaldata setwith 500 pointsfromtwoclusters( 250 pointsineachcluster),(b)Plotofthematrix H obtained bysampling m =1 Fouriercomponent.(c)Clustersobtainedbyexecuting k -meanson H . Table3.1ComparisonoftheconfusionmatricesoftheRFF,ke rnel k -means,and k -meansalgo- rithmsforthetwo-dimensionalsemi-circlesdataset,cont aining 500 points( 250 pointsineachof thetwoclusters). Class1 Class2 Cluster1 220 41 Cluster2 30 209 (a)RFFclustering Class1 Class2 Cluster1 250 0 Cluster2 0 250 (b)Kernel k -means Class1 Class2 Cluster1 132 129 Cluster2 118 121 (c) k -means rithmcalledthe RFFclustering algorithm,whichrstprojectsthedatasetintoalow-dimen sional spaceusingrandomFourierfeaturemaps,andthenexecutes k -meansonthetransformeddata. Let D = f x 1 ;:::; x n g representtheinputdataset,and ( ; ) bethekernelfunction.We assumethat ( ; ) isshift-invariant 1 andsatisesthecondition ( x ;x )= (0)=1 .Let K = [ ( x i ;x j )] n n denotethekernelmatrix.Thematrix H = z ( x 1 ) > ;:::; z ( x n ) > (3.4) denotesthedatamatrixobtainedbymappingeachpoint x 2D usingtherandomfeaturemap z ( ) . 1 Theassumptionofshift-invarianceismadeonlyforsimplic ity.Randomfeaturemapscanbeusedforother positivesemi-denitekernelsaswell,asdemonstratedin[ 81,92]. 84 Using(3.3),wecanapproximatethekernelmatrix K by b K = H > H: (3.5) Wecanreplacethekernelmatrix K inthekernel k -meansoptimizationproblem(1.11)with theapproximatekernelmatrix b K in(3.5),leadingtothefollowingoptimizationproblem: max U 2P tr ( e UH > H e U > ) ;(3.6) where U =( u 1 ;:::; u C ) > istheclustermembershipmatrix, P = f U 2f 0 ;1 g C n :U > 1 = 1 g , e U =[ diag ( U 1 )] 1 = 2 U ,and 1 isavectorofallones.Bycomparingtheaboveproblemtothe k -meansoptimizationproblem(1.12),itbecomesevidenttha ttheproblemin(3.6)canbesolved byexecuting k -meansonthematrix H .Algorithm6describestheRFFclusteringalgorithmfor clusteringusingtherandomFourierfeaturesobtainedfrom theRBFkernel.Weillustratethe algorithminFigure3.1.Figure3.1(a)showsatwo-dimensio naldatasetcontaining 500 points fromtwosemi-circularclusters.Thetwoclustersareident iedperfectlywhenthekernel k -means algorithmisexecutedonthisdataset.Forthepurposeofill ustration,wesampledoneFourier component(i.e. m =1 )andgeneratedatwo-dimensionalmatrix H torepresentthedata.Aplotof thisrepresentationisshowninFigure3.1(b).Notethatthe twoclustersaremoreseparatedinthis spacethanintheoriginalfeaturespace.Figure3.1(c)show stheclustersobtainedwhen k -meansis executedon H .Theerror,intermsofthenumberofpointsthataregroupedi ntothewrongcluster, isabout 14% ,asshownintheconfusionmatricesinTable3.1.Aconfusion matrixshowsthe mappingbetweenthetrueclasslabelsandtheclusterlabels .Eachclusterisassignedaclasslabel, correspondingtothetruelabelofthemajorityofthedatapo intsinthecluster.Eachentry ( k;c ) in theconfusionmatrixrepresentthenumberofdatapointsfro mclass c assignedtocluster k .The diagonalentriesrepresentthenumberofpointsthathavebe enassignedtothecorrectcluster.The confusionmatricesshowthattheaccuracyoftheRFFcluster ingalgorithmisclosetothatofthe 85 kernel k -meansalgorithm,andhigherthanthatofthe k -meansalgorithm. 3.3.1Analysis Inthissection,werstanalyzethecomputationalcomplexi tyoftheRFFclusteringalgorithm,and thenexaminethequalityofthedatapartitionsgenerated. 3.3.1.1Computationalcomplexity SamplingfromtheFouriertransformofthekernelfunctioni sarelativelyinexpensiveoperation formostshift-invariantkernels.Forinstance,severalef cienttechniqueshavebeenproposedfor samplingfromaGaussiandistributionintheliterature[53 ].ThecruxoftheproposedRFFclus- teringalgorithmthusliesincomputingthelow-dimensiona lrandomFourierfeatures H .Given md -dimensionalFouriercomponents,themappingtothematrix H canbeperformedin O ( ndm ) time.Le etal. proposedtheFastfoodalgorithmwhichreducestherunningt imecomplexityof thisoperationto O ( nm log( d )) [107].Insteadofdirectlymultiplyingthedatamatrix X withthe randomGaussianmatrix W toobtainthematrix H ,theycombine W withaWalsh-Hadamardma- trix.MultiplicationwithHadamardmatricescanbeperform edinloglineartime,therebyreducing therunningtime.AsGaussianmatricescombinedwithHadama rdmatricesbehavelikeGaussian matrices,thisdoesnotaffectthekernelmatrixapproximat ionsignicantly.Executing k -meanson H takes O ( nmCl ) time,where l isthenumberofiterationsrequiredforconvergence.Thus, the overallrunningtimecomplexityoftheRFFclusteringalgor ithmis O ( nm log( d )+ nmCl ) .Only O ( nm ) memoryisrequiredtostorethematrix H . 3.3.1.2Approximateerror Toexaminethedifferencebetweentheclusteringsolutions ofthekernel k -meansalgorithmand theRFFclusteringalgorithm,wemustrstboundthekernela pproximationerror K b K F .In thefollowingtheorem,weshowthatthiserrordecreasesatt herateof O (1 = p m ) : 86 Theorem3. Forany 2 (0 ;1) ,withprobability 1 ,wehave b K K F 2ln(2 = ) m + r 2ln(2 = ) m = O 1 p m :(3.7) Proof. Weusethefollowingresultfrom[165]toprovethistheorem: Lemma4. Let H beaHilbertspaceand ˘ bearandomvariableon ( Z;ˆ ) withvaluesin H . Assume k ˘ k M< 1 almostsurely.Denote ˙ 2 ( ˘ )= E ( k ˘ k 2 ) .Let f z i g m i =1 beindependent randomdrawersof ˆ .Forany 0 << 1 ,withcondence 1 , 1 m m X i =1 ( ˘ i E [˘ i ]) 2 M ln(2 = ) m + r 2 ˙ 2 ( ˘ )ln(2 = ) m :(3.8) Dene a ( w )= 1 p n (cos( w > x 1 ) ;:::; cos( w > x n )) > and b ( w )= 1 p n (sin( w > x 1 ) ;:::; sin( w > x n )) > :Let ˘ i = a ( w i ) a ( w i ) > + b ( w i ) b ( w i ) > .Wehave E [˘ i ]= E [a ( w i ) a ( w i ) > + b ( w i ) b ( w i ) > ]= K and jj ˘ i jj 2 F = jj a ( w i ) j 2 + j b ( w i ) jj 2 =1 ,whichimplies M = ˙ 2 =1 .Weobtaintheresult(3.7) bysubstitutingthesevaluesin(3.8). b K isagoodapproximationof K providedthatthenumberofFouriercomponents m issuf- cientlylarge.Wecannowobtainanupperboundonthediffere ncebetweenthesolutionsofthe kernel k -meansoptimizationproblemin(1.11)andtheoptimization problemin(3.6): Theorem4. Let U and U m betheoptimalsolutionsof (1.11) and (3.6) ,respectively.Let e U = U [D ] 1 = 2 and e U m = U m [D m ] 1 = 2 denotethenormalizedversionsof U and U m ,where D = 87 diag ([ U ]> 1 ) and D m = diag ([ U m ]> 1 ) .Forany 2 (0 ;1) ,withprobability 1 ,wehave tr [e U e U m ]> K [e U e U m ] 4ln(2 = ) m + r 8ln(2 = ) m = O 1 p m :Proof. Wehave tr ([ e U ]> K e U ) tr ([ e U ]> b K e U )+ K b K F tr ([ e U m ]> b K e U m )+ K b K F tr ([ e U m ]> K e U m )+2 K b K F :Sincetr ([ e U ]> K e U ) tr ([ e U m ]> K e U m ) ,wehave j tr ([ e U ]> K e U ) tr ([ e U m ]> K e U m ) j 2 K b K F :WecompletetheproofbyusingtheresultfromTheorem3andth estrongconvexitypropertyof tr ( e U > K e U ) . 3.4KernelClusteringusingRandomFourierFeaturesinCon- strainedEigenspace Despiteitssimplicity,RFFclusteringmaysufferfromhigh computationalcost.Asseenin Theorem4,alargenumberofrandomFouriercomponentsmaybe requiredtoachievealowap- proximationerror.Asaconsequence,weneedtoexecute k -meansoverahigh-dimensionalspace, leadingtohighruntimecomplexity.Toaddressthisproblem ,weproposeusinganideasimilar tothatintheapproximatekernel k -meansalgorithm,andconstraintheclustercenterstoliei nthe 88 Algorithm7 SVClustering 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered :theRBFkernelwidthparameter C :thenumberofclusters m :thenumberofFouriercomponents( C istheinput patternmatrix. 5: Computetheleftsingularvectorsof H correspondingtoitstop C singularvaluestoobtainthe matrix b V C =( b v 1 ;:::; b v C ) . 6: Runthe k -meansalgorithm(Algorithm1)on b V C withthenumberofclusterssetto C and obtainthemembershipmatrix U . subspacespannedbythetopeigenvectorsofthekernelmatri x.Let f ( i ;v i ) g n i =1 denotetheeigen- valuesandeigenvectorsofthekernelmatrix K ,rankedinthedescendingorderoftheeigenvalues. Let H a = span ( v 1 ;:::; v C ) representthespacespannedbythedominant C eigenvectors.The kernel k -meansproblemin(1.7)canbeapproximatedas min U 2P max f c k ( ) 2H a g C k =1 C X k =1 n X i =1 U ki n jj c k ( ) ( x i ; ) jj 2 H ;(3.9) where c k ( ) representtheclustercenters, U =( u 1 ;:::; u C ) > istheclustermembershipmatrix, P = f U 2f 0 ;1 g C n :U > 1 = 1 g ,and 1 isavectorofallones.Theaboveproblem(3.9) canbesolvedbyexecuting k -meansonthetopeigenvectorsof K ,i.e.bysolvingthefollowing optimizationproblem: max U 2P tr ( e U [V C V > C ]e U > ) ;(3.10) where V C =( v 1 ;:::; v C ) ,and e U =[ diag ( U 1 )] 1 = 2 U .Thismethodleadstoasignicantreduc- tionincomputationalcostwhencomparedtotheRFFclusteri ngalgorithm,aseachdatapointis representedbya C -dimensionalvectorand k -meansneedstobeexecutedoveralowerdimensional space. 89 However,computingtheeigenvectorsof K requiresthecomputationofthe n n kernelmatrix, whichisinfeasiblewhen n islarge.Wecircumventthisissuebyapproximatingtheeige nvectorsof K usingthesingularvectorsoftherandomFourierfeatures,a ndtherebyavoidcomputingthefull kernelmatrix.Morespecically,wecomputethetop C singularvaluesandthecorrespondingleft singularvectorsof H ,denotedby f ( b i ;b v i ) g C i =1 ,andrepresentthedatapointsin D bythematrix b V C =( b v 1 ;:::; b v C ) .Wethensolvetheapproximateoptimizationproblem max U 2P tr ( e U [b V C b V > C ]e U > ) ;(3.11) byexecuting k -meansonthematrix b V C toobtainthe C clusters.Thisprocedure,namedasthe SVclusteringalgorithm,isoutlinedinAlgorithm7.Ithast hesameinputandoutputastheRFF clusteringalgorithm,butdiffersinthenaltwosteps.Ast hedimensionalityoftheinputtothe k -meansclusteringstepintheSVclusteringalgorithmissig nicantlysmallerthanthatintheRFF clusteringalgorithm,SVclusteringismoreefcientthanR FFclustering,despitetheoverheadof computingthesingularvectors. 3.4.1Analysis Inthissection,wediscussthecomputationalcomplexityof theSVclusteringalgorithmandbound itsapproximationerror. 3.4.1.1Computationalcomplexity AstheinitialstepsintheSVclusteringalgorithmarethesa meastheRFFclusteringalgorithm, thesestepshavethesamerunningtimecomplexity.Inadditi on,thealgorithminvolvesperforming thesingularvaluedecompositionof H .Ifthetopsingularvectorsof H arefoundusingcon- ventionalmethods,theruntimecomplexityoftheSVDstepwo uldbe O ( nm 2 ) .Wereducethis complexityinourimplementationbyusingtheapproximateS VDtechniqueproposedin[70].We 90 sample s rowsfrom H toforman s 2 m matrix S .Thetopeigenvectorsof S > S ,denotedby e V =( e v 1 ;:::; e v C ) ,areclosetothetopeigenvectorsof H > H andthesingularvectorsof H canbe recoveredfromtheeigenvectorsof S > S as H e V .Usingthisapproximation,theruntimecomplex- ityofSVDisreducedto O ( sm min f s;m g ) .Thetimetakentoexecute k -meansonthesingular vectorsis O ( nC 2 l ) . When max( m;s;l;C ) ˝ n ,boththeRFFandSVclusteringalgorithmshavelineartimec om- plexity.However,thetimetakenbythe k -meansstepintheSVclusteringalgorithmis O ( nC 2 l ) , asopposedto O ( nmCl ) ,thetimetakenbythe k -meansstepintheRFFclusteringalgorithm.As C isusuallymuchlesserthan m ,theSValgorithmismuchmoreefcientthantheRFFclusteri ng algorithm. Thevalueschosenfor m and s introduceatrade-offbetweentheclusteringqualityandef - ciency.Highervaluesresultinbetterclusteringqualityb utlesserspeedup.Inourimplementation, wefoundthatareasonablygoodaccuracycanbeachievedbyse ttingthevalueof m torangebe- tween 1% and 2% of n ,andsetting s toaround 2% of n .Lower m=n ratiovaluesworkwellas n increases. 3.4.1.2Approximationerror TheSVclusteringalgorithmreliesontheassumptionofthee xistenceofalargeeigengap.This theorythathasbeenadoptedbymanyearlierkernel-basedal gorithmswhichrelyonthespec- tralembeddingofthedata[118],essentiallyimpliesthatm ostattributesofthedatacanbewell approximatedbyvectorsinthelow-dimensionalspacespann edbythetopeigenvectors. Thefollowingtheoremprovesthatwhenthelast n C eigenvalues f i g n i = C +1 of K aresuf- cientlysmall,thesubspace H canbewellapproximatedbythesubspace H a spannedbythetop C eigenvectorsof K . Theorem5. Let E and E a representtheoptimalclusteringerrorsinthekernelk-mea nsproblem 91 (1.7) andtheoptimizationproblem (3.9) ,respectively.Wehave j E E a j n X i = C +1 i :Proof. Let f c k ( ) g C k =1 and U betheoptimalsolutionsto(1.7).Let c a k ( ) representtheprojection of c k intothesubspace H a .Forany ( x i ; ) ,let g i ( ) and h i ( ) betheprojectionsof ( x i ; ) into thesubspace H a andspan ( v C +1 ;:::; v n ) ,respectively.Wehave E a =min U max c k ( ) 2H a C X k =1 n X i =1 U ki n jj c k ( ) ( x i ; ) jj 2 H C X k =1 n X i =1 U ki n jj c a k ( ) ( x i ; ) jj 2 H = C X k =1 n X i =1 U ki n jj c a k ( ) g i ( ) jj 2 H + jj h i ( ) jj 2 H E + 1 n C X k =1 n X i =1 jj h i ( ) jj 2 H E + n X i = C +1 i :Weproveasetofpreliminarylemmasbeforepresentingourma inresultinTheorem6which boundstheclusteringerroroftheSVclusteringalgorithm. Lemma5. (Resultfrommatrixperturbationtheory[166])Let ( i ;v i ) ;i 2 [n ]betheeigenvalues andeigenvectorsofasymmetricmatrix A 2< n n rankedinthedescendingorderofeigenvalues. Set X =( v 1 ;:::; v C ) and Y =( v C +1 ;:::; v n ) .Givenasymmetricperturbationmatrix E ,let ( X;Y ) > E ( X;Y )= 0 B @ E 11 E 12 E 21 E 22 1 C A :92 Let jjjj representaconsistentfamilyofnormsandlet = jj E 21 jj ; = C C +1 jj E 11 jjjj E 22 jj :If > 0 and < 1 2 ,thenthereexistsauniquematrix P 2< ( n C ) C satisfying jj P jj < 2 ;suchthat X 0 =( X + YP )( I + P > P ) 1 = 2 ;and Y 0 =( Y XP > )( I + PP > ) 1 = 2 aretheeigenvectorsof A + E . Lemma6. Given 2 (0 ;1) ,weassume ( C C +1 ) 3 ,where = 2ln(2 = ) m + r 2ln(2 = ) m ;(3.12) thereexists,withprobability 1 ,amatrix P 2< ( n C ) C satisfying jj P jj F 2 C C +1 ;suchthat b V C =( V C + V C P )( I + P > P ) 1 = 2 ;where b V C =( b v 1 ;:::; b v C ) , V C =( v 1 ;:::; v C ) ,and V C =( v C +1 ;:::; v n ) . 93 Proof. Let E = b K K .UsingTheorem3andLemma5,wehave = V > C E V C jj E jj ;and = C C +1 V > C EV C V > C E V C C C +1 > 0 :As C C +1 3 ,wealsohave < 1 2 ,allowingustoapplyLemma5andobtaintherequired result. Lemma7. UndertheassumptionsofLemma6,withaprobability 1 ,wehave C X i =1 jj b v i v i jj 2 2 jj P jj 2 F 18 2 ( C C +1 ) 2 ;where isdenedin (3.12) . Proof. Dene A = P ( I + P > P ) 1 = 2 and B = I ( I + P > P ) 1 = 2 .Let f i g C i =1 betheeigenvalues of P > P .UsingtheresultfromLemma6,wehave C X i =1 jj b v i v i jj 2 = V C A 2 F + jj V C B jj 2 F jj A jj 2 F + jj B jj 2 F jj P jj 2 F + C X i =1 i (1+ p i ) 2 2 jj P jj 2 F :94 Wecompletetheproofbyusingthefactthat jj P jj F 2 C C +1 3 C C +1 :Inthefollowingtheorem,weboundtheapproximationerroro ftheSVclusteringalgorithmand showthatityieldsabetterapproximationofkernelcluster ingthantheRFFclusteringalgorithm, providedthereisasufcientlylargegapintheeigenspectr um. Theorem6. Let U and U m betheoptimalsolutionsof (3.10) and (3.11) ,andlet e U and e U m representtheirnormalizedversions(asdenedinTheorem4 ),respectively.Given 2 (0 ;1) , assume ( C C +1 ) 3 ,where isdenedin (3.12) .Withprobability 1 ,wehave tr [e U e U m ]> [e U e U m ] 18 2 ( C C +1 ) 2 = O 1 m :Proof. ThistheoremisadirectresultofLemmas6and7. Theorem6showsthat,liketheRFFclusteringalgorithm,the SVclusteringalgorithm'sap- proximationerrorreducesasthenumberofFouriercomponen tsincreases,albeitatahigherrate of O (1 =m ) . 3.4.2Out-of-sampleClustering TheSVclusteringalgorithmcanbeusedtoefcientlyassign clusterlabelstodatapointsthat werenotseenpreviously.TheclustercentersintheSVclust eringalgorithmlieinthesubspace H a = span ( b v 1 ;:::; b v C ) ,andcanbeexpressedaslinearcombinationsofthesevector s: e c k = 1 n k n X i =1 U ki b v i ;95 where n k isthenumberofdatapointsinthe k th cluster.Givenadatapoint x 2< d ,wecanobtain itsclusterlabelusingthefollowingdoubleprojectionsch eme: (i)ComputetherandomFourierfeatures z ( x )= 1 p m (cos( w > 1 x ) ;:::; cos( w > m x ) ;sin( w > 1 x ) ;:::; sin( w > m x )) :(ii)Project z ( x ) intothesubspace H a toobtain b v . (iii)Assign x tothecluster k whichminimizes jj e c k b v jj 2 2 . Usingthisprocess,clusterlabelscanbeassignedin O ( md ) time. 3.5ExperimentalResults 3.5.1Datasets WeevaluatedtheperformanceoftheRFFandSVclusteringalg orithmsontheCIFAR-10,MNIST, ForestCoverType,Imagenet-34,Poker,andNetworkIntrusi ondatasets.Themedium-sized CIFAR-10andMNISTdatasetsareusedtocomparetheperforma nceoftheproposedalgorithms withthekernel k -meansalgorithm.Theremainingdatasetsareusedtodemons tratethescalability ofthealgorithmstolargedatasets. 3.5.2Baselines Usingthemedium-sizedCIFAR-10andMNISTdatasets,wecomp aredtheproposedalgorithms withthekernel k -meansalgorithm,todemonstratethattheirclusteringper formanceisclosetothat ofthekernel k -meansintermsofclusterquality.Wealsocomparedtheirpe rformancewiththe approximatekernel k -meansalgorithmandtheNystromapproximationbasedspect ralclustering 96 algorithm.Wealsogaugedtheperformanceofouralgorithms againstthatofthe k -meansalgorithm toshowthattheyachievebetterclusterquality. 3.5.3Parameters WeusedtheRBFkernelforallthekernel-basedalgorithmson allthedatasets.Wesetthekernel widthequalto ˆ d ,where d istheaveragepairwiseEuclideandistancebetweenthedata pointsand parameter ˆ istunedintherange [0 ;1] toobtainoptimalperformance 2 .Wevariedthenumber ofFouriercomponents m from 100 to 2 ;000 .Fortheapproximatekernel k -meansandspectral clusteringalgorithms, m representsthesizeofthesampledrawnfromthedataset.The valueof s , thenumberofrowssampledfrom H tocomputetheapproximatesingularvectors,wassetto 2% ofthetotalnumberofdatapoints n .Thenumberofclusters C wassetequaltothetruenumberof classesinthedataset. AllalgorithmswereimplementedinMATLAB 3 andrunona 2 :8 GHzprocessorusing 40 GBRAM.Allresultsareaveragedover 10 runsofthealgorithms.Ineachrunoftheproposed algorithms,weusedadifferentsetofrandomlysampledFour iercomponents.Forthebaseline algorithmswhichuseasubsetofthedata,weuseddifferentr andomlysampledsubsetsineach run. 3.5.4Results 3.5.4.1Runningtime Therunningtimeofthebaselinealgorithmsandtheproposed RFFandSValgorithmsarerecorded inTable3.2.ThenumberofFouriercomponents m fortheRFFandSVclusteringalgorithms 2 Theaveragepairwisesimilaritywasusedonlyasaheuristic tosettheRBFkernelwidth,andnotrequiredbythe proposedalgorithms.Othertechniquesmaybeemployedtoch oosethekernelandthekernelparameters. 3 Weusedthe k -meansimplementationintheMATLABStatisticsToolboxand theNystromapproximationbased spectralclusteringimplementation[35]availableathttp ://alumni.cs.ucsb.edu/wychen/sc.html.Theremainingal go- rithmswereimplementedin-house. 97 Table3.2Runningtime(inseconds)oftheRFFandSVclusteri ngalgorithmsonthesixbenchmark datasets.Theparameter m ,whichrepresentsthenumberofFouriercomponentsfortheR FFand SVclusteringalgorithms,andthesamplesizefortheapprox imatekernel k -meansandNystrom approximationbasedspectralclusteringalgorithms,isse tto m =2 ;000 .Itisnotfeasibleto executekernel k -meansonthelargeForestCoverType,Imagenet-34,Poker,a ndNetworkIntrusion datasetsduetotheirlargesize.Anapproximateoftherunni ngtimeofkernel k -meansonthese datasetsisobtainedbyrstexecutingkernel k -meansonarandomlychosensubsetof 50 ;000 data pointstondtheclustercenters,andthenassigningtherem ainingpointstotheclosestcluster center. Dataset RFF SV Approx. Nystrom Kernel k -means clustering clustering kernel approx. k -means (proposed) (proposed) k -means based spectral clustering CIFAR-10 3,418.21 58.32 37.01 116.13 725.32 159.22 ( 907 :14 ) ( 38 :68 ) ( 6 :52 ) ( 1 :97 ) ( 7 :39 ) ( 75 :81 ) MNIST 1,089.26 39.94 57.73 4,186.02 914.59 448.69 ( 483 :63 ) ( 5 :64 ) ( 12 :94 ) ( 386 :17 ) ( 235 :14 ) ( 177 :24 ) Forest 2,078.63 76.99 157.48 573.55 4,721.03 40.88 CoverType ( 617 :22 ) ( 17 :04 ) ( 27 :37 ) ( 327 :49 ) ( 504 :21 ) ( 6 :4 ) Imagenet-34 1,333.85 212.32 1,261.02 1,841.47 154,416 31,076 ( 6 :53 ) ( 4 :75 ) ( 37 :39 ) ( 123 :82 ) ( 32 ;302 ) ( 9 ;355 ) Poker 4,530.44 41.08 256.26 520.48 9,942 40.88 ( 276 :37 ) ( 2 :57 ) ( 44 :84 ) ( 51 :29 ) ( 1 ;476 ) ( 6 :40 ) Network 24,151 435.53 891.08 1,682.46 34,784 953.41 Intrusion ( 6 ;351 :34 ) ( 189 :07 ) ( 237 :17 ) ( 235 :70 ) ( 1 ;493 ) ( 169 :38 ) wassetto 2 ;000 andthesamplesetsizefortheapproximatekernel k -meansandtheNystrom approximationbasedspectralclusteringalgorithmwasals osetto 2 ;000 .Werstobservethat theRFFclusteringalgorithmtooklongerthantheSVcluster ingalgorithmonallthedatasets. Thoughbothalgorithmsrequirethecomputationofthedatam atrix H ,thetimetakentoperform thiscomputationwasinsignicantwhencomparedtothe k -meansclusteringtime.RFFclustering involvesrunning k -meansona 2 m -dimensionalmatrix,whichtakeslongerthanrunning k -means ona C -dimensionalmatrix.AlthoughtheSVclusteringalgorithm includescomputingthesingular vectorsof H ,theoverheadofperformingSVDissmall,renderingitmoree fcientthantheRFF clusteringalgorithm.OntheCIFAR-10dataset,theSVclust eringalgorithmwasatleast 15 times 98 fasterthantheRFFclusteringalgorithm.OntheMNISTdatas et,theSVclusteringalgorithmwas about 20 timesfasterthantheRFFclusteringalgorithm.Similarspe edupswereobtainedforthe otherdatasetsaswell.WewillseelaterthattheSVclusteri ngalgorithmachievessimilarclustering accuracyastheRFFclusteringalgorithm.Soweconcludetha ttheSVclusteringalgorithmismore suitableforlargescalekernelclusteringthantheRFFclus teringalgorithm. TheNystromapproximationbasedspectralclusteringalgor ithmndstheclustersbyexecuting k -meansonthetopeigenvectorsofalowrankapproximatekern elmatrixderivedfromarandomly sampleddatasubsetofsize m .Itrstobtainstheeigenvectorsofan m m matrixandthenex- trapolatesthemtothetopeigenvectorsofthe n n kernelmatrix.AstheSVclusteringalgorithm onlyndsthetopsingularvectorsofan s m matrix,itismoreefcientthantheNystromap- proximationbasedspectralclusteringalgorithm.TheSVcl usteringalgorithmwasalsofasterthan approximatekernel k -meansonallthedatasets. Asexpected,theSValgorithmwasfasterthanthekernel k -meansalgorithmontheCIFAR-10 andMNISTdatasets.Asitisprohibitivetoexecutekernel k -meansonthelargeForestCoverType, Imagenet-34,PokerandNetworkIntrusiondatasets,werand omlyselectedasubsetof 50 ;000 pointsfromthesedatasets,executedkernel k -meansonthissubsettoobtaintheclustercenters,and assignedtheremainingpointstotheclosestcenter.Wereco rdedthetimetakenforthisprocedure asthetimetakenbykernel k -meansontheselargedatasets.TheSValgorithmwasfastert hanthis approximateversionofkernel k -meansaswellonallthedatasets.Whenthedimensionalityo fthe datasetwasgreaterthanthenumberofclustersinit,theSVc lusteringalgorithmranfasterthan the k -meansalgorithm. 3.5.4.2Clusterquality Figure3.2recordsthesilhouettecoefcientvaluesofthep roposedandbaselinealgorithmson theCIFAR-10andMNISTdatasets.Theproposedalgorithmsac hievedvaluescomparabletothe kernel k -meansalgorithmandtheapproximatekernel k -meansalgorithm,showingthattheyyield 99 00.010.020.03Silhouette(a)CIFAR-10 00.10.20.30.40.5Silhouette(b)MNIST Figure3.2Silhouettecoefcientvaluesofthepartitionso btainedusingtheRFFandSVclustering algorithms.Theparameter m ,whichrepresentsthenumberofFouriercomponentsfortheR FFand SVclusteringalgorithms,andthesamplesizefortheapprox imatekernel k -meansandNystrom approximationbasedspectralclusteringalgorithms,isse tto m =2 ;000 . similarpartitions.Thesilhouettecoefcientvaluesofth eNystromapproximationbasedspectral clusteringalgorithmweremarginallylowerthanthoseofth eremainingkernel-basedclustering algorithms.The k -meansalgorithmyieldednon-compactpartitionswithsilh ouettevaluescloser to 0 . Figure3.3showstheNMIvaluesachievedbytheproposedalgo rithmsandthebaselinealgo- rithms.Werstobservethattheaccuracyofallthekernel-b asedalgorithms,includingtheproposed algorithms,wasbetterthanthatofthe k -meansalgorithm,demonstratingthefactthatincorporat- inganon-linearsimilarityfunctionimprovestheclusteri ngperformance.OntheCIFAR-10and MNISTdataset,weobservedthattheperformanceofbothoura lgorithmswassimilartothatof kernel k -means.Comparisonwithkernel k -meansisnotfeasibleontheremainingdatasetsdueto theirlargesize.Theproposedalgorithmsoutperformedthe approximateversionofkernel k -means inwhichasubsetofthedatawasclusteredandtheremainingp ointswereassignedtotheclosest center.Theproposedalgorithms'performancewassignica ntlybetterthanthatoftheNystromap- proximationbasedspectralclusteringalgorithmonalldat asets.Theyperformedonlymarginally 100 worsethantheapproximatekernel k -meansalgorithm.ThedifferenceintheNMIvaluesofthe RFFclusteringalgorithmandtheSVclusteringisminimalfo rmostdatasets. 3.5.4.3Parametersensitivity ThenumberofFouriercomponents m playsacrucialroleintheperformanceoftheRFFandSV clusteringalgorithms.Therunningtimeofthealgorithmsi scomparedwiththeapproximatekernel k -meansandtheNystromspectralclusteringalgorithmsford ifferentvaluesof m inTable3.3. Inthetable, m representsthenumberofFouriercomponentsinthecontexto ftheRFFandSV clusteringmethods,anditrepresentsthesizeofthesample drawnfromthedatasetinthecontext ofapproximatekernel k -meansandNystromapproximationbasedspectralclusterin g.Asobserved earlier,theSValgorithmisfasterthantheRFFclusteringa lgorithm.Forinstance,theSValgorithm isabout 15 timesfasterthantheRFFalgorithmontheCIFAR-10datasetw hen m =100 .The speedupfactorincreasedasthenumberofFouriercomponent s m increases.Wenotethatthe speedupontheNetworkIntrusiondatasetbecamesignicant onlywhen m 500 .TheSV clusteringalgorithmwasalsofasterthanapproximatekern el k -meansforallvaluesof m ,due tothefactthatunliketheapproximatekernel k -meansalgorithm,thedimensionalityoftheinput tothe k -meansstep(whichdominatestherunningtime)remainscons tantdespitetheincreasein m .Thedimensionalityoftheinputkernelintheapproximatek ernel k -meansalgorithmincreases linearlywith m . Thesilhouettecoefcientvaluesachievedbythealgorithm sontheCIFAR-10andMNIST datasets,fordifferentvaluesof m areshowninFigure3.4.Werstobservethatthesilhouette valuesachievedbytheproposedRFFandSVclusteringalgori thmsincreasedsignicantlyas m increased.Thevalueswereinitiallymuchlowerthanthosea chievedbytheapproximatekernel k - meansandNystromapproximationbasedspectralclustering algorithms,butbecamecomparable when m 1 ;000 . TheNMIvaluesachievedbythealgorithmsfordifferentvalu esof m areshowninFigure3.5. 101 051015NMI(a)CIFAR-10 01020304050NMI(b)MNIST 051015NMI(c)ForestCoverType 0246810NMI(d)Imagenet-34 0102030NMI(e)Poker 0510NMI(f)NetworkIntrusion Figure3.3NMIvalues(in%)ofthepartitionsobtainedusing theRFFandSVclusteringalgo- rithms,withrespecttothetrueclasslabels.Theparameter m ,whichrepresentsthenumberof FouriercomponentsfortheRFFandSVclusteringalgorithms ,andthesamplesizefortheapprox- imatekernel k -meansandNystromapproximationbasedspectralclusterin galgorithms,issetto m =2 ;000 .Itisnotfeasibletoexecutekernel k -meansonthelargeForestCoverType,Imagenet- 34,Poker,andNetworkIntrusiondatasetsduetotheirlarge size.TheapproximateNMIvalues ofkernel k -meansonthesedatasetsareobtainedbyrstexecutingkern el k -meansonarandomly chosensubsetof 50 ;000 datapointstondtheclustercenters,andthenassigningth eremaining pointstotheclosestclustercenter. 102 Wenotethat,althoughtheSVclusteringalgorithmperforme dworsethantheRFFclusteringal- gorithmintermsofNMIwhen m issmall,ityieldedsimilarperformanceastheRFFclusteri ng algorithmwhen m wassubstantiallylarge.OntheMNISTdataset,asthevalueo f m increased from 100 to 2 ;000 ,theaverageNMIachievedbytheRFFclusteringalgorithmin creasedbyabout 15% whereastheSVclusteringalgorithmachievedanincreaseof 20% .Similarratesofincrease wereobservedonotherdatasetsalso.Thisveriesourclaim thattheapproximationerrorofthe SVclusteringalgorithmdecreasesatahigherratewithresp ecttotheparameter m ,thanthatof theRFFclusteringalgorithm.WhiletheNMIvaluesoftheSVc lusteringmethodarehigherthan thoseoftheNystromspectralclusteringmethodforall m valuesonmostdatasets,theyareonly marginallylowerthanthoseoftheapproximatekernel k -meansalgorithmforsmall m ,andbecome closetotheapproximatekernel k -meansvaluesas m increases. 1002005001,0002,00000.010.020.03mSilhouette(a)CIFAR-10 1002005001,0002,00000.10.20.30.40.5mSilhouette(b)MNIST Figure3.4EffectofthenumberofFouriercomponents m onthesilhouettecoefcientvaluesof thepartitionsobtainedusingtheRFFandSVclusteringalgo rithms.Parameter m representsthe numberofFouriercomponentsfortheRFFandSVclusteringal gorithms,andthesamplesizefor theapproximatekernel k -meansandNystromapproximationbasedspectralclusterin galgorithms. 3.5.4.4Scalability WeanalyzethescalabilityoftheproposedRFFandSVcluster ingalgorithmsfordifferentvalues of n , d , C usingthesyntheticconcentriccirclesdataset.Wesetthen umberofFourierfeatures 103 Table3.3EffectofthenumberofFouriercomponents m ontherunningtime(inseconds)ofthe RFFandSVclusteringalgorithmsonthesixbenchmarkdatase ts.Parameter m representsthe numberofFouriercomponentsfortheRFFandSVclusteringal gorithms,andthesamplesizefor theapproximatekernel k -meansandNystromapproximationbasedspectralclusterin galgorithms. m RFFclustering SVclustering Approx.kernel Nystromapprox.based (proposed) (proposed) k -means basedspectralclustering 100 89.94 5.39 12.29 0.91 ( 18 :96 ) ( 1 :63 ) ( 4 :66 ) ( 0 :16 ) 200 176.47 6.09 39.91 1.86 ( 47 :59 ) ( 1 :76 ) ( 15 :11 ) ( 0 :20 ) 500 449.23 10.71 13.20 5.61 ( 103 :61 ) ( 3 :32 ) ( 2 :14 ) ( 1 :89 ) 1,000 1,176.74 16.46 49.50 26.24 ( 276 :07 ) ( 6 :54 ) ( 22 :17 ) ( 5 :26 ) 2,000 3,418.21 58.32 37.01 116.13 ( 907 :14 ) ( 38 :68 ) ( 6 :52 ) ( 1 :97 ) (a)CIFAR-10 m RFFclustering SVclustering Approx.kernel Nystromapprox.based (proposed) (proposed) k -means basedspectralclustering 100 85.36 3.85 26.57 6.00 ( 25 :64 ) ( 2 :37 ) ( 3 :12 ) ( 0 :89 ) 200 122.31 4.66 17.98 46.70 ( 48 :31 ) ( 1 :78 ) ( 7 :99 ) ( 8 :51 ) 500 272.57 9.22 24.72 342.38 ( 111 :25 ) ( 1 :22 ) ( 8 :46 ) ( 105 :80 ) 1,000 517.48 17.46 36.34 914.18 ( 44 :6 ) ( 1 :43 ) ( 6 :92 ) ( 215 :77 ) 2,000 1,089.26 39.94 86.43 4163.76 ( 483 :63 ) ( 5 :64 ) ( 12 :71 ) ( 383 :37 ) (b)MNIST 104 Table3.3(cont'd) m RFFclustering SVclustering Approx.kernel Nystromapprox.based (proposed) (proposed) k -means basedspectralclustering 100 154.97 9.62 19.10 11.75 ( 65 :72 ) ( 2 :57 ) ( 6 :35 ) ( 1 :73 ) 200 174.88 10.77 24.21 13.65 ( 65 :36 ) ( 1 :67 ) ( 12 :48 ) ( 1 :59 ) 500 534.01 22.15 32.48 41.92 ( 216 :18 ) ( 6 :08 ) ( 11 :64 ) ( 7 :89 ) 1,000 1,032.58 35.46 66.15 124.83 ( 221 :56 ) ( 5 :20 ) ( 19 :25 ) ( 38 :32 ) 2,000 2,078.63 76.99 157.48 534.77 ( 617 :22 ) ( 17 :04 ) ( 27 :37 ) ( 323 :76 ) (c)ForestCoverType m RFFclustering SVclustering Approx.kernel Nystromapprox.based (proposed) (proposed) k -means basedspectralclustering 100 24.43 17.72 551.70 125.82 ( 0 :92 ) ( 1 :09 ) ( 120 :53 ) ( 8 :26 ) 200 57.66 33.82 676.39 183.31 ( 2 :15 ) ( 0 :96 ) ( 10 :94 ) ( 4 :63 ) 500 163.74 84.34 906.07 461.52 ( 5 :54 ) ( 4 :62 ) ( 209 :53 ) ( 7 :48 ) 1,000 340.23 160.89 1028.99 586.66 ( 11 :30 ) ( 5 :65 ) ( 34 :83 ) ( 91 :72 ) 2,000 1,333.85 212.32 1261.02 1841.47 ( 6 :53 ) ( 4 :75 ) ( 37 :39 ) ( 123 :82 ) (d)Imagenet-34 105 Table3.3(cont'd) m RFFclustering SVclustering Approx.kernel Nystromapprox.based (proposed) (proposed) k -means basedspectralclustering 100 144.22 12.32 55.57 10.88 ( 11 :88 ) ( 1 :70 ) ( 11 :22 ) ( 1 :65 ) 200 411.32 17.35 89.14 46.78 ( 34 :34 ) ( 2 :07 ) ( 32 :12 ) ( 4 :21 ) 500 654.98 22.82 117.57 90.57 ( 132 :70 ) ( 2 :48 ) ( 20 :17 ) ( 18 :57 ) 1,000 2,287.53 27.37 202.84 261.14 ( 159 :06 ) ( 2 :09 ) ( 44 :25 ) ( 20 :51 ) 2,000 4,530.44 41.08 256.26 479.73 ( 276 :37 ) ( 2 :57 ) ( 44 :84 ) ( 47 :46 ) (e)Poker m RFFclustering SVclustering Approx.kernel Nystromapprox.based (proposed) (proposed) k -means basedspectralclustering 100 2,252.44 147.93 736.59 145.21 ( 465 :94 ) ( 62 :03 ) ( 238 :31 ) ( 22 :76 ) 200 5,371.85 258.86 697.04 169.27 ( 1 ;765 :02 ) ( 41 :32 ) ( 442 :25 ) ( 38 :15 ) 500 5,296.87 245.37 586.14 366.42 ( 3 ;321 :66 ) ( 158 :57 ) ( 130 :23 ) ( 175 :57 ) 1,000 24,151.47 435.53 763.75 589.57 ( 6 ;351 :34 ) ( 189 :07 ) ( 88 :55 ) ( 54 :14 ) (f)NetworkIntrusion 106 1002005001,0002,000051015mNMI(a)CIFAR-10 1002005001,0002,00001020304050mNMI(b)MNIST 1002005001,0002,000051015mNMI(c)ForestCoverType 1002005001,0002,0000246810mNMI(d)Imagenet-34 1002005001,0002,0000102030mNMI(e)Poker 1002005001,0002,000051015mNMI(f)NetworkIntrusion Figure3.5EffectofthenumberofFouriercomponents m ontheNMIvalues(in%)ofthepartitions obtainedusingtheRFFandSVclusteringalgorithms,onthes ixbenchmarkdatasets.Parameter m representsthenumberofFouriercomponentsfortheRFFandS Vclusteringalgorithms,and thesamplesizefortheapproximatekernel k -meansandNystromapproximationbasedspectral clusteringalgorithms. 107 m to 1 ;000 .Figures3.6(a)and3.7(a)showthattherunningtimeoftheR FFandSVclustering algorithmsvarynearlylinearlyasthenumberofpointsinth edatasetvariesfrom n =100 to n =10 7 ,withdimensionality d =100 andnumberofclusters C =10 .Thescalabilityplotsof theRFFandSVclusteringalgorithmsaresimilartothescala bilityplotsoftheapproximatekernel k -meansalgorithm,becauseallthreealgorithmshavelinear timecomplexitywithrespectto n . Thedimensionalityofthedatasetaffectsthetimetakenfor calculationoftheFourierfeatures. Theorderofincreaseintherunningtimeofthetwoalgorithm sas d variesfrom d =10 to d = 1 ;000 ,with n =10 6 and C =10 ,areshowninFigures3.6(b)and3.7(b). Asthenumberofclusterswasincreasedfrom C =10 to C =1 ;000 ,with n =10 5 and d =100 ,therunningtimeoftheRFFandSValgorithmsincreasesalmo stlinearlywith C ,as showninFigures3.6(c)and3.7(c).Wenotethatthenumberof clustersaffectstherunningtime oftheSVclusteringalgorithmmorethanthatoftheRFFclust eringalgorithm,becausetheSV clusteringalgorithmprojectsthedataintoa C -dimensionalspacebeforeclustering. 1021041060.511.52x 104Running timein secondsSize of the data set n (log scale)(a) 10110210311.522.5x 104Running timein secondsDimensionality of the data set d(log scale)(b) 10110251015x 104Running timein secondsNumber of clusters C (log scale)(c) Figure3.6RunningtimeoftheRFFclusteringalgorithmford ifferentvaluesof(a) n ,(b) d and(c) C . 3.5.4.5Out-of-sampleclustering Toevaluatetheperformanceofouralgorithmonout-of-samp ledatapoints,wedividedeachdata setintotwoparts,onecontaining 80% ofthedata,andtheothercontainingtheremaining 20% . Wecalltherstpartasthe trainingset andthesecondpartasthe testset ,inaccordancewith 108 102104106200400600Running timein secondsSize of the data set n (log scale)(a) 1011021034006008001000Running timein secondsDimensionality of the data set d (log scale)(b) 10110212345x 104Running timein secondsNumber of clusters C (log scale)(c) Figure3.7RunningtimeoftheSVclusteringalgorithmfordi fferentvaluesof(a) n ,(b) d and(c) C . theconventionfollowedinsupervisedlearningproblems.W ecomputedtheclustercentersusing thetrainingset,andassignedeachtestpointtotheclosest clustercenter,usingtheSVclustering algorithm.Theclassassignmentofatestpointwasdetermin edbythemajorityclassinthecluster towhichitwasassigned. Wecomparedtheperformanceofouralgorithmwiththeweight edkernelprincipalcomponent analysis(WKPCA)extensionforout-of-sampledatapoints, proposedin[11].Thismethodrst ndstheeigenvectors Z =( z 1 ;:::; z C ) ofthematrix D 1 MK correspondingtoitssmallest C eigenvalues,where D= diag K > 1 isthedegreematrixand M = I 1 1 > D 1 1 11 > D 1 isacenteringmatrix,andthenencodestheeigenvectorsint obinarycodewordsbasedontheir sign.Thesecodewordsareclusteredtoobtain C binarycodewords f c 1 ;:::; c C g .Thefollowing procedureisemployedtoobtaintheclusterlabelforanewpo int x : (i)Project x ontothespacespannedbytheeigenvectorsofthetrainingse tas ' Z ,where ' =( ( x ;x 1 ) ;:::; ( x ;x n )) . (ii)Computethecodeword c = sign ( ' ) . (iii)Assign x tothecluster k whichminimizes d HM ( c ;c k ) ,where d HM representstheHam- mingdistance[66]betweenthevectors c and c k ,denedas d HM ( x a ;x b )= j x a x b j 109 TheWKPCAextensionrequirestheeigendecompositionofan n n matrix,whichtakes O ( n 3 ) time.Inaddition,an O ( n ) vectorneedstobecomputedtoperformlabelassignment. Wealsocomparetheperformanceoftheproposedalgorithmwi ththeapproximatekernel k - meansalgorithm.Thetestpoint x isaddedtotheclusterwhosecenter,givenby c k ( )= > k b K k 2 ' > k ;isclosest.Intheaboveexpression, ' =[ ( x ;b x 1 ) ;:::; ( x ;b x m )] , f b x 1 ;:::; b x m g arethesetof sampleddatapoints, b K isthekernelsimilaritybetweenthesampledpoints,and k isthe k th row ofclustercentercoefcientmatrix ,givenby(2.10). Wereporttherunningtimeandaccuracyonthesixdatasetsin Table3.4.Therunningtimeis dividedintotrainingtimeandtestingtime.Thetrainingti meforWKPCAincludesthetimetaken tocomputethekernelmatrixforthetrainingdataanditseig envectors,andthetimetakentoconvert theeigenvectorstotheclustercodewords.Thetestingtime isthetimetakenfordataprojectionand Hammingdistancecomputationforallthetestdatapoints.F ortheapproximatekernel k -means algorithm,thetrainingtimeincludesthetimetoclusterth etrainingdataandobtainthecluster centercoefcientmatrix .Thetestingtimeincludesthetimetakentocomputethesimi larity betweenthetestdatapointsandthesampleddatapoints,and thetimetoassigntheclusterlabels tothetestdatapoints.ForSVclustering,thetrainingtime isdenedasthetimetakentocompute therandomfourierfeaturesandthesingularvectorsforthe trainingdata,andthetestingtimeis denedasthetimetakentoassignlabelstotestdata. TheWKPCAmethodtookabout 40 seconds,onanaverage,toassignlabelstothe 12 ;000 testimagesintheCIFAR-10dataset,whereasourmethodtook lessthan 5 seconds,for m = 1 ;000 .OntheMNISTdataset,theWKPCAmethodtookabout 940 secondstoclusterthetest setcontaining 14 ;000 datapoints,signicantlylongerthantheproposedalgorit hm,whichtook around 60 seconds,for m =1 ;000 .ItisinfeasibletoevaluatetheperformanceofWKPCAonthe 110 largedatasets.Weobservedthatboththeproposedalgorith mandtheWKPCAmethodachieved similarclassicationperformanceontheCIFAR-10andMNIS Tdatasets.Areasonablygood accuracywasachievedontheremaininglargedatasetsalso. Theproposedalgorithmalsoruns fasterthantheapproximatekernel k -meansalgorithm,andachievescomparabletestaccuracy. Table3.4Runningtime(inseconds)andpredictionaccuracy (in%)forout-of-sampledatapoints. Parameter m representsthesamplesizefortheapproximatekernel k -meansalgorithmandthe numberofFouriercomponentsfortheSVclusteringalgorith m.Thevalueof m issetto 1 ;000 for boththealgorithms.ItisnotfeasibletoexecutetheWKPCAa lgorithmonthelargeForestCover Type,Imagenet-34,Poker,andNetworkIntrusiondatasetsd uetotheirlargesize. Dataset CIFAR MNIST Forest Imagenet Poker Network -10 Cover -34 Intrusion Type Training WKPCA 755.02 910.90 - - - - time ( 91 :35 ) ( 84 :37 ) Approx. 26.24 62.11 39.38 1913 391.04 998.36 kernel ( 2 :36 ) ( 3 :58 ) ( 3 :93 ) ( 414 ) ( 120 :1 ) ( 812 :73 ) k -means SV 5.96 10.48 25.28 155.89 49.75 115.73 clustering ( 0 :83 ) ( 0 :51 ) ( 1 :61 ) ( 4 :77 ) ( 6 :09 ) ( 3 :50 ) Testing WKPCA 39.68 29.50 - - - - time ( 2 :77 ) ( 4 :69 ) Approx. 22.47 55.38 26.76 1543 373.45 213.68 kernel ( 2 :05 ) ( 1 :75 ) ( 0 :97 ) ( 412 ) ( 119 :5 ) ( 29 :28 ) k -means SV 5.33 2.12 5.97 80.24 14.24 121.35 clustering ( 2 :25 ) ( 0 :57 ) ( 2 :33 ) ( 0 :02 ) ( 0 :51 ) ( 32 :50 ) Accuracy WKPCA 80.70 84.84 - - - - Approx. 83.08 88.76 59.39 88.50 55.40 57.30 kernel ( 0 :01 ) ( 0 :001 ) ( 0 :10 ) ( 0 :002 ) ( 0 :001 ) ( 0 :03 ) k -means SV 83.13 88.33 58.42 80.56 55.41 59.03 clustering ( 0 :04 ) ( 0 :52 ) ( 0 :64 ) ( 0 :01 ) ( 0 :04 ) ( 0 :03 ) 111 3.6Summary TheRFFclusteringandtheSVclusteringalgorithms,propos edinthischapter,userandomFourier featurestoobtainagoodapproximationofkernelclusterin gusinganefcientlinearclustering algorithm.Wehaveanalyticallyboundtheapproximationer rorofboththesemethods.Wehave shownthat,whenthereisalargegapintheeigenspectrumoft hekernelmatrix,asisthecasein mostbigdatasets,theSVclusteringalgorithmwhichcluste rsthesingularvectorsoftherandom Fourierfeaturesisamoreeffectiveandscalableapproxima tionofkernelclustering,allowinglarge datasetswithmillionsofdatapointstobeclusteredusingk ernel-basedclustering.Italsosolves theout-of-sampleclusteringproblemefciently.TheRFFc lusteringalgorithmcanbetrivially parallelizedbyreplicatingtherandomGaussianmatrixacr ossthecomputingnodes,calculating therandomFourierfeaturesforasubsetofthedataineachno de,andemployingtheparallel k - meansalgorithmtoclustertherandomFourierfeaturematri x,toobtaintheclusterlabels.TheSV clusteringalgorithmcanbesimilarlyparallelized,byusi ngthedistributedLanczoseigensolverto obtaintheeigenvectorsoftherandomFourierfeaturematri x. Theapproximatekernel k -meansalgorithminChapter2andtherandomFourierfeature s-based algorithmsinthischapterareallbasedonsamplingthedata setandusingthesamplesasbasis functionsfortheclustercenters.Whileapproximatekerne l k -meansemploysthedata-dependent Nystromkernelapproximation,andobtainsthebasisfuncti onsbyfactorizingthekernelmatrix, thebasisfunctionsinRFFandSVclusteringalgorithmsared ependentonthekernelfunction. Therefore,thesealgorithmsrequirealargenumberofFouri ercomponentstoachieveclusterqual- ityequivalenttothatoftheapproximatekernel k -meansalgorithm.Kernelselectionisalsovery crucialintheRFFandSVclusteringalgorithms.Wehavefocu sedonusingscale-invariantkernel functionsinourwork,butthesealgorithmscanbeextendedt opolynomialandintersectionkernels usingtheschemesprescribedin[92]andreferencestherein ,toobtainthebasisfunctions. 112 Chapter4 StreamClustering 4.1Introduction Inadditiontothelargevolume,bigdataisalsocharacteriz edbyfivelocityfl-thecontinuouspace atwhichdataowsinfromsourcessuchassensors,machines, networks,anduserinteractionwith websites.Analysisofthisreal-timedatacanhelpinmaking valuabledecisions.Forinstance, intrusionscanbedetectedinIPnetworksbyanalyzingthene tworktrafc. Clusteringstreamingdataischallengingduetothefollowi ngtworeasons: (i)Streamingdatasetsareoftentoolargetoloadinmemory; theycouldpotentiallybeun- bounded.Onlyasmallsubsetofthedatamaybestored,depend ingontheamountofmemory available.Sothedatacanbeaccessedatmostonce,and (ii)thedataisnon-stationary,i.e.thedistributionofth edatachangesovertime.Thedatathat arrivedmorerecentlyhashigherrelevancethantheolderda ta. Batchclusteringalgorithmssuchas k -meansandkernel k -means,assumethatthedataiscom- pletelyavailableinmemoryatthetimeofclustering.Theya lsoassumethattheinputdatais drawnfromthemixtureofaxedsetofdistributions,andthe aimofclusteringistoidentifythese 113 componentdistributions.Therefore,batchclusteringalg orithmscannotbedirectlyusedtocluster streamingdata.Streamclusteringalgorithmsmodelthedat adynamically.Clusterlabelsareas- signedtodatapointsastheyarrive,inanonlinemanner.Str eamclusteringalgorithmsgenerally consistoftwostages:(i)anonlinephase,wherethestreamd ataissummarizedintofiprototypesfl asitarrives,and(ii)anofinephasewheretheseprototype sareusedtoobtaintheclusters.Theset ofprototypesaredynamicallyupdatedtoaccountfortheevo lutionoftheclustersinthestreaming data. ManystreamclusteringalgorithmsusemeasuressuchastheE uclideandistancetodenethe pairwisesimilarity.Asdemonstratedintheearlierchapte rs,kernel-basedalgorithmsachievebetter clusteringqualitythanlinearclusteringalgorithms.How ever,kernel-basedclusteringalgorithms areill-suitedtostreamsbecauseoftheirhighcomputation alcomplexity.Inthischapter,weadapt thekernel k -meansalgorithmtoefcientlyhandlestreamingdata.Thep roposedalgorithmsamples thedatapointsastheyarriveandconstructsanapproximate kernelmatrixusingthesampledpoints. Thesamplingisperformedwithprobabilityproportionalto thestatisticalleveragescores[34]of thismatrix,ameasureoftheimportanceofthedatapoints.T hesampleddatapointsarestoredin memoryandusedtodeterminetheclusterlabelsoftheincomi ngdatapoints.Weshowthatonlya smallsubsetofthedataneedstobestoredinmemory,thereby enhancingtheefciencyofkernel clusteringfordatastreams. 4.2Background Datastreamclusteringhasbeenstudiedextensivelyinthep atternrecognitionanddatamining literature.Moststreamclusteringalgorithmssummarizet hedatastreamusingspecialdatastruc- tures,andobtaintheclusterrepresentativesusingthissu mmary.Theydifferbythedatastructures usedtosummarizethedata;commondatastructuresaretrees ,coresets,andgrids(SeeTable4.1). StreamandLSearchalgorithmssplittheincomingdataintoc hunks,clusterthechunksindi- 114 Table4.1Majorpublishedapproachestostreamclustering. Approachesforstreamclustering Examples CF-Trees Stream[79],StreamLSearch[140],Scalable k -means[30],Single-pass k -means[62] Microclustertrees CluStream[8],ClusTree[98],ClusTrel[124], DenStream[32],HPStream[9] Coresets StreamKM++[6] Grids D-Stream[36],ODAC[149] Approximateclustering Streaming k -meansapproximation[10],Fast streaming k -means[162] Kernel-based Incrementalspectralclustering[139],Adap- tivenon-linearclustering[86],sKKM[84], TechnoStream[134] viduallytondtheclusterprototypes,andthenclusterthe seprototypestoobtainthenalclus- ters[79,140].Thesealgorithmscannotbeusedtoperformre al-timeclustering.TheCluster- ingFeature(CF)TreewasintroducedbyZhang etal. asapartoftheBIRCHclusteringalgo- rithm[197].ACF-Treesummarizesthedatastreamintoahier archyofnodes.Eachnodecontains asetofCF-vectorscomprisingthelinearsumandthesquared sumofasetofpoints,whichare closetoeachother.TheCF-Treehasbeenusedinseveralstre amclusteringalgorithmssuchas scalable k -means,andsingle-pass k -meansalgorithms[30,62].TheideaofCF-vectorswasthen extendedtofimicro-clustersflwhichincludethetemporalin formationaboutthedata[32,98,124]. Thisinformationisusedtodetectevolutionarychangesint hedatastream.Forinstance,theCluS- treamalgorithmstoresthelinearandsquaredsumsofthetim estampsofthedatapointsinthe microcluster,inadditiontothelinearsumandthesquareds umofthedatapoints.Thesetimes- tampvaluesareusedtoassignweightstothedatapoints,the rebygivingmoreimportancetothe newdatathanolderdatawhileclustering.Similarly,theHP Streamalgorithmweightstheclusters usingthetemporalinformationandassignsdatatomorerece ntclusters[9]. Acoresetisaweightedsubsetofpointsthatapproximatethe inputdatasetuptoapre-dened errormargin.TheStreamKM++algorithmsummarizesthedata streamintoasetofcoresetsor- 115 ganizedintoahierarchyknownasthecoresettree[6].Eachn odeinthetreecontainsasubsetof pointsrepresentedbyasetofprototypes.Thenalclusters areobtainedbygroupingthecoreset representativesintherootnodeofthecoresettree.Grid-b asedalgorithmssuchasDStreamand DGClustpartitionthe d -dimensionalfeaturespaceintogridcells[36,149].Eachc ellisrepresented byatuplecontainingthetimestamps,aclusterlabelandthe densityofthegrid.Datapointsare addedtothegridsandthegridsummariesareupdatedincreme ntally,asthedatapointsarrive.Ap- proximateclusteringalgorithmssuchasstreaming k -means[10,162]chooseasubsetofthepoints fromthestream,ensuringthattheselectedpointsareasdis tantfromeachotheraspossible,and execute k -meansonthedatasubset. Tothebestofourknowledge,basedonpublishedliterature, veryfewattemptshavebeenmade tousenon-linearsimilaritymeasuresforclusteringdatas treams.Theagglomerativehierarchical clusteringalgorithmisadaptedtousekerneldistancemeas uresin[193].Theincrementalspectral clusteringalgorithm[139]extendsspectralclusteringto streamdatabytreatingeachnewedgein thegraphasavectorappendedtothesimilaritymatrix.Theg raphLaplacian,itseigenvaluesand eigenvectorsareupdatedincrementallywiththenewedges. Thestreamkernel k -meansalgorithm[84]dividesthedatasetintowindowsofx edtime-steps, andperformsclusteringusingthedatapointsineverytwoco nsecutivewindows.Informationfrom thecurrenttime-stepispassedontothetothenexttime-ste pintheformofmeta-vectorscontaining weightsforeachofthe C clusters.Jain etal. proposedatwo-tiersystemcalledtheadaptivenon- linearclusteringalgorithmtoperformstreamclusteringu singnon-linearsimilarity[86].Intherst tier,theincomingdatapointsarepartitionedintosegment s,separatedfromeachotherbynoveldata points.Adatapoint x isconsiderednovelifthekernel-baseddistancefrom x tothemeanofthe datapointsinthecurrentsegmentisgreaterthantheuser-d enedthreshold.Inthesecondtier, therepresentativesegmentsareidentiedandprojectedin toalow-dimensionalspacespannedby thedominantprincipalcoordinatesofthedatainkernelspa ce[76].Theclusterlabelsforthedata pointsareobtainedbyclusteringthelow-dimensionalrepr esentationsofthedata.Thistechnique 116 requirestheeigendecompositionofalargenumberofpoints inthesecondtier.Theproposed methodusesthecompletehistoryofdata,anddoesnotrequir ecomplexoperations,unlikethe existingmethods. 4.3ApproximateKernel k -meansforStreams Figure4.1Schemaoftheproposedapproximatestreamkernel k -meansalgorithm. InChapter2,wepresentedtheapproximatekernel k -meansalgorithmwhichconstrainedthe clustercenterstothespanofasubsetofthedatapoints.Wee mployasimilarstrategytocluster streamingdata.Thekeyideaistosamplethedatapointsasth eyarriveandconstructthekernel matrixincrementallyusingthesampledpoints.Thisapprox imatekernelmatrixisusedtocluster thesampledpoints.Theclusterlabelsareassignedtotheun sampleddatapointsusingtheirkernel similaritywiththesampledpoints.Ahighleveloverviewof theproposedclusteringframework ispresentedinFigure4.1.Ourframeworkconsistsofthreep rimarycomponents,workingin tandem:(i)importancesampling,(ii)clustering,and(iii )clusterlabelassignment.Thesampling componentsamplesthepointsfromthestream,andconstruct stheapproximatekernelmatrix. Theclusteringandlabelassignmentcomponentsupdatethec lustersandthenumberofclusters dynamically,andassignclusterlabelstoallthedatapoint sinthestream. 117 Wedescribeeachofthesecomponentsinthefollowingsectio ns: 4.3.1Sampling Oneoftheobstaclestousingkernel k -meansforclusteringstreamdataisthatitrequiresthe computationofthe n n kernelmatrix,where n isthenumberofpointsinthedataset.Itis infeasibletocomputethefullkernelmatrixforstreamdata because n ispotentiallyunbounded. Theapproximatekernel-basedclusteringalgorithmspropo sedinChapters2and3alsoneedto storetheentiredatainmemory,beforeconstructingtheapp roximatekernelmatrices.Thestream clusteringalgorithmproposedinthischapteralleviatest hisissuebyincrementallysamplinga subsetofthepointsfromthestream,andusingonlythissubs ettoconstructthekernelmatrix. Wemaintainabuffer S inmemorytostorethesampledpoints;thenumberofpoints s in S is constrainedbytheuser-denedparameters m and M ( m s M ).Let K t 1 representthekernel matrixattime ( t 1) with K 1 = ( x 1 ;x 1 ) .Whenadatapoint x t arrivesattime t ,weupdatethe kernelmatrixas K t = 8 > > > > < > > > > : 2 6 4 K t 1 ' > ' ( x t ;x t ) 3 7 5 withprobability p t ;K t 1 withprobability 1 p t , (4.1) where K t 1 =[ ( x i ;x j )] ;x i ;x j 2 S ,and ' =( ( x t ;x 1 ) ;:::; ( x t ;x s )) > . Thesimplestmethodofdeterminingwhetherornottoaddadat apoint x t to S ,istoper- formindependentBernoullitrials,i.e. x t isstoredin S withprobability p t = 1 2 .However, Bernoullisamplingresultsinalargekernelapproximation error,andrequiresalargenumberof pointstobestoredinmemory 1 .Toalleviatethisissue,weperformimportancesamplingin stead ofBernoullisampling.Thesamplingprobability p t foreachpoint x t isbasedonitsfiimpor- 1 WedemonstratethisusingasyntheticdatasetinFigure4.2, andusingfourlargebenchmarkdatasetsinSec- tion4.5. 118 (a) (b) (c) (d) Figure4.2Illustrationofimportancesamplingonatwo-dim ensionalsyntheticdatasetcontaining 1 ;000 pointsalong 10 concentriccircles( 100 pointsineachcluster),representedbyfioflinFig- ure(a).Figure(b)shows 50 pointssampledusingimportancesampling,andFigures(c)a nd(d) show 50 and 100 pointsselectedusingBernoullisampling,respectively.T hesampledpointsare representedusingfi*fl.Allthe 10 clustersarewell-representedbyjust 50 pointssampledusing importancesampling.Ontheotherhand, 50 pointssampledusingBernoullisamplingarenotad- equatetorepresentthese 10 clusters(Cluster 4 inredhasnorepresentatives).Atleast 100 points areneededtorepresentalltheclusters. tancefl,denedintermsofthestatisticalleveragescores[ 56].Letthekernelmatrix K t attime t bedecomposedas K t ' V C C V > C ,where C representsthenumberofactiveclusters 2 attime t , C = diag ( 1 ;:::; C ) containsthehighest C eigenvaluesof K t ,and V C =( v 1 ;:::; v C ) contains thecorrespondingeigenvectors.Theprobabilityofadding point x t to S isdenedby p t = 1 C V ( t ) C 2 2 ;(4.2) where V ( j ) C isthe j th rowof V C .Statisticalleveragescoresmeasurethecorrelationbetw eenthe eigenvectorsofthematrix K t andthestandardbasis.Ahighscoreindicatesthatthecorre sponding 2 Werefertothesetofclustersthatthedatapointsinthebuff er S belongtoattime t asthesetofactiveclusters. 119 datapointhasalargeinuenceintheapproximationoftheke rnelmatrix.Thesubsetofdata correspondingtothelargeststatisticalleveragevaluesa rethemostinformative,andcanrepresent thedistributionoftheentiredata.Byperformingimportan cesamplingonthedatastream,the samplesthathavenotbeenadequatelyrepresentedbytheexi stingsamplesareaddedtothebuffer. Statisticalleveragescoreshavebeenusedsuccessfullyto obtainlowrankmatrixapproxima- tionsoflargematrices,performlargescaleregressionand otherlargescaledataanalysisopera- tions[28,34].Thefollowinglemmaadaptedfrom[74]showst hat,attime t ,theapproximation errorbetweenthetruekernelmatrixforthe t points f x 1 ;x 2 ;:::; x t g andthelow-rankkernelma- trixconstructedusingthissamplingschemeisminimized,w henthenumberofsamplesin S at time t is s =( C ln C ) : Lemma8. Let K bea t t SPSDmatrix,and V C =( v 1 ;:::; v C ) representtheeigenvectors correspondingtothetop C -dimensionaleigenspaceof K .Let K B representthe t s matrix obtainedbysamplingthecolumnsof K withprobabilitydenedin (4.2) and b K bethe s s submatrixof K B correspondingtothesampledcolumns.Foragivenfailurepr obability 2 (0 ;1] , andapproximationfactor 2 (0 ;1] ,if s 3200 2 C ln(4 C= ) ,wehave K K B b K 1 K > B 2 k K K k 2 + 2 k K K k ;where K isthebest C -rankapproximationof K ,and kk 2 and kk representthespectralnorm andtracenormrespectively 3 . Byusingimportancesampling,weobtainagoodapproximatio nofthetruekernelbysampling justafractionofthedataset.Figures4.2(a)-(d)illustra tetheadvantageofimportancesampling overBernoullisamplingonatwo-dimensionaldatasetconta ining 1 ;000 pointsfrom 10 clusters. Eachtrueclusterisaconcentriccircleofvaryingradius,w ith 100 points,asshowninFigure4.2(a). 3 Lemma8boundstheerrorbetweentheapproximatekernelandt hetruekernelforasetof t datapoints.We demonstrateempiricallyinSection4.5thattheaccumulate derrorastime t increasesiswell-bounded. 120 Figure4.2(b)alsoshows 50 pointssampledusingimportancesampling.Weobservethata llthe 10 clustersareadequatelyrepresentedbythe 50 sampledpoints.Figure4.2(c)showsthat 50 points sampledfromthedatausingBernoullisamplingdonotrepres entalltheclusters,astheprobability ofsamplingdatapointsfromalltheclustersislow.Allthec lustersarerepresentedonlywhen 100 pointsaresampled,asshowninFigure4.2(d). 4.3.2Clustering Let s bethenumberofpointsinthebuffer S and C bethenumberofactiveclusters 2 attime t . Afterthekernelmatrix K t isconstructedinaccordancewith(4.1),thedatapointsin S canbe partitionedinto C clustersbysolvingthekernel k -meansproblem max U 2P tr ( e UK t e U > ) ;(4.3) where U =( u 1 ;:::; u C ) > istheclustermembershipmatrix, e U =[ diag ( U 1 )] 1 = 2 U ,domain P = f U 2f 0 ;1 g C s :U > 1 = 1 g ,and 1 isavectorofallones.Therunningtimecomplexity ofthisstepwouldbe O ( s 2 ) .Wefurtherreducethiscomplexitybyconstrainingtheclus tercenters toasmallersubspace,spanningthetop C eigenvectorsofthekernelmatrix K t ,alongthelinesof thespectralclusteringalgorithm.Weposetheclusteringp roblemasthefollowingoptimization problem: min U 2P max f c k ( ) 2H a g C k =1 C X k =1 s X i =1 U k;i s jj c k ( ) ( x i ; ) jj 2 H ;(4.4) where H a = span ( v 1 ;:::; v C ) .Theclustercenterscanbeexpressedaslinearcombination softhe eigenvectorsofthekernelmatrix: c k ( )= s X i =1 C X j =1 U k;i n k p j v ij = u k n k V C 1 = 2 C ;k 2 [C ];(4.5) 121 where n k isthenumberofpointsinthe k th cluster,and u k =( U k; 1 ;U k; 2 ;:::;U k;s ) > .Bysubstitut- ing(4.5)in(4.4),weobtainthefollowingtracemaximizati onproblem: max U 2P tr ( e UV C C V > C e U > ) :(4.6) Theaboveproblemcanbesolvedefcientlybyexecuting k -meansonthematrix V C 1 = 2 C .Inthe followinglemma,weshowthattheerrorincurredduetotheap proximation(4.4)isbounded,when thelowesteigenvaluesofthekernelmatrixhavesmallmagni tudes,whichistrueformostrealdata sets[45]: Lemma9. Let E and E a representtheoptimalclusteringerrorsin (4.3) and (4.6) ,respectively. Wehave j E E a j s X i = C +1 i :Proof. Let f c k ( ) g C k =1 and U betheoptimalsolutionto(4.3).Let c a k ( ) representtheprojectionof c k intothesubspace H a .Forany ( x i ; ) ,let g i ( ) and h i ( ) betheprojectionsof ( x i ; ) intothe subspace H a andspan ( v C +1 ;:::; v s ) ,respectively.Wehave E a =min U 2P max c k ( ) 2H a C X k =1 s X i =1 U k;i s jj c k ( ) ( x i ; ) jj 2 H C X k =1 s X i =1 U k;i s jj c a k ( ) ( x i ; ) jj 2 H C X k =1 s X i =1 U k;i s jj c a k ( ) g i ( ) jj 2 H + jj h i ( ) jj 2 H E + 1 s C X k =1 s X i =1 jj h i ( ) jj 2 H E + s X i = C +1 i :Wenotethattheeigenvaluesandeigenvectorsdonotneedtob ere-computedforclustering, 122 astheywerealreadycomputedwhilecalculatingtheleverag escores.Thiseliminatestheneed forcomputingandstoringthekernelmatrix K t ,asonlyitstopeigenvaluesandthecorresponding eigenvectorsarerequiredforbothsamplingandclustering .Startingwith V C = 1 and C = ( x 1 ;x 1 ) ,wecanupdatetheeigensystemincrementallyasthedatapoi ntsarrive.Efcientmethods toupdatetheeigenvectorsandeigenvaluesincrementallya rediscussedinSection4.4. 4.3.3LabelAssignment Datapointsareassignedclusterlabelsusingtheclusterce ntersobtainedfromthesampleddata pointsinamannersimilartotheSVclusteringalgorithminC hapter3,andtheactiveclustersare updatedusingafadingclustermechanism,similartothatus edbytheadaptivenon-linearclustering algorithm[86].Eachcluster k isassociatedwithatimestamp t k representingthelasttimeadata pointwasassignedthe k th clusterlabel,andarecencyvaluedenedbyamonotonicfunc tion f k ( t )=exp( ( t t k )) ;(4.7) where isauser-denedparameter,representingthedecayrateofa cluster[9].Adatapoint x t is addedtocluster k if k =argmin k 2 [ C ] jj c k ( ) g t ( ) jj 2 H ;and f k ( t ) >; (4.8) where c k ( ) istheclustercentergivenby(4.5), g t ( ) istheprojectionof ( x t ; ) intothesubspace spannedbytheeigenvectors V C ,and isauser-denedlifetimethresholdwhichdetermineshow longaclusterremainsactive.Iftherecency f k ( t ) oftheclosestcluster k islessthan ,thena newclusteriscreatedwiththedatapoint x t .Aftertheclusterassignmentismade,thetimestamp andtherecencyvalueoftheassignedclusterareupdated.Cl usterswhoserecencyislessthan (calledstaleclusters)aredeleted,andthedatapointsint hebufferthatbelongtothesestaleclusters 123 areremovedfromthebuffer. Algorithm8describestheproposedstreamclusteringmetho d.Theinputtothealgorithmis thedatastream D ,kernelfunction ( ; ) ,initialnumberofclusters C ,buffersizeparameters( m and M ),andclusteringfadingmechanismparameters( and ).Selectionofkernelfunction andinitialnumberofclusters C isbasedondomainknowledge.Severalarticlesinthelitera ture describetechniquestolearnthekernelfunctionfromtheda ta[112,177,200].Theparameters m and M shouldbesetsuchthattheinitialandnalsamplesetsconta insufcientrepresentatives fromalltheclusters.Theparameters and shouldbeselectedbasedonhowfastthecategories areexpectedtochangeinthestream.Heuristicstosetthese parametersarediscussedfurtherin Section4.5. 4.4ImplementationandComplexity Thetwomajoroperationsintheproposedalgorithmare:comp utationofleveragescores,and clusteringofthetop C eigenvectorsoftheapproximatekernelmatrixusing k -means.Boththe operationsrequiretheeigenvaluesandeigenvectorsofthe kernelmatrix.Let s bethenumberof pointsinthesampleset S attime t .Eigendecompositionofan s s kernelmatrix K t takes O ( s 3 ) time,ifperformednaively.However,wecanupdatetheeigen systemincrementallyusingthefast rank-oneupdatemechanismproposedin[31].Giventheeigen decomposition, K t = V V > ,and vector ' 2< s ,thismethodndstheeigendecompositionof K t + '' > as K t + '' > = V w jj w jj 0 V w jj w jj > (4.9) 124 Algorithm8 ApproximateStreamKernel k -means 1: Input : D = f x 1 ;x 2 ;::: g ;x i 2< d :thedatastreamtobeclustered ( ; ): < d < d 7!< :thekernelfunction C :theinitialnumberofclusters m :theinitialnumberofpointstobesampled( m>C ) M :maximumnumberofpointsallowedinthesampleset( m ,assign x t to k ,otherwisecreateanewclusterwith x t andset C = C +1 . 15: Findtheclusterswhoserecency f k ( t ) ;k 2 [C ],andremovethesestaleclusters.Set C = C c ,where c isthenumberofstaleclusters. 16: If card ( S ) > = M ,ndindex q =argmin l V ( l ) C 2 2 andremovedatapoint x q from S . 17: endfor where w = I VV > ' isthecomponentof K t thatisorthogonalto V ,and 0 containsthe dominanteigenvaluesofthesparsematrix 2 6 4 V > ' ' > V jj w jj 3 7 5 :Thisoperation,repeatedeverytimeanewdatapointisinput tothesystem,canbeperformedin O ( sC + C 3 ) time. 125 Clusteringisperformedeverytimeapointisaddedtothesam pleset S ,whichtakes O ( sC 2 l ) time,where l isthenumberofiterationsrequiredtoreachconvergence.I nordertoreducethe runningtime,wecanemploya lazyreclustering approach,bywhichweperformtheclustering afterevery T datapointadditions.Tofurtherenhancetheefciencyofth ealgorithm,thedata pointscanalsobeprocessedinbatchesofsize B . Insummary,thetimetakenbytheproposedapproximatestrea mkernel k -meansalgorithm toclusteradatasetofsize n is O ( ndM + nCM + nC 3 + M 2 C 2 l ) ˘ O ( nd + nC ) ,when max( C;d;M;l ) ˝ n .Thiscontrastswiththe O ( n 2 ) runningtimecomplexityoftypicalkernel- basedclustering. 4.5ExperimentalResults 4.5.1Datasets Theproposedstreamclusteringalgorithminputsthedatase tinbatches,andcanhandlepotentially unboundeddatasets,hencethesizeofthedatasetisnotsign icant.Thedimensionalityofthedata setplaysanimportantroleinthekernelsimilaritycomputa tionandtheeigensystemupdate.We demonstratetheeffectivenessoftheproposedalgorithmon theCIFAR-10,MNIST,ForestCover Type,Imagenet-34,Poker,andNetworkIntrusiondatasets. 4.5.2Baselines Wecomparedtheperformanceoftheproposedalgorithmwitht worecentstreamclusteringal- gorithms(StreamKM++andsKKM),whichhavebeenshowntoper formbetterthantheother streamclusteringalgorithms.TheStreamKM++algorithm[6 ]isalinearstreamclusteringalgo- rithm,whichinthesamespiritastheproposedalgorithm,ex tractsthecorepointsinthestreaming data,andusesthesecorepointstodeterminetheclustercen ters.Thealgorithmmaintainsaset 126 ofbuckets,eachofsize m .Datapointsareaddedtotherstbucketuntil m pointsarereceived. Theyarethenrecursivelymergedwiththepointsinthesubse quentbucketstoformacoresetof m points,usingacoresettree.Thecoresetsarenallycluste redusingthe k -means++algorithm[12] toobtaintheclustercenters.Theperformanceofthisalgor ithmdependsonthecoresetsize m . Thestreamingkernel k -means(sKKM)algorithmproposedin[84]processesthedata inchunks ofsize m .Theinitialdatachunkisclusteredusingkernel k -means.Weightedkernel k -meansis usedtoclusterthesubsequentdatachunks.Theclustercent ersfromtheprecedingdatachunk areusedtoobtaintheweights.Weshowthattheproposedappr oximatestreamkernel k -meansis moreeffectivethanthesealgorithms.Wealsocomparethepe rformanceoftheproposedalgorithm with(i)thebatch k -meansalgorithmtoshowthatouralgorithmachieveshigher accuracy,and(ii) thebatchkernel k -meansalgorithmtoevaluatethelossintheclusterquality .Wecouldexecute thekernel k -meansalgorithmonlyonthemedium-sizedCIFAR-10andMNIS Tdatasetsdue toitsquadratictimecomplexity.Fortheremainingdataset s,weexecutedkernel k -meanson a 50 ;000 -sizedrandomlyselectedsubsetofthedata,andassignedth eremainingpointstothe closestclustercenters.Thisgivesusanapproximationoft hetimetakentoexecutekernel k -means onthefulldataset.Wenallyevaluatetheperformanceofth eproposedapproximatestream kernel k -meansalgorithmwheneachdatapointissampledwithprobab ility 1 = 2 ,andshowthat importancesamplingplaysasignicantroleinreducingthe memoryrequirementsandenhancing theclusteringaccuracy. 4.5.3Parameters WeusedtheuniversalRBFkernelfortheproposedalgorithma ndthekernel-basedbaselinealgo- rithmsonallthedatasets.Wetunedthekernelwidthusinggr idsearchintherange [0 ;1] toobtain bestperformance.Fortheproposedapproximatestreamkern el k -meansalgorithm,wevariedthe initialsamplesizefrom m =1 ;000 to m =5 ;000 inmultiplesof 1 ;000 ,andthemaximumbuffer sizefrom M =5 ;000 to M =20 ;000 inmultiplesof 5 ;000 ,toconstrainthememoryusedto 127 4 GB.Weemployedthelazyreclusteringapproachwith T setto 50 ,andprocessedthedatain batchesofsize B =10 ;000 .Wesettheclusterdecayfactor =0 :5 assuggestedin[86],and variedthelifetimethreshold as =exp( ˝ ) ,where ˝ = f 1 ;2 ;:::; 5 g .Thecoresetsizeand chunksizeparametersfortheStreamKM++andsKKMalgorithm swerevariedfrom 1 ;000 to 5 ;000 .Theinitialnumberofclusters C wassetequaltothetruenumberofclassesinthedataset, forallthealgorithms. WeobtainedthecodefortheStreamKM++algorithmfromtheau thors 4 ,andimplemented theotheralgorithmsinMATLAB.Weexecutedeachalgorithm 10 timesona 2 :8 GHzprocessor withthememoryconstrainedto 4 GBforthestreamclusteringalgorithms,andto 40 GBforthe batchclusteringalgorithms.Wepresentthemeanandvarian ceofthetimetakenforclustering (inmilliseconds)andtheclusteringquality,measuredint ermsoftheSilhouettecoefcientand NMI[104],overthese 10 runs.Differentpermutationsofthedatasetwereinputtoth eclustering algorithmsineachrun. 4.5.4Results 4.5.4.1Clusteringefciencyandquality Clusteringtimeforouralgorithmiscomputedastheaverage timetakentoassignalabeltoeach datapoint.Forthebaselinealgorithms,wecomputedthisti mebydividingthetotaltimetakento clusterthedatasetbythenumberofpointsinthedataset.Fi gures4.3,4.4and4.5comparethe runningtime,silhouettecoefcientandNMIvalues,respec tively,oftheproposedalgorithmwith thebaselinealgorithms,whentheparameters m =5 ;000 , M =20 ;000 and ˝ =1 .Asexpected, theproposedalgorithmwasfasterthanthebatchkernel k -meansalgorithmsanditsapproximation (describedinSection4.5.2)onmostofthedatasets,buttoo klongerthanthe k -meansalgorithm, becauseouralgorithmhastocomputethekernelsimilaritya nditstopeigenvectorsunlikethe k - 4 ThecodeforStreamKM++isavailableathttp://www.algorit hm-engineering.de/software-projects? view=project&task=show&id=17 128 051015Running timein milliseconds(a)CIFAR-10 050100Running timein milliseconds(b)MNIST 01020304050Running timein milliseconds(c)ForestCoverType 050100150200Running timein milliseconds(d)Imagenet-34 020406080Running timein milliseconds(e)Poker 0200400600Running timein milliseconds(f)NetworkIntrusion Figure4.3Runningtime(inmilliseconds)ofthestreamclus teringalgorithms.Theparametersfor theproposedapproximatestreamkernel k -meansalgorithmaresetto m =5 ;000 , M =20 ;000 , and ˝ =1 .ThecoresetsizefortheStreamKM++algorithm,andthechun ksizeofthesKKM algorithmaresetto 5 ;000 .Itisnotfeasibletoexecutekernel k -meansontheForestCoverType, Imagenet-34,Poker,andNetworkIntrusiondatasetsduetot heirlargesize.Theapproximate runningtimeofkernel k -meansonthesedatasetsisobtainedbyrstexecutingkerne l k -meanson arandomlychosensubsetof 50 ;000 datapointstondtheclustercenters,andthenassigningth e remainingpointstotheclosestclustercenter. 129 00.020.040.060.08Silhouette(a)CIFAR-10 00.20.40.60.8Silhouette(b)MNIST Figure4.4Silhouettecoefcientvaluesofthepartitionso btainedusingtheproposedapproximate streamkernel k -meansalgorithm.Theparametersfortheproposedalgorith mweresetto m = 5 ;000 , M =20 ;000 ,and ˝ =1 .ThecoresetsizefortheStreamKM++algorithm,andthechun k sizeofthesKKMalgorithmweresetto 5 ;000 . meansalgorithm.Thesilhouettecoefcientvaluesofthepr oposedalgorithmarecomparableto thoseofthekernel k -means,showingthattheyyieldedsimilarpartitions.TheN MIachievedby ouralgorithmishigherthanthatof k -meansbecauseoftheuseofnon-linearsimilaritymeasures . Theproposedalgorithmalsooutperformstheapproximateva riantofthekernel k -meansalgorithm, describedinSection4.5.2.OntheCIFAR-10dataset,thebat chkernel k -meansachievedanNMI valueof 16 :9% .TheproposedalgorithmachievescomparableNMIvalues( 15 :5% ). ComparedtotheStreamKM++algorithm,theproposedalgorit hmachieveshigherclustering quality,bothintermsofsilhouettecoefcientandNMI,alt houghittakesslightlylongertoas- signclusterlabelstothepoints.Thisisduetothefactthat ouralgorithmneedstoupdateand clustertheeigenvectorsoftheapproximatekernelmatrixf oreachbatchofdatapoints.Thepro- posedalgorithmofferstheadvantagethattheclusterlabel scanbeobtainedinreal-time,unlike theStreamKM++algorithmwhichneedstoprocessallthedata pointsbeforeassigningtheclus- terlabels.Forinstance,theproposedalgorithmwasableto clusterabout 2 ;700 imagesfromthe CIFAR-10datasetpersecond,whichisequivalenttoaspeedo fabout 8 MBps.Ontheremaining 130 051015NMI(a)CIFAR-10 01020304050NMI(b)MNIST 051015NMI(c)ForestCoverType 0246810NMI(d)Imagenet-34 010203040NMI(e)Poker 0510NMI(f)NetworkIntrusion Figure4.5NMI(in%)oftheclusteringalgorithmswithrespe cttothetrueclasslabels.The parametersfortheproposedapproximatestreamkernel k -meansalgorithmaresetto m =5 ;000 , M =20 ;000 ,and ˝ =1 .ThecoresetsizefortheStreamKM++algorithm,andthechun ksizeof thesKKMalgorithmaresetto 5 ;000 .Itisnotfeasibletoexecutekernel k -meansontheForest CoverType,Imagenet-34,Poker,andNetworkIntrusiondata setsduetotheirlargesize.The approximateNMIvaluesofkernel k -meansonthesedatasetsisobtainedbyrstexecutingkerne l k -meansonarandomlychosensubsetof 50 ;000 datapointstondtheclustercenters,andthen assigningtheremainingpointstotheclosestclustercente r. 131 01020304050NMIData streamFigure4.6ChangeintheNMI(in%)oftheproposedapproximat estreamkernel k -meansalgorithm overtime.Theparameters m , M and ˝ weresetto m =5 ;000 , M =20 ;000 and ˝ =1 , respectively. threedatasets,theclusteringspeedrangesfrom 30 KBpsto 700 KBps.Ouralgorithmalsooutper- formsthesKKMclusteringalgorithmintermsofclusteringq uality.WhilethesKKMalgorithmis slowerthantheproposedalgorithmontheCIFAR-10dataset, it'sspeedisatparwiththeproposed algorithmontheremainingdatasets.TheStreamKM++algori thmobtainsclustersfromcoresets whichsummarize all thepointsinthedataset.ThesKKMalgorithmreliesonthein formation fromonlytwotimestepsanddiscardsmostofthehistoricali nformation.Theproposedapproxi- matestreamkernel k -meansalgorithmndsthemiddlegroundbyretainingpotent iallyusefuldata pointsusingimportancesampling,anddiscardingtheresto fthedatapoints.Thisisreectedin thesilhouetteandNMIvaluesachievedbythealgorithms. Figure4.6showshowtheNMIvaluesoftheproposedalgorithm fallduetotheaccumulation ofthekernelapproximationerrorovertime.Weobservethat thereductioninNMIisslowand stabilizesovertimeformostofthedatasets,showingthatt heapproximationerrorreducesover time.Theerroraccumulationcanbefurtherminimizedbyclu steringthepointsinthebuffermore frequently(asdiscussedinSection4.4),althoughthiswou ldincreasetherunningtime.Theuser cantrade-offbetweentheefciencyandaccuracybytuningt heparametersofthealgorithm. 132 4.5.4.2Parametersensitivity: Theproposedapproximatestreamkernel k -meansalgorithmreliesonveparameters:initialsam- plesize m ,maximumbuffersize M ,initialnumberofclusters C ,clusterdecayrate andcluster lifetimethreshold .Westudytheinuenceoftheseparametersonthealgorithm' sperformance andpresentheuristicstosettheparametervalues: Initialsamplesize m : Thetimetakenbytheproposedalgorithmtoclustereachdata point x t isinuencedbythenumberofpointsinthebuffer S attime t ,becausethesizeofthe eigenvectormatrix V C increasesproportionally.Thebuffersizeattime t ,inturn,depends ontherst m datapoints f x 1 ;:::; x m g inputtothesystem.Moredatapointsaresampled fromthestreamandaddedto S ,iftheinitialsampledoesnotcontainasufcientnumberof representativepoints.OntheCIFAR-10dataset,thenumber ofadditionalpointssampled reducedfrom 6 ;087 to 4 ;434 astheinitialsamplesize m wasincreasedfrom 1 ;000 to 5 ;000 . Similartrendswereobservedfortheremainingdatasetsasw ell.Figure4.7comparesthe runningtimeoftheproposedalgorithmwiththeStreamKM++a ndsKKMalgorithmsasthe parameter m isvaried.Recallthat m representsthecoresetsizeandthechunksizeforthe StreamKM++andsKKMalgorithms,respectively.As m wasincreased,thetimetakenfor clusteringbythebaselinealgorithmsalsoincreased.Asex pected,theproposedalgorithm tookslightlylongerthantheStreamKM++andsKKMalgorithm sformostdatasets,espe- ciallywhen m waslarge.However,theNMIvaluesachievedbytheproposeda lgorithmare muchhigherthanthoseachievedbythebaselinealgorithms, asshowninFigure4.9.Our algorithm'saccuracyimprovessignicantlyas m increases,whilethereisminimalimprove- mentintheclusterqualityoftheStreamKM++algorithm.Thi simprovementinaccuracy compensatesforthehigherrunningtimeoftheproposedalgo rithm.Theseresultsindicate thattheinitialsample,determinedbytheorderofthedata, playsacrucialroleintheperfor- manceoftheproposedalgorithm.ThevarianceintheNMItend storeduceas m increases, againindicatingthattheorderofthedataisimportant.The silhouettecoefcientvalues 133 10002000300040005000051015mRunning timein milliseconds(a)CIFAR-10 10002000300040005000020406080100mRunning timein milliseconds(b)MNIST 1000200030004000500001020304050mRunning timein milliseconds(c)ForestCoverType 10002000300040005000020406080mRunning timein milliseconds(d)Imagenet-34 10002000300040005000050100mRunning timein milliseconds(e)Poker 10002000300040005000050100150200mRunning timein milliseconds(f)NetworkIntrusion Figure4.7Effectoftheinitialsamplesize m ontherunningtime(inmilliseconds)oftheproposed approximatestreamkernel k -meansalgorithm.Parameter m representstheinitialsamplesetsize, thecoresetsizeandthechunksizefortheapproximatestrea mkernel k -means,StreamKM++and sKKMalgorithms,respectively.Theparameters M and ˝ aresetto M =20 ;000 and ˝ =1 , respectively. 134 1000200030004000500000.020.040.060.08mSilhouette(a)CIFAR-10 1000200030004000500000.20.40.60.8mSilhouette(b)MNIST Figure4.8Effectoftheinitialsamplesize m onthesilhouettecoefcientvaluesoftheproposed approximatestreamkernel k -meansalgorithm.Parameter m representstheinitialsamplesetsize, thecoresetsizeandthechunksizefortheapproximatestrea mkernel k -means,StreamKM++and sKKMalgorithms,respectively.Theparameters M and ˝ aresetto M =20 ;000 and ˝ =1 , respectively. achievedbytheproposedalgorithmvaryminimallywithincr easeintheinitialsamplesize, asshowninFigure4.8. Maximumbuffersize M : Themaximumbuffersize M doesnotaffectthealgorithmic efciencyoftheproposedalgorithm,providedthat M ˘ 2 m ,andtheinitialsampleis representativeofthestream.If M issmall,datapointsneedtoberemovedmoreoften fromthebuffertoaccommodateforthenewlysampleddatapoi nts,whichresultsinan increasedrunningtimeasshowninTable4.2.Forinstance,w hen M wassetto 5 ;000 ,about 2 ;500 pointswereremovedfromthebuffer,whereasnopointsneede dtoberemovedwhen M =20 ;000 ,resultingina 2 millisecondreductionoftheclusteringtimeperdatapoint . Thesilhouettecoefcientvaluesvaryminimallywith M ,asrecordedinTable4.3.TheNMI valueincreasesas M increasesbecausealargernumberofrepresentativedatapo intscanbe storedinthebuffer,asshowninTable4.4. Clusterdecayrate ,lifetimethreshold andnumberofclusters C : Thenalnumber ofclustersattheendofclusteringdependsontheorderingo fthedataset,andthecluster 135 10002000300040005000051015mNMI(a)CIFAR-10 1000200030004000500001020304050mNMI(b)MNIST 10002000300040005000051015mNMI(c)ForestCoverType 1000200030004000500002468mNMI(d)Imagenet-34 10002000300040005000010203040mNMI(e)Poker 100020003000400050000510mNMI(f)NetworkIntrusion Figure4.9Effectoftheinitialsamplesize m ontheNMI(in%)oftheproposedapproximate streamkernel k -meansalgorithm.Parameter m representstheinitialsamplesetsize,thecoreset sizeandthechunksizefortheapproximatestreamkernel k -means,StreamKM++andsKKM algorithms,respectively.Theparameters M and ˝ aresetto M =20 ;000 and ˝ =1 ,respectively. 136 Table4.2Effectofthemaximumbuffersize M ontherunningtime(inmilliseconds)ofthepro- posedapproximatestreamkernel k -meansalgorithm.Parametersettings: m =5 ;000 , ˝ =1 . M 5,000 10,000 15,000 20,000 CIFAR-10 9.34 8.50 9.57 7.48 ( 0 :76) ( 3 :33) ( 2 :79) ( 1 :24) MNIST 11.05 10.35 8.99 9.94 ( 2 :22) ( 4 :04) ( 0 :41) ( 1 :75) Forest 7.07 24.17 40.65 58.55 CoverType ( 0 :27) ( 6 :69) ( 12 :81) ( 21 :57) Imagenet-34 10.57 18.77 48.15 57.91 ( 2 :62) ( 4 :85) ( 18 :18) ( 22 :20) Poker 7.38 21.06 44.04 62.38 ( 3 :56) ( 9 :57) ( 17 :76) ( 32 :31) Network 12.09 27.15 43.05 161.89 Intrusion ( 2 :57) ( 7 :07) ( 15 :31) ( 69 :43) Table4.3Effectofthemaximumbuffersize M ontheSilhouettecoefcientoftheproposed approximatestreamkernel k -meansalgorithm.Parametersettings: m =5 ;000 , ˝ =1 . M 5,000 10,000 15,000 20,000 CIFAR-10 5.53 5.63 6.92 7.75 ( e 02 ) ( 0 :12) ( 0 :04) ( 0 :29) ( 0 :26) MNIST 80.50 77.72 82.19 82.51 ( e 02 ) ( 0 :84) ( 0 :66) ( 1 :29) ( 1 :75) Table4.4Effectofthemaximumbuffersize M ontheNMI(in%)oftheproposedapproximate streamkernel k -meansalgorithm.Parametersettings: m =5 ;000 , ˝ =1 . M 5,000 10,000 15,000 20,000 CIFAR-10 6.22 8.07 15.49 15.40 ( 0 :27) ( 2 :73) ( 0 :18) ( 0 :39) MNIST 20.15 29.97 48.31 48.31 ( 0 :26) ( 0 :87) ( 1 :50) ( 1 :50) Forest 0.56 0.72 12.19 14.27 CoverType ( 0 :07) ( 0 :05) ( 0 :02) ( 2 :13) Imagenet-34 1.58 1.73 6.55 7.04 ( 1 :27) ( 1 :62) ( 1 :19) ( 1 :24) Poker 0.64 22.54 39.11 36.09 ( 3 :45) ( 2 :92) ( 4 :19) ( 4 :94) Network 13.71 13.86 13.75 14.32 Intrusion ( 0 :01) ( 0 :40) ( 0 :30) ( 0 :10) 137 Table4.5Effectoftheclusterlifetimethreshold =exp( ˝ ) ontherunningtime(inmil- liseconds)oftheproposedapproximatestreamkernel k -meansalgorithm.Parametersettings: m =5 ;000 , M =20 ;000 . ˝ 1 2 3 4 5 CIFAR-10 7.48 9.28 8.33 8.54 9.08 ( 1 :24) ( 1 :03) ( 1 :53) ( 1 :66) ( 1 :12) MNIST 9.94 9.25 9.31 9.42 10.31 ( 1 :75) ( 0 :46) ( 0 :59) ( 0 :61) ( 1 :25) Forest 58.55 42.80 48.78 40.09 41.88 CoverType ( 21 :57) ( 17 :26) ( 20 :72) ( 13 :81) ( 15 :90) Imagenet-34 57.91 60.25 55.77 57.24 54.98 ( 22 :20) ( 24 :43) ( 26 :20) ( 24 :57) ( 31 :10) Poker 62.38 44.39 44.11 42.65 43.66 ( 32 :31) ( 16 :04) ( 15 :62) ( 17 :48) ( 16 :27) Network 161.89 164.61 165.18 162.36 163.05 Intrusion ( 0 :69) ( 0 :70) ( 0 :71) ( 0 :68) ( 0 :64) Table4.6Effectoftheclusterlifetimethreshold =exp( ˝ ) ontheSilhouettecoefcientofthe proposedapproximatestreamkernel k -meansalgorithm.Parameters: m =5 ;000 , M =20 ;000 . ˝ 1 2 3 4 5 CIFAR-10 7.75 7.66 6.40 6.35 6.07 ( e 02 ) ( 0 :26) ( 0 :24) ( 0 :19) ( 0 :20) ( 0 :22) MNIST 82.51 82.51 82.51 82.51 82.51 ( e 02 ) ( 1 :75) ( 0 :01 :75) ( 1 :75) ( 1 :75) ( 1 :75) Table4.7Effectoftheclusterlifetimethreshold =exp( ˝ ) ontheNMI(in%)oftheproposed approximatestreamkernel k -meansalgorithm.Parameters: m =5 ;000 , M =20 ;000 . ˝ 1 2 3 4 5 CIFAR-10 15.49 15.55 15.41 15.45 15.50 ( 0 :39) ( 0 :23) ( 0 :33) ( 0 :23) ( 0 :25) MNIST 48.31 47.77 49.45 45.98 47.74 ( 1 :40) ( 1 :49) ( 1 :48) ( 1 :40) ( 1 :49) Forest 14.27 12.10 12.11 12.10 12.10 CoverType ( 2 :13) ( 0 :03) ( 0 :03) ( 0 :03) ( 0 :03) Imagenet-34 7.04 7.04 6.95 6.95 7.76 ( 1 :24) ( 1 :24) ( 1 :14) ( 1 :14) ( 1 :54) Poker 36.09 32.07 32.07 36.09 32.07 ( 4 :94) ( 4 :41) ( 4 :41) ( 4 :94) ( 4 :41) Network 14.32 13.65 13.65 13.65 13.66 Intrusion ( 0 :10) ( 0 :06) ( 0 :06) ( 0 :06) ( 0 :06) 138 Table4.8Comparisonoftheperformanceoftheapproximates treamkernel k -meansalgorithm withimportancesamplingandBernoullisampling. Dataset CIFAR MNIST Forest Imagenet Poker Network -10 Cover -34 Intrusion Type Importance Running 7.48 9.94 58.55 57.91 62.38 161.89 sampling time(ms) ( 1 : 24) ( 1 : 75) ( 21 : 57) ( 22 : 20) ( 32 : 31) ( 0 : 69) Silhouette 7.75 82.51 - - - - coefcient ( 0 : 26) ( 1 : 75) ( e 02 ) NMI 15.49 48.31 14.27 7.04 36.09 14.32 (%) ( 0 : 39) ( 1 : 40) ( 2 : 13) ( 1 : 24) ( 4 : 94) ( 0 : 10) Number 5,434 6,136 16,561 14,735 6,265 14,886 ofpoints ( 2 ; 093) ( 34) ( 3 ; 710) ( 1 ; 790) ( 132) ( 2 ; 627) sampled Bernoulli Running 2091.50 2210.77 1257.03 3002.45 86.43 923.16 sampling time(ms) ( 47 : 34) ( 58 : 05) ( 39 : 33) ( 77 : 97) ( 1 : 86) ( 40 : 41) Silhouette 0.72 8.11 - - - - coefcient ( 0 : 01) ( 0 : 13) ( e 02 ) NMI 11.33 14.35 3.93 4.97 2.90 6.50 (%) ( 4 : 9) ( 0 : 05) ( 0 : 7) ( 0 : 19) ( 0 : 02) ( 0 : 15) Number 31,483 23,000 407,220 389,177 50,000 1,711,101 ofpoints ( 717) ( 203) ( 5 ; 807) ( 11 ; 325) ( 100) ( 44 ; 866) sampled decayandlifetimeparameters and .Forinstance,whenthepointsintheCIFAR-10data setwereinputintheirtrueorder(i.e.allimagesfromclass i areinputbeforeallimagesfrom class j ( i a f b k f a kk f b k ;where ts a and f a denotethetimestampandthetf-idffeaturesofatweet,repr esentedbydatapoint x a ,respectively.Therstterminthekernelfunctionensures thattwotweetswhichweregenerated inthesametimeperiodarelikelytobeassignedtothesamecl uster,andthesecondtermensures thattwotweetswithsimilarvocabularyaregroupedtogethe r.Wegaveequalimportancetoboth 141 Figure4.11Sampletweetsfromthe HTML cluster. thetimestampandthetf-idffeaturesbysetting =0 :5 .Wesettheparameters m =5 ;000 , M =10 ;000 , C =20 , =0 :5 , =exp( )=0 :6 and B =10 ;000 .Ouralgorithmassigned aclusterlabeltoeachtweetinabout 200 milliseconds.Treatingthehashtagsasthegroundtruth labels 6 ,weobtainedanaverageclusteringaccuracyof 61% intermsofNMI.Ontheotherhand, theStreamKM++algorithmtookabout 83 millisecondspertweet,andachievedanNMIvalueof 40% ,andthesKKMalgorithmtookabout 2 secondspertweet,andachievedanNMIvalueof 53% .Figures4.10and4.11showsomesampletweetsfromtheASP.N ETandHTMLclusters, respectively.Weobservedthat,bygivingequalimportance tothetimestampofthetweet,andthe wordsinthetweet,weobtainclusterscontainingtweetstha thavebothtemporalproximityand vocabularysimilarity.Retweetsarealwaysassignedtothe sameclusterastheoriginaltweet.For example,thetweetsabout stickyheaders areassignedtotheHTMLcluster,asseeninFigure4.11. 6 Althoughhashtagsarepronetoerror,theyarethebestindic atorsofthetopicofatweet.Theyhavebeenusedas topiclabelsinmanyotherstudiesincluding[75,150]. 142 Morerecenttweetsratherthanoldtweetsarestoredintheme mory.Figure4.12(a)showsthe trendsofthetop-veclustersoverthemonth.Thiscoincide swellwiththetruetrendofthetop topicsshowninFigure4.12(b).Wefoundthattheorderofpop ularityofthetopicclustersis ASP.NET,HTML,SQL,JavaScript,Perl,C++,Postgresql,Pyt hon,GO,PHP,Swift,Scala,Java, Ruby,C#,XML,Erlang,Julia,ObjectiveCandVBScript;whil ethetrueorderoftopicpopularity isASP.NET,HTML,Python,JavaScript,Perl,Java,PHP,Ruby ,SQL,C++,Swift,C#,Scala, Postgresql,XML,Erlang,Julia,GO,ObjectiveC,andVBScri pt. 5101520253000.20.40.6Percentage of tweetsfrom each clusterTimeline ASP.NET HTMLSQLJavaScript Perl(a)Clustertrends 5101520253000.20.40.6TimelinePrecentage of tweetsfrom each topic ASP.NETHTML PythonJavascriptPerl(b)Truetrendsofthetopics Figure4.12TrendingclustersinTwitter.Thehorizontalax isrepresentsthetimelineindaysand theverticalaxisrepresentsthepercentageratioofthenum beroftweetsintheclustertothetotal numberoftweetsobtainedontheday.Figure(a)showsthetre ndsobtainedbytheproposed approximatestreamkernel k -meansalgorithm,andFigure(b)showsthetruetrends. 143 4.7Summary Inthischapter,wehaveproposedanefcientandeffectiver eal-timekernel-basedstreamclustering algorithm,calledapproximatestreamkernel k -means.Experimentalresultsshowthattheproposed algorithmoffersagoodtrade-offbetweenclusteringefci encyandclusteringquality.Further, unlikesomestate-of-the-artkernel-basedstreamcluster ingalgorithms,theproposedalgorithm cancontrolthedecayandbirthofclusters,therebydynamic allycontrollingthenalnumberof clusters.Thekeytotheefciencyoftheproposedalgorithm isthesamplingofthestreaming databasedontheirimportance,denedintermsofthestatis ticalleveragescores.Thisallows ustomaintainthelong-termhistoryofthestreamingdataan dalsolimitthememoryrequiredto storethedata.Wecatertothedriftinthedatadistribution byplacingthresholdsonthelifeofa cluster.Wedemonstratedempiricallythattheproposedalg orithmcanclusterfaststreamssuchas theTwitterstreamwithlimitedmemory,andachievehigherc lusteringaccuracythanthecurrent streamclusteringalgorithms. 144 Chapter5 Kernel-BasedClusteringforLargeNumber ofClusters 5.1Introduction Documentandimagedatasets,containingmillionsofhigh-d imensionalpoints,usuallybelongto alargenumberofclusters.Findingclustersinsuchdataset siscomputationallyexpensiveusing kernel-basedclusteringtechniques.Ouraimistospeedupk ernel-basedclusteringfordatasets withlargenumberofclusters.Inthischapter,wepresentav ariantoftheonlinekernelclustering algorithmdiscussedinChapter4,calledthe sparsekernelk-meansalgorithm whichcanefciently clusterlargedatasetsintothousandsofclusters,withsig nicantlylowerprocessingandmemory requirements,andhighclusteringaccuracy[38,39]. Approximatekernelclusteringalgorithmssuchasapproxim atespectralclustering[67,157] andapproximatekernel k -means(fromChapter2)reducetherunningtimeofkernelclu stering byuniformlysamplingan m -sizedsubsetofthedata,andconstructingalow-rankappro ximate kernelmatrixusingthesampleddata.Theseapproachesredu cetherunningtimecomplexityof kernelclusteringto O ( nmd + nmC ) .Notethattherunningtimeincreasesproportionatelywith 145 Table5.1Complexityofpopularpartitionalclusteringalg orithms: n and d representthesizeand dimensionalityofthedatarespectively,and C representsthenumberofclusters.Parameter m>C representsthesizeofthesampledsubsetforthesampling-b asedapproximateclusteringalgorithms. n sv C representsthenumberofsupportvectors.DBSCANandCanopy algorithmsaredepen- dentonuser-denedintra-clusterandinter-clusterdista ncethresholds,sotheircomplexityisnot directlydependenton C . Clusteringalgorithms Complexity k -means[87] O ( nCd ) DBSCAN[61] O ( n log( n ) d ) Canopy[126] O ( nCd ) Kernel k -means[72] O ( n 2 d + n 2 C ) Spectralclustering[118] O ( n 2 d + n 3 + nC 2 ) Supportvectorclustering[19] O ( n 2 dn sv ) Approximatespectralclustering[67] O ( nmd + nmC ) Approximatekernel k -means[40] O ( nmd + nmC ) thenumberofclusters(SeeTable5.1).AsdemonstratedinCh apters2and3,thesealgorithms takeverylongtoclusterthedatasetwhenthenumberofclust ersisintheorderofthousands.In addition,thenumberofsamples m requiredtoobtainagoodapproximationisdependentonthe rankofthekernelmatrix,whichisinturndependentonthenu mberofclustersinthedata[74]. Clusteringdatasetswithlargenumberofclustersusingthe sealgorithmsrequiressampling O ( n ) numberofdatapoints,tosufcientlyrepresentalltheclus ters.Thisrenderstheapproximatekernel clusteringalgorithmsalsonon-scalable. Theproposedsparsekernel k -meansalgorithmreducestherunningtimeandmemorycom- plexityofkernelclusteringusingtwokeyideas:(i)kernel approximationusingincrementalim- portancesampling,and(ii)kernelsparsity.Importancesa mplinginvolvesselectingdatapoints basedontheirnovelty,measuredintermsofstatisticallev eragescores[34].Fewersamples ( m =( C log C ) )arerequiredtoconstructagoodkernelapproximationusin gimportancesam- plingthanuniformsampling.However,ndingthestatistic alleveragescoresfortheentiredata involvescomputingtheeigenvectorsofthefull n n kernelmatrix,whichiscomputationallyex- pensive[56].Wedesignanefcientonlinemethodtosamplet hedatabasedontheirimportance, 146 therebyreducingthetimerequiredforsampling. Wealsoreducethecomplexityofkernelcomputationandclus teringbyusingsparsication. Wecomputethe p -nearestneighborgraph(where p isauser-denedparameter)forthesampled pointsandusethissparsekernelmatrixtoobtainthecluste rcenters.Clusteringisperformed efcientlybyrstprojectingthedataintoasubspacespann edbythetopeigenvectorsofthesparse kernelmatrix,andthenclusteringtheprojectedpointsusi ngamodied k -meansalgorithm,which usesrandomized kd -trees[132]tondthenearestclustercenterforeachdatap oint. Theruntimecomplexityoftheproposedalgorithmislineari n n and d ,andlogarithmicin C . Weshowthatonlyasmallsubsetofthedataneedstobesampled ,therebyreducingthememory requirements.Wedemonstrateempiricallyusingseveralbe nchmarkdatasetsthattheproposed clusteringalgorithmisscalabletodatasetscontainingmi llionsofhigh-dimensionaldatapoints, andthousandsofclusters. 5.2Background Importancesampling AsdiscussedinChapter4,theprinciplebehindimportances amplingis toselectasubsetofthedatathatismostinformative.Letth ekernelmatrix K bedecomposed as K ' V C C V > C ,where C = diag ( 1 ;:::; C ) containsthehighest C eigenvaluesof K and V C =( v 1 ;:::; v C ) containsthecorrespondingeigenvectors.Adatapoint x i issampledwith probability p i ,denedas p i = 1 C V ( i ) C 2 2 ;(5.1) where V ( i ) C isthe i th rowof V C .Theterm V ( i ) C 2 2 ,calledthestatisticalleveragescorefordatapoint x i ,isanindicatoroftheimportanceofthepoint.Ahighscorei ndicatesthatthecorresponding datapointhasahighinuenceintheapproximationoftheker nelmatrix. WeshowedinLemma8thatimportancesamplingreducesthedep endencyofthenumberof samplesrequiredonthenumberofclusterssignicantly,wh encomparedtouniformsampling. 147 (a) (b) Figure5.1Illustrationofkernelsparsityonatwo-dimensi onalsyntheticdatasetcontaining 1 ;000 pointsalong 10 concentriccircles.Figure(a)showsallthedatapoints(re presentedbyfiofl)and Figure(b)showstheRBFkernelmatrixcorrespondingtothis data.Neighboringpointshavethe sameclusterlabelwhenthekernelisdenedcorrectlyforth edataset. Kernelsparsity Anotherkeycomponentoftheproposedalgorithmiskernelsp arsity.Thepro- posedalgorithmusesthe p -nearestneighbors( p>C )ofeachpointtoconstructasparsekernel matrix.Theintuitionbehindthisisthefactthat,eachdata pointissurroundedbypointsbelonging tothesameclusterinthehighdimensionalfeaturespace,pr ovidedthekernelfunctionisappropri- atelyselected.Figure5.1illustratesthisconceptonthet wo-dimensionalconcentriccirclesdataset. TheRBFkernelmatrixcorrespondingtothisdataisshowninF igure5.1(b).Nearbydatapoints intermsofthekernelsimilaritytendtohavethesamecluste rlabel.Thisideahasbeenpreviously appliedinseveralsupervisedlocallearningapproaches[2 7].Thelocallearning-basedclustering algorithm[188]andthelocalspectralclusteringalgorith m[118]alsousethenearestneighbor graphstoobtaintheclusterlabelsforthedata.However,th esemethodsrequirethecomputationof thefull n n similaritymatrices,renderingthemnon-scalable. Findingthenearestneighborsofadatapointfromamongst s pointswouldrequirethecom- putationof O ( s ) similarities.Popularapproximatenearestneighboralgor ithmsadoptoneofthe followingtwoapproachestondthenearestneighborsefci ently[3]: Usehashingtechniquessuchaslocalitysensitivehashing, whichusehashfunctionstoplace similarobjectsinthesamebin[100,198]. 148 Usedatastructureslike kd -trees(alsodenotedas k-d trees)[131]anditsvariantslikeR-trees, R * -treesandmetrictrees[154]toorganizethedataaccording totheirsimilarityandenable efcientquerying. Therandomized kd -trees[132]techniqueforapproximatenearestneighborco mputationin- volvesconstructingmultiple kd -treesandsearchingtheminparallel.Whileaclassical kd -treeis builtbysplittingthedataalongthedimensionswiththehig hestvariance[131],eachrandomized kd -treesplitsthedataalongadimensionchosenrandomlyfrom thetop n d dimensionswiththe highestvariance.Apriorityqueuewithinformationaboutt hedistanceofeachbranchtothedeci- sionboundaryisusedtoindexintothemultipletrees.Ittak es O ( s log s ) timetobuildthetrees, and O (log s ) timeforeachquery.Therefore,thetimetakenfornearestne ighborcomputation issignicantlyreduced,whenalargenumberofqueriesneed tobeperformedonthesamedata set.Weemployrandomized kd -treesintheproposedalgorithmtorstndthenearestneig hbors andbuildthesparsekernelmatrix,andthentondthecloses tcenterforeachdatapointduring clustering. Theproposedalgorithmoffersthefollowingadvantagesove rtheexistingtechniquestoreduce therunningtimeofkernel-basedclustering[40,42,67,157 ,188]: (i)Itemploysimportancesampling,sofewernumberofsampl esarerequiredtoapproximatethe kernelmatrix,whencomparedtotheapproximationmethodsi n[40,67,157],whichemploy uniformrandomsampling. (ii)Existingapproximatekernelclusteringalgorithms[4 0,67,157]needtoperform O ( nm ) ker- nelsimilaritycomputations,where m isthenumberofsamples.Thenumberofkernelsim- ilaritycomputationsperformedbytheproposedalgorithmi s O ( np ) ,wherethenumberof neighbors p ˝ m .Thisalsoreducesthetimeandmemoryrequiredforclusteri ng,compared totheotherapproximateclusteringalgorithms. 149 (iii)Theclusteringqualityisbetterwhencomparedtothee xistingapproximatekernelcluster- ingmethods,evenwitharelativelysmallnumberofsamples, becausethemostinformative samplesareusedtoperformclustering. (iv)Itdoesnotrequirethecomputationofthefullkernelma trix,unlikethelocalclusteringmeth- odsin[118]and[188]. (v)Itisonlineinnature,i.e.thedataisclusteredinbatch esofauser-denedsize B ,soitcan clusterverylargedatasets(includingdatastreams). 5.3SparseKernelk-means Theproposedsparsekernel k -meansclusteringalgorithmisdescribedinAlgorithm9.Th e algorithmstartswiththerst m datapointsstoredinabuffer S ofaxedmaximumsize M ( C < > : ( x i ;x j ) if x i 2N ( x j ) and x j 2N ( x i ) ;0 otherwise. WeassumethatnearbypointsintheHilbertspacebelongtoth esamecluster.Thekernelfunction shouldbeappropriatelydenedforthisassumptiontobeval id.Severalarticlesintheliterature describetechniquestolearnthekernelfunctionfromtheda ta[112,177,200]. Theremainingdataisclusteredinbatches fD 1 ;D 2 ;::: g ofsize B ,where D t = f x t 1 ;:::; x t B g . Let K 0 = V C C V C > ,where C = diag ( 1 ;:::; C ) containsthetop C eigenvaluesof K 0 and 1 Thenearestneighborsarefoundefcientlyusingrandomize d kd -trees.Weusethekernelfunction ( ;) todene theinter-pointdistancefunction. 150 Algorithm9 SparseKernel k -means 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered ( ; ): < d < d 7!< :thekernelfunction C :thenumberofclusters m :minimumnumberofpointstobesampled( m>C ) p :numberofneighborsforcalculatingthesparsekernelmatr ix( p C . 7: Clusterthedatapointsin S byexecutingapproximate k -means(Algorithm10)on V C C 1 = 2 to obtaintheirclusterlabels. 8: for t =1 ;2 ;:::;n B do 9: for i =1 ;2 ;:::;B do 10: Calculatetheprobability p t i using(5.1). 11: Set S = S [f x t i g withprobability p t i . 12: If x t i wasaddedto S inStep11,updatetheeigenvalues C andeigenvectors V C us- ing(5.8),andreclusterthepointsin S byexecutingtheapproximate k -meansalgo- rithm(Algorithm10)on V C C 1 = 2 ,otherwiseassign x t i tocluster k ,where k = argmin k 2 [ C ] jj c k ( ) g t ( ) jj 2 H , c k ( ) isgivenby(5.6),and g t ( ) istheprojectionof ( x i t ; ) intothesubspacespannedbytheeigenvectors V C . 13: If card ( S ) >M ,ndindex q =argmin l V ( l ) C 2 2 andremovedatapoint x q from S . 14: endfor 15: endfor 151 V C =( v 1 ;:::; v C ) containsthecorrespondingeigenvectors.Thematrices V C and C areupdated usingeachpoint x t i from D t ,andthekernelmatrixisupdatedas K t = 8 > > > > < > > > > : 2 6 4 K t 1 ' > ' ( x t i ;x t i ) 3 7 5 withprobability p t i ;K t 1 withprobability 1 p t i , (5.3) where ' isasparsevectordenedby ' =[ ( x t i ;x s )] > , x s 2N ( x t i ) \ S ,and p t i istheimportance samplingprobabilitydenedin(5.1).Datapoint x t i isaddedto S withprobability p t i .Thecluster labelsforthepointsin S canbeobtainedbysolvingthekernel k -meansproblem max U 2P tr ( e UK t e U > ) ;(5.4) where U =( u 1 ;:::; u C ) > istheclustermembershipmatrix, e U =[ diag ( U 1 )] 1 = 2 U ,domain P = f U 2f 0 ;1 g C s :U > 1 = 1 g , s = card ( S ) ,and 1 isavectorofallones.Theclusterlabels fortheunsampledpointscanbeobtainedbyassigningthemto theclosestcenter.Therunning timecomplexityofthisstepis O ( s 2 ) .Wefurtherreducethiscomplexitybyconstrainingthe clustercenterstoasmallersubspace,spanningthetopeige nvectorsofthekernelmatrix K t ,along thelinesofspectralclustering 2 .Weposetheclusteringproblemasthefollowingoptimizati on problem: min U 2P max f c k ( ) 2H a g C k =1 C X k =1 s X i =1 U k;i s jj c k ( ) ( x i ; ) jj 2 H ;(5.5) where H a = span ( v 1 ;:::; v C ) .Theclustercenterscanbeexpressedaslinearcombination softhe 2 Notethattheeigenvaluesandeigenvectorswerecomputedwh ilendingthesamplingprobabilities(5.1),hence theeigenvectorsdonotneedtobere-computedforclusterin g. 152 eigenvectorsofthekernelmatrix: c k ( )= s X i =1 C X j =1 U k;i n k p j v i;j = u k n k V C 1 = 2 C ;k 2 [C ];(5.6) where n k isthenumberofpointsinthe k th cluster,and u k =( U k; 1 ;U k; 2 ;:::;U k;s ) > .Bysubstitut- ing(5.6)in(5.5),weobtainthefollowingtracemaximizati onproblem: max U 2P tr ( e UV C C V > C e U > ) :(5.7) Theaboveproblemcanbesolvedbyexecuting k -meansonthematrix V C 1 = 2 C .Thecomplexityof running k -meanson V C 1 = 2 C wouldbe O ( sC 2 ) ,whichcanagainbecomputationallyexpensivefor large C . Wealleviatethisissuebyemployinganapproximatevariant ofthe k -meansalgorithm(Algo- rithm10),similartothelteringalgorithmin[91].Themos tcomputationallyexpensivestepinthe k -meansalgorithmiscomputingtheclosestcenterforeachda tapoint,whichrequires O ( sC ) dis- tancecomputations.Wereducethenumberofdistancecomput ationsbyusingrandomized kd -trees tondtheclosestclustercenters. Theproposedsparsekernel k -meansalgorithmisdependentonthreeparameters:initial sample size m ,maximumbuffersize M ,andthenumberofneighbors p usedtobuildthesparsekernel matrix.Theparameters m and M shouldbesetsuchthattheinitialandnalsamplesetsconta in representativesfromalltheclusters.Theparameter p shouldbesetlargeenoughtoensurethatthe kernelmatrixremainspositivesemi-deniteanditsrankis greaterthanthenumberofclusters C . Heuristicstosettheseparametersarediscussedfurtherin Section5.5. 153 Algorithm10 Approximate k -means 1: Input : D = f x 1 ;:::; x n g ;x i 2< d :thesetof nd -dimensionaldatapointstobeclustered C :thenumberofclusters MAXITER :Maximumnumberofiterations 2: Output :Clusterlabelsforthedatapoints 3: Randomlyinitializetheclusterlabels f l 1 ;l 2 ;:::l n g ;l i 2 [C ]. 4: Computetheclustercenters c k = P l i = k x i ;k 2 [C ]. 5: Set t =0 ; 6: repeat 7: Set t = t +1 . 8: Buildrandomized kd -treeindex I forthe C centers[132]. 9: for i =1 ;2 ;:::;n do 10: Findtheapproximatenearestcenter c k ofdatapoint x i usingtheindex I . 11: Set l i = k . 12: endfor 13: Recomputethecenters f c 1 ;c 2 ;:::; c C g . 14: until thelabelsdonotchangeor t>MAXITER 5.4Analysis 5.4.1ComputationalComplexity Themostcomputationallyintensiveoperationsinthepropo sedalgorithmare:(i)computingthe m m kernelmatrix K 0 (Step5),andndingitseigenvectorstoobtaintheleverage scores(Step 6),and(ii)updatingtheeigenvectorsineachiteration,an dclusteringthemusingtheapproximate k -meansalgorithm(Step12).Inordertoobtaintheeigenvalu esandeigenvectorsofan s s kernelmatrix K t (where s isthenumberofdatapointsinthebuffer S ),weneedtoperform eigendecompositionof K t .Naiveimplementationsofeigendecompositiontake O ( s 3 ) time.We canreducethetimeforcomputingtheeigenvectorsbymaking twomodicationstothealgorithm: (i)UseefcientalgorithmssuchasLanczos,subspaceitera tion,andtraceminimizationmethods todecomposethe m m kernelmatrix K 0 obtainedfromtherst m points[21].Thisreduces therunningtimecomplexityofthisstepto O ( mp + m ) .Inourimplementation,weusedthe 154 svds functioninMATLABtoobtainthetop C eigenvaluesandeigenvectorsforthekernel matrixcorrespondingtotherst m datapoints. (ii)Updatetheeigenvectors V C incrementallyineachiterationofthealgorithmusingfast update mechanisms[31,175],toreducethetimetakentoprocessthe pointsineachbatch.Using therank-1updatemechanismproposedin[31],weupdatethee igenvectorsin O ( sp + p 3 ) time,where s isthenumberofpointsinthebuffer S .Giventheeigendecomposition, K t = V C C V > C ,andvector ' 2< m ,thismethodndstheeigendecompositionof K t + '' > as K t + '' > = V w jj w jj 0 V w jj w jj > (5.8) where w = I VV > ' isthecomponentof K t thatisorthogonalto V ,and 0 contains thedominanteigenvaluesofthesparsematrix 2 6 4 V > ' ' > V jj w jj 3 7 5 :Thismethodalsoeliminatestheneedtostorethekernelmatr ix K t inmemory.Afterthematrix K 0 anditseigenvectorsareobtained,onlythevector ' in(5.3)isrequiredtoupdate V C and C . Theapproximate k -meansalgorithmrstbuildsmultiplerandomized kd -treescontainingthe C clustercenters,andanindexintothesetrees,whichtakes O ( C log C ) time.Itthenndsthe approximatenearestneighborsforeachdatapointin S in O ( s log C ) time,withan approximation error.Therefore,thetotaltimeforclustering s pointsusingtheapproximate k -meansalgorithmis O ( C log Cl + s log Cl ) ,where l isthenumberofiterationsrequiredforconvergence.Clust ering isperformedeverytimeapointisaddedtothesampleset S fromtheinputbatchofdatapoints. Inordertofurtherreducetherunningtime,wecanemploya lazyreclustering approach,bywhich weperformtheclusteringafterevery T datapointadditions.Eachunsampleddatapointcanbe 155 assignedaclusterlabelbyndingtheclosestcenterin O (log C ) time. Insummary,theoverallrunningtimecomplexityofthepropo sedsparsekernel k -meansalgo- rithmis O ( npd + mp + m + QC log Cl + QM log Cl + n log C ) ,where Q isthetotalnumber ofpointssampledfromthestream.WedemonstrateinSection 5.5thatthenumberofpoints Q isclosetotheinitialsamplesize m .Therefore,therunningtimecomplexitycanbesimplied as O ( npd + mp + m + mC log Cl + mM log Cl + n log C ) ˘ O ( npd + n log C ) ,assuming max( mp;mCl;mMl ) ˝ n .Therefore,theproposedalgorithmhasrunningtimecomple xitylin- earin n ,linearin d ,andlogarithmicin C .Itissignicantlyfasterthanthekernel k -meansalgorithm andtheapproximatekernelclusteringalgorithms,whichha ve O ( n 2 d + n 2 C ) and O ( nmd + nmC ) runningtimecomplexities,respectively.Theamountofmem oryrequiredis O ( mp + Md + MC ) , forstoringtheinitialkernelmatrix K 0 ,thedatapointsinthebufferandtheeigenvectorsofthe kernelmatrix. 5.4.2ApproximationError Theproposedsparsekernel k -meansalgorithmessentiallyapproximatestheeigenvecto rsofthe true n n kernelmatrixwiththesingularvectorsofasparse n Q matrix,where Q isthetotal numberofpointssampledfromthedatasetusingimportances ampling.Inthissection,werst boundthisapproximationerror(duetoimportancesampling andsparsication),andthenbound theerrorincurredduetotheapproximation(5.5)forcluste ring. Theorem7. Let K bethe n n kernelmatrixandlet K bethe n Q kernelmatrixbetween the n pointsinthedatasetandthe Q sampledpoints.Let Z C =( z 1 ;:::;z C ) representthetop C eigenvectorsof K ,and 2 (0 ;1) bethesmallestprobabilitysuchthat ( C C +1 ) > 3 ,where < 2 1 Q ln 2 + j K j F s 2ln(2 = ) Qn and 2 =max 1 i Q n X j =1 2 ( x i ;x j ) :156 Assuming = O ( p Q ) and ( ; ) 1 , max 1 i C j v i z i j 2 9 2( C C +1 ) ;(5.9) withprobability 1 . Proof. Wewillrstestablisharelationshipbetweenthesingularv ectorsofthesparsekernelmatrix thatisconstructedbytheproposedalgorithmandthe n n kernelmatrix K sp = K sp i;j n n dened asfollows: K sp i;j = 8 > < > : ( x i ;x j ) if x i 2N ( x j ) and x j 2N ( x i ) ;0 otherwise, andthenusethefactthat j K sp j F j K j F toobtaintherequiredresult.Let b Z =( b z 1 ;:::; b z n ) representtheeigenvectorsof K sp , X =( b z 1 ;:::; b z C ) ,and Y =( b z C +1 ;:::; b z n ) .Let L n bealinear operatorthatmapsanyfunction f ( ) toafunction L n [f ]( ) 2H denedby L n [f ]( )= 1 n n X i =1 ( x i ; ) f ( x i ) :(5.10) Theeigenfunctions[187]of L n ,whichformthebasisofthespace H aregivenby b ' i ( )= 1 p i n n X j =1 b z i;j ( x j ; ) :(5.11) Similarto L n ,let L Q representthelinearoperatorbasedonthesampledexamples ,denedby L Q [f ]( )= 1 Q Q X i =1 ( x i ; ) f ( x i ) :(5.12) Werstproveasimplerresultthatestablishesarelationsh ipbetweenthesubspaces X and V C , inthefollowinglemma: 157 Lemma10. Let 2 (0 ;1) bethesmallestprobabilitysuchthat ( C C +1 ) > 3 ,where is denedas = 2 1 Q ln 2 + j K sp j F s 2ln(2 = ) Qn 2 1 Q ln 2 + j K j F s 2ln(2 = ) Qn Thereexistsamatrix P 2 R ( n C ) C satisfying k P k F 2 C C +1 ;suchthat V C =( X + YP )( I + P > P ) 1 = 2 . Proof. Theproofisbasedonthefollowingresults(Lemmas11and12) from[166]and[165], respectively: Lemma11. Let ( i ;v i ) ;i 2 [n ]betheeigenvaluesandeigenvectorsofasymmetricma- trix A 2 R n n rankedinthedescendingorderofeigenvalues.Set X =( v 1 ;:::; v C ) and Y =( v C +1 ;:::; v n ) .Givenasymmetricperturbationmatrix E ,let ( X;Y ) > E ( X;Y )= 0 B @ E 11 E 12 E 21 E 22 1 C A :Let kk representaconsistentfamilyofnormsandset = k E 21 k ; = C C +1 k E 11 kk E 22 k :If > 0 and < 1 2 ,thenthereexistsauniquematrix P 2 R ( n C ) C satisfying k P k < 2 ;such 158 that X 0 =( X + YP )( I + P > P ) 1 = 2 ;Y 0 =( Y XP > )( I + PP > ) 1 = 2 aretheeigenvectorsof A + E . Lemma12. Let H beaHilbertspaceand ˘ bearandomvariableon ( Z;ˆ ) withvaluesin H . Assume k ˘ k M< 1 almostsurely.Denote ˙ 2 ( ˘ )= E ( k ˘ k 2 ) .Let f z i g Q i =1 beindependent randomdrawersof ˆ .Forany 0 << 1 ,withcondence 1 , 1 Q Q X i =1 ( ˘ i E [˘ i ]) 2 M ln(2 = ) Q + s 2 ˙ 2 ( ˘ )ln(2 = ) Q :(5.13) Let C = C C +1 .Dene A =[ h ( x i ; ) ;L n ( x j ; ) i H ]n n ;B =[ h ( x i ; ) ;L Q ( x j ; ) i H ]n n ;and E = B A .Wehave = k X > EY k F ; = C k X > EX k F k YEY k F :Usingtherelationship b ' i = q 1 i n P n k =1 b z i;k ( x k ; ) , i =1 ;:::;n ,wehave [X > EY ]i;j = b z > i E b z j = n X a;b =1 b z a;i b z b;j h ( x i ; ) ;( L n L Q ) ( x j ; ) i H = p i j h b ' i ;( L n L Q ) b ' j i H = h b ' i ;L 1 = 2 n ( L n L Q ) L 1 = 2 n b ' j i H :159 Wehavesimilarresultsfor X > EX and Y > EY .Thus,weobtain and as = v u u t C X i =1 n X j = C +1 h b ' i ;( L 2 n L 1 = 2 n L Q L 1 = 2 n ) b ' j i 2 H k L 1 = 2 n ( L n L Q ) L 1 = 2 N k F = C v u u t C X i;j =1 h b ' i ;( L 2 n L 1 = 2 n L Q L 1 = 2 n ) b ' i i 2 H v u u t n X i;j = C +1 h b ' i ;( L 2 n L 1 = 2 n L Q L 1 = 2 n ) b ' i i 2 H C k L 1 = 2 n ( L n L Q ) L 1 = 2 n k F :Wesubstitutetheseboundsfor and intoLemma11toobtain k P k F 2 k L 2 n L 1 = 2 n L Q L 1 = 2 n k F C C +1 k L 2 n L 1 = 2 n L Q L 1 = 2 n k F :Wenowbound k L 2 n L 1 = 2 n L Q L 1 = 2 n k F usingLemma12.Let i [f ]( )= ( x i ; ) f ( x i ) and ˘ i = L 1 = 2 N i L 1 = 2 N .Wedene M and ˙ 2 as M =max 1 i n k ˘ i k F ,and ˙ 2 = E i [k ˘ i k 2 F ]. Wehave M k L n k 2 k i k F = 1 and ˙ 2 = E " n X k =1 h b ' k ;˘ 2 i b ' k i H # = E " n X k =1 h b ' k ;L 1 = 2 n i L n i L 1 = 2 n b ' k i H # E " h ( x i ; ) ;L n ( x i ; ) i H n X k =1 h b ' k ;L 1 = 2 n i L 1 = 2 n b ' k i H # 2 n E " n X k =1 h b ' k ;L 1 = 2 n i L 1 = 2 n b ' k i H # 2 n k L n k 2 F 2 j K sp j 2 F n 2 j K j 2 F n :160 Wecompletetheproofbysubstitutingtheboundsfor M and ˙ 2 intoLemma12. NowweproveTheorem7usingtheresultofLemma10.Wehave max 1 i C j v i z i j 2 = k V C X k 2 k YP ( I + P > P ) 1 = 2 k 2 + k ( I ( I + P > P ) 1 = 2 ) X k 2 k Y k 2 k P k 2 + k I ( I + P > P ) 1 = 2 k 2 j X j 2 k P k F +1 1 p 1+ k P k 2 F k P k F +1 q 1 k P k 2 F k P k F + k P k 2 F 2 3 2 k P k F :Weobtaintherequiredresultusingthefactthat k P k F 2 C C +1 3 C C +1 :Wecompletetheproofbyusingthefact j K j F 1 toobtaintherelation = O 1 Q + 1 p n ,when = O ( p Q ) . Inthefollowinglemma,weshowthattheerrorincurreddueto theapproximation(5.5)iswell- bounded,providedthatthetailoftheeigenspectrumisfast decaying,whichistrueformostreal datasets[45]: Lemma13. Let E and E a representtheoptimalclusteringerrorsin (5.4) and (5.7) ,respectively. Wehave j E E a j s X i = C +1 i :Proof. Let f c k ( ) g C k =1 and U betheoptimalsolutionto(5.4).Let c a k ( ) representtheprojectionof c k intothesubspace H a .Forany ( x i ; ) ,let g i ( ) and h i ( ) betheprojectionsof ( x i ; ) intothe 161 subspace H a andspan ( v C +1 ;:::; v s ) ,respectively.Wehave E a =min U 2P max c k ( ) 2H a C X k =1 s X i =1 U k;i s jj c k ( ) ( x i ; ) jj 2 H C X k =1 s X i =1 U k;i s jj c a k ( ) ( x i ; ) jj 2 H C X k =1 s X i =1 U k;i s jj c a k ( ) g i ( ) jj 2 H + jj h i ( ) jj 2 H E + 1 s C X k =1 s X i =1 jj h i ( ) jj 2 H E + s X i = C +1 i :5.5ExperimentalResults 5.5.1Datasets Wedemonstratetheeffectivenessoftheproposedsparseker nel k -meansalgorithmusingthe CIFAR-100,Imagenet-164,YoutubeandTinydatasets. 5.5.2BaselinesandParameters Wecomparedtheperformanceoftheproposedalgorithmwitht hekernel k -means[72]algorithm ontheCIFAR-100dataset.Itisinfeasibletoexecutetheker nel k -meansalgorithmontheother threedatasets.Wealsoevaluateditsperformanceagainstt he k -meansalgorithm.Weshowthat althoughtheproposedalgorithmhasahigherrunningtimeth an k -means,ityieldsbetterclustering accuracy.Finally,wecomparedouralgorithmwiththeappro ximatekernel k -meansalgorithmfrom Chapter2,wherethedataissampledwithuniformprobabilit y,andalowrankapproximatekernel 162 isconstructedusingthesampleddata.Weshowthatimportan cesamplingandkernelsparsityplay asignicantroleinreducingthetimeandmemoryrequiremen ts. WeusedtheuniversalRBFkernelfortheproposedalgorithma ndthekernel-basedbaselineal- gorithms(kernel k -meansandapproximatekernel k -means)ontheCIFAR-100,TinyandImagenet- 164datasets.FortheYoutubedataset,whichcontainsbotht extandimagefeatures,weuseda combinationofthecosinesimilarityandtheRBFkernel,de nedas ( x a ;x b )= 1 2 exp k g a g b k 2 + f > a f b k f a kk f b k ;where f a and g a denotethetf-idfandGISTfeaturesfordatapoint x a ,respectively.Wetunedthe kernelwidthfortheRBFkernelusinggridsearchintherange [0 ;1] toobtainbestperformance. Wevariedtheinitialsamplesetsizefrom m =5 ;000 to m =20 ;000 ,andthenumberof neighborsfrom p =1 ;000 to m inmultiplesof 5 ;000 .Themaximumsamplesetsizewassetto M =50 ;000 .Thenumberofclusters C wassetequaltothetruenumberofclassesinthedata setfortheCIFAR-100andImagenet-164datasets.Thetruenu mberofclassesisunknownforthe YoutubeandTinydatasets,sowesetthenumberofclusterseq ualto 10 ;000 .Thebatchsize B wassetequaltotheinitialsamplesize m . WeimplementedallthealgorithmsinMATLAB,andexecutedth em 10 timeseachona 2 :8 GHzprocessor.Thememoryusedwasconstrainedto 60 GB.Wepresenttheresults(meanand variance)overthe 10 runs.Differentpermutationsofthedatasetwereinputtoth eclusteringalgo- rithmsineachrun.Weusedtherandomized kd -treesimplementationintheFLANNlibrary[132] tondtheapproximatenearestneighborsintheproposedalg orithm.Thedistancefunctionused bythelibrarywasdenedastheinverseofthekernelsimilar ityfunction.Therandomized kd -tree parametersweresetasfollows:thenumberofdimensions n d to 5 ,thenumberoftreesto 8 ,and theapproximationerrorto =1 e 16 . 163 Table5.2Runningtime(inseconds)oftheproposedsparseke rnel k -meansandthethreebaseline algorithmsonthefourdatasets.Theparametersofthepropo sedalgorithmweresetto m = 20 ;000 , M =50 ;000 ,and p =1 ;000 .Thesamplesize m fortheapproximatekernel k -means algorithmwassetequalto 20 ;000 fortheCIFAR-100datasetand 10 ;000 fortheremainingdata sets.Itisnotfeasibletoexecutekernel k -meansontheImagenet-164,YoutubeandTinydata setsduetotheirlargesize.Theapproximaterunningtimeof kernel k -meansonthesedatasets isobtainedbyrstexecutingthealgorithmonarandomlycho sensubsetof 50 ;000 datapointsto ndtheclustercenters,andthenassigningtheremainingpo intstotheclosestclustercenter. Dataset SparseKernel Approx. k -means Kernel k -means kernel k -means (proposed) k -means CIFAR-100 49,887 11,394 1,507 117,513 ( 93) ( 600) ( 332) ( 211) Imagenet-164 74,794 16,023 240,722 182,311 ( 870) ( 3 ;577) ( 5 ;351) ( 14 ;916) Youtube 217,533 57,096 145,039 679,061 ( 1 ;264) ( 2 ;196) ( 1 ;436) ( 2 ;284) Tiny 343,560 371,004 359,291 704,656 ( 2 ;528) ( 1 ;588) ( 7 ;045) ( 8 ;482) 5.5.3Results 5.5.3.1Runningtime Table5.2comparestherunningtimeofouralgorithmwiththe approximatekernel k -means,kernel k -meansand k -meansalgorithms,whentheparameters m and p areequalto 20 ;000 and 1 ;000 ,re- spectively.OntheCIFAR-100dataset,theproposedalgorit hmtakeslongerthanthe k -meansalgo- rithm,asexpected,becauseoftheadditionaltimerequired forkernelcomputationandeigensystem calculation.Italsotakeslongerthantheapproximatekern el k -meansalgorithm,asitperformsim- portancesamplingbycalculatingandupdatingtheeigenvec torsofthesparsekernelmatrix.Onthe otherhand,theapproximatekernel k -meansalgorithmselectsthesubsetofthedatausingunifor m randomsampling,andcomputestheclustercentersusingthe low-rankmatrixconstructedfrom thissubset.Theproposedalgorithm,theapproximatekerne l k -means,andthe k -meansalgorithms aresignicantlyfasterthanthekernel k -meansalgorithm.Theproposedalgorithmspendsmore 164 Figure5.2Sampleimagesfromthreeofthe 100 clustersintheCIFAR-100datasetobtainedusing theproposedalgorithm. timeinupdatingtheeigenvectorsandndingtheleveragesc oresthanclusteringtheeigenvectors toobtaintheclusterlabels.Similarperformanceisobserv edontheImagenet-164,Youtubeand Tinydatasets.Theproposedalgorithmisalsofasterthan k -meansontheImagenet-164dataset, because k -meanstakeslongertoconverge.Itisinfeasibletocompute thefullkernelmatrixforthe Imagenet-164,YoutubeandTinydatasets,sowewereunablet oexecutekernel k -meansonthem. Forthesedatasets,weexecutedkernel k -meansona 50 ;000 -sizedrandomlyselectedsubsetof thedata,andassignedtheremainingpointstotheclosestcl ustercenters.Theproposedalgorithm isalsofasterthanthisimplementationofkernel k -means,becauseittakesalongtimetondthe distancebetweenthedatapointsandtheclustercenters,an dassignlabels.Theproposedalgorithm isalsomoreaccuratethanthiskernel k -meansimplementationontheImagenet-164dataset. 5.5.3.2Clusterquality Figure5.2showexamplesofclustersobtained,usingthespa rsekernel k -meansalgorithm,fromthe CIFAR-100dataset.Weassignedaclasslabeltoeachcluster ,basedonthetrueclassofmajority oftheobjectsinthecluster.Table5.3recordsthesilhouet tecoefcientvaluesofthepartitionsof theCIFAR-100dataset.Thesparsekernel k -meansalgorithmachievesvaluesclosertothatofthe kernel k -meansalgorithm.Theapproximatekernel k -meansand k -meansalgorithmsareunableto 165 Table5.3Silhouettecoefcient( e 02 )oftheproposedsparsekernel k -meansandthethree baselinealgorithmsontheCIFAR-100dataset.Theparamete rsoftheproposedalgorithmwere setto m =20 ;000 , M =50 ;000 ,and p =1 ;000 .Thesamplesize m fortheapproximatekernel k -meansalgorithmwassetequalto m =20 ;000 . Sparsekernel Approx.kernel k -means Kernel k -means(proposed) k -means k -means 11.36 2.33 3.02 30.18 ( 0 :07) ( 0 :02) ( 0 :01) ( 0 :13) 051015NMI(a)CIFAR-100 051015NMI(b)Imagenet-164 Figure5.3NMI(in%)oftheproposedsparsekernel k -meansandthethreebaselinealgorithms ontheCIFAR-100andImagenet-164datasets.Theparameters oftheproposedalgorithmwere setto m =20 ;000 , M =50 ;000 ,and p =1 ;000 .Thesamplesize m fortheapproximate kernel k -meansalgorithmwassetequalto 20 ;000 fortheCIFAR-100datasetand 10 ;000 forthe Imagenet-164dataset.Itisnotfeasibletoexecutekernel k -meansontheImagenet-164dataset, duetoitslargesize.TheapproximateNMIvalueachievedbyk ernel k -meansontheImagenet-164 datasetisobtainedbyrstexecutingthealgorithmonarand omlychosensubsetof 50 ;000 data pointstondtheclustercenters,andthenassigningtherem ainingpointstotheclosestcluster center. 166 achievesimilarsilhouettevalues. Weanalyzethepredictionaccuracy,intermsofNMI,ofthepr oposedsparsekernel k -means usingtheCIFAR-100andImagenet-164datasets.Asthetruec lasslabelsfortheYoutubeandTiny datasetsarenotavailable,wewereunabletondtheNMIfort hesedatasets.Figure5.3showsthe NMIvalueswithrespecttothetrueclasslabels,foreachoft healgorithmsontheCIFAR-100and Imagenet-164datasets.InFigure5.3(a),itisobservedtha ttheNMIachievedbyouralgorithmis closetothatofthekernel k -meansalgorithm.Theproposedalgorithmoutperformsboth k -means andapproximatekernel k -meansonboththeCIFAR-100andImagenet-164datasets,due tothe factthatitsamplesthemostinformativepointsfromthedat aset. 5.5.3.3Parametersensitivity Oursparsekernel k -meansalgorithmreliesonthreeparameters:theinitialsa mpleset m ,the maximumsizeofthesampleset M ,andthesizeoftheneighborhood p .Weevaluatedtheeffectof eachoftheseparametersontheperformanceoftheproposeda lgorithm,usingtheCIFAR-100and Imagenet-164datasets. Initialsample: Theinitialsampleusedtoconstructthekernel K 0 ,andobtaintheinitial clusterlabelsplaysacrucialroleintheperformanceofour algorithmasshowninTable5.4, Table5.5andFigure5.4.Theycomparetheperformanceofthe proposedalgorithmandthe approximatekernel k -meansalgorithmwithincreasing m value.Asexpected,therunning timeofboththealgorithmsincreasesastheinitialsamples izeincreasesfrom m =5 ;000 to m =20 ;000 .As m increases,thesizeoftheinitialkernel K 0 ,andthetimetocompute anddecomposeitintoitseigenvaluesandeigenvectorsincr easeproportionately.Theinitial samplealsodeterminesthenumberofpointssampledfromthe dataset,aseachinputbatch isprocessed.Moredatapointsweresampledandaddedtotheb uffer S ,iftheinitialsample didnotcontainsufcientnumberofrepresentativepoints. Thetimetoclusterincreasesas morepointsareaddedtothebuffer.Thesilhouettecoefcie ntvaluesontheCIFAR-100 167 Table5.4Comparisonoftherunningtime(inseconds)ofthep roposedsparsekernel k -means algorithmandtheapproximatekernel k -meansalgorithmontheCIFAR-100andtheImagenet-164 datasets.Parameter m representstheinitialsamplesetsizefortheproposedalgo rithm,andthesize ofthesampledsubsetfortheapproximatekernel k -meansalgorithm.Theremainingparameters oftheproposedalgorithmaresetto M =50 ;000 ,and p =1 ;000 .Approximatekernel k -meansis infeasiblefortheImagenet-164datasetwhen m> 10 ;000 duetoitslargesize. m CIFAR-100 Imagenet-164 Sparsekernel Approx.kernel Sparsekernel Approx.kernel k -means k -means k -means k -means 5,000 6,192 1,693 24,029 15,691 ( 424) ( 339) ( 4 ;469) ( 3 ;786) 10,000 18,256 4,134 36,669 16,023 ( 21) ( 549) ( 603) ( 3 ;577) 15,000 34,192 7,856 53,142 - ( 2 ;652) ( 929) ( 3 ;058) 20,000 49,887 11,394 74,794 - ( 93) ( 600) ( 870) Table5.5Comparisonofthesilhouettecoefcient( e 02 )oftheproposedsparsekernel k -means algorithmandtheapproximatekernel k -meansalgorithmontheCIFAR-100dataset.Parameter m representstheinitialsamplesetsizefortheproposedalgo rithm,andthesizeofthesampled subsetfortheapproximatekernel k -meansalgorithm.Theremainingparametersoftheproposed algorithmweresetto M =50 ;000 ,and p =1 ;000 . m 5,000 10,000 15,000 20,000 Sparsekernel 19.42 11.77 11.67 11.36 k -means(proposed) ( 0 :12) ( 0 :04) ( 0 :06) ( 0 :07) Approx.kernel 2.45 2.37 2.45 2.33 k -means ( 0 :03) ( 0 :02) ( 0 :02) ( 0 :02) 168 50001000015000200000510mNMI(a)CIFAR-100 5000100001500020000051015mNMI(b)Imagenet-164 Figure5.4ComparisonoftheNMI(in%)oftheproposedsparse kernel k -meansalgorithmand theapproximatekernel k -meansalgorithmontheCIFAR-100andtheImagenet-164data sets. Parameter m representstheinitialsamplesetsizefortheproposedalgo rithm,andthesizeofthe sampledsubsetfortheapproximatekernel k -meansalgorithm.Theremainingparametersofthe proposedalgorithmweresetto M =50 ;000 ,and p =1 ;000 .Approximatekernel k -meansis infeasiblefortheImagenet-164datasetwhen m> 10 ;000 duetoitslargesize. datasetdecreaseminimallywhen m increasesfrom 5 ;000 to 10 ;000 ,butremainconstant for m 10 ;000 .Ontheotherhand,thereisminimalchangeinthesilhouette valuesofthe approximatekernel k -meansalgorithmforincreasing m .TheNMIvaluesachievedbyour algorithmincreaseconsiderablyasthesamplesize m increases,indicatingthattheinitial sampleisimportanttotheclusteringaccuracyofthepropos edalgorithm.Evenwithjust 5 ;000 datapointsintheinitialsample,ouralgorithmisabletoac hieve 13% NMI.Onthe otherhand,theapproximatekernel k -meansalgorithmisunabletoachievethesamewith even 20 ;000 samples.Theperformanceofthesparsekernel k -meansalgorithmisbestwhen thesamplesizeissetgreaterthan C log C ,inaccordancewithLemma8. Maximumsamplesize: Inourexperiments,wesetthemaximumsamplesizeto 50 ;000 . Wefoundthatthisparameterisnotascriticalastheinitial sample,providedthatitissetlarge enoughtoaccommodateforasufcientlyrepresentativesam ple.OnboththeCIFAR-100 andImagenet-164datasets,thenumberofpointsaddedtothe bufferrangefrom 100 to 500 , onanaverage.Thenumberofpointsaddeddecreasesastheini tialsamplesize m increases 169 Table5.6Effectofthesizeoftheneighborhood p ontherunningtime(inseconds),thesilhouette coefcientandNMI(in%)oftheproposedsparsekernel k -meansalgorithmontheCIFAR-100 andImagenet-164datasets.Theremainingparametersofthe proposedalgorithmweresetto m =20 ;000 ,and M =50 ;000 . p CIFAR-100 Imagenet-164 Runningtime Silhouettecoefcient NMI Runningtime NMI ( e 02 ) 1,000 49,887 11.36 12.23 74,794 16.15 ( 93) ( 0 :07) ( 2 :3) ( 870) ( 0 :004) 5,000 52,073 11.25 12.09 82,880 17.58 ( 483) ( 0 :06) ( 0 :02) ( 21 ;360) ( 0 :10) 10,000 54,205 12.27 13.86 192,725 18.01 ( 874) ( 0 :12) ( 0 :07) ( 3 ;874) ( 0 :07) 15,000 55,062 11.32 14.00 247,911 18.23 ( 837) ( 0 :09) ( 0 :01) ( 7 ;789) ( 0 :004) from 5 ;000 to 20 ;000 .Forinstance,ontheCIFAR-100dataset,when m =5 ;000 , 453 additionalpointswereaddedtothebuffer.When m =20 ;000 ,only 69 pointswereadded. Sizeoftheneighborhood: Thenumberofneighbors p usedtoconstructthesparsekernel similarityisalsoimportanttotheperformanceofthepropo sedsparsekernel k -meansalgo- rithm.Table5.6showshowtherunningtime,thesilhouettec oefcientvalues,andtheNMI valuesontheCIFAR-100andImagenet-164datasetsareaffec tedasthevalueof p increased from 1 ;000 to 15 ;000 ,andtheinitialsamplesize m wasxedat 20 ;000 .Therunningtime doubleswhen p increasedfrom p =1 ;000 to p =15 ;000 ,onboththedatasets.Thisis duetothefactthatalargernumberofsimilaritycomputatio nsneedtobeperformedasthe valueof p increases.However,althoughthereisasmallincreaseinth esilhouettecoefcient andNMIvalues,theincreaseisnotsignicantenoughtojust ifytheincreaseintherunning time.Weconcludethattheneighborhoodsizeisanimportant parameterindeterminingthe efciencyofthealgorithm. Numberofclusters: Weshowtheeffectofvaryingthenumberofclusters C ontheperfor- manceoftheproposedalgorithm,inFigures5.5and5.6.Ther unningtimeofthealgorithm 170 2040608010020002200240026002800Running timein secondsNumber of clusters (C)(a)CIFAR-100 5010015011.21.4x 105Running timein secondsNumber of clusters (C)(b)Imagenet-164 0.511.52x 104246810x 105Running timein secondsNumber ofclusters (C)(c)Youtube 0.511.5x 104246810x 105Running timein secondsNumber ofclusters (C)(d)Tiny Figure5.5Effectofthenumberofclusters C ontherunningtime(inseconds)oftheproposed sparsekernel k -meansalgorithm. 20406080100789101112NMINumber of clusters (C)(a)CIFAR-100 501001501213141516NMINumber of clusters (C)(b)Imagenet-164 Figure5.6Effectofthenumberofclusters C ontheNMI(in%)oftheproposedsparsekernel k -meansalgorithm. 171 1050.511.52x 104Running timein secondsSize of the data set n(log scale)(a) 1011021031000200030004000Running timein secondsDimensionality of the data set d(log scale)(b) 1011021031234x 104Running timein secondsNumber of clusters C(log scale)(c) Figure5.7Runningtimeofthesparsekernel k -meansclusteringalgorithmfordifferentvaluesof (a) n ,(b) d and(c) C . increaseswiththenumberofclusters.However,unlikemany otherclusteringalgorithms includingtheapproximatekernel k -means,RFF,SVandapproximatestreamkernel k -means clusteringalgorithmspresentedintheearlierchapters,t herunningtimeofthesparsekernel k -meansalgorithmincreasesalmostlogarithmicallywithth enumberofclusters,onmost datasets.TheNMIvaluesachievedbyouralgorithmalsoincr easeasthenumberofclus- tersincrease.WenotethattheNMIvaluesoftheproposedalg orithmarebetterthanthose achievedbythebaselinealgorithms,onboththeCIFAR-100a ndImagenet-164datasets,for allvaluesof C . 5.5.3.4Scalability Wevariedthenumberofpoints,thedimensionalityandthenu mberofclustersintheconcentric circlesdataset,andexecutedouralgorithmonthesedatase tstoexamineitsscalability.Weused theRBFkerneltocomputetheinter-pointsimilarity.Theal gorithmparameters m , p and M were setto m =1 ;000 , p =100 and M =20 ;000 respectively,andthedatawasinputinbatchesof 10 . Figures5.7(a),5.7(b)and5.7(c)showthattheproposedalg orithmislinearlyscalablewithrespect tothesizeanddimensionalityofthedataset,andalmostlog arithmicallyscalablewithrespecttothe numberofclusters,inaccordancewiththecomplexityanaly sisinSection5.4.1.InFigure5.7(a), thesizeofthedatasetisvariedfrom n =100 to n =10 7 ,whilethedimensionalityandthe 172 numberofclustersarexedat d =100 and C =10 .Therunningtimeoftheproposedalgorithm increaseslinearlywith n .Figure5.7(b)showstherunningtimeoftheproposedalgori thmasthe dimensionalityofthedatavariesbetween d =10 and d =1 ;000 ,while n =10 6 and C =10 . Finally,therunningtimeofouralgorithmincreaseslogari thmicallyasthenumberofclusters increasesfrom C =10 to C =1 ;000 ,with n =10 6 and d =100 ,asshowninFigure5.7(c). 5.6Summary Inthischapter,wehaveproposedthesparsekernel k -meansclusteringalgorithm,whichcanef- cientlyclusterlargehigh-dimensionaldatasetsintoalar genumberofclusters.Bysamplingthe datapointsbasedontheirnovelty,denedintermsofthesta tisticalleveragescores,weonlystore themostinformativepointsinthedata,therebylimitingth ememoryrequirements.Weneedto computethekernelsimilarityofthedatapointsonlywithre specttothesesampledpoints,thusre- ducingtherunningtimecomplexity.Wefurtherreducetheru nningtimecomplexitybyintroducing sparsityintothekernel,basedontheassumptionthattheke rnelfunctionisappropriatelydened, andnearbypointsinthekernelspacehavesimilarlabels.We demonstratedthattheproposed algorithmisscalableandaccurateusingseverallargebenc hmarkdatasets. 173 Chapter6 SummaryandFutureWork Astheamountofdigitaldatacontinuestogrowatarapidrate ,continuedeffortstodesignand developscalableandefcientalgorithmstoorganizethisd ataandextractusefulinformationfrom itareessential.Wehavefocusedontheunsupervisedlearni ngtaskofclusteringinthisthesis. Whilelinearclusteringalgorithms(e.g. k -means)arefastandscalable,theyareincapableof ndingtheunderlyingclustersinreal-worlddatasetswith highaccuracy.Ontheotherhand, kernel-basedclusteringalgorithmsareaccurate,butaren otscalabletobigdatasets.Wehave proposedanumberofkernel-basedclusteringalgorithmswh icharenotonlyscalabletodatasets containingbillionsofdatapoints,butalsoachievecluste rqualitycomparabletothatoftheexisting kernel-basedclusteringalgorithms.Theproposedalgorit hmsareprimarilybasedon randomly sampling thedatasetsandndingtheclustersusingfast iterativeoptimization techniques.The maincontributionofthisthesisisthedesignofapproximat ealgorithmsfortheadvancementofthe scalabilityofkernel-basedclustering,whilemaintainin gtheclusterquality,anddemonstratingthe performanceoftheproposedalgorithmsondiversedatasets . 174 6.1Contributions Theapproximatebatchkernelclusteringalgorithmspropos edinChapters2and3makethefol- lowingcontributions: Theapproximatekernel k -meansalgorithminChapter2demonstratesthat,byusingun iform randomsampling,kernel-basedclusteringcanbeperformed in O ( nmC + nmd ) time,where n isthesizeofthegivendataset, d isitsdimensionality, C isthenumberofclusters,and m isthenumberofsamplesfromthedataset( m ˝ n ).Thisrunningtimecomplexityis signicantlysmallerthanthe O ( n 2 C + n 2 d ) complexityofclassicalkernel-basedclustering algorithms,giventhat m ˝ n . Incontrasttotheapproximatekernel k -meansalgorithm,whichdecomposesthekernelma- trixintoitslow-rankcomponents,theRFFandSVkernelclus teringalgorithms,introduced inChapter3,factorthekernelfunctionusingtheFouriertr ansform,andprojectthedatainto alow-dimensionalspacespannedbytheFouriercomponents. TheRFFandSVclustering algorithmshave O ( nm log( d )+ nmC ) and O ( nm log( d )+ nC 2 ) runningtimecomplexities, respectively,where m ˝ n isthenumberofFouriercomponents.Bothalgorithmsperfor m wellonlargehigh-dimensionaldatasets.TheSVclustering algorithmisfasterthantheRFF andapproximatekernel k -meansalgorithms,whenthenumberofclusters C issmall(less than 100 ),withaminimallossinclusterquality. Theerrorincurredbytheapproximatekernel k -meansalgorithmduetosamplingis O (1 =m ) , whichimpliesthattheerrorreduceslinearly,asthenumber ofdatapointssampledfromthe datasetincreases.Similarly,theerrorincurredbytheRFF andSVclusteringalgorithms reducesattherateof O (1 = p m ) and O (1 =m ) ,respectively,where m representsthenumber ofFouriercomponentsusedforprojection. Thebestclusteringqualityisachievedbytheseapproximat ealgorithms,whenthenumber 175 ofsamples(orthenumberofFouriercomponents) m issignicantlygreaterthan C ,andthe eigenvaluesofthekernelmatrixhavealong-taileddistrib ution. Theproposedalgorithmsachieveclusteringqualitysimila rtothekernel k -meansandspec- tralclusteringalgorithmsonlargebenchmarktextandimag edatasets,containingupto 10 milliondatapoints,withsignicantlylowerrunningtime. TheonlinekernelclusteringalgorithmsproposedinChapte rs4and5makethefollowingcontri- butions: Datastreamsoftencontainunboundednumberofdatapoints, soitisimpossibletostorethe entiredatasetinmemory.Itisalsodifculttouniformlysa mplestreamingdatasetsbecause oftheirarbitrarysize.Theapproximatestreamkernel k -meansalgorithm,introducedin Chapter4,relieson importance sampling,andtherebyusesonlythemostinformativedata pointsinthestreamtoperformclustering.Importancesamp lingisinherentlyacomplex procedurebecauseitrequirestheeigendecompositionofth ekernelmatrix.Bydevisingan efcientonlinemethodtoperformimportancesampling,weh avereduceditsrunningtime complexity.Theapproximatestreamkernel k -meansalgorithmcanclusterlargestreamdata setsin O ( nd + nC ) time. Wehavedemonstratedtheperformanceoftheapproximatestr eamkernel k -meansonthe Twitterstream.Itcanalsobeappliedtondclustersinnan cialdata,climatedata,click- streamsetc. Whenthenumberofclusters C inthedatasetislarge(intheorderoftensofthousands), theexistingkernelclusteringalgorithmshavelongrunnin gtimesasaresultoftheirlinear runningtimecomplexitywithrespectto C .Byusingimportancesamplingtosamplethedata set,andinducingsparsityintothekernelmatrixconstruct edfromthesampleddatapoints, thesparsekernel k -meansalgorithm,introducedinChapter5,reducesthistim ecomplexity 176 to O ( nd + n log C ) ,with O (1 = p m ) approximationerror,where m isthenumberofpoints sampledfromthedataset. Wehavedemonstratedthescalabilityofthesparsekernel k -meansalgorithmonlargehetero- geneousdatasetssuchastheTinyimagedatasetandtheYoutu bedataset(textandimage), containingmillionsofdatapointswithupto 10 ;000 clusters. Thelossintheclusteringqualitybytheapproximatestream kernel k -meansandthesparse kernel k -meansalgorithmsisminimalwhencomparedtothebatchkern el k -meansclustering algorithm. Thecruxoftheproposedalgorithmsistorandomlysamplethe largedatasetsandthereby,reduce thenumberofsimilaritycomputationsrequiredtoconstruc tthekernelmatrixandclusterthedata. Thesamplesizeandthesamplingstrategyplayacrucialrole intheperformanceofthealgorithms. Whiletheproposedbatchclusteringalgorithmsselectthes amplesuniformlyfromthegivendata set,theonlinealgorithmsemploythemoresophisticatedim portancesamplingstrategy.Theim- portancesamplingtechniquereducesthetotalnumberofsam plesrequiredbecauseitchoosesthe datapointsintelligently,basedonthedatadistribution. 6.2FutureWork Kernel-basedclusteringresearchpresentedinthisdisser tationcanbefurtheradvancedasfollows: Parallelization. Incontrasttolinearclusteringalgorithms,kernel-based clusteringalgo- rithmsneedthecomputationofthekernelmatrix,duetowhic htheyaremoredifcultto parallelizethanlinearclusteringalgorithms.Theapprox imatekernel-basedclusteringalgo- rithmspresentedinthisthesisareeasiertoparallelizeth anclassicalkernel-basedcluster- ingalgorithms.Unlikeparallelversionsoftheclassicalk ernel-basedclusteringalgorithms, whichrequireallthedatatobereplicatedinallthenodes,t heapproximatekernel-based 177 clusteringalgorithmsrequireonlythesampleddatapoints tobereplicated.Thisreduces theamountofmemoryrequiredandthecommunicationcost.We havedemonstratedhow theapproximatekernel k -meansalgorithmcanbeexecutedonadistributedframework in Chapter2.TheRFFandSVclusteringalgorithmscanbesimila rlyparallelized.However, theremainingapproximatekernel-basedclusteringalgori thmsproposedinthisthesisrely ontheeigenvectorsoftheapproximatekernelmatrices,and needeffectiveonlineparallel techniquesforeigenvectorupdates.Parallelizationofth esealgorithmscanaidintheirde- ploymenttolargescalecomputingframeworks. Kernelselection. AsdemonstratedinChapter1,thekernelfunctionusedtode netheinter- pointsimilarityplaysacrucialroleintheefciencyandac curacyofkernelclustering.Em- ployingthewrongkernelforclusteringcanadverselyaffec ttheclusterquality,andcanresult inclusteringqualityworsethanthatoflinearclusteringa lgorithms.However,choosingthe correctkernel,andselectingthekernelparametersisacha llengingtask.Althoughafew algorithmshavebeenproposedtolearnthekernelfromtheda tainanunsupervisedman- ner,thesealgorithmshavehighrunningtimecomplexity,re sultingintheirnon-scalability. Morescalabletechniqueshavebeendevelopedtolearntheke rnelinthesupervisedand semi-supervisedsettings,butobtainingthelabelsforlar gedatasetsisexpensiveandoften impossible.Developmentofscalableunsupervisedkernell earningalgorithmsisapotential directionforfuturework. Overlappingclusters. Inapplicationssuchasusercommunitydetectioninsocialn etworks, usersoftenbelongtomorethanonecommunity,causingthecl usterstooverlapwitheach other.Veryfeweffortshavebeenmadetondsuchoverlappin gclustersusingkernel-based clusteringtechniques.Fuzzykernelclusteringtechnique sonlycomputetheprobabilitythat adatapointbelongstoacluster,anddonotdeterministical lyndtheclustermemberships. Moreconcretescalabletechniquesneedtobedevelopedton doverlappingclustersindata. 178 BIBLIOGRAPHY 179 BIBLIOGRAPHY [1]DataAnalytics. http://searchdatamanagement.techtarget.com/ definition/data-analytics ,Jan2008. [2]BigDatain2020. http://www.emc.com/collateral/analyst-reports/ idc-digital-universe-2014.pdf ,Dec2012.IDCandEMCCorpReport. [3]M.R.Abbasifard,B.Ghahremani,andH.Naderi.Asurveyo nnearestneighborsearch methods. InternationalJournalofComputerApplications ,95(25):39Œ52,2014. [4]M.E.Abbasnejad,D.Ramachandram,andR.Mandava.Asurv eyofthestateoftheartin learningthekernels. KnowledgeandInformationSystems ,31(2):193Œ221,2012. [5]D.AchlioptasandF.McSherry.Fastcomputationoflow-r ankmatrixapproximations. Jour- naloftheACM ,54(2),2007. [6]M.R.Ackermann,M.Märtens,C.Raupach,K.Swierkot,C.L ammersen,andC.Sohler. StreamKM++:Aclusteringalgorithmfordatastreams. JournalofExperimentalAlgorith- mics ,17:1Œ30,2012. [7]R.H.Affandi,A.Kulesza,E.Fox,andB.Taskar.Nystroma pproximationforlarge-scale determinantalprocesses.In ProceedingsoftheInternationalConferenceonArticialI ntel- ligenceandStatistics ,pages85Œ98,2013. [8]C.C.Aggarwal.Asurveyofstreamclusteringalgorithms .In DataClustering:Algorithms andApplications ,pages231Œ258.2013. [9]C.C.Aggarwal,J.Han,J.Wang,andP.S.Yu.Aframeworkfo rprojectedclusteringofhigh dimensionaldatastreams.In ProceedingsoftheInternationalConferenceonVeryLarge DataBases ,pages852Œ863,2004. [10]N.Ailon,R.Jaiswal,andC.Monteleoni.Streamingk-me ansapproximation.In Proceedings oftheConferenceonNeuralInformationProcessingSystems ,pages10Œ18,2009. [11]C.AlzateandJ.A.K.Suykens.Multiwayspectralcluste ringwithout-of-sampleexten- sionsthroughweightedkernelPCA. IEEETransactionsonPatternAnalysisandMachine Intelligence ,32(2):335Œ347,2010. [12]D.ArthurandS.Vassilvitskii.k-means++:Theadvanta gesofcarefulseeding.In Proceed- ingsoftheACM-SIAMSymposiumonDiscreteAlgorithms ,pages1027Œ1035,2007. [13]K.BacheandM.Lichman.UCImachinelearningrepositor y. http://archive.ics. uci.edu/ml ,2013. 180 [14]A.Barla,F.Odone,andA.Verri.Histogramintersectio nkernelforimageclassication.In ProceedingsoftheInternationalConferenceonImageProce ssing ,volume3,pages513Œ 516,2003. [15]O.Beaumont,H.Larchevêque,andL.Marchal.Non-linea rdivisibleloads:Thereisnofree lunch.In ProceedingsoftheInternationalParallelandDistributed ProcessingSymposium , pages1Œ10,2012. [16]H.Becker,M.Naaman,andL.Gravano.Beyondtrendingto pics:Real-worldeventiden- ticationontwitter.In ProceedingsoftheInternationalAAAIConferenceonWeblog sand SocialMedia ,pages438Œ441,2011. [17]M.A.BelabbasandP.J.Wolfe.Spectralmethodsinmachi nelearningandnewstrategies forverylargedatasets. ProceedingsoftheNationalAcademyofSciences ,106(2):369Œ374, 2009. [18]S.Belongie,C.Fowlkes,F.Chung,andJ.Malik.Spectra lpartitioningwithindenitekernels usingtheNystromextension.In ProceedingsoftheEuropeanConferenceonComputer Vision ,pages531Œ542.2002. [19]A.Ben-Hur,D.Horn,H.T.Siegelmann,andV.Vapnik.Sup portvectorclustering. The JournalofMachineLearningResearch ,2:125Œ137,2002. [20]Y.Bengio,A.Courville,andP.Vincent.Representatio nlearning:Areviewandnewper- spectives. IEEETransactionsonPatternAnalysisandMachineIntellig ence ,35(8):1798Œ 1828,2013. [21]M.W.Berry.Large-scalesparsesingularvaluecomputa tions. InternationalJournalof SupercomputerApplications ,6(1):13Œ49,1992. [22]M.W.Berry,S.A.Pulatova,andG.W.Stewart.Computing sparsereduced-rankapproxi- mationstosparsematrices. ACMTransactionsonMathematicalSoftware ,31(2):252Œ269, 2005. [23]J.A.BlackardandD.J.Dean.Comparativeaccuraciesof articialneuralnetworksanddis- criminantanalysisinpredictingforestcovertypesfromca rtographicvariables. Computers andElectronicsinAgriculture ,24(3):131Œ152,1999. [24]D.M.Blei,A.Y.Ng,andM.I.Jordan.LatentDirichletAl location. JournalofMachine LearningResearch ,3:993Œ1022,2003. [25]L.BoandC.Sminchisescu.Efcientmatchkernelbetwee nsetsoffeaturesforvisual recognition.In ProceedingsoftheConferenceonNeuralInformationProces singSystems , pages135Œ143,2009. [26]S.BochnerandK.Chandrasekharan. FourierTransforms .PrincetonUniversityPress,1949. 181 [27]L.BottouandV.Vapnik.Locallearningalgorithms. NeuralComputation ,4(6):888Œ900, 1992. [28]C.Boutsidis,M.W.Mahoney,andP.Drineas.Animproved approximationalgorithmforthe columnsubsetselectionproblem.In ProceedingsoftheACM-SIAMSymposiumonDiscrete Algorithms ,pages968Œ977,2009. [29]D.C.Brabham. Crowdsourcing .MITPress,2013. [30]P.S.Bradley,U.M.Fayyad,andC.Reina.Scalingcluste ringalgorithmstolargedatabases. In ProceedingsoftheInternationalConferenceonKnowledgeD iscoveryandDataMining , pages9Œ15,1998. [31]M.Brand.Fastlow-rankmodicationsofthethinsingul arvaluedecomposition. Linear AlgebraanditsApplications ,415(1):20Œ30,2006. [32]F.Cao,M.Ester,W.Qian,andA.Zhou.Density-basedclu steringoveranevolvingdata streamwithnoise.In ProceedingsoftheSIAMInternationalConferenceonDataMi ning , pages328Œ339,2006. [33]R.Cattral,F.Oppacher,andD.Deugo.Evolutionarydat aminingwithautomaticrulegener- alization. RecentAdvancesinComputers,ComputingandCommunication s ,pages296Œ300, 2002. [34]S.ChatterjeeandA.S.Hadi.Inuentialobservations, highleveragepoints,andoutliersin linearregression. StatisticalScience ,1(3):379Œ393,1986. [35]W.Chen,Y.Song,H.Bai,C.Lin,andE.Y.Chang.Parallel spectralclusteringindistributed systems. IEEETransactionsonPatternAnalysisandMachineIntellig ence ,33(3):568Œ586, 2011. [36]Y.ChenandL.Tu.Density-basedclusteringforreal-ti mestreamdata.In Proceedingsof theInternationalConferenceonKnowledgeDiscoveryandDa taMining ,pages133Œ142, 2007. [37]Y.Cheng.Meanshift,modeseeking,andclustering. IEEETransactionsonPatternAnalysis andMachineIntelligence ,17(8):790Œ799,1995. [38]R.Chitta,A.K.Jain,andR.Jin.Sparsekernelclusteri ngofmassivehigh-dimensionaldata setswithlargenumberofclusters.In ProceedingsofthePhDWorkshopattheInternational ConferenceonInformationandKnowledgeManagement ,2015. [39]R.Chitta,A.K.Jain,andR.Jin.Sparsekernelclusteri ngofmassivehigh-dimensional datasetswithlargenumberofclusters.TechnicalReportMS U-CSE-15-10,Departmentof ComputerScience,MichiganStateUniversity,2015. 182 [40]R.Chitta,R.Jin,T.C.Havens,andA.K.Jain.Approxima tekernelk-means:Solutionto largescalekernelclustering.In ProceedingsoftheInternationalConferenceonKnowledge DiscoveryandDatamining ,pages895Œ903,2011. [41]R.Chitta,R.Jin,T.C.Havens,andA.K.Jain.Scalablek ernelclustering:Approximate kernelk-means. arxivpreprintarXiv:1402.3849 ,2014. [42]R.Chitta,R.Jin,andA.K.Jain.Efcientkernelcluste ringusingrandomfourierfeatures. In ProceedingsoftheInternationalConferenceonDataMining ,pages161Œ170,2012. [43]R.Chitta,R.Jin,andA.K.Jain.Streamclustering:Ef cientkernel-basedapproximation usingimportancesampling.In ProceedingsoftheICDMWorkshoponDataScienceand BigDataAnalytics ,2015. [44]J.ChiuandL.Demanet.Sublinearrandomizedalgorithm sforskeletondecompositions. SIAMJournalonMatrixAnalysisandApplications ,34(3):1361Œ1383,2013. [45]A.Clauset,C.R.Shalizi,andMarkE.J.Newman.Power-l awdistributionsinempirical data. SIAMReview ,51(4):661Œ703,2009. [46]C.Cortes,M.Mohri,andA.Talwalkar.Ontheimpactofke rnelapproximationonlearning accuracy. JournalofMachineLearningResearch ,9:113Œ120,2010. [47]T.F.CoxandM.A.A.Cox. MultidimensionalScaling .CRCPress,2000. [48]A.S.Das,M.Datar,A.Garg,andS.Rajaram.Googlenewsp ersonalization:Scalableonline collaborativeltering.In ProceedingsoftheInternationalConferenceonWorldWideW eb , pages271Œ280,2007. [49]J.Deng,W.Dong,R.Socher,L.J.Li,K.Li,andL.Fei-Fei .Imagenet:Alarge-scale hierarchicalimagedatabase.In ProceedingsoftheIEEEConferenceonComputerVision andPatternRecognition ,pages248Œ255,2009. [50]A.DeshpandeandS.Vempala.Adaptivesamplingandfast low-rankmatrixapproxima- tion.In Approximation,Randomization,andCombinatorialOptimiz ation:Algorithmsand Techniques ,pages292Œ303.2006. [51]I.S.Dhillon,Y.Guan,andB.Kulis.Auniedviewofkern elk-means,spectralclustering andgraphcuts.TechnicalReportTR-04-25,DepartmentofCo mputerScience,University ofTexasatAustin,2004. [52]C.Ding,X.He,andH.D.Simon.Ontheequivalenceofnonn egativematrixfactorization andspectralclustering.In ProceedingsoftheSIAMDataMiningConference ,pages606Œ 610,2005. [53]J.A.Doornik.AnimprovedZigguratmethodtogeneraten ormalrandomsamples. Univer- sityofOxford ,2005. 183 [54]P.Drineas,R.Kannan,andM.W.Mahoney.FastMonte-Car loalgorithmsformatricesII: Computingalow-rankapproximationtoamatrix. SIAMJournalonComputing ,36(1):158Œ 183,2006. [55]P.Drineas,R.Kannan,andM.W.Mahoney.FastMonte-Car loalgorithmsformatricesIII: Computingacompressedapproximatematrixdecomposition. SIAMJournalonComputing , 36(1):184Œ206,2006. [56]P.Drineas,M.Magdon-Ismail,M.W.Mahoney,andD.P.Wo odruff.Fastapproximation ofmatrixcoherenceandstatisticalleverage. TheJournalofMachineLearningResearch , 13(1):3475Œ3506,2012. [57]P.DrineasandM.W.Mahoney.OntheNystrommethodforap proximatingaGrammatrix forimprovedkernel-basedlearning. TheJournalofMachineLearningResearch ,6:2153Œ 2175,2005. [58]C.EckartandG.Young.Theapproximationofonematrixb yanotheroflowerrank. Psy- chometrika ,1(3):211Œ218,1936. [59]R.Edmonds,E.Guskin,A.Mitchell,andM.Jurkowitz.Th eState oftheNewsMedia2013. http://stateofthemedia.org/ 2013/newspapers-stabilizing-but-still-threatened/ newspapers-by-the-numbers ,May2013.PoynterInstituteandPewResearch CenterReport. [60]A.Ene,S.Im,andB.Moseley.FastclusteringusingMapR educe.In Proceedingsofthe InternationalConferenceonKnowledgeDiscoveryandDatam ining ,pages681Œ689,2011. [61]M.Ester,H.P.Kriegel,J.Sander,andX.Xu.Adensity-b asedalgorithmfordiscovering clustersinlargespatialdatabaseswithnoise.In ProceedingsoftheInternationalConference onKnowledgeDiscoveryandDatamining ,pages226Œ231,1996. [62]F.Farnstrom,J.Lewis,andC.Elkan.Scalabilityforcl usteringalgorithmsrevisited. ACM SIGKDDExplorationsNewsletter ,2(1):51Œ57,2000. [63]D.Feldman,M.Schmidt,andC.Sohler.TurningBigdatai ntotinydata:Constant-size coresetsfork-means,PCAandprojectiveclustering.In ProceedingsoftheACM-SIAM SymposiumonDiscreteAlgorithms ,pages1434Œ1453,2013. [64]C.Fellbaum. WordNet:AnElectronicLexicalDatabase .BradfordBooks,1998. [65]M.Filippone,F.Camastra,F.Masulli,andS.Rovetta.A surveyofkernelandspectral methodsforclustering. PatternRecognition ,41(1):176Œ190,2008. [66]G.D.Forney.Generalizedminimumdistancedecoding. IEEETransactionsonInformation Theory ,12(2):125Œ131,1966. 184 [67]C.Fowlkes,S.Belongie,F.Chung,andJ.Malik.Spectra lgroupingusingtheNystrom method. IEEETransactionsonPatternAnalysisandMachineIntellig ence ,pages214Œ225, 2004. [68]C.FraleyandA.E.Raftery.Howmanyclusters?Whichclu steringmethod?Answersvia model-basedclusteranalysis. TheComputerJournal ,41(8):578Œ588,1998. [69]A.Frieze,R.Kannan,andS.Vempala.FastMonte-Carloa lgorithmsforndinglow-rank approximations.In ProceedingsoftheFoundationsofComputerScience ,pages370Œ378, 1998. [70]A.Frieze,R.Kannan,andS.Vempala.FastMonte-Carloa lgorithmsforndinglow-rank approximations. JournaloftheACM ,51(6):1025Œ1041,2004. [71]J.Ginsberg,M.H.Mohebbi,R.S.Patel,L.Brammer,M.S. Smolinski,andL.Brilliant. Detectinginuenzaepidemicsusingsearchenginequerydat a. Nature ,457(7232):1012Œ 1014,2008. [72]M.Girolami.Mercerkernel-basedclusteringinfeatur espace. IEEETransactionsonNeural Networks ,13(3):780Œ784,2002. [73]A.Gittens,P.Kambadur,andC.Boutsidis.Approximate spectralclusteringviarandomized sketching. arXivpreprintarXiv:1311.2854 ,2013. [74]A.GittensandM.W.Mahoney.RevisitingtheNystrommet hodforimprovedlarge-scale machinelearning. arXivpreprintarXiv:1303.1849 ,2013. [75]F.Godin,V.Slavkovikj,W.DeNeve,B.Schrauwen,andR. VandeWalle.Usingtopic modelsfortwitterhashtagrecommendation.In ProceedingsoftheInternationalConference onWorldWideWebCompanion ,pages593Œ596,2013. [76]J.C.Gower.Addingapointtovectordiagramsinmultiva riateanalysis. Biometrika , 55(3):582Œ585,1968. [77]J.C.GowerandG.J.S.Ross.Minimumspanningtreesands inglelinkageclusteranalysis. AppliedStatistics ,pages54Œ64,1969. [78]H.P.Graf,E.Cosatto,L.Bottou,I.Dourdanovic,andV. Vapnik.Parallelsupportvector machines:ThecascadeSVM.In ProceedingsoftheConferenceonNeuralInformation ProcessingSystems ,pages521Œ528,2004. [79]S.Guha,A.Meyerson,N.Mishra,R.Motwani,andL.O'Cal laghan.Clusteringdata streams:Theoryandpractice. IEEETransactionsonKnowledgeandDataEngineering , pages515Œ528,2003. [80]S.Guha,R.Rastogi,andK.Shim.CURE:Anefcientclust eringalgorithmforlarge databases. InformationSystems ,26(1):35Œ58,2001. 185 [81]R.Hamid,Y.Xiao,A.Gittens,andD.DeCoste.Compactra ndomfeaturemaps. arXiv preprintarXiv:1312.4626 ,2013. [82]S.Har-PeledandS.Mazumdar.Oncoresetsfork-meansan dk-medianclustering.In ProceedingsoftheACMSymposiumonTheoryofComputing ,pages291Œ300,2004. [83]T.Hastie,R.Tibshirani,andJ.Friedman. TheElementsofStatisticalLearning ,volume2. Springer,2009. [84]T.C.Havens.Approximationofkernelk-meansforstrea mingdata.In Proceedingsofthe InternationalConferenceonPatternRecognition ,pages509Œ512,2012. [85]T.C.HavensandJ.C.Bezdek.Anefcientformulationof theimprovedvisualassess- mentofclustertendency(iVAT)algorithm. IEEETransactionsonKnowledgeandData Engineering ,24(5):813Œ822,2012. [86]A.Jain,Z.Zhang,andE.Y.Chang.Adaptivenon-linearc lusteringindatastreams.In ProceedingsoftheInternationalConferenceonInformatio nandKnowledgeManagement , pages122Œ131,2006. [87]A.K.Jain.Dataclustering:50yearsbeyondk-means. PatternRecognitionLetters , 31(8):651Œ666,2010. [88]A.K.JainandR.C.Dubes. AlgorithmsforClusteringData .Prentice-Hall,Inc.,1988. [89]A.K.Jain,R.P.W.Duin,andJ.Mao.Statisticalpattern recognition:Areview. IEEE TransactionsonPatternAnalysisandMachineIntelligence ,22(1):4Œ37,2000. [90]A.K.Jain,M.N.Murty,andP.J.Flynn.Dataclustering: Areview. ACMComputing Surveys ,31(3):264Œ323,1999. [91]T.Kanungo,D.M.Mount,N.S.Netanyahu,C.D.Piatko,R. Silverman,andA.Y.Wu.An efcientk-meansclusteringalgorithm:Analysisandimple mentation. IEEETransactions onPatternAnalysisandMachineIntelligence ,24(7):881Œ892,2002. [92]P.KarandH.Karnick.Randomfeaturemapsfordotproduc tkernels.In Proceedingsofthe InternationalConferenceonArticialIntelligenceandSt atistics ,pages583Œ591,2012. [93]G.KarypisandV.Kumar.Asoftwarepackageforpartitio ningunstructuredgraphs,parti- tioningmeshes,andcomputingll-reducingorderingsofsp arsematrices.Technicalreport, DepartmentofComputerScience,UniversityofMinnesota,1 998. [94]L.KaufmanandP.J.Rousseeuw. FindingGroupsinData:AnIntroductiontoCluster Analysis .WileyBlackwell,2005. [95]D.W.Kim,K.Y.Lee,D.Lee,andK.H.Lee.Evaluationofth eperformanceofclustering algorithmsinkernel-inducedfeaturespace. PatternRecognition ,38(4):607Œ611,2005. 186 [96]T.Kohonen. Self-organizingMaps .Springer,2001. [97]S.B.Kotsiantis.Supervisedmachinelearning:Arevie wofclassicationtechniques. Infor- matica ,31(3),2007. [98]P.Kranen,I.Assent,C.Baldauf,andT.Seidl.TheClusT ree:Indexingmicro-clustersfor anytimestreammining. KnowledgeandInformationSystems ,29(2):249Œ272,2011. [99]A.KrizhevskyandG.Hinton.Learningmultiplelayerso ffeaturesfromtinyimages.Tech- nicalreport,DepartmentofComputerScience,Universityo fToronto,2009. [100]B.KulisandK.Grauman.Kernelizedlocality-sensiti vehashingforscalableimagesearch. In ProceedingsoftheInternationalConferenceonComputerVi sion ,pages2130Œ2137, 2009. [101]A.Kumar,Y.Sabharwal,andS.Sen.Asimplelineartime (1+ )-approximationalgorithm fork-meansclusteringinanydimensions.In ProceedingsoftheIEEESymposiumonFoun- dationsofComputerScience ,pages454Œ462,2004. [102]S.Kumar,M.Mohri,andA.Talwalkar.Onsampling-base dapproximatespectraldecom- position.In ProceedingsoftheInternationalConferenceonMachineLea rning ,pages553Œ 560,2009. [103]S.Kumar,M.Mohri,andA.Talwalkar.Samplingtechniq uesfortheNystrommethod.In ProceedingsofConferenceonArticialIntelligenceandSt atistics ,pages304Œ311,2009. [104]T.O.Kvalseth.Entropyandcorrelation:Somecomment s. IEEETransactionsonSystems, ManandCybernetics ,17(3):517Œ519,1987. [105]D.Laney.3Ddatamanagement:Controllingdatavolume ,velocity,andvariety.Technical report,METAGroup,2001. [106]S.Lazebnik,C.Schmid,andJ.Ponce.Beyondbagsoffea tures:Spatialpyramidmatching forrecognizingnaturalscenecategories.In ProceedingsoftheIEEEComputerSocietyCon- ferenceonComputerVisionandPatternRecognition ,volume2,pages2169Œ2178,2006. [107]Q.Le,T.Sarlos,andA.Smola.Fastfood-Approximatin gkernelexpansionsinloglinear time.In ProceedingsoftheInternationalConferenceonMachineLea rning ,pages16Œ21, 2013. [108]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner.Gradien t-basedlearningappliedtodocu- mentrecognition. ProceedingsoftheIEEE ,86(11):2278Œ2324,1998. [109]J.Lee,S.Kim,G.Lebanon,andY.Singer.Locallow-ran kmatrixapproximation.In ProceedingsofInternationalConferenceonMachineLearni ng ,pages82Œ90,2013. [110]F.Li,C.Ionescu,andC.Sminchisescu.RandomFourier approximationsforskewedmulti- plicativehistogramkernels. PatternRecognition ,pages262Œ271,2010. 187 [111]M.Li,J.T.Kwok,andB.Lu.Makinglarge-scaleNystrom approximationpossible.In ProceedingsoftheInternationalConferenceonMachineLea rning ,pages631Œ638,2010. [112]B.Liu,S.X.Xia,andY.Zhou.Unsupervisednon-parame trickernellearningalgorithm. Knowledge-BasedSystems ,44:1Œ9,2013. [113]L.L.Liu,X.B.Wen,andX.X.Gao.SegmentationforSARi magebasedonanewspectral clusteringalgorithm. LifeSystemModelingandIntelligentComputing ,pages635Œ643, 2010. [114]R.LiuandH.Zhang.SamplingcriteriafortheNystromm ethod. http://citeseerx. ist.psu.edu/viewdoc/summary?doi=10.1.1.112.6368 . [115]T.Liu,C.Rosenberg,andH.A.Rowley.Clusteringbill ionsofimageswithlargescale nearestneighborsearch.In ProceedingsoftheIEEEWorkshoponApplicationsofCompute r Vision ,pages28Œ33,2007. [116]S.Lloyd.LeastsquaresquantizationinPCM. IEEETransactionsonInformationTheory , 28(2):129Œ137,1982. [117]D.G.Lowe.Distinctiveimagefeaturesfromscale-inv ariantkeypoints. InternationalJour- nalofComputerVision ,60(2):91Œ110,2004. [118]U.Luxburg.Atutorialonspectralclustering. StatisticsandComputing ,17(4):395Œ416, 2007. [119]U.Luxburg. ClusteringStability .NowPublishersInc.,2010. [120]D.MacDonaldandC.Fyfe.Thekernelself-organisingm ap.In ProceedingsoftheIn- ternationalConferenceonKnowledge-BasedIntelligentEn gineeringSystemsandAllied Technologies ,volume1,pages317Œ320,2002. [121]M.Mahajan,P.Nimbhorkar,andK.Varadarajan.Thepla nark-meansproblemisNP-Hard. In ProceedingsoftheInternationalWorkshoponAlgorithmsan dComputation ,pages274Œ 285,2009. [122]M.W.MahoneyandP.Drineas.CURmatrixdecomposition sforimproveddataanalysis. ProceedingsoftheNationalAcademyofSciences ,106(3):697Œ702,2009. [123]O.A.MaillardandR.Munos.Compressedleast-squares regression.In Proceedingsofthe ConferenceonNeuralInformationProcessingSystems ,pages1213Œ1221,2009. [124]S.MalinowskiandR.Morla.AsinglepassTrellis-base dalgorithmforclusteringevolving datastreams. DataWarehousingandKnowledgeDiscovery ,pages315Œ326,2012. [125]C.D.Manning,P.Raghavan,andH.Schütze. IntroductiontoInformationRetrieval .Cam- bridgeUniversityPress,2008. 188 [126]A.McCallum,K.Nigam,andL.H.Ungar.Efcientcluste ringofhigh-dimensionaldata setswithapplicationtoreferencematching.In ProceedingsoftheInternationalConference onKnowledgeDiscoveryandDataMining ,pages169Œ178,2000. [127]G.McLachlanandD.Peel. FiniteMixtureModels .JohnWiley&Sons,2004. [128]M.McPherson,L.Smith-Lovin,andJ.M.Cook.Birdsofa feather:Homophilyinsocial networks. AnnualReviewofSociology ,pages415Œ444,2001. [129]A.K.MenonandC.Elkan.Fastalgorithmsforapproxima tingthesingularvaluedecompo- sition. ACMTransactionsonKnowledgeDiscoveryfromData ,5(2):1Œ36,2011. [130]K.Mizumoto,H.Yanagimoto,andM.Yoshioka.Sentimen tanalysisofstockmarketnews withsemi-supervisedlearning.In ProceedingsoftheIEEE/ACISInternationalConference onComputerandInformationScience ,pages325Œ328,2012. [131]A.W.Moore.Anintroductorytutorialonkd-trees.Tec hnicalreport,DepartmentofCom- puterScience,CarnegieMellonUniversity,1991. [132]M.MujaandD.G.Lowe.Scalablenearestneighboralgor ithmsforhighdimensionaldata. IEEETransactionsonPatternAnalysisandMachineIntellig ence ,36(11):2227Œ2240,2014. [133]R.Nallapati,W.Cohen,andJ.Lafferty.Parallelized variationalEMforLatentDirichlet Allocation:Anexperimentalevaluationofspeedandscalab ility. ICDMWorkshoponHigh PerformanceDataMining ,pages349Œ354,2007. [134]O.Nasraoui,C.Cardona,andC.Rojas.Usingretrieval measurestoassesssimilarityinmin- ingdynamicwebclickstreams.In ProceedingsoftheInternationalConferenceonKnowl- edgeDiscoveryinDataMining ,pages439Œ448,2005. [135]D.Newman,A.Asuncion,P.Smyth,andM.Welling.Distr ibutedinferenceforLatent DirichletAllocation.In ProceedingsoftheConferenceonNeuralInformationProces sing Systems ,pages17Œ24,2007. [136]R.T.NgandJ.Han.CLARANS:Amethodforclusteringobj ectsforspatialdatamining. IEEETransactionsonKnowledgeandDataEngineering ,pages1003Œ1016,2002. [137]N.H.Nguyen,P.Drineas,andT.D.Tran.Matrixsparsi cationviatheKhintchineinequal- ity. http://citeseerx.ist.psu.edu/viewdoc/download?doi=1 0.1.1. 164.4755&rep=rep1&type=pdf ,2009. [138]L.Nguyen-Dinh,C.Waldburger,D.Roggen,andG.Tröst er.Tagginghumanactivitiesin videobycrowdsourcing.In ProceedingsoftheConferenceonInternationalConference on MultimediaRetrieval ,pages263Œ270,2013. [139]H.Ning,W.Xu,Y.Chi,Y.Gong,andT.S.Huang.Incremen talspectralclusteringby efcientlyupdatingtheeigen-system. PatternRecognition ,43(1):113Œ127,2010. 189 [140]L.O'Callaghan,N.Mishra,S.Guha,A.M.Meyerson,and R.Motwani.Streaming-data algorithmsforhigh-qualityclustering.In ProceedingsoftheInternationalConferenceon DataEngineering ,pages685Œ695,2002. [141]A.OlivaandA.Torralba.Modelingtheshapeofthescen e:Aholisticrepresentationofthe spatialenvelope. InternationalJournalofComputerVision ,42(3):145Œ175,2001. [142]M.OuimetandY.Bengio.Greedyspectralembedding.In ProceedingsoftheInternational WorkshoponArticialIntelligenceandStatistics ,pages253Œ260,2005. [143]S.Owen,R.Anil,T.Dunning,andE.Friedman. MahoutinAction .ManningPublications Co.,2011. [144]G.Petkos,S.Papadopoulos,andY.Kompatsiaris.Two- levelmessageclusteringfortopic detectioninTwitter.In ProceedingsoftheSNOWDataChallenge ,pages49Œ56,2014. [145]A.K.QinandandP.N.Suganthan.Kernelneuralgasalgo rithmswithapplicationtocluster analysis. PatternRecognition ,4:617Œ620,2004. [146]M.RaginskyandS.Lazebnik.Locality-sensitivebina rycodesfromshift-invariantkernels. In ProceedingsoftheConferenceonNeuralInformationProces singSystems ,pages1509Œ 1517,2009. [147]A.RahimiandB.Recht.Randomfeaturesforlarge-scal ekernelmachines.In Proceedings oftheConferenceonNeuralInformationProcessingSystems ,pages1177Œ1184,2007. [148]C.Ranger,R.Raghuraman,A.Penmetsa,G.Bradski,and C.Kozyrakis.Evaluatingmapre- duceformulti-coreandmultiprocessorsystems.In ProceedingsoftheIEEESymposiumon HighPerformanceComputerArchitecture ,pages13Œ24,2007. [149]P.P.Rodrigues.Learningfromubiquitousdatastream s:Clusteringdataanddatasources. AICommunications ,25(1):69Œ71,2012. [150]K.D.Rosa,R.Shah,B.Lin,A.Gershman,andR.Frederki ng.Topicalclusteringoftweets. In ProceedingsoftheACMSIGIRWorkshoponSocialWebSearchan dMining ,2011. [151]P.J.Rousseeuw.Silhouettes:Agraphicalaidtothein terpretationandvalidationofcluster analysis. JournalofComputationalandAppliedMathematics ,20:53Œ65,1987. [152]W.Rudin. FourierAnalysisonGroups .Wiley-Interscience,1990. [153]T.SakaiandA.Imiya.Fastspectralclusteringwithra ndomprojectionandsampling. Ma- chineLearningandDataMininginPatternRecognition ,pages372Œ384,2009. [154]H.Samet. FoundationsofMultidimensionalandMetricDataStructure s .MorganKauf- mann,2006. 190 [155]T.Sarlos.Improvedapproximationalgorithmsforlar gematricesviarandomprojections.In ProceedingsoftheIEEESymposiumonFoundationsofCompute rScience ,pages143Œ152, 2006. [156]F.Schleif,A.Gisbrecht,andB.Hammer.Accelerating kernelneuralgas.In Proceedingsof theInternationalConferenceonArticialNeuralNetworks andMachineLearning ,pages 150Œ158.2011. [157]F.Schleif,X.Zhu,A.Gisbrecht,andB.Hammer.Fastap proximatedrelationalandkernel clustering.In ProceedingsoftheInternationalConferenceonPatternRec ognition ,pages 1229Œ1232,2012. [158]B.Schölkopf,R.Herbrich,andA.Smola.Ageneralized representertheorem.In Proceed- ingsofComputationalLearningTheory ,pages416Œ426,2001. [159]B.SchölkopfandA.Smola. Learningwithkernels:Supportvectormachines,regulariz a- tion,optimization,andbeyond(Adaptivecomputationandm achinelearning) .TheMIT Press,2001. [160]B.Schölkopf,A.Smola,andK.R.Muller.Nonlinearcom ponentanalysisasakerneleigen- valueproblem. NeuralComputation ,10(5):1299Œ1314,1996. [161]J.ShiandJ.Malik.Normalizedcutsandimagesegmenta tion. IEEETransactionsonPattern AnalysisandMachineIntelligence ,22(8):888Œ905,2002. [162]M.Shindler,A.Wong,andA.W.Meyerson.Fastandaccur atek-meansforlargedatasets. In ProceedingsoftheConferenceonNeuralInformationProces singSystems ,pages2375Œ 2383,2011. [163]H.D.SimonandH.Zha.Low-rankmatrixapproximationu singtheLanczosbidiagonaliza- tionprocesswithapplications. SIAMJournalonScienticComputing ,21(6):2257Œ2274, 2000. [164]A.Smola,L.Song,andC.H.Teo.Relativenoveltydetec tion.In Proceedingsofthe InternationalConferenceonArticialIntelligenceandSt atistics ,volume5,pages536Œ543, 2009. [165]S.SteveandD.X.Zhou.Geometryonprobabilityspaces . ConstructiveApproximation , 30:311Œ323,2009. [166]G.W.Stewart. MatrixPerturbationTheory .AcademicPress,1990. [167]S.J.Stolfo,W.Fan,W.Lee,A.Prodromidis,andP.K.Ch an.Cost-basedmodelingfor fraudandintrusiondetection:ResultsfromtheJAMproject .In ProceedingsoftheDARPA InformationSurvivabilityConferenceandExposition ,volume2,pages130Œ144,2000. 191 [168]A.StrehlandJ.Ghosh.Clusterensembles-Aknowledge reuseframeworkforcombining multiplepartitions. JournalofMachineLearningResearch ,3:583Œ617,2003. [169]Z.SunandG.Fox.StudyonparallelSVMbasedonMapRedu ce.In Proceedingsofthe InternationalConferenceonParallelandDistributedProc essingTechniquesandApplica- tions ,pages495Œ561,2012. [170]A.TalwalkarandA.Rostamizadeh.Matrixcoherencean dtheNystrommethod.In Pro- ceedingsofConferenceonUncertaintyinArticialIntelli gence ,2010. [171]P.Tan,M.Steinbach,andV.Kumar. IntroductiontoDataMining .PearsonEducation,2007. [172]R.Tibshirani,G.Walther,andT.Hastie.Estimatingt henumberofclustersinadatasetvia thegapstatistic. JournaloftheRoyalStatisticalSociety:SeriesB(Statist icalMethodology) , 63(2):411Œ423,2001. [173]A.Torralba,R.Fergus,andW.T.Freeman.80millionti nyimages:Alargedatasetfor nonparametricobjectandscenerecognition. IEEETransactionsonPatternAnalysisand MachineIntelligence ,30(11):1958Œ1970,2008. [174]J.W.Tukey. ExploratoryDataAnalysis .Reading,MA,1977. [175]J.Tzeng.Split-and-combinesingularvaluedecompos itionforlarge-scalematrix. Journal ofAppliedMathematics ,2013. [176]J.K.Uhlmann.Satisfyinggeneralproximity/similar ityquerieswithmetrictrees. Informa- tionProcessingLetters ,40(4):175Œ179,1991. [177]H.ValizadeganandR.Jin.Generalizedmaximummargin clusteringandunsupervisedker- nellearning.In ProceedingsoftheConferenceonNeuralInformationProces singSystems , pages1417Œ1424,2006. [178]A.VedaldiandB.Fulkerson.VLFeat:Anopenandportab lelibraryofcomputervision algorithms. http://www.vlfeat.org ,2008. [179]A.VedaldiandA.Zisserman.Efcientadditivekernel sviaexplicitfeaturemaps. IEEE TransactionsonPatternAnalysisandMachineIntelligence ,34(3):480Œ492,2012. [180]S.Vega-PonsandJ.Ruiz-Schulcloper.Asurveyofclus teringensemblealgorithms. Inter- nationalJournalofPatternRecognitionandArticialInte lligence ,25(3):337Œ372,2011. [181]R.Vidal.Subspaceclustering. IEEESignalProcessingMagazine ,28(2):52Œ68,2011. [182]J.Wang,S.C.H.Hoi,P.Zhao,J.Zhuang,andZ.Liu.Larg escaleonlinekernelclassica- tion.In ProceedingsoftheInternationalJointConferenceonArti cialIntelligence ,pages 1750Œ1756,2013. 192 [183]L.Wang,C.Leckie,R.Kotagiri,andJ.Bezdek.Approxi matepairwiseclusteringforlarge datasetsviasamplingplusextension. PatternRecognition ,44(2):222Œ235,2011. [184]S.WangandZ.Zhang.AscalableCURmatrixdecompositi onalgorithm:Lowertime complexityandtighterbound.In ProceedingsoftheConferenceonNeuralInformation ProcessingSystems ,pages656Œ664,2012. [185]K.Q.Weinberger,M.Slaney,andR.Zwol.Resolvingtag ambiguity.In Proceedingsof ConferenceonMultimedia ,pages111Œ120,2008. [186]J.J.Whang,X.Sui,andI.S.Dhillon.Scalableandmemo ry-efcientclusteringoflarge- scalesocialnetworks.In ProceedingsoftheInternationalConferenceonDataMining , pages705Œ714,2012. [187]C.WilliamsandM.Seeger.UsingtheNystrommethodtos peedupkernelmachines.In ProceedingsoftheConferenceonNeuralInformationProces singSystems ,pages682Œ688, 2001. [188]M.WuandB.Schölkopf.Alocallearningapproachforcl ustering.In Proceedingsofthe ConferenceonNeuralInformationProcessingSystems ,pages1529Œ1536,2006. [189]Z.Xiaojin.Semi-supervisedlearningliteraturesur vey.TechnicalReport1530,Department ofComputerScience,UniversityofWisconsin-Madison,200 5. [190]L.Xu,J.Neufeld,B.Larson,andD.Schuurmans.Maximu mmarginclustering.In Advances inNeuralInformationProcessingsystems ,pages1537Œ1544,2004. [191]D.Yan,L.Huang,andM.I.Jordan.Fastapproximatespe ctralclustering.In Proceedings oftheInternationalConferenceonKnowledgeDiscoveryand Datamining ,pages907Œ916, 2009. [192]H.Zha,X.He,C.Ding,M.Gu,andH.D.Simon.Spectralre laxationfork-meansclustering. In ProceedingsoftheConferenceonNeuralInformationProces singSystems ,pages1057Œ 1064,2001. [193]D.Zhang,S.Chen,andK.Tan.Improvingtherobustness ofonlineagglomerativeclustering methodbasedonkernel-inducedistancemeasures. NeuralProcessingLetters ,21(1):45Œ51, 2005. [194]K.ZhangandJ.T.Kwok.ClusteredNystrommethodforla rgescalemanifoldlearningand dimensionreduction. IEEETransactionsonNeuralNetworks ,21(10):1576Œ1587,2010. [195]K.Zhang,I.W.Tsang,andJ.T.Kwok.ImprovedNystroml ow-rankapproximationand erroranalysis.In ProceedingsoftheInternationalConferenceonMachineLea rning ,pages 1232Œ1239,2008. 193 [196]R.ZhangandA.I.Rudnicky.Alargescaleclusteringsc hemeforkernelk-means. Pattern Recognition ,4:289Œ292,2002. [197]T.Zhang,R.Ramakrishnan,andM.Livny.BIRCH:Anefc ientdataclusteringmethodfor verylargedatabases. ACMSIGMODRecord ,25(2):103Œ114,1996. [198]Y.M.Zhang,K.Huang,G.Geng,andC.Liu.Fast k -nngraphconstructionwithlocality sensitivehashing. MachineLearningandKnowledgeDiscoveryinDatabases ,pages660Œ 674,2013. [199]W.Zhao,H.Ma,andQ.He.Parallelk-meansclusteringb asedonMapReduce. Cloud Computing ,pages674Œ679,2009. [200]J.Zhuang,J.Wang,S.C.H.Hoi,andX.Lan.Unsupervise dmultiplekernellearning. JournalofMachineLearningResearch ,20:129Œ144,2011. 194