.II.-.-....I ...-.-.l--.. . . ...-.0..m..v.v.. ... .063...” 0—1... v.6...‘t‘ . 3. ... ... u t al. 9- .. . 0.09. o. 9...... '0... C... I...l. 1:2... 0 or. . . a. O... 0...... .0... .. ... .. 0... ...... c.90. . .o..I-,..90 ....G..0.. . . . . . . .02... .9..¢.9.0’0O ......9..-.C. . . ,.. ...-.... ’09... o. ......02 .0 ... . ..0. . ..o.’ 09.0.. .... 8 .....i. r... .. 3-1.9... I . .... e. .o'o.llt.l‘...!...d 3.9:“. bar“ 31...! 3.... . I. 1481.59.01. ch I10.) .....1C11. . . , Y..-LI" .0.,f—0 ....l.....'0-..9..o b 00'.P,§?O.v.o.’..., .05.“.5'5 0...: ' 9‘0...0‘..‘9 .0 0.100.... ‘04. Ila-3,179.43... ... .v.0 9,1.COQI‘Z ......:..:¢ 99....2. ...!!850...t.o...l!9!.. I... ..l. 0.67. 3.. 9...... . 9:... .....,0..IQ.....?$.P, . . . 0..."? t . , . , .. I 9 . . , . . . . 3.9.3.399! ... ...... -.. .... .0. f. 901.... . 90.. 9. ...)..900 3... ult93.... 30.0 ....an .301... I. I... Iflc ... ...-0.31... ......2... ‘. ...: 312‘... .00 3210. ... .0: o .....o .... v. r... I... .3. ‘1‘... . 035 O I.‘.-..000.‘4. ...-...... ...». .0r .- .5: ...-0.... . a. .....r .. 23...: IF.OQ~’0.0.‘0. ...... -11... o - v.9... . ... . .0 I. “0.... a: ....‘I. on\\. Y..... : .... ...-.....:9..6 ... ...-.....o.o. 2.9.. ...‘..I‘: ....91..". .0.\.. 9.... .. o. .. O .‘I’. .0. ch... ‘01 ...... - o. ‘09 . 0.1.9.. 1.....02 13011992.... .l.o:\lo. . u I ., 3.00:..- ..u..... ... :2... J... 5.. . ...-9 0.331... '0 I. ..Io..f.... .0..fi.~...-On..oo. II... 3.3.0.569- ..X. .... . 013. 1.7.1.... 159...? ...... :9...- .c.IK J ‘. no.0... o A. 10......ccoItQ :00, «I. ll. . .4 10.00.... 0....o....v0.r..v._.I. .01.} .9... ..- ...... 9...”..31. To... 1. 1.1.9.5er .. TVs-6.1.349... a... p 0... 1h“ .0} 0" I...‘A.l..0...... .... ..0. I OI. I. 3-9.9. .0...“..O4, 4.12.3..9- ....»n!‘ .. n . . .. ... ' 9...... . .....O. p. .9. 4:... v9... 9.. . .. 07— u' .‘ . ti”... ...)... u... “.CWL‘.."A.I.J.O 6.. . _ . I}... .. 0......- o .3112)? 0.... O. ..5 I 13...... ...... .....IV..9......P. 50. ‘I.I.t.s\151v....\ I 7. ..‘V..l.o...l...Iu’.v..0’Pr.. . ‘ ....5..0h v3.2.4... .33 .. ... ...-o ......o .39... u. L... 3... .-....‘ififin 7.1 ....I 1...... .. ...10. .> ..0'0...!..:..00-. .03.... ...CW. .300. 50.0! . 0.0.»... 90 'lvo‘..00 000‘. .J. ... 1 .3... PI.....0.......J... I... 50.2.69. 3...... ......ol .-.0...J3 O O. a .. .... ...-.0 oi}... .5300 ... J. ...3 _ ...‘AJ..u.....-, ..r. Q?‘.07.50.a¢n.o.h.5.n. .... .... . .. I... . . O. ..V .0. . .. .. ... . '9 .. \ 5. . . i . .. .. . I 9.. . 0 .4 . . .X.IC.. .....xJ ... 99!-..VI9? a... .. ..FIIT . .... I. ... . ... 0.. ...... .. .... . 0 £0.01 20.. ...Id Gob-90. “$03!.93‘0“J . ‘4‘. ’0 990,“ ..‘Yl....ioa‘.0fi\fu “Ola - I'O..‘I¢II:- 0.9" ......YOJ. ... t‘lfihc. 992'? 99.9 30.16....0 . 9:0..10990.‘\0.n 4".” .A... 00.1.9...|.o00..0a0". ............O.. .....9 9.3.. t. 3.1..I... ...... .I.... 900“... ...-0.. o...«.".o. D910... ...-9 5.0.01... .... .IIIO .....n ..‘0 :0. ’0 .99... .....I........I .. .0... .0. . 1... ...b...... l... r... .I... . v ...—I... J01. .0...4. ‘10 . ... . .0... 1....- ........ . 309.09.. o.....>..o. . 9.... ... . .0 .... ..I.... 1...... ...)... ...l. 0.. o .. . v9.4 0.9‘31. ...v ...I... L I. ...-'90.... . .....f. .... v .001... :0... ., . r... ......319...l . . . .... 9...... . . . .. . . . ,,. ... . 9.... . 44.0.... . ... I .000..-....... ... 0.0.9.-.... Q .r.... ...... 0-9.1.- ... . ...-I0 I a ‘1'. . I .....1: .I.... ...o... .1...o..o.,. d... ... . . I... . ... .... . .. . . . ... 2.0.. ... .. p!!..9¢8.93!..o|0$9¢|90 .5. . . 'ro. . 00 5 0.0.9995. .916394... 0.119.. A0. . Iu . ... . .. . 9... yr..t...¢.91§!0993.‘9"0. 99a: 0 79.. O‘Wn’. o 9. ... ..- .1“... ............O.. . 0I.....0. .....l.‘..“o..0.9.000-vooflé:3.00.9. . ..vc........f....;..9...... 7-...0...,-. .1... l'09.-.“§..‘0.' .. . . . . .. . 0...... ... :9. .33.. ..P. u . .....9. v. -... VrfvcéxltwflltzO. .......4..l l............ ...0...... . .0. ... ... o 0. s o. no. . ...... . . ... o. . $954.36 .9 0| . . . . . ...... . ...-50 1.00.61 ....9... ... i I... .0 00 ...-......- *0. . .00‘. ......090... fi I .. 5 - 955-..OI (......I I0...I|.900..I90nl ...-.31... ....0 I... .. .v.. r0........4.6.ln.'.§x. ...0....0s.... .. 0.0... ..ue..1l...d\t.o.1. .. 339......QYI. .............. .V'I9I0.0 .0...$0.H.. ......-.......>0.0..0...0..'.. ,0.. ...)...~ 9.0....~.O.0..fl.0...00 . ...;oov. o9..-I. ...».I'. I.O.9 .. .. ... . A. é.. a a ... 0. . . . .- 0. 0 .. .... a. I. . .03 ... 4.. C. ,. .. .. 9 .. . to 1;... . o u. . ... O . .. 9.. u . . n . 0.. .0 t .. . . .. . .. . .. . .u q . .. ...... ” ..-... . ..I.. p . r . . ..,.... ...v. ... .. . . u I . .. ... ... . .0 . a .0. o t. . . . a CI . 0. .. .... I ..l o. . ‘ ... ... O. .I . .0. . . n I. 0 , ... . o . . . . c. . v . . ... a. . ...10. ‘u . a .I. I! . J\.Io. ... n . u . 0. I ... .0. . . . c. . ... O. . . ..I . ... z -. . . M. . _ . . .. . . . ... . . . . o u n .. 2 . . . fl . , . . . . . 0t ... . a . - o . ... ... . . 4 ... v. . . a .. 0‘ . . . . . . . o . n . . o 3.. . .. . ... I. . .. . . .. ... u .9 . . . . . .rc I! . .9 . . . ... . o ”H . . . . . .o . . . o .9 . . .. . ... . ,... . .0 . .. .. . o. . . . a u . u .. A . o It. 9 . .. I I . o .. .... .l . . >.0 .. . . .o 0 .. . . I I 0.. 0 .. . . .. I 0 .. ... .. .0 _. 0 . . I V .. .o . . . . . l .. . . . . . . . . u 0 V .. .... . . a . . . ‘. . .... . . t. v. . .... I . P . .o . . . .... . « ... . . .. u .. I .. _ Y C. .. . . .. . . .... a . o 9. . . . 0. . o . . . ... . O.- O _ D . n . ..I O . . .. . 0 . .. o . . . . . . . n . . . . U I .a . w. ._ . . u . a .. . v . C. 00 .I . . . 0. . . n ’ O u . o .. . I ._ I .. . ... . ... .. . . .0... I. . .0 . 0 n C. . . . ... . _ _ v . u. .. ..V. . . .. .a . . . v A . ... . . I . . . .0. . o. .. . . . . . o . .‘ _ . . . .0 . .0 u . . . . . o 00 .. I. _ . . .5. . . . . . . .. . o . . . . 1:... . I . 9 . , . . v. .0 ._ . _ . . . ... .. . . -.. - . .. . I . . I . . I c. 4 .. .0... .. .. ._ .. . .... . . o 9 . _ 0 I . .. 0 .. . . o - . A . . .. -9 r .v . . I... . 00 .. O. . . .. I. . . . I . _ . . no . . _ .o a .o . . . . . . . O. . . C. . .. .. . . _ . 0. a . .... . u . .0. I . .n . .. .. . ... . . . o: . .OI ..l . . V . . . n . . -. _ ... .. o. .. v I . I a a. u. . . . .o n . . . u .. 0-00. 9. ... v . . ... . .. ...... . . . . . p . . .u . . _ ... . .. . . ... v. . .. . . I . . . u I . . .. 9... .. ...l_ . .. I _ . a . v. . V J 0 i . u . . . a . . 0 . o O . . . .0. .o . . . . . . . .o .. . . . . .. . . ... . . . .. .. . . . _ . v . ... o . . . .I‘ . . ....o . . o. . 2. o. .0 . .. . . . . .. .. A . n . . ll.-Y.~I . I u . . o C. . . . . . 0 . ... 0.. I 300.0: .. 0 I . . . .. . . . . . 9 .. . c .. . . .. I I V I . .0 . 0. . . ... .. o n . _ . . . . . . .. .. . u 9.0...-.. . .0 .. . .... _ ... ..I.. . . b I . . . v. . . . . _ 4. . . .. . . . .. .. .. .o . . . .. . ... no ... . . .. ~I .. . . . c . . . . . u . ... I . . . . . .. . . . ¢ . . .. . ~ . n . . . . o . . . I.. . ..0 .. . . . . u . . . . s .. fl. .. .. ..U c . o. u . ... . .. n ..0 .. . . . .. 0 . . . .... n . . .. ........ 9 . ... . . . o . . .. . . . . .. a . ... . . .. . . . . . . a . ... ...-. . . I ... . . . . . . . . . .... . . . . . . . . . n . .9 _ . . . . . ‘ ... . A n ...I . . . . ... 0 ..l . .... . ... n . . o «.... . _. . o . v .n . . . . . _ I. . o. I. I. .. I. .0 0 . . .. ... . ... . . ... .. o... M. .. . .. . C. a . . I ...v .... . . ... .. . . . ..I I . 0. . . . ... . u . .. . .. . . . . ... .. .. _ . . . . . . . . .c .0 . . . . . . . .... . a . ... . 9 . . I. . a. 9 I . . ... . ....a . ... 0 . o. . o I... w . u o . ... . u n . . . . . . . _ .. .. A . _ I. . . .0 Ho . . .. . . . a . . . . . 0 . ..... . . . u . .. . . . . o . . Q . .. 0 0. . . I. . . . , . . I. _ u . c . n z c o. . . . 0.. . o . I . g. a . . . . A. . . . . I O 9. . . n . . . . o . . . . . . . . . . . . . . - . . .. . .0 . -. ... ... ... . . .... ... a . . .. . . . . I. . t . . _ . .... . ~ . .. .. . . . ... . 0.. ._ .. .0 . .. .0. . I . . . I . ...... I. o . . u. . . . . . u . ... . .. A . .u. I... n . n . . . . . . . . . . a . ¢ . u . . . c u . . . .. . _. . . n . . ..I. . . . . . . . . ~ .0 . . . . . . . o . . . s . o . ... .. o . . .. .. . . . . . . . . . . . . . . . . . n . 0 . . . . . . . . . . .. .0 . . . . _ . .. . . . . . .6 9 . I . . . . . . . . . . n I 9. . . ... . . . u . . . . 0 . . ., . 0 u . . . . . . . r . no. . . l _ . . . . . . .6. . . . . . n . . O 0 o .. .. o . I u n . . . . . 0. o I . . o . . . . . o .. . . . . n u. . . ... .l . . I . . .. . 0 . . . 9 I o . . . . . . . n. . . . . .. . I a . o . . u . - _ .. .. c . . . . . . _ . . . . . . . . . . 4. . . . . . . _ I. . .... . ... . o .. _ .. .. .. ... .- . . .. . u. "6. .. . . ... . o. I . . . . o o. . I . . . . . u . . . 0 . . u . . I. o. . _ . . . . . . I . . . 9. . . u 9. . . . . . . . . . . . . . . . . .0. . I ._ p .. . . . . , . . . .. . . . .. o. . . c. . . . .. . . n Y o - . . . . . . . . I . _ . . I . . I . . . . . . .. . . I . . i . f 0. ... .. . . .. I . . A. . v .. .. . . . 0 . . . . .. - . . . . .. . . . . . . u . . . . . . I . . . .. . . . .. . . . . I . . . \. . . . . . . . . . . . . . .. . . . . . . . . . . _ . q . . . ... . - . . . . .. _ .. . . . -Y . . . . . . . . o. . . . . o . . . . . . I. . . . . . s u . . .. . , . . . . o . . _. . . . .. . . . . . . . I . I I o C . 9‘ I u o . . . ..wA.-.u.b.r4.. 0.0... I. ....9. I... i e.. 09‘ ..H.” . I 0 0.. .. - . p .1. . II. . 1.0.6.. . ... o . ......C ... :00... ‘ill‘. 0“ u ..I .J.... .... ...... ... . .‘..I v _. .0 . . . .OvoJu‘ 5 .0. ..0. I...0.c . .w..!o..¢| ..8... 0'. . .A.. ....30... ... I ...: . . . .o ....9. . ... .... . .. ,..... ...o ... " ...? . .. . . . .. . .... ..... o..- 0900.... o ..c... ...v . .p . I . . - ... . . . . . . . . . .. . . .. , r .. ... n . Io. .. . . . ... .. . . .9.. . . . . v .. I. .. . I .. . O .. . . . .. .. . p . . . .. . .. - .. o a . .9 .un. ..cl.,Ia...§...0.n§t.8..9950...0...§.:.. .981... .0 . .. ,. . . . . .. . .. . . ... .. . ... ... .. ... .....1.... :0 ..901 1.9.1.5.? §atxlb,.. bDUVJYQtl"..v-'9,0\S.1n.no.33....... lb....c1.l... .. . atx... . 33 3 ' LlBRARY 509153;“; State l .M University This is to certify that the dissertatiOn entitled INCORPORATING BACKGROUND KNOWLEDGE IN DOCUMENT CLUSTERING presented by SAMAH JAMAL FODEH has been accepted towards fulfillment of the requirements for the PhD. degree in Computer Science ( “twlwyfi Major Professor’s Signature |C| Where [(dk 9141') =1 ifthe label of dk equals l.j_ NM] [119]." measures the normalized mutual information between the cluster assignment and the ground truth of the documents. NMI is an increasingly popular measure of cluster quality. NMI(X,Y) = “X” (5) (logk + Iogc)/2 where X, Y are random variables of the assignments of the clusters and the ground truth labels. K is the number of clusters and C is the number of the labels. The higher the value of NMI; the better are the clusters. 49 4.2 Baseline Clustering Algorithm We use spherical kmeans as the baseline clustering algorithm to evaluate the proposed methods in this thesis. Spherical kmeans is a popular clustering method for clustering high dimensional text data in which cosine similarity measure is used to compute the angle between documents and centroid vectors. The time complexity of the algorithm is linear in the number of documents and clusters. X 0 Y cosine(X,Y) = (6) IIX ll IIY 11 Despite its wide use, it suffers from the initial centroids problem. That is the quality of the partitions obtained is highly dependent on the initial centroids. To address this problem we run spherical kmeans 50 times and report the purity of the solution that better minimizes the objective function. 4.3 Effect of Preprocessing As mentioned in Chapter 3, the preprocessing steps performed on the documents include stopwords removal, nouns identification by WordNet and stemming. Some of the advantages of using our method of preprocessing are: ° Simplicity. We reduce number of steps required for preprocessing. We skip part of speech tagging, instead using only stemming and noun identification via WordNet. The WordNet noun identification step is a lookup. ' Efficiency. Our approach effectively reduces the size of the feature space; instead of using all words, we only use the nouns as identified by WordNet as input for the clustering algorithm. 50 In this section we do an extensive experimental study to show the adequacy of using nouns in document clustering. We generate a baseline clustering solution using spherical kmeans and the nouns as features. Our goal is to show that the noun-based approach often yields a more rigorous baseline than using all words. We compare the baseline performances reported in some of the research papers mentioned in Chapter 2 against our simple baseline. In addition, we include the performance of the new methods proposed in the respective papers. We followed the steps of the authors as best as we could to create an equivalent baseline. Column 3 in Table 10 shows the purity obtained using the proposed clustering methods in the mentioned papers. Column 4 shows the reported purity using the baseline. Comparatively, column 5 lists the purity obtained using our proposed baseline using the nouns as identified by WordNet. We have also attempted to re-generate the baseline purity values reported in the papers. For the most part, the differences in their reported and our regenerated baseline are small, except for the dataset used by Huang et al [50] and Hu et al [47]. They reported a purity of 0.603 for their baseline compared to 0.73 produced by our words baseline. We presume that these differences come from either implementation or other detailed matters. For example, different implementations of Porter stemmer and different stopwords list could produce a different set of features in terms of size and contents. Furthermore, it is not clear that TFIDF normalization was applied to Reuters2 dataset for the reported baseline. When w applied the reported method without TFIDF, we obtained similar results to what was reported. 51 The performance of our proposed baseline was significantly better than the reported baselines as shown in Table 10. This is true across all methods. For example, for the 20NG-100 dataset, Hu et al [49] reported a purity of 0.411 using all words as Reported Nouns Reference Data Method Baseline Baseline Hotho et al* 0.61 0.571 Sedding et al* Mir” S'Max‘oo 0.47 0.58 060“ Deerwester et al 0.596 Hu et al. 0.65 Huang et al Min20-Max200 0.678 0.603 0.791 Deerwester et al 0.748 Hu X. et al 20NG-100 0.442 0.41 1 0.52 Binaryl ** 0.7 0.62 0.92 Binary2** 0.68 0.57 0.894 Binary3** 0.75 0.65 0.92 Multi51 0.59 0.38 0.924 Slonim and Tishby Multi52 0.58 0.36 0.912 Multi53 . 0.53 0.34 0.943 MultilOl 0.35 0.24 0.742 M‘ulti102 0.35 0.27 0.724 Multi103 0.35 0.28 0.734 Table 10: Comparison between the nouns baseline (column 5) against reported baseline (column 4) and reported clustering methods in the literature (column 3) in terms of purity. (*) This result was created using 60 clusters to make an equivalent comparison. (**) This result was created using 4 clusters compared to 0.52 using our approach. The approach by Huang et al [50] and Hu et al [47] whose baseline used stemmed words had a purity of 0.603 compared to 0.79 using our baseline. More interestingly, the algorithms proposed by some of the authors gave poorer 52 results than those using our nouns baseline. Note that Hotho et al [35] partitioned the MinIS-Max100 dataset using bisecting k-means into 60 clusters instead of 46, which is the number of topics in the documents. However, to be consistent with his experimental settings, we clustered the documents into 60 partitions using bisecting k-means. The result was a purity of 0.604 using our baseline compared to 0.571 using their baseline of stemmed words. 4.3.1 Why Nouns? Results from the previous subsection suggest that the noun-based clustering approach Class %Nouns Class %Nouns alum 0.75 livestock 0.90 bop 0.75 money-supplL 0.80 cocoa 0.85 nat-gas 0.70 coffee 0.80 orange 0.75 copper 0.85 pet-chem 0.80 cotton 0.90 reserves 0.70 cpi 0.70 retail 0.80 gas 0.80 rubber 0.80 gnp 0.90 ship 0.90 gold 0.90 strategic-metal 0.85 grain 0.85 sggar 0.95 housing 0.80 tin 0.80 ipi 0.65 veg-oil 1 .00 iron-steel 0.75 wpi 0.70 jobs 0.70 zinc 0.95 Table 11: Fraction of nouns among the top 20 words of each cluster centroid often produces comparable and sometimes better clusters than using all words but with a significantly reduced number of features. This section provides further evidence to 53 indicate the importance of nouns in capturing the underlying structure of the data and its clusters. 4.3.1.1 Nouns and Centroids Analysis In this experiment, we investigate the nouns role in forming the different topics in a cocoa natural gas grain nouns words nouns words nouns words Cocoa cocoa gas Ga Grain grain Buffer buffer natural Nature Usda usda Stocks icco pipeline pipelin Soviet crop Delegate Stock foot feet crop soviet Rules Deleg energy cubic certificates certif Tonne Tonn contract energi agriculture agriculture Manager Rule texas file tonne mln Council manag company contract estimate tonn Consumers purchas customer flow official estim International Bui oil compani import offici Producer council corp texa department ussr Ghana consum access coastal program rpport Purchase Intern petroleum transamerican farm harvest Differential ghana bankruptcy court ussr drought Bra produc field copp harvest farm Buy differenti interest mln elevator elev Organization Organ filing transco china lyng Compromise compromis canadian transport bill program Season kanon subsidiag custom report import Meeting Bra spot access land depart Table 12: Comparison between the noun-based and word-based centroids 54 Table 12 (cont’d) Gold Co per Jobs Nouns words nouns words nouns words Gold gold copper copper unemployment unemploy Mine mine cent cent Pct Pct Ounce ounc lb Lb workforce workforc Ton coin cathode cathod number number Coin Or price effect unemployed unemploi Foot Warrant alloL immedi Fell Adjust Ore Explor corp pound people Fell Tons Venture subsidiary price benefit Season miner C ompani octane mine record Mln Mining Dlr magma ct Rate Total Fire South mining corp employment People Company Product aluminum newmont statistics Benefit Reserves Issu electrolytic magma labour Jobless Production Corp pound subsidiari Total Statist Venture Bullion lowering alloi figure Rate Deposit Deposit raising electrolyt office Record Exploration Mln company compani department End South Price pump inspir labor Labour Property Miner rod lower official Employ Underground Property mineral resin application Compare 55 collection of documents. Our intuition was that the nouns are important in a cluster if They contribute significantly to the weights of its corresponding centroid when clustering using all the words. Our strategy was to show the importance of the nouns by examining the centroids of the clusters. Specifically, we select the top 20 words with highest weights in each cluster centroid and compute the fraction of nouns among the top 20 words. The higher the fraction, the more important are the nouns in terms of defining the main theme of the cluster. We used Min20-Max20 to test our statement in this experiment. This dataset has 30 classes. Out of the 7170 words, about 64% of them are nouns. In Table 1 1, the first and third columns show the list of clusters, each labeled with the class that has the majority of documents in that cluster. The second and fourth columns show the corresponding fraction of nouns in the top 20 words of the cluster centroid. The results in Table 11 suggest that the nouns occupy a large portion of most of the centroids. For example, coverage above 85% is observed for cocoa, gold, sugar, and zinc topics. It is important to mention that having the nouns dominating the centroids does not necessarily assure a better solution. However, it does suggest that one may still be able to recover most of the clusters after removing the non-nouns from each document. Table 12 compares the top 20 features of 6 selected centroids obtained by applying spherical k-means on Min20-Min200 words and Min20-Min200 nouns. Note the high overlap between the top 20 features of the noun-based and word-based centroids. 4.3.1.2 Nouns and LSI Analysis In previous experiments, we showed that the nouns are a major component in document clusters. In this experiment, we examine the importance of nouns in terms of capturing the overall structure of the data. We consider two approaches for evaluating this. The first 56 approach examines the percentage of nouns in the concepts produced by Latent Semantic Indexing LSI. LSI is a widely used method to find a low-rank approximation of the tenn- document matrix by decomposing it into matrices of left and right singular vectors. We examined the fraction of nouns that contribute to the first 10 components (i.e., left singular vectors). Specifically, we sort the words based on their magnitude in a given component and select the top words at 99th {top 72 words}, 95"1 {358 words} and 90th {717 words} percentiles. Table 13 shows the results. Observe that the nouns contribute to at least 75% of the words in the 99th percentile across all 10 components, concepts. This result again emphasizes the important role of nouns in identifying the topics in a dataset. components 99th 95th 90th 1 82% 79% 73% 2 83% 79% 72% 3 78% 72% 71% 4 81% 75% 70% 5 81% 76% 71% 6 81% 72% 68% 7 80% 71% 79% 8 78% 74% 68% 9 75% 75% 67% 10 82% 74% 70% Table 13: Percentage of nouns in different percentiles in the top 20 components The second approach is to compute the mutual information of each word w to the document set D according to the following formula: 57 _ , p(xIW) I(W)—p(W)d;Dp(x|d)10g pm (7) This formula was used in [94] to select the top 2000 features of a text corpus. We rank the words according to their mutual information score and compute the percentage of nouns in the top-50, top-100, and top-200 words with highest mutual information. The results given in Table 14 showed that although the nouns composed of roughly 36-39% of the words in the Binary-Multiple-500 data. they account for more than 60% of the words with highest mutual information. Dataset Top-50 Top-100 Top-200 All words Binary1 64.00% 63.00% 66.50% 37.20% Binary2 66.00% 71.00% 69.00% 37.30% Binary3 64.00% 64.00% 65.50% 37.40% Multi51 62.00% 61.00% 65.00% 36.50% Multi52 70.00% 64.00% 64.00% 36.00% Multi53 82.00% 72.00% 72.50% 37.50% Multi101 76.00% 74.00% 73.50% 37.70% Multi102 80.00% 78.00% 75.50% 37.80% Multi103 66.00% 73.00% 70.50% 38.50% Table 14: Percentage of nouns in top k words with highest mutual information 4.3.1.3 Nouns versus Random Words: Comparison The goal of this experiment is to show that representing the documents using only nouns suffice to form clusters that are comparable to using all words. One way to show that is to cluster the documents once using only nouns and another time a random set of words drawn from the documents that is that same size as the document feature set. If the 58 quality of the clusters obtained using nouns are close to those using all words then this concludes that nouns are indeed a strong participant in forming the partitions. The experiment is conducted in two phases. In the first phase, we cluster the documents using the set of nouns as features. The second phase of the experiment is similar to the first one except that the features are randomly selected and includes nouns as well as non-nouns. 14 r___. --;-_:— _—_ , -I_____, I T I 1 T r ’-—— noun ‘ . [:1 random words . i l 12'» fl — 10L — 6 a 8* - D 0' 9 LL 6* a 4* —i 2i 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 08.8 Purity Figure 10: Purity of clusters of nouns versus random words To be consistent with the first phase, the size of the random set of words is equal to the size of the set of nouns in order to avoid any bias related to the size of the features set. As this is a random sample, we repeat the experiment 50 times, each time using a different random sample. Min20-Max200 dataset was used in this experiment and the 59 results are summarized in the histogram shown in Figure 10. The vertical line shows the cluster purity sing nouns only. The purity obtained by nouns is 0.78 compared to the average purity of 0.682 using random words. These results suggest that “nouns” as a reduced feature set can produce better clusters than any other random set of words with a similar size. 4.4 Feature Selection using Information Gain As shown in the flowchart in Figure 7 in Chapter 3, the next step after concept mapping is feature selection. We propose a feature selection approach to further reduce dimensionality and retain only core semantic features that contribute the most in clustering. We use an information theoretic approach to select salient semantic features. In Section 4.4.2 we explain how features are selected based on information gain, and in Section 4.4.2 we present our method of selecting core semantic features. 4.4.] Information Gain We use an information gain approach to identify nouns that have high information gain after being replaced by their corresponding semantic concepts. A high information gain means that the average entropy of the semantic concepts after disambiguation is significantly lower than the entropy of the original noun (before disambiguation. In other words, most of the documents that contain each of the disambiguated semantic concepts are from the same class (though the documents that contain the original noun belong to different classes). Unfortunately, computing information gain requires knowledge of ground truth, which is unavailable due to the unsupervised nature of the clustering task. To compensate for this, we approximate the ground truth using clusters obtained from all nouns and all semantic concepts. Since spherical kmeans is 60 sensitive to the initial centroids, clustering is repeated a number of iterations to obtain more stable clusters to simulate the ground truth. The following steps describe how the information gain of a noun n is computed: 1. Cluster the documents separately using the nouns as features and the semantic .71’ concepts as features. Let n be the set of noun clusters and ”s be the set of semantic clusters. .75 2. Compute the entropy of each noun n using the n clusters. Let p(i|n) be the fraction of documents containing noun n that belong to cluster i. The entropy of a noun n is: e...,. = — 2120' I n>log 120' I n) iEJrn 3. Compute the average entropy of the concepts associated with this noun as retrieved by our WSD approach. Let C (n) = ’c].c2, ...,c[} the set of concepts that disambiguate the noun n and let p(i|n.c,) be the fraction of documents containing noun n (before disambiguation) and concept c (after disambiguation) that were assigned to semantic cluster j. Furthermore, p(c,'ln) corresponds to the fraction of documents containing noun n that were disambiguated to concept ci. The average entropy of the set ofconcepts associated with noun n is: 61 6mm? 2 p(ci|n)Ep(j|n,c,)logp(j|n,c,) (8) c,EC(n) jEJts 4. The information gain from disambiguating the noun n is computed as the difference in entropy before and after disambiguation: Gam(n) = em" (n) - econcepxn) (9) Frequency Before Frequency After Disambiguation Disambiguation ”III ”n2 ”SI ”52 n 3 3 c1 2 1 C2 0 1 C3 1 1 Table 15: Document distribution across clusters before and after disambiguation Example3: Consider the example shown in Table 15 in which a noun n is replaced (using WSD) with {c1, c2, c3}. fin] and ”"2 are the noun clusters and ”s1 and ”32 are the semantic clusters. The entropy values before and after disambiguation can be computed as follows: 3 3 3 3 emu/101) _ -(gl0g(g)) - (glog(g)) - 1 3 2 2 l I l 2 1 l l 1 e,.(,,,(,p,(n) = g(-§I0g(-3') - 3I0g(§)) '1' g(-I X IOg(I)) + g(-EI0g('2-) - 51045)) = .7925 Gain (n) = 3,10,," (n) " e..,,..,,,(n) =1 - 07925 = 0.2075 62 4.4.2 Core Semantic Feature Selection with Information Gain Although existing ontology-based methods can improve document clustering, they also increase the dimensionality of the feature space. As we have noted, this can both increase computational complexity and potentially affect clustering performance due to the curse of dimensionality problem. Therefore the proposed clustering framework aims to improve document clustering using only a subset of the semantic features, the core features. These features are not only useful for clustering, but once identified, they may represent the main theme of the topics in the documents. In our method, we globally evaluate the significance of the nouns to decide whether their disambiguated concepts should be added to the final subset of semantic features. We establish this significance using information gain as described in Section 4.4.1. Overall, we use two criteria to establish whether a noun is a core feature: 1. The noun should a polysemous or synonymous. We impose further restriction that it should be in the top 30% of the most frequent nouns. This final restriction further reduced the dimensionality but did not empirically change the clustering results. This is likely due to the fact that, according to our experiments, polysemous and synonymous are approximately 60% of the most frequent nouns. 2. The noun should achieve either an information gain that is greater than a predefined threshold t or an entropy equals to 0 after disambiguation, i.e., econcept(n) 20- 63 [ Algorithm: Extracting Core Semantic Features Input: A set of documents, D, a set of nouns N, a set of concepts C, a set of noun clusters .71] 71' n , a set of semantic clusters s . the list of polysemous and synonymous nouns Mpoly,synm , the list of frequent noun Mfreq, and a threshold t. Output: Document clusters Dc using as a feature set the core semantic features F C l. Initialize the set of core semantic features F c = Q 2. for each n E N a. Identify the concept set c associated with n using the above described method. b. Identify the information gain [G of n by clustering with and without the concept(s) c. c. if all of these conditions hold: n E Mpolygynm , n E Mfreq and 1G 2 t then add c to the list of core semantic features i.e., F c (-F CU c 3. Identify the document subset D, which are the documents covered by the newly identified core features F c 4. Use spherical kmeans to cluster D with feature set F c 5. Identify the document set D'=D - D 6. Map the uncovered documents D' to the best of the existing centroids from step 4. The first part of this requirement ensures that disambiguating an ambiguous noun makes a considerable change in the document distribution across the clusters. The second part of this requirement is related to the nouns for which the distribution of the documents across the clusters after disambiguation produce documents to produce the base clusters. However, since the set of core features is only a small subset of the entire features set, 64 only a subset of the documents will be clustered which we call “covered documents”. That is, there will be a number of documents which become “uncovered”, not contain any of the core semantic features. We address such uncovered documents by mapping them to the clusters that have the closest centroids. 4.4.3 Experimental Evaluation This section presents the empirical validation of our proposed algorithm for extracting core semantic features. The performance of our algorithm will be evaluated in terms of cluster purity, amount of feature reduction, and interpretability of the cluster centroids. We used Reuters2 and samples form 20newsgroups that we have discussed in Chapter 3. 4.4.3.1 Clustering using Core Semantic Features Having described our approach, we here report on the performance of spherical kmeans clustering using core semantic features. We chose spherical kmeans as our underlying clustering algorithm for several reasons. First, it is a popular algorithm used by many ontology-driven clustering methods. Furthermore, since our goal is to evaluate the utility of the core semantic features rather than the effectiveness of the clustering algorithm, we avoid using algorithms that perform any additional feature transformations during clustering. Spherical k-means fits our criteria because it considers every feature in the input data, unlike other document clustering algorithms such as spectral clustering, which performs clustering on eigenvectors of a Laplacian matrix constructed from the feature vectors of its input data. We compared the performance of the following four methods: 1. Spherical kmeans clustering using all nouns as features. 65 2. Spherical kmeans clustering using all extracted concepts (senses) from WordNet as features. 3. Spherical kmeans clustering using a combination of all nouns and their corresponding concepts (including hypemyms of each concept up to five levels distant in the WordNet concept hierarchy). This approach is similar to the one given by (Hotho et al., 2003), except we use binary features rather than the term frequency (to make an equivalent comparison with our binary weighted features). 4. Spherical kmeans clustering using the core semantic features (labeled CSF). The clustering parameters used were the same for all four methods. The parameter k is set to the known number of classes for these data sets. Because spherical kmeans is dependent on initial centroid location, clustering is repeated 50 times for each method. Empirically, we observed that the results do not change considerably even if the number of iterations increases to 100. The quality of the clustering results is evaluated using purity, which is a supervised measure that computes the fraction of documents correctly assigned to their ground truth clusters. In addition, the amount of reduction in the number of input features is used as another criterion for evaluating the usefulness of the core semantic features. Given a baseline method B, we compute the percentage of reduction in the number of input features as follows: # F eatures( B)—# F eatures(C S F ) # F eatures( B) % Re duction = where CSF denote the proposed core semantic features approach. The results of our experiments are summarized in Table 16. The following observations are made when comparing the C SF approach against other baseline methods: 66 1. We achieve a feature reduction of at least 90% using the CSF approach on all the data sets (see columns 10-12 Table 16). On average, the number of core semantic features used for clustering is close to 6% of the number of nouns used for spherical k-means clustering. The ontology-driven clustering approach developed by Hotho has the highest number of features because it augments the original nouns with concepts from the WordNet hierarchy. On average, the number of features selected using our CSF approach is about 3% of the total number of features required by Hotho’s method. 2. Depending on the data sets used, ontology-driven clustering may or may not improve the performance of k-means clustering using all nouns. In data sets where ontology helps (bin7, bin9, bin12, and bin13), the cluster purity obtained using core semantic features (column 9 in Table 16) is higher than that obtained using all nouns (column 3 in Table 16). For datasets where ontology does not help (bin21, bin] 0, multi101 , and Reuters). the cluster purity of CSF is lower than that using all nouns. 67 Amuaoocoovu .858sz A... 05 53> 38820 826.238 05 mo beam 65 266:: $8 63 26: co 58:86 68V .65sz 238m mo EsoEa Ea bra 8826 .«0 8:8 E 80:88 8.050: no 88250 =a .855: =m mafia 8852: 2583 625 8:83 585% Emuv 638$ 9:888 83 65 50253 :85qu0 L 033. $8 £8 £3 :82 :3 3. mm N86 3% 89¢ 86 85 ~28 mesa”. :8 some some .523 82 88 N: ammo $82 as 88: 846 80: 882835 :8 some some .82 we; 3.. E was 82 was $8 83 R _ 8 2:22: .53 $8 $8 53 33 EN 82 £3 8% a: 8% wood 2?. 3:5 :8 $8 $8 33 .23 8m 8m 823 NS: 83 83. 89c 9.? 2:5 $8 $8 $8 53 Sad 888 28 Red 3% .83 8? 89¢ 83. 2:5 as; $8 £8 £3 83 mm 38 use 22 _ Ed 284 83 $3 :53 $8 $8 some two 686 SN SN 53 :3 _ ES $8 :86 2:. 2:: £3 $8 $8 Rad 83° :8 88 885 $2: :3 8:. 83¢ 32. 0:5 :3 $8 sea 23 83 m _ m Nam 36 mm: _ 3o 8?. 83 e _ t. was £8 some .88 83 83 am 98 Sad 68 :3 83. 3o 3:. 25 $8 $8 :8 Sec 83 Nam 4mm 3o 8:: ES 2.2. $3 was. 85 $3 so? some $0.0 Rod 3.... SN 82 $3 $3 8% 3o 88 25 £8 $3 $3 23¢ 33 RM 84 8.3 32: Rod R: Rod 83 was £3 $8 $8 83 33 SN 8». 38¢ 82: 5 3o 83. Sad 8m... 25 $8 3:. $8 $8 EB Ram 28 83 5: $3 2.2 Rec 28 was £3 $8 $3 Sod R3 am 82 N23 2% a: 8?. 83 82 EB u .z .m d d 2.8: =< =< be...— ..su .80 .5: 5:...— ..8..£ be: 55% be: 55% s 8.858 80 as: :88 82:2: 8.2.85 %N— we? =< 38230 E... 2:82 =< Etna—vex 9:53“— mocauaeu 2.52:8 0.80 33280 + 8532 68 Overall, our proposed approach is comparable (we consider the cluster purities to be comparable if the cluster purities is less than .02) to l or better than clustering using all nouns in 12 of the 17 datasets, even though it has significantly less number of features. . The cluster purity obtained using Hotho’s method tends to be closer to the minimum cluster purity obtained using all nouns or all concepts. We suspect this is due to the presence of both nouns and concepts in the feature vector, which increases the number of features considerably. In two of the data sets where ontology helped (bin9, bin12), the cluster purity is around 0.625, which is lower than the purity of using all nouns. In 12 of 19 datasets, the cluster purity for our CSF approach (column 9 in Table 16) is at least comparable to the cluster purity of using all concepts even though it has less than 91% of the semantic concepts. For bin4, bin7, bin9, and multilO, the cluster purity using CSF is higher. These results suggest that the core semantic features provide a good representative subset of the ontological concepts used for clustering. There are several possible explanations for the poor performance of the CSF approach in 7 of the remaining 19 datasets. First, 3 of the datasets correspond to the case where ontology does not help (bin14, and Reuter52). Second, 2 of these datasets have a wide range of topics (Reuters2, multiple6000 and multi101), this issue is discussed in more details later in this section. Furthermore, the core semantic features are extracted using] the information gain when that gain exceeds the threshold t as described in our algorithm. Threshold t is an adjustable value that determines the level of information gain sufficient to change the clustering 69 results when replacing a noun with its associated concept. The core semantic features can be used to both remove semantic noise introduced by the WSD process as well as exclude semantic features that have no effect on document distribution across the clusters (estimated classes). The results reported in Table 16 is based on a fixed threshold t = 0.5. For datasets such as bin13, we achieve a higher purity comparable to using all concepts when the threshold is reduced to 0.3. In spite of the benefits gained from focusing on the core features, this small portion of the total feature set might not cover all the topics in a dataset. That is, the core semantic features might not include any features from some documents, leaving them uncovered by the features being used. This requires mapping the uncovered documents to one of the existing core feature centroids based on “closeness” of those centroids. This is especially prevalent in the case of multi- class datasets where the topics are more varied. For example, in the multiple6000 dataset the purity using all nouns is 0.486 but the purity decreases to 0.439 using only the core semantic features to cluster all the documents (uncovered as well). In the Reuters2 dataset, the purity of the all noun case, 0.65, decreased to 0.305 when using only the core semantic features to cluster all the documents. Clusters Refinement To improve the performance of our approach when applied to multi-class datasets, we explored different approaches for mapping the uncovered documents to the core feature centroids. We modified the mapping as follows. Instead of mapping the entire set of the uncovered documents to their core feature centroids, a random subset of the uncovered 70 documents were mapped to their closest centroids. After mapping, the centroids of the clusters were updated to reflect the new added documents. This was done iteratively until all the uncovered documents were mapped. This modification improved the final cluster purity for multiple6000 and Reuters. For multiple6000, the purity was raised from 0.439 to 0.495. For Reuters2, the purity of their clusters increased to 0.433 from 0.305. In the multiple6000 case, the cluster purity was comparable to those obtained using all nouns, as was the case for most of the other datasets. However, for the Reuters dataset, the purity remains below the purity of that for all nouns. Investigating further, we noticed that the confusion matrix for Reuters had only 10 of 20 topics covered by the core semantic features, thus leaving half of the topics uncovered. However the purity of the clusters of the covered documents was high, 0.914 as shown in Table 16. Thus clustering performance was good for parts of the document set. These shortcomings might be addressed by using multiple core semantic features that cover different subsets of document set, essentially the idea of subspace clustering. Doing this iteratively would produce clusters of different parts of the document space that would have to be combined. Consider the following example. We slightly modified our approach by imposing a threshold on the distance of the documents from the closest centroid. A document is mapped to a cluster if it is within n-standard deviations distance from its centroid, otherwise, the document is not clustered. This was done to show the tradeoff between cluster purity and coverage using the core semantic features. As shown in Table 17, increasing n caused more documents to be clustered, but with lower cluster purity. This essentially means that the extracted core features cover only some of the topics from the document set. We propose to enhance the algorithm by performing 71 iterative core feature identification. In short, we apply the described algorithm and include uncovered documents only within some threshold n, a measure of distance of between a centroid and any document. Those documents that remain, that were not clustered, are redesignated as a new document set and the algorithm is reapplied. The result is a set of clusters based on different core feature sets. Again, those clusters would have to be combined at the end of the run to provide an overall result. These are suggestions we will explore in our future work. In short, our empirical results showed that the core semantic features not only produced clusters with comparable (or sometimes higher) purity as using all concepts, it also reduces the number of features significantly. Despite its smaller number, these core semantic features are informative and sufficiently capture the main themes of a text corpus. Thus, it can be used to efficiently cluster new documents using a reduced number of semantic concepts. Standard # Mapped Purity of the Deviation Documents Mapped Documents 1 161 0.936 3 162 0.93 5 236 0.757 7 598 0.521 9 1403 0.386 1 1 21 17 0.321 Table 17: Threshold-based centroid mapping 4.4.3.3 Effect of using Core Semantic Features Our strategy for selecting the core features using information gain can be used to find the main topics in a dataset. Generally, after clustering is performed, topics can be inferred 72 from the centroids of the formed clusters. If cluster purity indicates that the clusters are sufficiently representative, then we assume that the centroids of the clusters cover the main topics of the documents and each cluster contains documents that share a similar topic. In this experiment, we show the advantages of using the core semantic features for clustering. Due to space limitations, the results are shown for dataset bin10 only, although similar plots can be drawn for other datasets. First, we compare the distances of the documents to the respective cluster centroids when clustering using the core semantic features. Since there are only two clusters, the distances can be visualized in a 2-d plot as shown in Figure l 1. 1 , . 'W M. fiwlfl ‘ I "r';“"\‘. —‘ n (""~"T ‘nl'r‘; - .J‘ 1! I II m g I: it“ . *LTT 7A I - 741.1.- I _I_ l ..x.._1!}IJ:L;T\_. fiv . v- __ h . . , Class1 0.9 _ 0.8 _ 0.7 F 0.6 I 0.5 I I 04 Distance from centroid of cluster2 0 3 I l I I I l 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Distance from centroid of cluster1 Figure 11: Distances of documents from centroids of clusters using the core concepts 73 Each data point is marked as either a circle or an asterisk, depending on its ground truth class. Observe that all the points marked as circles (denoting classl) have a significantly larger distance to the centroid of cluster 2 than the centroid of cluster 1. On the other hand, if we use all the concepts as features, the distance from a document to the centroid of its opposite class is no longer pushed to its maximum value (see Figure 12). (:2, Class1 ‘ at ClassZ .0 £0 I .0 co T D m T Distance from centroid of cluster2 g: o 01 NJ 0.4 1 1 J 1 1 l L 1 0.5 0.55 .05 0.55 0.7 0.75 _ 0.8 0.85 0.9 0.95 1 Distance from centrord of cluster1 Figure 12: Distances of documents from centroids of clusters using all the concepts As previously noted, the WSD process could potentially introduce erroneous semantic features into the text corpus. Since the core semantic features are frequent and have high information gain, we expect such features to be more accurate when used to 74 guide the clustering process. To examine the quality of the core semantic features, we compare the cluster centroids obtained using k-means clustering on all nouns against the centroids obtained from core semantic features. Table 18 shows the list of top 15 features that form the cluster centroids for the bin] 0 dataset. Top 15 Nouns Top 15 Core Concepts Noun Noun Concept Concept centroid] centroid2 centroid] centroid2 Space Article Circuit orbit Article Work Ampere mission NASA Time Transformer satellite System Power Chip military Orbit Make Police budget Time Email Resistor landing Make Question Advice Mary Years Good chips Satellite Earth Circuit wiring Gravity Program Line Connection astronomy Work Problem amplifier exploration Science ld player Astronaut Shuttle Back memory Comet research Current Port Mars Launch University Exit Dynamics Table 18: Top 15 features of the centroids in the clusters using nouns and the clusters using the core concepts The dataset contains documents belonging to the “sciencespace” and “scienceelectronics” cat- egories. Using all nouns as features (columns 1 and 2 in Table 18), observe that the top features of the cluster centroids contain many noisy terms (e.g., 75 make and years for centroid 1 and Article, Time, and Email for centroid 2). On the other hand, using the core semantic concepts as features (columns 3 and 4 in Table 18), most of the top features that form the centroids clearly identify the topic of the class. For example, the second cluster centroid contains concepts such as “orbit”, “mission”, “satellite”, “atmosphere”, “exploration”, “gravity” and “astronomy” which all related to the “sciencespace” class, whereas the centroid for the first cluster includes “circuit”, “ampere”, “transformer”, “chip” and “resistor” as its top ranked features, which is consistent with the “sci- ence.electronics” class. A question might be asked about the concept “Mary”, which appears in the top 10 concepts in the “sciencespace”. The reason is because “Mary” exists as an individual concept in WordNet and “Mary Shafer”, a NASA employee, was a frequent contributor to the newsgroup “sciencespace”. In short, the results of this section suggest that the core semantic features produce cluster centroids that are informative and relate to the main topics of the documents. 4.5 Summary In this chapter, we presented the evaluation metrics used to validate our methods. We also describe our simplified noun-based approach baseline clustering in which the nouns are identified using WordNet and the clusters are obtained by applying spherical kmeans to the documents. Finally, we proposed a feature selection approach based on information gain in which only a subset of the semantic features is selected to be used in clustering. Our experimental results showed that the core semantic features were sufficient to not only maintain or possibly improve clustering but also substantially reduce the dimensionality of the feature set. 76 Chapter 5 5 Clustering Documents using WordNet: An Ensemble Approach In this chapter we propose a new method in which we combine statistics and semantics to improve clustering. We first give a detailed description of the method; we describe how we build the noun ensemble and the sense ensemble. Then we explain how we combine the clustering solutions of both ensembles to obtain the final clusters. 5.1 Preliminaries Given a text corpus of N documents D = {d1. d3 ..... dN}, our goal is to partition D into k clusters by incorporating semantic knowledge. Let W = {n}, n2,...,nm} be the corresponding set of nouns. We consider the bag of nouns representation in which each document di is represented as a frequency vector where each entry carries the frequency of the corresponding noun in the document. Similarly, let V = {3], sz,...,sr} be the set of senses. Each document di is represented as a binary vector of 1 if the sense exists in the document and 0 otherwise. 77 5.2 Proposed Framework The proposed ensemble clustering framework combines the clustering solutions obtained from the semantic similarity of the documents with those obtained based on frequency Algorithm: Combined Ensemble Clustering Input: A set of documents, D, and number of clusters k, and maximum number of iterations 1 Output: Cluster membership matrix. C. 7. Af *- Preprocessing(D) 8. AS <— WordSenseDisambiguation(Af) 9. Initialize: Enssense: Q and Ensnoun = E 10.forl=1toldo ' Cr <— Noun_Cluster_Membership(Af, k) Ensnoun Z Ensnoun U {Cl} C, <— Sense_Cluster_Membership(Ag. k) Enssense Z Enssense U {C8} 11.end 12. C *— Combine(Ensn0un, Enssense~ k) 78 similarities. Our rationale for using ensemble clustering is that, although individual clustering solutions (using either frequency or semantic similarity) may make poor decisions regarding the cluster assignments for some documents, one may be able to improve clustering results by consrdermg their collective decrsrons . The proposed framework is also highly flexible because it may accommodate any baseline clustering algorithms as well as methodology for creating different instances of the ensemble. The ensemble clustering framework has two types of data inputs: (1) a document-noun frequency matrix Af which corresponds to the preprocessed document vectors with frequency values, and (2) document-sense binary matrix As which corresponds to binary document vectors obtained by transforming the nouns contained in each document to their corresponding senses using WordNet. No frequency is included in the As matrix. A set of noun-based clustering solutions. Ensnoum is then generated from Af using the methodology presented in Section 5.3. Analogously, a set of sense-based clustering solutions, Enssense~ is obtained by applying the methodology given in Section 5.4. The clustering solutions from Ensnoun and Enssense are then aggregated to obtain the final clustering using the approach described in Section 5.5. 5.3 Noun Cluster Membership Matrix The nouns cluster membership matrix is generated by applying ensemble clustering to the document-noun frequency matrix Af. First, we randomly sample a subset of the nouns, Assuming each clustering solution 15 Independent and IS domg better than random cluster assrgnments. 79 W5 E W. To alleviate bias in the sample size le|, the number of nouns to be sampled is an integer chosen randomly between |Wg|/2 and lel-l, where m is the total number of nouns in the dataset vocabulary. A truncated document-noun frequency matrix, Af‘ is then created by choosing only the nouns in each document that belongs to the subset W3. Once the truncated matrix is obtained, the weights for each document vector are further normalized using the TFIDF method [90]. Finally, we apply the spherical k-means algorithm to obtain an N X k frequency-based cluster membership matrix, Cf, whose (if)th element is equal to 1 if the document di belongs to cluster j and 0 otherwise. We repeat the process to get I clustering solutions (where I ={10, 20, 30}) that will be used to compose the Noun Ensemble (Ensnoun) as we will describe in Section 5.5. 5.4 Sense Cluster Membership Matrix Our approach for creating the semantic cluster membership matrix is quite similar to the approach described in the previous section. However, instead of using the document- noun frequency matrix, we use the document-sense binary matrix Ag as input to the ensemble clustering algorithm. We randomly sample a subset of the senses where the sample size is a random integer chosen between |Vg|/2 and Ing-l. A truncated document- sense binary matrix AS‘ is then created by removing all the senses not included in the sample. After normalization using TFIDF, we apply the k-means clustering algorithm to obtain the sense-based cluster membership matrix Cs, where the element C 5(i,j)=1 if the 80 document di belongs to cluster j and C3(i,j)=0 otherwise. In addition to this approach, we have experimented with other approaches for generating the sense—based cluster membership matrix. One approach is to apply agglomerative hierarchical clustering on the Wu-Palmer similarity matrix between senses in a document. We then sample the senses based on the cluster cohesion (e.g., sample 70% of the senses in the cohesive clusters and 20% from the non-cohesive clusters). Although this approach intuitively helps to ensure that the sample retains some of the “core” senses of the document, our experimental results do not seem to show any significant improvements in the clustering results despite being more computationally intensive. 5.5 Combined Ensemble Clustering This section describes our proposed approach for combining the Noun Ensemble (Ensnoun) with the Sense Ensemble (Enssense)- First. an N X N weighted co-association matrix Sf is computed from the set of frequency-based cluster membership matrices that we obtained as described in section 5.6.3. The co-association matrix represents the number of times a pair of documents is assigned to the same cluster in the ensemble, weighted by the “quality” of the individual clustering solution. Formally it is iteratively computed from the frequency-based cluster membership matrices C f as follows: (1+1) _ (t) (t) (”T 81 . C<’>(;(t >T . . . . . where the’ matrix product f f 13 a binary 0/1 matrix that indicates whether a . . th . . pair of documents belongs to the same cluster during the I iteration of the ensemble and the weighting factor w. measures the quality of the clustering. The matrix product C (t) C (”T is also known as an incidence matrix in clustering literature and will f be denoted by the symbol It in the remainder of this section. In principle, if the true clusters are known, wt is given by the accuracy of the individual clustering solution. However, since clustering is an unsupervised learning task, its quality needs to be estimated using unsupervised cluster validity measures such as the correlation coefficient. Given a document—noun frequency matrix Af, we may compute the cosine similarity matrix for all pairs of documents by normalizing the rows in Af to unit length and multiplying the normalized matrix with its transpose, i.e. 2 = AfAfT <2) cosine The weighting factor wt is then computed by correlating the cosine similarity matrix Ecosine with the incidence matrix It. The higher the correlation is, the greater the level of agreement between the clustering results and the document similarity matrix. Ecosine Equation (3) is iteratively updated using all the clustering solutions on the noun frequency matrix. The weighted co-association matrix Sf effectively encodes the likelihood that a pair of documents is in the same cluster based on its noun frequency 82 information. Furthermore, we applied k-means on Sf to obtain the final clustering for the Noun Ensemble. The procedure is repeated to obtain the co-association matrix SS: (1+1) _ (1) (0 (OT SS — SS + w,Cs C, (3) (0 (0T where the incidence matrix C5 C5 depends on the clustering solutions of the document-sense binary matrix AS, and the weighting factor wt is the correlation coefficient between the cosine similarity of documents (computed from the document- sense binary matrix As) and the incidence matrix. The overall Sense Ensemble can be obtained by applying the k-means algorithm to the weighted co-association matrix. However, since our intention is to combine the Ensnoun with Enssense, we linearly aggregate their weighted co-association matrices as follows: Sc 0 mbined = ass + (1 — a)Sf (4) where a is a parameter that governs the tradeoff between using both ensembles. We will apply k-means on the combined weighted co-association matrix to obtain the clusters. 5.6 Experiments We used ReutersZ, Multiple-6000 and Binary-Multiple-5 00 dataset in our experiments in this chapter. The experiments below show the effect of using all senses or all nouns on clustering versus using our proposed ensemble algorithm. 5.6.1 Document Clustering using Senses After fixing the sense of each noun using the word disambiguation approach described in chapter 3, our next step was to compare the quality of clusters obtained using the senses 83 as features instead of their original nouns. For ReutersZ dataset, we started with 5922 nouns as features. After applying WordNet, the data is transformed into a document- sense matrix with 6559 senses. Likewise in the Multiple-6000 dataset, we initially started with 13801 nouns that correspond to 14589 senses produced by our word sense disambiguation method. Clearly, the number of dimensions has increased after word sense disambiguation since the algorithm may resolve the same noun into multiple senses depending on the context of the documents. Next, we apply k-means clustering on both the document-noun frequency matrix Af and document-sense binary matrix As. Table 19 shows the results of the clustering, which are based on the average entropy for 50 iterations of applying k- means with different initial centroids. These results however do not seem to support the need to replace the nouns with their corresponding senses. In fact, the use of sense information seems to degrade the entropy significantly for ReutersZ data set. Nouns- frequency Semantic Dataset matrix binary matrix Multiple-6000 0.86 0.88 ReutersZ 0.97 1.19 Table 19: Comparison of entropy for K-means One possible explanation for the poor results is related to increasing the number of dimensions when we replace the nouns by their senses. Another explanation is due to replacing some of the nouns with incorrect senses. Nevertheless, we still expect some documents to be correctly placed in the right cluster because of the word sense disambiguation. What is needed is a clustering algorithm that: 84 1. Better utilizes the sense information in combination with the term statistics information. 2. Deals with the increasing number of dimension when replacing the nouns by their senses. 3. Tolerant to noise when incorrect senses are used. Because of its flexibility and robustness, we conjecture that ensemble clustering is an appropriate approach for combining term statistics with semantic knowledge for document clustering. 5.6.2 Ensemble Clustering Ensemble clustering is a technique that was developed to improve the performance of clustering algorithms by aggregating the results from multiple runs to obtain the final clusters. In this work, we combine the clustering results from both the Sense Ensemble (denoted as EnSsense) and the Noun Ensemble (denoted as Ensnoun) into one final clustering (denoted as Ensboth)- Our motivation for using ensemble clustering is because of its flexibility to accommodate any input data matrix (either the document-noun frequency matrix, the document-sense binary matrix, or both), its ability to deal with high dimensionality via feature subsampling, and its resilience to noise and other variability in the data. Figure 13 shows the results of applying the three ensemble clustering methods to the ReutersZ and Multiple6000 datasets. The number of clustering solutions created in each ensemble varies from 10 to 30. 85 Multiple 6000 Ens_sense 0.8 - lEns_noun DEns_both 0.6 - 0.4 . 0.2 a O i i 10 20 30 iterations IE ReutersZ nS-sense 1.2 . lEns_noun 1 . UEns_both 0.8 r 0.6 - 0.4 - 0.2 - 0 ' iterations Figure 13: Comparison of entropy values of Ensnoun’ Enssense and Ensboth clusters for Multiple6000 and ReutersZ datasets 86 To generate the final clustering Ensbom, we combine half of the clustering solutions from Enssense with another half from EnSnoun- As mentioned in Section 5.5, each clustering solution in the ensemble will be weighted according to the quality of their clusters (which is measured in terms of the correlation between the cosine similarity of the documents and the resulting incidence matrix of the clusters). We observed that the weighting factors wt associated with the solutions in Ensnoun were generally greater than the weighting factors for Enssense- For example, for the ReutersZ dataset the average of the weights of the Ensnoun was approximately .5 compared to .3 in the Enssense' The weighted clusters in each ensemble were aggregated in a co-association matrix (with a = 0.8) that reflects the consensus of the individual runs on allocating the documents across the clusters. For the Multip1e6000 dataset, the compound ensemble Ensboth achieved the lowest entropy score. For example, after 20 runs, the entropy score for the compound ensemble Ensboth is .559 which is considerably lower than the scores for the Noun Ensemble .636 and the Sense Ensemble .787. This result suggests that our compound ensemble method is capable of enhancing the clustering results by taking advantage of the variability in the clustering solutions obtained from the term statistics and semantic knowledge. Even though the semantic binary model has higher entropy than noun frequency model, it still provides useful solutions that can be exploited 87 by our compound ensemble. o 81 - Multiple 6000 IEns_sense 0'7 IEns_noun . 9 DEns!both 0.77 0.75 0.73 0.71 0.69 0.67 0.65 0.63 _-, . 30 iterations —— IEns_sense ReutersZ IEns_noun 0.6 a DEns_both 02‘ 0 . f as; a 10 20 30 iterations Figure 14: Comparison of purity values of EnSnouns Enssense and EnSboth clusters for Multiple-6000 and Reuters2 datasets Furthermore, all the ensemble clustering results are significantly better than the results for individual runs even though it uses only a sample of the original features. 88 In ReutersZ dataset, no significant improvement was observed for the compound ensemble Ensboth over the Ensnoun- This result suggests that the nature of the data set also plays a significant role in determining the effectiveness of incorporating semantic knowledge from WordNet. Figure 14 shows the purity values for the different ensemble clustering methods. Once again, the compound ensemble Ensboth achieved the highest purity score .802 for the Multiple-6000 dataset whereas the noun frequency ensemble Ensnoun has the highest purity for the ReutersZ dataset. 1.2 - D kmeans I Ens(sense,nouns) I ..i 0.6 ‘ 5: 3 if; 0.4 i 0.2 - i 0 I u—- N M I— N m —‘ N M b b b .9 ‘Q 'Q 53 g 2 a at «3 z 2 .2: ... .. .. .s .s 5 == = = 1; g g .o .o .0 E E E E". E E Figure 15: Comparison between k-means using cosine and the combined ensemble Finally, it is worth noting that our proposed method using the compound ensemble clustering improved the clusters quality compared to LSI for both datasets. The lowest entropy obtained for Reuters2 in LSI was 1.01 using 100 components compared to 89 .947 after 10 iterations using our compound ensemble method. For the Multiple-6000 dataset, LSI achieved entropy of 1.13 with 100 components compared to .559 when applying the compound ensemble at 20 iterations. Table 20 shows a comparison between the purity of the clusters obtained using spherical k-means represented by the Baseline and the corresponding purity produced by the combined ensemble Ensboth on the Binary-Multiple-500 dataset. Empirical results suggest an improvement for all 9 datasets compared to the Baseline. For example, with multi101 a purity of .83 was obtained compared to .742 purity on the Baseline. Likewise, Ensboth exhibited an improvement for multi102 and multi103, 9% and 14% relative enhancement in the quality of the partitions was observed for both datasets, respectively. [=10 [=20 L=30 sets Ensscnsc Ensnoun Ensboih Ensscnsc Ensnoun Ensboth Enssens Ensnoun Ensboth Binaryl 0.922 0.924 0.93 0.916 0.928 0.928 0.916 0.92 0.924 Binary2 0.888 0.912 0.906 0.898 0.908 0.914 0.894 0.908 0.916 Binary3 0.922 0.914 0.928 0.936 0.922 0.936 0.924 0.926 0.928 multi51 0.89 0.934 0.934 0.912 0.942 0.936 0.92 0.938 0.940 multi52 0.927 0.939 0.949 0.931 0.919 0.941 0.925 0.925 0.933 multi53 0.959 0.973 0.975 0.953 0.971 0.965 0.947 0.959 0.963 multi101 0.64 0.754 0.742 0.718 0.798 0.786 0.754 0.792 0.830 multi102 0.702 0.746 0.804 0.754 0.754 0.774 0.764 0.762 0.790 multi103 0.77 0.76 0.788 0.774 0.828 0.838 0.768 0.818 0.836 Table 20: Ensemble results for nouns senses and the both combined for Binary-Multiple- 500 90 Table 20 below shows the clusters purity for the three ensembles for all 9 datasets. In Figure 16, we plot the purity results with 30 clustering solutions. Blnary-Multlple—SOO r‘ "7' a 1’ W T , _‘T *— i —r — ’—r‘ fl — Ens I: Enssense _ H l: noun 09-” —- ! EnSboth r , r T " T l. 0.9 ‘, 7 1 .9 'E a 0.85 * 1 ~ g 0.8 . . ..3 0.75 _ 1 i _ .1 ' J binaryl binary2 binary3 multi51 mulli52 multi53 multi101 multi102 multi103 Figure 16: Ensemble results for nouns senses and both combined for Binary-Multiple-SOO 5.6.3 Value of Sense Information Our experimental results suggest that incorporating sense information is helpful for clustering the Multiple-6000 dataset but not for the ReutersZ dataset. To validate this result, we compare the cosine similarities obtained from the document-noun frequency matrix Af and the document-sense binary matrix As against the ground truth clusters. If sense is indeed helpful, we expect the similarity between documents that belong to the same cluster to increase when using the sense information as opposed to using the 91 frequency information. To measure the value of adding sense information, we compare the cosine similarity matrix Ecosine generated from the input data against the ground truth clusters. Our conjecture is that sense helps when the input data created from the document-sense matrix produces cosine similarities that are more “consistent” with the desired clusters. More specifically, for each document di, we sort the similarity values in th . . . . . the i row of zcosine in descending order. If the clustering algorithm 18 perfect, then all the documents belonging to the same cluster as di (according to the ground truth) should be well-separated from those that belong to a different cluster. In other words, documents that are supposed to be in the same cluster as di should be ranked highest. By comparing the relative ordering of the documents obtained using Ecosine» we can get a sense of how consistent is the input data with the desired clusters. We apply the Wilcoxon-Mann- Whitney (WMW) statistics [104] as our measure of consistency between the ordering of cosine similarity values and the ground truth clusters. We illustrate an example of using the statistic below. Example 1: Let D = {d1, d2, d6} be a collection of six documents. Suppose the cosine similarity (based on the document-noun frequency matrix) between d] and the remaining five documents are (0.5, 0.3, 0.7, 0.02, 0.11). Furthermore, let d2 and d3 be in the same cluster as d. (we denote them as “+” examples) whereas d4 —- d6 are in a 6‘ 66 different cluster (we denote them as examples). By sorting the documents according 92 to their cosine similarity to d], we obtain the following order: (d4, d2, d3, d6, d5), or equivalently, (-, +, +, -. -). The rankings of the positive examples are given by the vector x = [2, 3], while the rankings of the negative examples are encoded in the vector y = [1, 4, 5]. The WMW statistic is defined as follows: n+ n_ 1(x,-,y -) 2 O J (7) W = "OJ" n+n_ where [(xi, yj) = 1 if xi < yj and 0 otherwise, while m and n- are the number of positive and negative examples. In the previous example, W = 4/6. If all the positive examples are ranked higher than the negative examples, then W = 1. Averaging the WMW statistics for all the documents in the input data allows us to measure the consistency between the cosine similarity values obtained from the input data with the desired ground truth clusters. Table 21 shows the average WMW statistics for both Reuters2 and Multiple- 6000 datasets when applied to the document-noun frequency and document-sense binary matrices. Average WMW statistics Dataset Document-Noun Matrix Document-Sense Matrix Multiple-6000 0.857 0.86 Reuters2 0.931 0.918 Table 21: Average WMW statistics For Reuters2 data set, the WMW statistics becomes worse after transforming the nouns into senses. This implies that the cosine similarities obtained from the senses does 93 not help to improve clustering. On the other hand, the WMW statistics improves slightly for the Multiple-6000 dataset. 5.7 Summary This chapter presents our contribution in developing a new methodology for combining term statistics with semantic knowledge from WordNet for document clustering. Our analysis shows that a straightforward replacement of the terms with their senses may not necessarily improve the clustering results, which is consistent with some of the previous results reported in [92] and [104]. The clustering algorithm needs to be flexible and robust enough to deal with the higher dimensionality and noise due to improper selection of senses. To overcome these challenges, we propose an ensemble clustering method that systematically combines clustering solutions from the Noun Ensemble with those from the Sense Ensemble based on the consistency of their clustering solutions. Our experimental results show that the ensemble method is effective on some but not all datasets. 94 Chapter 6 6 Clustering Documents using Wikipedia: Concept Mapping Approach In this chapter we investigate the benefit of incorporating Wikipedia in document clustering. We first analyze the effect of Wikipedia preprocessing and concept mapping on document clustering. We then present a method of exploiting Wikipedia’s categories and use them as features. Finally, we present the effect of combining multiple sources of knowledge on document clustering. 6.1 Exploiting Wikipedia Wikipedia has been recently used to generate a concept—based representation for a collection of documents [28][47][54]. In the context of Wikipedia, the concepts are the articles stored in Wikipedia. Like WordNet, Wikipedia also has a hierarchical relationship between the concepts/articles, though the relations are different. We will show how this different kind of ontology can be used to solve the same problem previously solved by WordNet. Some of the proposed methods that incorporates Wikipedia in document clustering, utilized the contents of the articles to find the concepts whereas others used the hyperlink network between the articles and/or the category network to compute relatedness between the concepts [47][110][50]. Content-based methods rely on finding the textual overlap between the documents and the articles 95 themselves whereas hyperlink-based approaches utilize the overlap with the anchor texts of the hyperlinks found within Wikipedia articles as well as the hyperlink network between the articles themselves. In the next section, we show the results of applying the concept mapping approach described in Section 3.4 to ten datasets. We also compare our method with state of the art research that exploits and leverage Wikipedia knowledge to document clustering. 6.2 Evaluation of Wikipedia-based Concept Mapping Approach Our method is an extension of Gabrilovich et al approach [28]. In this method, each concept in Wikipedia is given a relatedness score to a given document. The relatedness score is computed by taking the dot product between the document and the concept vectors. We modify this score by the relative frequency of mapping that concept to the document as we explained in Section 3.4. We evaluate our method by comparing it with the noun-based baseline clustering (discussed in Section 4.3) as shown in Table 22. We noticed that for binary datasets using Wikipedia concepts as features resulted in clusters with purity as good as the Baseline for 85% of the binary datasets. However, for datasets with multiple classes, the problem becomes more challenging, therefore Wikipedia concepts were not sufficient to upgrade the clusters due to irrelevant concepts included during the concept mapping process. We investigated this problem in more depth by examining the multi101 dataset, which has 10 classes. We measured a scatter ratio that reflects how close the classes in the space are before and after mapping the concepts. More specifically, we computed a 96 ratio of the scatter between classes to the scatter within classes. The between class scatter is nothing more than the average distances between the mean of each class and the centre of the data, whereas the within class scatter is the average distances of the documents within a class from their centre. If the scatter ratio is higher after mapping the concepts then this indicates that the presence of some of the mapped concepts in the feature set added to the ambiguity between the classes, thus distorted the clustering results. For multilOl dataset, the scatter score was 4.07 using only the nouns as features before mapping and 1.9 using Wikipedia. Obviously, the scatter in the dataset was imbalanced using the concepts. therefore for this dataset the concepts were not as effective. Datasets Baseline Wikipedia Tdepth4 T3 Up bin7 0.965 0.965 0.950 0.965 bin9 0.952 0.950 0.950 0.945 bin10 0.940 0.902 0.860 0.883 bin] 1 0.980 0.970 0.942 0.925 bin12 0.912 0.930 0.900 0.902 binary] 0.920 0.871 0.836 0.846 multi51 0.924 0.844 0.800 0.844 multi101 0.740 0.610 0.553 0.544 multiple6000 0.693 0.330 0.279 0.290 Reuters2 0.697 0.61 1 0.48 0.490 Table 22: Comparison between Wikipedia concepts. Tdepth4s and T3Up clusters in terms of Purity Although our method did not produce better clusters than those obtained by the rigorous noun-based baseline, it outperformed current methods that incorporate background knowledge for document clustering. Table 23 shows the purity of the clusters 97 obtained when applying the different methods on one dataset. The results are the average of 10 runs of spherical kmeans. The last row in Table 23 shows the performance of our method. The other rows show the performance of the following methods: WordNet based algorithm proposed by Hotho et al [35], Hu et al’s algorithm [47], a system that applies Gabrilovich et al method [28] and Huang et al approach [50] (refer to the related work chapter for description of these methods). Note that the results in the table are copied from Huang et al [50]. They experimented with the min20-max200 sample from Reuters- 21578 dataset. To make an equivalent comparison with all approaches in the table, we applied our method to the same dataset. Reference Purity Clustering approach Hotho et al .605 Bisecting kmeans Gabrilovich et al .607 Spherical kmeans Hu et al .655 Spherical kmeans Huang et al .670 Spherical kmeans our method .700 Spherical kmeans Table 23: Comparison with other methods using min20-max200 Reuters dataset Our proposed method outperformed all other methods and exhibited an improvement of 4.47% over the best performance reported by Huang et al. the best purity obtained by our algorithm in the experiment was 0.724. To further investigate the capabilities of all approaches in terms of the quality of the features extracted. Table 24 presents a comparison between subsets of features extracted by these methods. The comparison is presented using document #15264 which belongs to the “copper” class. The first four rows in the table are taken from [50]. The results include some selected concepts from the feature vector without any particular 98 ordering where the last row in the table shows the top 12 concepts extracted by our, method. All approaches did extract features related to copper mining. Hotho’s WordNet based approach was not able to capture named entities existing in the original document as WordNet does not contain this information. As Hu et al [47] exploits both the category structure and the hyperlink network to extract the concepts, they have broader topics in their list of features, thus tech Cominco is expanded with Mining Companies in Canada and Con Mine [47]. Huang et al. [50] extracted their features using the anchor texts in Wikipedia. Their method did pick up features related to copper mining. Copper, venture, highland, valley, british, Columbia, affiliate, mining, negotiation, complete, administration, reply, silver, ounces, molybdenum Teck, John Towson, Cominco Arena, Allegheny Lacrosse Officials Association, Scottish Highlands, Productivity, Tumbler Ridge, British Columbia, Highland High school, Economy of Manchukuo, silver, Gold (color), Copper(color) Tech Cominico, British Columbia, Mining, Molybdenum, Joint Venture, Copper Mining, Joint Venture, Copper, Silver, Gold Ore, Management, Partnership, Product(business), Ounces, Negotiation, Molybdenum, Teck Cominco, Vice President, Consortium, Short ton. Copper, Highland Park New Jersey, Logan Lake British Columbia, Copper Mining in Arizona, Venture Capital, Scottish Highland Dance, North American Waterfowl Management Plan, Estella mine, Molybdenum, Codelco, El Salvador Mine, List of mines in British Columbia Table 24: Comparing features generated by different approaches 99 Comparing with Gabrilovich’s approach, note that our list of concepts is better than their concepts. This improvement is the result of modifying the relatedness score and ranking the concepts. For example the concept copper (color) was extracted by their method though it is not related to copper mining, but was included as a result of the textual overlap between the document and the concept. In summary: ' Our method performed better when compared to existing approaches that incorporate external knowledge to boost clustering. ' Our concepts mapping approach was as good as the noun-based baseline when applied to binary datasets but considerably worse using multi-class datasets. 6.3 Exploiting Categories In this section we establish an alternative set of features based on the categories from Wikipedia. The rationale is to introduce new features to the document that have high level or broader meanings as well as features with specific meanings. General categories are placed in the top levels of the Wikipedia category structure with the categories becoming more specific as you traverse down the category structure. For example, consider two documents that belong to the class “sports”, d; is about swimming and d2 is concerned with tennis sport. Ideally, both documents should be clustered together since they share the same topic “sports”. Assume that the concept Swimming (sports) in Wikipedia was mapped to d1 and the concept tennis was mapped to d2, Using these concepts does not help in grouping d] and d2 because they don’t have concepts in common. However using the category structure, we can increase the similarity between 100 the documents by adding broader topic features at the top levels of the ontology. Figure 17 shows part of the categories of the two concepts Swimming (sports) and Tennis. The direct categories for Swimming (Sports) are (swimming and Olympic sports), on the other hand (tennis and racquet sports) are the direct categories of the concept Tennis. Note that expanding the direct categories to a higher level connects both concepts Swimming (sports) and Tennis with the category (sports by type). Thus including this new categorical feature would increase the semantic similarity between d] and d2 and increase the chances of putting them in the same cluster. This is an example that shows how categories can improve clustering. categories { [ \ concepts ‘[ SYESSEng Tennis Figure 17: Part of the category structure in Wikipedia, rectangles denote concepts while circles denote categories Therefore, we utilize the category structure from Wikipedia to extract the categories and use them as features to represent the documents. Hu et a1 [49] did exploit the categories and used them as features in his work. However, they used direct categories associated with a concept and did not utilize the category structure of Wikipedia. Furthermore. their conclusion was not definitive 101 regarding the effectiveness of using categories “only” as features. They concluded that a weighted linear combination of the original words of the documents and the categories helps clustering. However, we argue that this analysis is incomplete and needs more investigation. That is using the direct categories of a concept (categories at the first level of the category structure) does not necessarily help clustering as we have shown in the sports example above. Furthermore, the articles are generated and edited by Wikipedia users. Thus the correctness of linking an article to a set of categories is subject to the users’ opinions and knowledge. Therefore it is essential to use higher levels in the category structure, as two concepts are more likely to intersect via their corresponding categories at higher levels of the Wikipedia category structure. In our research, we investigate these issues in addition to exploring the benefit of using categories “only” as features. Unlike Hu et a1 [49], we focus on using higher level categories rather than direct categories for the reasons we mentioned above. We experimented with two sets of categories: one set includes categories at a high level in the category structure with broader topics, and the other contains the direct categories and their ancestors up to three levels high in the network. The benefit of the latter option is that it maintains categories with specific meanings directly related to the contents of the article as well as categories with broader topics that might be related to the class of the original document. The discussion below explains how we obtain these sets and their characteristics. 1. Categories at depth 4 from the root (Tdepth4)- This set is obtained by traversing the category network in a Depth First Search fashion (DFS). First the direct categories of each concept are extracted T = { 1‘]- | t]- is a direct category of Ck}. 102 Second, for all categories in the set T we extract their corresponding depth_4 categories in the category structure. The reason we choose this particular level is our intention to reduce the dimensionality that we observed using the concepts. Wikipedia maintains approximately 12,000 categories 4 levels down from the category root. 2. Direct Categories and their ancestors to 3 levels above the article’s category in the category structure (T3Up). We implemented the limited depth first search algorithm to get this set of categories. The depth limit is set to 3 and the algorithm starts with the direct categories of a particular concept. The weight of each direct category to (category at level 0) of a particular concept Ck is equal to the relatedness score of that concept. At level j, the weight of the category tis equal to the sum of the weights of the categories associated with its incoming-links categories at level j-l , lhl Wj’ = 2W}_, (1) i=1 Here h is the number of incoming-links (children) oft. Table 22 illustrates the results of the depth4 categories (Tdepth4) and the 3— LevelsUp categories (T3up). When compared to the noun-based baseline, the performance of both sets of categories for several datasets, especially the multi-class ones, was not encouraging. We noticed that some categories that belong to Tdepth4 had high document frequency that, in some cases, is nearly equal to the total number of 103 documents in a dataset. Such categories are non-discriminative and have a negative effect on clustering because their existence incorrectly increases the similarity between semantically unrelated documents. To illustrate this finding, we plotted the Cumulative Distribution Function (CDF) of the document frequency of nouns (solid curve in the figure) and the document frequency of Tdepth4 Wikipedia categories as shown in Figure 18. Recall that ReutersZ has 20 topics in total in which the maximum number of documents per topics is 200. It appears from the figure that the density of document frequency of the nouns spans the interval [0,500]. In fact F (200) =P(X<200)=0.986 which indicates that the density is mostly distributed below 200. In other words, a large portion of the nouns have document frequency less than or equal to 200. This is consistent with the fact that the datasets includes topics with mostly 200 documents. More specifically, a noun that is specifically related to a certain topic might appear in at most 200 documents (i.e. have document frequency of 200) or less. The remaining area under the curve has a value of 0.014 which represents the number of nouns with document frequency greater than 200. The existence of such features is undesirable because they cause the different topics to intertwine making partitioning more difficult. Unfortunately, in the Tdepth4 category curve in Figure 18, a large number of categories have high document frequency since the categories distribution spans a wider interval [0, 2500] than the nouns. This is at least partially responsible for the poor performance of the categories compared to the nouns. It was based on this analysis that we decided to use other category sets besides Tdepth4 which seemed to have too many common connections between document categories. We introduced T3 Up categories to overcome wide coverage problem observed 104 in the Tdepth4 categories. That is, T3Up includes categories with specific topics that are semantically close to the associated concepts as well as categories with broader topics. For 6 out of 10 dataset, a slight improvement in the clusters was obtained using T3Up compared to Tdepth4 while no improvement observed comparing to the noun-based baseline as shown in Table 22. We suspect that including all the categories on the path from level 0 up to level 3 in the network increases the chances of including irrelevant categories to the features set. Empirical CDF I [ ..lioliioioiolgilI|""]'I'Y'r1"rvvfiv T i|."' — nouns TDEPTH4 0.7 ' 5: _ 0.6 _ CDF 0.5 f - — 0.4 r Tvvprrg 0.3 r 2 0.1L ~ O L A l l l O 500 1000 1 500 2000 2500 3000 Document Frequency Figure 18: Cumulative distribution function of Wikipedia Tdepth4 categories and nouns for Reuters2 dataset 105 Especially that the category structure of Wikipedia is so highly interlinked and links doe not necessarily indicate an is-a relationship. To deal with this challenge, more research is required to better extract the categories from Wikipedia to represent the documents. The conclusion of this experiment is summarized as follows: ° Despite the extensive study that we made, both sets of categories exhibited worse performance than the concepts 0 Tdepth4 categories gave comparable clusters to the nouns baseline in 50% of the datasets. 0 No significance between Tdepth4 and T3 Up in terms of performance. 6.4 Combining Multiple Sources of Knowledge In Chapter 6 we showed that combining the ensemble produced by aggregating clustering solutions based on the senses from WordNet and the ensemble produced based on the nouns improved document clustering. In this section, we explore the benefit of adding another ensemble based on Wikipedia concepts to the previously created ensembles. We briefly describe the three components of the new ensemble which we call Ensa”: ' Noun-based ensemble Ensnoun: this ensemble is generated by combining different clustering solutions produced by applying spherical kmeans using nouns as features. ° WordNet-based ensemble Enssensei this ensemble is generated by combining different clustering solutions produced by applying spherical kmeans using the senses extracted from WordNet. The senses were obtained by leveraging the 106 WordNet hierarchy as Wu-Palmer similarity measure was used to identify the most appropriate senses that disambiguate the nouns. ° Wikipedia-based ensemble EnSconcepfi this ensemble is generated by combining different clustering solutions produced by applying spherical kmeans using the concepts from Wikipedia as explained in Section 3.4, these concepts are extracted based on the textual overlap between the original documents and the concepts themselves. To build the combined ensemble, three co-association matrices have to be generated: the noun co-association matrix, the sense co-association matrix and the concept co-association matrix. In the previous chapter, Sections 5.3 and 5.4 explain how we generated the noun cluster membership matrix and the sense cluster membership matrix, respectively. Then in Section 5.5 an explanation on forming the co-association matrices for both the nouns and the senses is provided. The co-association matrix using Wikipedia concepts is generated in a similar fashion. At this point, the three co- association matrices are linearly combined according to the following equation: (9) Ens,” = (1— A) x Ens + (2/3)}t x Ens + (1/3)A x Ens noun sense concepts The three ensembles are linearly combined based on the weighting factor it. We give minimal weight to Wikipedia ensemble since it has the worst clusters in the hybrid ensemble. In the experiment, the weighting parameter it set to 0.5, since we noticed from our experiments that it produced best results. The clustering results reported for each ensemble are based on 30 iterations. In each iteration, spherical kmeans is applied to the corresponding co-association matrix to obtain the clusters. 107 ENTROPY Dataset Baseline EnsseLse Enscow EnSnoun Ensboth M bin] 1 0.090 0.077 0.204 0.107 0.103 0.079 bin12 0.273 0.302 0.313 0.254 0.263 0.222 binary] 0.270 0.289 0.394 0.272 0.270 0.262 multi51 0.345 0.290 0.556 0.249 0.228 0.217 Multiple6000 0.939 0.720 1.610 0.594 0.579 0.70] Reuters2 0.970 1.130 1.187 0.929 0.948 0.986 Table 25: Performance of the ensembles in terms of entropy The clusters are evaluated using entropy, purity and NMI as shown in Table 25, Table 26 and Table 27, respectively. PURITY Dataset Baseline Enssense Enscom Ensnoun Ensboth Ensall binl] 0.980 0.985 0.948 0.978 0.979 0.984 binl 2 0.912 0.903 0.889 0.919 0.919 0.936 binary] 0.920 0.915 0.863 0.921 0.923 0.925 multi51 0.924 0.935 0.833 0.947 0.952 0.953 Multiple6000 0.693 0.710 0.493 0.749 0.768 0.721 Reuters2 0.697 0.621 0.613 0.706 0.689 0.668 Table 26: Performance of the ensembles in terms of Purity 108 We mainly focus on comparing the performance of Ensa” with EnSboth because our goal is to evaluate the benefit of adding Wikipedia Knowledge EnSconeept to Ensboth. Empirical results showed that combining Wikipedia ensemble Ensconcept with Ensboth either does not help improve clustering or adds minimal enhancement to the clusters of EnSboth- NMI Dataset Baseline Enssense BMW EnSnoun EnSboth _EnS_all binl] 0.859 0.889 0.706 0.845 0.852 0.886 bin12 0.605 0.564 0.548 0.634 0.621 0.680 binary] 0.600 0.584 0.432 0.608 0.611 0.622 multi51 0.785 0.820 0.655 0.845 0.858 0.865 multiple6000 0.509 0.520 0.291 0.610 0.612 0.563 ReutersZ 0.336 0.333 0.314 0.335 0.323 0.320 Table 27: Performance of the ensembles in terms of NMI This minimal improvement is particularly obvious using the entropy measure. Again, we explain the poor results by the poor performance of Ensconcept- More specifically, the incorrect clustering solutions that are included in the co-association matrix of Ensconcept distorted the co-association matrix of Ensboth which consequently degraded the final clusters. Therefore, the individual clustering solutions of the Ensconcept needs to be improved to effectively participate in the combined ensemble. 109 1.1 ' fl Ens_both I Ens_all 0.5 ‘ 0.4 ‘ binary1 Multi51 Multiple6000 ReutersZ Figure 19: Comparison of the different ensembles with the Baseline in terms of Purity 1.2 - D Ens_both I Ens_all J .‘ '3 -. .1: . . D 1‘ v3?- . ... 3:5‘0': ,- -. .i, .‘ . , .9: g : :25 rv fl : N s- B O N s E 8 E B 8 :5 E :.__,O a, 3 i: E Figure 20: Comparison of the different ensembles with the Baseline in terms of Entropy 110 The conclusion from this experiment is that concept mapping is the key to get good clusters. Hence it remains a challenging problem that requires more research. 1.1 - 1 a BEns_both 0.9 ‘ IEns_all 0.8 - 0.7 a 0.6 ‘ 0-5 ‘ iii-5'5. * 0.4 1- N ‘- 0 N z 2 e 8 8 5 '5 = s 8 2 :9; 01:, a 2 Figure 21: Comparison of the different ensembles with the Baseline in terms of NMI 6.5 Summary In this chapter we have presented an experimental study on using Wikipedia to create a concept-based representation of a collection of textual documents. Our approach is an extension of Gabrilovich’s method. Our method achieved an improvement compared to other methods when evaluated on a subset of the Reuters dataset. Additionally, we used the category graph from Wikipedia to extract categories to represent the documents. Despite the extensive experiments, the results were poor, which might be related to the 111 fact that Wikipedia category structure is not taxonomy with semantic relations, rather, a thematically organized thesaurus. Furthermore, we experimented with combining an ensemble of clusters using Wikipedia concepts as features with the combined ensemble of clusters using the nouns and the senses from WordNet. But because Wikipedia ensemble is not as good as the joint ensemble of nouns and senses, the results showed that integrating Wikipedia concepts did not help clustering. Hence more research is needed to improve the concepts prior to combining Wikipedia with other sources of knowledge. 112 Chapter 7 7 Conclusion and Future Directions In this thesis we have investigated the problem of incorporating background knowledge into document clustering. While the idea behind this research sounds promising, its success is highly dependent on several factors: ° Effectiveness of the concept mapping approach. Specifically, a concept mapping approach must be able to address the following issues in order to provide a reasonable set of concepts for clustering: o Disambiguating the words in a document by mapping them to their most appropriate concepts based on the context of the document. 0 Generalizing the words in the documents by using higher level concepts of the given ontology. This step has a direct effect on increasing the similarity between documents that are semantically related yet do not have features in common. ° Effectiveness of the feature selection approach. Feature selection is needed to reduce dimensionality and eliminate erroneous concepts introduced to the feature set after concept mapping. 113 7.] Contributions of this Thesis This thesis investigates the various aspects of ontology-driven clustering research, including the strategy for concept mapping, feature selection, choice of baseline algorithm, and development of robust algorithm for combining document frequency with background knowledge from an ontology. The overall contributions of the thesis are summarized below. 7.1.] Development of a Simplified Noun-based Approach for Baseline Clustering A baseline clustering should reflect the best clusters that can be obtained from the data with a relatively cheap clustering algorithm. Unfortunately, the inconsistency in the baselines used for evaluation led to contradictory results about the use of an ontology in document clustering. We propose a noun-based approach from baseline clustering that is not only simple to construct but also rigorous and outperformed complicated clustering algorithm. Our analysis showed that the nouns are adequate to produce comparable or significantly better clusters than typical baseline results from published research. 7.1.2 Clustering using Core Features with High Information Gain We proposed a methodology for clustering using the core semantic features. Our analysis shows that using all the semantic concepts may not necessarily improve the clustering results, which is consistent with some of the previous results in the literature (see Chapter 2 for details). The clustering algorithm needs to be flexible and robust enough to deal with the higher dimensionality and noise due to improper selection of concepts. To overcome these challenges. we proposed an information gain based clustering algorithm 114 in which only a subset of the semantic features is selected to be used in clustering. Our experimental results showed that the core semantic features were sufficient to not only maintain or possibly improve clustering but also substantially reduce the dimensionality of the feature set. 7.1.3 Hybrid Ensemble Approach for Document Clustering We propose a hybrid ensemble approach that combines an ensemble of clusters constructed using nouns as features and an ensemble of clusters built using concepts extracted from an ontology. Our empirical results show that the compound ensemble produced better clusters than the noun-based baseline clusters. 7.1.4 Exploit Wikipedia for Document Clustering and Combine Multiple Sources of knowledge We utilized Wikipedia to extract concept based representation for a given collection of documents. The method we used to map the concepts to the documents is based on Gabrilovich approach, however we enforce a weighting scheme on the extracted concepts to get rid of noise. We also experimented with using the categories from Wikipedia to represent the documents; nonetheless, our method needs improvement as the clustering results were not promising. Furthermore, we investigated the problem of constructing a hybrid ensemble framework using concepts from Wikipedia, senses from WordNet and the original nouns. Unfortunately, the empirical results have shown that Wikipedia did not add value to the final clusters. We suspect that this is related to the quality of the concepts extracted from Wikipedia. Having a better set of concepts from Wikipedia is expected to lead to better clustering after the combination. This will be investigated in the future work. 115 7.2 Future Directions The work presented in this thesis investigates possible benefits from incorporating background knowledge into document clustering. However, this problem remains challenging and raised several questions that open different paths for future research. Next, we review some of the pending issues that we would like to investigate as part of our future work: ° Exploiting Wikipedia. Based on our previous research, the categories extracted from Wikipedia were useful for clustering binary datasets but occasionally performed good on multi class datasets. In the short term we plan to improve the mapping from the concepts to their related categories. ° Combining multiple resources of background knowledge. Previously, we combined multiple sources of knowledge via an ensemble approach in an effort to improve document clustering. However this method showed improvement for some of the datasets only. We plan to improve upon this approach by combining different types of information from the various sources. For example, WordNet has a well defined hierarchy that is reliable for computing similarities between the words in a dataset. On the other hand, Wikipedia is well-known for its wide scope of coverage. However we noticed that the concepts extracted from Wikipedia needs to be filtered to drop the noise. We plan to compute a word-word similarity matrix based on WordNet and modify the TFIDF weights of the words within concepts using this new matrix. This will give words that are semantically related within a concept higher score than loosely related ones, eventually these can be 116 dropped from the concept. Thereafter, we apply our concept mapping approach to extract the concept-based representation for a given dataset. Incorporation of semantic knowledge directly from an aextemal source of knowledge into document clustering. Current methods [104][109][1 10][112][115] often create a new set of semantic features that is used either solely or in combination with the original nouns in document clustering. The disadvantage of such methods is that clustering is performed independent from semantic feature extraction; which means that the effect of the erroneously generated semantic features can not be reversed at the clustering step. An alternative method to incorporate background knowledge is to design clustering algorithms that directly integrate ontology information into the clustering process. Predict the effectiveness of an ontology-clustering approach on several datasets. We would like to determine the usefulness of incorporating semantic knowledge from ontology in document clustering prior to applying WSD or concept mapping on a dataset. We aim to identify special characteristics of a dataset that require incorporating ontology to improve the quality of clusters. Examples of such characteristics include the percentage of polysemous and synonymous terms in a dataset or semantic relatedness of words. 117 APPENDIX Top 46 Topics of Reuters-21578 Dataset Class ID Class # Documents 1 Acq 1 00 2 alum 48 7 Bop 47 9 carcass 29 14 cocoa 59 1 7 coffee 100 1 8 copper 62 23 cotton 27 27 cpi 75 29 crude 100 33 dlr 37 36 earn 100 43 gas 33 44 gnp 1 00 45 gold 1 00 46 grain 100 50 heat 1 6 52 hog 1 6 53 housing 1 6 56 interest 100 58 ipi 49 59 iron-steel 5 1 61 jobs 50 63 lead 1 9 69 livestock 57 70 lumber 1 3 72 meal-feed 21 74 mongy-fx 1 00 Table 28: List of top 46 Topics in Reuters 118 Table 28(con't) 75 money-supply 100 77 nat-gas 48 82 oilseed 78 83 orange 1 8 89 pet-chem 29 1 00 reserves 5 l 101 retail 19 104 rubber 40 109 ship . 100 I 1 1 silver 16 1 1 9 strategic-metal 19 120 sugar 1 00 126 tin 32 127 trade 100 l 30 veg-oil 93 1 3 1 wheat 21 1 33 wpi 24 135 zinc 20 119 List of Stopwords A (1 ie One thats whoever a's definitely if Ones the whole able described ignored Only their whom about despite immediate Onto theirs whose above did in Or them why according didn't inasmuch Other themselves will accordingly different inc Others then willing across do indeed otherwise thence wish actually does indicate Ought there with after doesn't indicated Our there's within afterwards doing indicates Ours thereafter without again don't inner Ourselves thereby won't against done insofar Out therefore wonder ain't down instead Outside therein would all downwards into over theres would allow during inward overall thereupon wouldn't allows e is own these x almost each isn't p they y alone edu it particular they'd gs along eg it'd particularly they'll get already eight it'll per they're you also either it's perhaps they've you'd although else its placed think Lou'll always elsewhere itself please third you're am enough j plus ' this you've among entirely just possible thorough your amongst especially k presumably thoroughly yours Table 29: List of Stopwords 120 Table 29 (con't) an et keep probably those yourself and etc keeps provides though yourselves another even kept (1 three 2 apy ever know que through zero anybody every knows quite throughout january anyhow everybody known qv thru jan anyone everyone 1 r thus february anything everything last rather to feb anyway everywhere lately rd together march anyways ex later re too mar anywhere exactly latter really took april apart exarpple latterly reasonably toward apr appear except least regarding towards may appreciate f less regardless tried june appropriate far lest regards tries july are few let relatively truly august aren't fifth let's respectively try september around first like right tryigg sep as five liked s twice october aside followed likely said two oct ask followingy little same u november asking follows look saw un nov associated for looking say under december at former looks saying unfortunately dec available formerly ltd says unless one away forth m second unlikely two awfully four mainly secondly until tow b from many see unto three be further may seeing up four Table 29 (con't) became furthermore maybe seem upon five because g me seemed us six become get mean seeming use seven becomes gets meanwhile seems used eight becoming getting merely seen useful nine been given might self uses ten before gives more selves using 1 beforehand go moreover sensible usually 2 behind goes most sent uucp 3 being going mostly serious v 4 believe gone much seriously value 5 below ot must seven various 6 beside gotten my several very 7 besides greetings myself shall via 8 best h n she viz 9 better had name should vs 1 0 between hadn't namely shouldn't w periodic beyond happens nd since want period both hardly near six wants billion brief has nearly so was million but hasn't necessary some wasn't thousand by have need somebody way thousands c haven't needs somehow we millions c'mon having neither someone we'd billions c's he never somethinL we'll day came he's nevertheless sometime we're month can hello new sometimes we've year can't hell) next somewhat welcome yearly cannot hence nine somewhere well week 122 Table 29 (con't) cant her no soon went weekly cause here nobody sorry were saturday causes here's non specified weren't sunday certain hereafter none specify what monday certainly hereby noone specifying what's tuesday changes herein nor still whatever wednesday clearly hereupon normauy sub when thursday co hers not such whence friday com herself nothing sup whenever i come hi novel sure where ii comes him now t where's iv concerning himself nowhere t's whereafier vi consequently his 0 take whereas iii consider hither obviously taken whereby v considering hopefully of tell wherein www contain how off tends whereupon edu containing howbeit often th wherever com contains however oh than whether org corresponding i ok thank which gmt could i'd okay thanks while text couldn't i'll old thanx whither html course i'm on that who info currently i've once that's who's cs 123 REFERENCES [1] Banerjee S, Ramanathan K. and Gupta A. Clustering Short Texts using Wikipedia. In Proceedings of SIGIR, 2007, 787-788. [2] Basu S., Banerjee A., and Mooney R. J., Semi-supervised clustering by seeding. In Proceedings of the International Conference on Machine Learning 2002, 27-34. [3] Beil. F, Ester M, Xu X, Frequent term-based text clustering. In Proceedings of the eighth ACM SIGKDD, 2002. 436-442. [4] BelMufti G., Bertrand P., and ElMoubarki L., Determining the Number of Groups from Measures of Cluster Validity, In Proceedings of International Symposium on Applied Stochastic Models and Data Analysis, pp. 404-414, 2005. [5] Berkhin P. A survey of clustering Data mining techniques. Springer Berlin Heidelberg, 2006, 25-71 [6] Blei David, NG Andew, Jordan Michael. Latent dirichlet al.location. [7] Bradley P., Bennett K., and Demiriz A., Constrained k-means clustering. Microsoft Research Technical Report,MSR-TR-2000-65, 2000. [8] Bruijn J., Pearce D., Polleres A., and Valverde A., A semantical framework for hybrid knowledge bases. In Knowledge and Information Systems KAIS Journal, 2010. [9] Cheng, Y. and Church, G.M., Biclustering of expression data, In Proceedings of the International Conference of Intelligent Systems for Molecular Biology, pp 93-103, 2000. [10] Chemov, S. Iofciu, T. Nejdl, W. and Zhou X., Extracting semantic relationships between Wikipedia categories, In Proceedings of Workshop on Semantic Wikis, 2006. [l 1] Cucerzan, S., Large-scale named entity disambiguation based on Wikipedia data, In Proceedings of EMNLP-CoNLL, pp 708-716, 2007. 124 112] [13] [14] [15] [16] [17] [18] [19] [20] [21] [221 Dai W., Xue G., Yang Q. and Yu Y. Co-clustering based classification for out-'of- domain documents. In Proceedings of KDD, pp 210—219 , 2007. Davidov D., Gabrilovich E. and Markovitch, Parameterized Generation of Labeled Datasets for Text Categorization Based on a Hierarchical Directory. In Proceedings of the 27m annual International SIGIR conference, pp 250-257, 2004. Deerwester S., Dumais, S. T., Fumas, G. W., Landauer, T. K., & Harshman, R. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, pp 391-407, 1999. Demiriz A., Bennett K., and Embrechts M., Semi-supervised clustering using genetic algorithm 1 999. [Online]. Available:citeseer.ist.psu.edu/demiriz99 emisupervisedhtm Dudoit S. and Fridlyand J., Bagging to Improve the Accuracy of a Clustering Procedure, In Proceedings of Bioinformatics, vol. 19, no. 9, pp. 1090-1099, 2003. Dhillon I., Mallela S., Modha D., Information-Theoretic Co-Clustering. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp 89-98, 2003. El-Yaniv R., Souroujon O., Iterative Double Clustering for Unsupervised and Semi- Supervised Learning. In Proceedings of ECML, pp 121-132, 2001. Fazzinga B., Flesca S., and Tagarelli A., Evaluation of two heuristic approaches to solve the ontology meta-matching problem. . In Knowledge and Information Systems KAIS Journal, 2010. Fern X.Z. and Brodley C.E., Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach, In Proceedings of 20th International Conference on Machine Learning, pp. 186-193, 2003. Fischer B. and Buhmann J.M. Bagging for Path-Based Clustering, In Proceedings of IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 11, pp. 1411-1415, 2003. Francisco V., Gervas P., and Peinado F., Ontological reasoning for improving the treatment of emotions in text, In Knowledge and Information Systems KAIS Journal, 2010. 125 [23] [24] [25] [26] [27] [23] [29] [30] [31] [32] Fred A., Finding Consistent Clusters in Data Partitions. In Proceedings of the Second International Workshop Multiple Classifier Systems, 2001. Fred A. and Jain A.K., Combining Multiple Clustering Using Evidence Accumulation, In Proceedings of IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 835-850, 2005. Fred A. and Jain A.K., Data Clustering Using Evidence Accumulation. In Proceedings of the I6th International Conference on Pattern Recognition, pp. 276- 280, 2002. Freidman N., Mosenzon O., Slonim N. and Tishby N., Multivariate Information Bottleneck. In Proceedings of I 7'” conference of UAI, 2001. Fogarolli, A., Word Sense Disambiguation Based on Wikipedia Link Structure, In Proceedings of the International Conference on Semantic Computing, pp 77-82, 2009. Gabrilovich E. and Markovitch S. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In proceedings 0f the 20’ international Joint Conference on Artificial Intellegence (IJCAI) 2007. Greene D., Tsymbal A., Bolshakova N., and Cunningham P., Ensemble Clustering in Medical Diagnostics, Technical Report TCD-CS-2004-12, Dept. of Computer Science, Trinity College, Dublin, Ireland, 2004. Goe J. , Tan P. N., and Cheng H., Semi-supervised clustering with partial background information. In Proceedings of SIAM International Conf on Data Mining, Bethesda, MD 2006. Gomes P., Pereira F.C., Paiva P., Seco N., Carreiro P., Ferreira J., and Bento C. Noun Sense Disambiguation with WordNet for Software Design Retrieval. In Proceedings of the Sixteenth Canadian Conference on Artificial Intelligence, pp 537-543, , 2003. Govaert G. Simultaneous clustering of rows and columns. Control and Cybernetics, pp 437-458, 1995. 126 [33] [34] [35] [361 [371 [381 [391 [40] [41] [42] [43] [441 [45] [46] [47] [43] [49] Hadjitodorov S.T., Kuncheva LL, and Todorova L.P., Moderate Diversity for Better Cluster Ensembles, Information Fusion, 2005. Hofmann T. Probabilistic latent semantic indexing. In proceedings of the Twenty- Second Annual International SIGIR conference, 1999. Hotho A., Staab S., Stumme G, WordNet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop, pp 541-544, 2003. http://www.daviddlewis.com/resources/testcollections/reutersZ1 578 http://www.dmoz.org/ http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview http://lucene.apache.org/ http://people.csail.mit.edu/jrennie/20Newsgroups http://www.nlm.nih.gov/mesh http://www.webconfs.com/stop-wordsphp http://wikipedia.edu/ http://wn-similarity.sourceforge.net/ http://wordnet.princeton.edu Hu 8., WiKi’mantics: interpreting ontologies with WikipediA. In Knowledge and Information Systems KAIS Journal, 2010. Hu J., Fang L, Cao Y. , Enhancing Text Clustering by Leveraging Wikipedia Semantics. In Proceedings of Special interest Group for Information Retrieval SIGIR , 2008. Hu X. and Yoo I., Cluster Ensemble and Its Applications in Gene Expression Analysis, In Proceedings of Second Asia-Pacific Bioinformatics C onference, 2004. Hu X., Zhang X., Lu C., Park E. and Zhou X. Exploiting Wiki edia as external knowledge for document clustering. In Proceedings of the 15' ACM SIGKDD 127 [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] international conference on Knowledge discovery and data mining, pp 389-396, 2009. Huang A. Milne D., Frank E., and Witten I. Clustering documents using a Wikipedia-based concept representation. In Proceedings of Advances in Knowledge Discovery and Data mining, pp 628-636, 2009. Huang A. Milne D., Frank E., and Witten I. Clustering documents with active learning using Wikipedia. In Proceedings of International Conference on Data Mining, pp 839-844, 2008. Jalali V., Reza M. and Borujerdi. Information retrieval with concept-based pseudo- relevance feedback in MEDLINE. In Knowledge and Information Systems Journal. 2010. Jain AK. and Dubes R.C., Algorithms for Clustering Data Prentice Hall, 1988. Jiang J.J. and Conrath D.W., Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of the international Conference on Research in Computational Linguistics. I998. Jing L., Ng M. K., and Huang J. 2., Knowledge-based vector space model for text clustering. . In Knowledge and Information Systems KAIS Journal, 2010. Jing L., Zhou L., Ng M. K. and Huang J. 2., Ontology-based distance measure for text clustering. In Proceedings of International Conference for Data Mining SDM 2006. Knappe R., Bulskov H. and Andreason T., .Perspectives on Ontology-Based Querying. International Journal of Intelligent Systems. 2004. Kuncheva LI. and Hadjitodorov S.T., Using Diversity in Cluster Ensembles, In Proceedings of IEEE International Conference Systems, Man, and Cybernetics, 2004. Lang K., NewsWeeder: Learning to Filter Netnews. In Proceedings of IEEE Conference on Machine Learning, pp 331-339, 1995. Lange T., Roth V., Braun ML, and Buhmann J.M., Stability-Based Validation of Clustering Solutions, In Proceedings of Neural Computation, vol. 16, pp. 1299- 1323, 2004. 128 [61] [62] [63] [64] [65] [66] [67] I68] [69] [70] [71] Law M.H. and Jain A.K., Cluster Validity by Boostrapping Partitions, Technical Report MSU-CSE-03-5, Michigan State University, 2003. Law M. H. C., Topchy A., and Jain A. K. Clustering with Soft and Group Constraints. In Proceedings of the Joint IAPR Workshop on Syntactical and Structural Pattern Recognition and Statistical Pattern Recognition, pp 662-670, 2004. Levine E. and Domany E., Resampling Method for Unsupervised Estimation of Cluster Validity, In Proceedings of Neural Computation, vol. 13, pp.2573-2593, 2001. Lewis D. Reuters—21578 text categorization test collection, 1997 Li Y., Luke R.W.P. , Ho E.K.S. and Chung F.L. Improving Weak Ad-Hoc Queries using Wikipedia as External Corpus. In Proceedings of Special Interst Group for Information Retrieval SIGIR, pp 797-798, 2007. Lin, D. An infonnation-theoretic definition of similarity. In Proceedings of the international Conference on Machine Learning. 1998. Lu Z. and Leen T., Semi-supervised learning with penalized probabilistic clustering. In Advances in Neural Information Processing Systems (I 7). Madeira, SC. and Oliveira, A.L. Biclustering algorithms for biological data analysis: a survey, In Proceedings of the IEE E Transactions on computational Biology and Bioinformatics, pp 24-45, 2004 Mann H. B., Whitney D. R. On a test whether one of two random variables is stochastically larger than the other. Annals of Mathmatical Statistics, 18, pp 50-60, 1947. Maulik U. and Bandyopadhyay S., Performance Evaluation of Some Clustering Algorithms and Validity Indices, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1650-1654, 2002. Milne D., Computing Semantic Relatedness using Wikipedia Link Structure. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC) 2007 129 [72] [73] [74] [75] [761 [77] [731 [79] [80] Milne D, Medelyan O. and Witten H. 1., Mining Domain Specific Thesauri from Wikipedia: A case study. In Proceedings of the International Conference on Web Intelligence (IEE/WIC/ACM WI’2006) Hong Kong, 2006. Milne D, Medelyan O. and Witten H. I. An Effective, low cost measure of semantic relatedness obtained from Wikipedia Links. In the AAA] Workshop on Wikipedia and Artificial Intelligence, 2008. Milne D and Witten H. 1., Learning to link with Wikipedia. In Proceedings of International Conference on Information and Knowledge Management CIKM, New York, pp 509-518, 2008. Mihalcea, R., Using Wikipedia for automatic word sense disambiguation, In Proceedings of NAACL HLT, 2007. Miller J., WordNet: a lexical database for English, Communications of the ACM, pp 39-41, 1995. Minaei B., Topchy A., and Punch W., “Ensembles of Partitions via Data Resampling,” Proc. Int'l Conf Information Technology: Coding and Computing, 2004. Monti S., Tamayo P., Mesirov J., and Golub T., Consensus Clustering: A Resampling Based Method for Class Discovery and Visualization of Gene Expression Microarray Data, In Proceedings of Machine Learning Journal, vol. 52, pp. 91-118, 2003. Moravec P., Kolovrat M. and Snasel V., LSI vs. WordNet ontology in dimension reduction and information retrieval. Proceedings of the Dateso 2004 Annual International Workshop on Databases, Texts, Specifications and Objects, 2004. Natural Language ToolKit. liiip://m\=\\'.iilikorg/l lomc [81] Ni X., Quan X., Lu Z., Wenyin L. and Hua B., Short text clustering by finding core [821 terms. In Knowledge and Information Systems KAIS Journal, 2010. Phan, X., Nguyen L. and Horiguchi, S. Learning to Classify Short and Sparse Test & Web with Hidden Topics from Large-scale Data collection, WWW, pp 21-25, 2008. 130 [83] Pedersen T., Patwardhan S., and Michelizzi J. WordNet Similarity—Measuring the Relatedness of Concepts. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, 2004, 1024-1025. [84] Porter M.F., An algorithm for suffix stripping. In Proceedings of the Program journal,l4(3) pp 130—137, 1980. [85] Rand W.M., Objective Criteria for the Evaluation of Clustering Methods, .1. Am. [86] [87] [88] [39] [90] [91] [92] [93] Statistical Assocociation, vol. 66, pp. 846-850, 1971. Recupero D. A new unsupervised method for Document Clustering by using WordNet Lexical and Conceptual Relations. In Proceedings of Information Retrieval, pp 563-579, 2007. Resnik 0., Semantic Similarity in a taxonomy: An Information-Based Measure and its application to Problems of Ambiguity and Natural Language. Journal of Artificial Intelligence Research, pp 95-130, 1999. Rosen-Zvi M, Griffiths T., Steyvers T., Smyth P., The author-topic model for authors and documents, In proceedings of the 20:}; conference on Uncertainty in Artificial Intelligence, pp 487-494, 2004. Roth V., Lange T., Braun M., and Buhmann J ., A Resampling Approach to Cluster Validation. In Proceedings of the Conference of Computational Statistics, pp. 123- 128, 2002. Salton G. and Buckley C. Term-weighting approaches in automatic text retrieval. In Proceedings of Information Processing & Management. 24 (5), pp 513—523, 1988. Salton G., Wong A., Yang C. S., A vector space model for automatic indexing, In proceedings of Communications of the ACM, pp 613-620, 1975. Sedding J ., Kazakov D., WordNet-based text document clustering. In Proceedings of the 3rd workshop on Robust Methods in Analysis of Natural Language Processing Data, pp 104-113, 2004 , Slonim N., Friedman N. and Tishby N., Agglomerative Multivariate Information Bottleneck. In Proceedings of Neural Information Processing Systems NIPS, 2001. I31 [94] [95] [961 [97] [98] [99] [100] [101] [102] [103] [104] [105] Slonim N. and Tishby N., Document Clustering Using Word Clusters via the Information Bottleneck Method. In proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208-215, 2000. Shin K., Han S. and Gelbukh A. Advanced clustering technique for medical data using semantic information, 322-331, 2004. Shvaiko P. and Euzenat J. A survey of schema-based matching approaches. Journal on Data Semantics IV, pp 146-171, 2005. Steinbach M., Karypis G., Kumar V. A comparison of document clustering techniques. In proceedings of KDD workshop on text mining. 2000. Steyvers M., Smyth P., Rosen-Zvi M, Griffiths T., Probabilistic author-topic models for information discovery. In proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pp 306-315, 2004. Steinbach M. and Karypis G. and Kumar V., A comparison of document clustering techniques. In proceedings of KDD Workshop on Text Mining, 2000. Strehl A. and Ghosh J ., Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions, In Proceedings of Machine Learning Journal, vol. 3, pp. 583-618, 2002. Strube M. and Ponzetto S.P. WikiRelate! Computing Semantic Relatedness Using Wikipedia. In Proceedings of AAAI, 2006. Tan P.N., Steinbach M., Kumar V., Introduction to Data Mining, Addison- Wesley Longman Publishing Co, 2005. TechTC - Technion Repository of Text Categorization Datasets - littpz/hcclitc.cs.tcclinion.zic.il/ Termier A., Rousset MC, Sebag M, Combining statistics and semantics for word and document clustering, In proceedings of IJCAI, 2001, 49-54. Tibshirani R., Walther G., and Hastie T., Estimating the Number of Clusters in a Data Set via Gap Statistic, In Journal of Royal Statistical Soc. B, vol. 63, pp. 411- 423, 2001. 132 [106] [107] [108] [109] [110] [111] [112] [113] [114] [115] [116] Tishby Naftali, Periera F. Bialek W., The Information Bottleneck Method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing. 368-377, 1999. Topchy A., Jain A.K., Punch W., A mixture model for clustering ensembles, In Proceedings of SIA M Conference on Data Mining, pp 379—390, 2004. Topchy A., Jain A.K., and Punch W., Combining Multiple Weak Clusterings, In Proceedings of IEEE International Conference on Data Mining, pp. 331-338, 2003. Wang P. and Domeniconi C. Building Semantic Kemals for text classification using Wikipedia. In proc. Of KDD, pp 24-27, 2008. Wang P., Hu J., Zeng H. J., Chen L, Chen Z. Improving text classification by using encyclopedia knowledge. In Proceedings of ICDM 2007, 332-341. Wagstaff A., Cardie C., Rogers 8., and Schroedl S., Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning 2001, 577—5 84. Wang Y., Hodges J., Document Clustering with Semantic Analysis. In Proc. of 39th Annual Hawaii International Conference. 54c-54c, 2006. A. Weingessel, E. Dimitriadou, and K. Homik, An Ensemble Method for Clustering, In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 2003. Wu Z. and Palmer M. Verb Semantics and Lexical Selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 133- 138, 1994. Yoo I., Hu X., Song 1., Integration of Semantic-based Bipartite Graph Representation and Mutual Refinement Strategy for Biomedical Literature Clustering. . In proc. of 12‘h ACM KDD, pp 791-796, 2006. Zhang Zhao, and Ye N., Locality preserving multimodal discriminative learning for supervised feature selection. In Knowledge and Information Systems KAIS Journal, 2010. I33 [117] [118] [119] [120] [121] [122] Zhang X., Jing L., Hu x.,Ng M., Zhou X. A comparative study of Ontology Based Term Similarity Measures on Pubmed Document Clustering. In proc. Of 12th International Conference on Database Systems for Advanced Applications DASFFA, 2007 Zhang X., Zhou X. and Hu X., Semantic Smoothing for Model-based Document Clustering. In Proceedings of International Conference on Data Mining ICDM, pp 18-22, 2006. Zhao Y. and Karypis G. Criterion functions for document clustering: Experiments and analysis. Technical Report TR 01-40. 2001. Zhong, S. and Ghosh J., Generative model-based document clustering: a comparative study. In Proceedings of Knowledge and Information Systems, pp 374-3 84, 2005 Zhou X., Zhang X. and Hu X. Semantic smoothing of Document Models for Agglomerative Clustering. In Proceedings of the 20"” International Joint Conference on Artificial Intelligence IJCAI, 2007. Zhu S., Zeng J, Mamitsuka H., Enhancing MEDLINE clustering by incorporating MeSH semantic similarity, Bioinformatics, 2009. 134 Ill III II III III III III I II II I