.4044. ‘ 33333337.... 3 .Hv..4 4"431.44%: 41"! " :444'434'34 4444' '18" '4'3444'44'44 4444:: 3 434443 44: 43 u 3"] 3 14. ' 443' 4434 " 4‘" 444 3444344 44 44444343 3.3.4.4441 444‘": 3'3 %Mn' ',4'7"" --', 3gw4. 3 9444444134444. .3441" '3a"""" '44'. 44444.3444 1.3 4.4.44 "" 44‘ 2-4 4:4 .34 Nam-334,3 “ ‘ruw 3 0.333 \uli ' 44341.1} 444,74”: 444 3333 3143.44“ 4333] 3'". n! . 4'94}; 4:4: 4 }i‘} "4 '4' 44' {4}" 444454» '4 4].. 5... :‘. '4'4'44-‘444 "444 44' '44" '44? 54-: .4444. ..-.- .3444454 334 . 44.445: 4 ' '4'; '13{143'.3-';:3'3 [I 44 4'13 :3; ' ' '3 . 43a \4 '4'4x4 4'12; “' .qu'UU m'fl’fi‘ ¢"‘ a' 3,3...- »‘»l ‘; 4 ‘5' 4h v. I» d“ ‘-- "a t‘.‘ .._ a -. ,— v-O U 4 4:43‘3M33 . ' .4,h§ " " '. ' I, 44544?» '44441'4 '1!" 33 . ET": ~‘3333 ._ 64‘ a§fiutfgyfie 333$? 44' 4'44. 4: 4 ‘ ' '44'4'444443 :4; 43.4444. 334: ' I} 3334’3‘E3‘3 3, . fife-.11. - 433141.114}; g4}; “43:93, ‘4 :‘3 33:33 -" 4444-4 4-35 3454;:- '4 . . 34343114 44441343444; 3 :34?! £4" Kfi$§&@*‘.$4 44455.4-34»: . 3 '--gfifi5§flf54“,,gfihdfifim,57.3 Lu] 5 Raw Y ‘4 ' 4 H‘ . $133333“ \ , 1.4 !_‘333333 3‘331Q‘v '1'] 4 3:33:23; :3; {{‘5’Lgi :::"‘i:£ -: . 0.. 5.3.:th ... "35'. 14 .3 43:43 331454437: 3373; ' 4'4 - ”Sig . ' N—- .. '.-[E<10-3 1.12x10’7 0.20 6.49x10‘4 8.39><10-3 2.54x10-3 4.03x10‘3 4.70><10-3 0 0.30 9.64x10“4 4.74x10'3 5.07x10-3 8.89x10‘3 8.86x10‘3 0 Table 4-6 Experiment M2, Average Global Structure 8 (B) MF 0 (A) Eigen 0.02 0.1 0.5 0.3 0.6 0.10 8.98x10‘4 1.26x10‘3 2.93x10‘3 4.16x10‘4 6.56x10"5 9.5x10’7 0.20 8.90x10'4 9.74x10'4 1.79x10‘3 1.28x10-5 6.58x10‘5 1.0x10’5 0.30 7.66x10‘4 1.97x10'3 1.77x10‘3 7.69x10'4 7.64x10‘4 6.5x10”5 Table 4-7 Experiment M2, Analysis of Variance F -test alocal adist F 90% F 99% F A (213) 6.028 0.708 2.62 6.01 F3 (2‘ 13) 0.829 5.254 2.62 6.01 F 33 (4.13) 3.257 1.094 2.29 4.58 32 Table 4-7 indicates that there is a significant effect on local structure due to o and significant interaction between 0 and 8. A significant effect on global structure occurs due to 8. This effect, however, is not visually detectable. Although not pictured, the apparent local and global structure looks about the same in mappings run with different 8. The final stress values for each run are shown in Figure 4-5. The final stress does not vary much with 8. The 8 value of 0.5 probably provided a slow enough annealing for this data. Larger values should result in higher final stress and smaller values, in longer run times. Figure 4-5 also shows that there is almost no difference in final stress between replicated runs for one value of 0, so annealing is relatively stable with respect to random number seed. Figure 4-6 shows the values of adist for each run of the experiment. There is less correspondence between 8 and the final stress values than in experiment M1 (Figure 4-2). All of the adist values are, however, very small and all of the mappings produced look like fairly good representations of the original data. The difference in alocal values observed between different methods used in this experiment is probably not significant even though the analysis of variance indicates that there are significant effects. As explained in Section 4.1, this is probably due to the small size of the data set and the way alocal is constructed. The final stress values obtained by gradient descent are slightly higher than those obtained by simulated annealing, although the values of adist are about the same or lower. Values of alocal are roughly the same as those obtained by simulated annealing except when 0 is smallest, in which case gradient descent has smaller values of alocal. The final mappings again look like fair representations of the data, although the gradient descent mapping looks a little better because the shape of the clusters is more regular. This result is curious and indicates a fault in either the local structure statistic or of the choice of stress as the criterion function. A similar result occurs in experiment M4. The variance retained in the eigenvector projection of the tightly clustered data is 98%. As mentioned in experiment M1, we should expect three tight clusters to fit well in 33 two dimensions. The larger clusters have somewhat less of the variance retained, 94% and 89%. The values of alocal and adist for the mappings made by eigenvector projec- tion are zero or almost zero. The mappings, however, are not very good representations of the data, since many of the points in the mappings project to the same point. This hap- pens because the clusters are simplexes. Because each cluster of the mapping is exactly the same, alocal is zero, even though the mapping is not a good representation. Figure 4-7 shows mappings obtained by all three methods for a 0 value of 0.10. Since the original data contains clusters in which the points are all equidistant from one another, the mapped clusters should look roughly like pentagons. The gradient descent and the simulated annealing mappings look like good representations of the data, although one of the simulated annealing mappings (Figure 4-7b) contains two closely spaced points (14 and 15). This condition is not detected by alocal. 34 .o= 0.10 mo= 0.20 [30: 0.30 8:0.02 8=0.1 1.5—1 1.5— Final Final 1111 rm IIII I ”“817 . :::: :::: ”1! Stress” :::: :::: :::: .. ;;,, “ll ::::::::;;;; 111111111111 .. 3;; "" ”" 1111 ::::::::;;;;111111111111 "" ”H II“ ::::::::;;;; 111111111111 :::: :::: 1211 "" "" 1m :::: :21: 1111 1111 1111 0.5-1:3: ::::2122 H" ""1111 0.5—2:332:13: 111111111111 :::: :::: 1123 "II II" 1111 :::::::21131111111111111 .. III! II“ “II 1111 ::::::::3333111111111111 ..1113""”"1111 ::::::::I::: 111111111111 ..::::111111111111 ::::::::222:111111111111 :22: 1111 1111 1111 322222223222 111111111111 0 "H H" "N 1111 III] III! 0 "H "H .... 111] III! III] 8 = 0.5 Gradient Descent 1.5— 1.5— F1113] 1111 Final 14 1111 SUCSS 1111 1111 1111 Stress —_ 1111 -- "" 1m 1111 :::: 1m -§§55 "" 1m 1111 :::: 1111 :::: :::;3333 331mm 33;; :3: 05.222; 5333 55;; 1111 III: III: 0.5—5222 3311333333 "“11111111 :::: 1111 '2:::: 11111111111: :::: 1111 -::::111111111111 22:: 1111 ..:';; 111111111111 :::Z 1111 :::: 1111 1111 1111 :::: 1111 o .... .... .... llLl LIJJ 111] 0 .... lll Figure 4-5 - Experiment M2. Plots showing final stress for all runs. 35 -o=0.10 m6=020 [30:030 8:0.02 8 0.1 O.ml—::;;T W 1111 0.001— :23: IIIIW FT'ITT‘ 1111 1111 1:13;;131111TTIT1111 1111 :22: :22: IIII III! 1111 . ;;;; "" 1111 . ::::::::111111111111 adist 31;; :37. 1111 ad”: ::::::::111111111111 ;;;; III' 1111 —— III: III: Illl llll IIII 32:: 1111 III? 1111 1111 1111 :3: ZZZ: Wu” :Z:: IIII IIII llll 0.0001—SEES 2:2: 2:2: m: 1111 1111 0.0001—1:3; 1111 1111 1111 ::'_ :::: III: ”'1'.“ ||'|—‘ ::-: .::: 2212”” ”H ”Ii -”:.I1I|II||””:z..-:.:::::”'IHHHll ,::: 11111111 1111 3311 2:2: 2:2: "" H" "" . .22: 11111111 |||| Z331 :22: 22:: H" "'I "II 222211111111|||| :33: ;;;;;;;;1111 11111111 ...::: 111111111111 22:: 22:: :2:: III! IIII ”ll 22:: 2:2: :22: 1111 1111 1111 1333 III: 1133 H" "II "" lC-OS 11111111 1111 16-05 11“ 11“ 1'“ 8 = 0.5 Gradient Descent 0001'“ H“ ”" 1111 0.(X)l-— adist 53% III: "" 1m adist 00001— 00001- ' I. O... OD. III. ....11111111II|I 1111 1111 1111 Oil 1111 1111 1111 . 111111111111 18-05 III: III: III. 1111 1111 1111 16-05 1222 mm Figure 4-6 - Experiment M2. Plots showing adist for all runs. 36 1.5 — 6 7 1 ._ 6 910 9 7 10 1 _ l 2 11 ‘ 5 0.5 u 12 0.5 -— 13 2 12 11 4 1 0 -1 1314 15 0 -( 5 3 I I I I I 0 0.5 0 0.5 1 Simulated Annealing Simulated Annealing (a) (b) 1 — 1 6 8 ,9 I 10 ‘15” 9 0.6 — 13 0.4 — 0.5 -— 11 14 12 15 0.2 — 1 A 2 0 -1 3 5 0 ‘5 z I I I I I I 0 0.5 l 1.5 0 0.5 1 1.5 Gradient Descent Eigenvector Projection (C) (d) Figure 4-7 - Experiment M2, Assorted Mappings. The annealing curves for the mappings in this experiment are similar in character to those from experiment M1 (Figure 4-3). The approximate run times are shown in Table 4-8. It is interesting to note that, at a fixed value of 8, the run times tend to increase as 0 decreases. Because simulated annealing is so slow and the results are no better than 37 those obtained by gradient descent, it is not practical to use simulated annealing for such small data sets. Table 4-8 Experiment M2, Run times Projection Approx. Run time (seconds) 8 = 0.02 850 8 = 0.1 230 8 = 0.5 80 Gradient Descent (20 runs) 25 Eigenvector Projection 0.5 4.3. Experiment M3 - Large Gaussian Clusters This experiment looks at the quality of the mappings obtained for several large data sets. Factor A in the analysis of variance was cluster spread o with levels {0.01, 0.07, 0.21} and factor B was cooling parameter 8 with levels {0.1, 0.5, 2.5}. The value of e was fixed at 10'9 and Markov chain length was fixed at 150. The data sets all consist of five gaussian clusters of 15 points per cluster in six dimensions. There are three data sets for each level of o. The values of statistics alocal and adist are shown in Table 4-9 and Table 4-10, averaged over K = 3 replications per cell. Results for gradient descent and eigenvector projection are shown for comparison. Table 4-11 contains the Fisher F values for the analysis of variance. 38 Table 4-9 Experiment M3, Average Local Structure 8 (B) MF . O (A) E1 gen 0.1 0.5 2.5 0.3 0.6 0.01 0.0080 0.0305 22.6086 0.3876 0.1384 0.0692 0.07 0.0114 0.0139 0.0107 0.0134 0.0135 0.0597 0.21 0.0432 0.0301 0.0586 0.0393 0.0395 0.0191 Table 4-10 Experiment M3, Average Global Structure 8 (B) MF 0 (A) Eigen 0.1 0.5 2.5 0.3 0.6 0.01 0.1 162 0.6080 0.4957 0.0690 0.0780 0.4038 0.07 0.0375 0.0401 0.0510 0.0496 0.0496 0.1696 0.21 0.0690 0.0708 0.0790 0.0531 0.0528 0.1799 Table 4-11 Experiment M3, Analysis of Variance F -test alocal adist F 90% F 99% FA (2.18) 0.999 20.167 2.62 6.01 F3 (2.18) 1.003 3.824 2.62 6.01 F 33 (4.13) 1.000 3.571 2.29 4.58 Table 4-11 indicates very significant effects on global structure due to o and significant effects due to 8 and factor interaction, but no significant effects on local struc- ture. The effect due to 8 confirms results of experiment M1. The cell to cell variation is particularly evident in Tables 4-9 and 4-10 for tightly clustered data. When 8 is too large the relative distances between clusters are not reproduced well in the mapping. How- ever, it is not necessarily the case that the larger 8, the worse the results. A mapping made with too large a value of 8 may occasionally produce adequate results. 39 The effect on global structure due to cluster spread can be traced to the stress func- tion. Intuition dictates that stress functions corresponding to data sets with tight clusters have deeper local minima than those for more loosely clustered data. More annealing moves that increase stress are necessary to relocate an entire cluster to another minimum in the stress function for tightly clustered data than for loosely clustered data. Figure 4-8 shows the final stress values and Figure 4-9 shows the values of adist for each run of this experiment. The final stress values and adist values seem to correlate for the tightly clustered data as they did in experiment M1. The results of the gradient descent mappings are, for the most part, comparable to the results from annealing, but in a few instances, the results from annealing are actually better. The values of alocal for each run in the experiment are shown in Figure 4—10. Although no significant effects on alocal were observed from the analysis of variance, the values of alocal for the mappings with small 8 of the tightly clustered data are lower than for the corresponding gradient descent mappings. Additionally, the final stress obtained for the tightly clustered data was lower than that obtained by gradient descent, and the values of adist for the tightly clustered data were comparable to gradient descent in two of the three runs. The apparently poor performance of simulated annealing for 8 of 0.1 and o of 0.01 compared to gradient descent is caused by one run with a particu- larly large adist. This can be seen in Figure 4-9. It is not surprising that simulated annealing performs relatively better for this large experiment than for the small data sets of experiments M1 and M2. The stress functions of the larger data sets are functions of more variables, and therefore have more complex shapes, probably containing more local minima. The mappings made by eigenvector projection were not as good as those made by minimizing stress. The variances retained for the tightly clustered data, loosely clustered, and very loosely clustered data were approximately 75%, 82%, and 67%. The alocal and adist values were also larger in almost every case. Figure 4-11 shows final mappings of the tightly clustered data produced by each of the three methods. The simu- lated annealing clusters appear to have better local structure than those by gradient 40 descent. Remember that the clusters are very tight and that each cluster contains fifteen points. -o=0.01 mnom Do: 0.21 150— 150—4 100.. 100-33%? Final Final '33' 1311 Stress Stress 0 :12: :::: :::: m 8 = 2.5 Gradient Descent 150— 53;; 150— 100— iii? 100— Final —-1 5523 Final Stress 3' '3 Stress 50-éiii iiéé so- 0 :::: :::: IZZIW 0 3:33 3133 3313M Figure 4—8 - Experiment M3. Plots showing final stress for all runs. flo=001 00 II 0.1 [Ella 0.8 — 0.6 .— adist 0.4 —1 0.2 - ''''''''' 111 1 l 0.8 - 0.6—g5; adist 0.445;; 0.2-iii? éiii €53? " llll llll = 0.07 0.8 J'— adist 0.4—€52? i253 0.2—iii? iii 53;; " llll Ill] Gradient Descent 0.8 — 0.6 — adist 0.4 -— 0.2 — """ 1111 [ll] Figure 4-9 - Experiment M3. Plots showing adist for all runs. 42 .o=0.01 0.8 — 0.6 -— alocal 0.4 —1 0.2 — 0.8 1 0.6 — alocal 0.4 -1 0.2 — 67.8 [Ego = 0.07 DO 00 II 0.5 0.8 -« 0.6 — alocal 0.4 _ 0.2 — Gradient Descent 0.8 — 0.6 — alocal 0.4-1;; " one. too. .I.- .'.. "" v0.6 "" I! I use. 0.2—"" A... oeu- "" one. .on- "" one. no- "" a... a... _nTm——I_I—— Figure 4-10 - Experiment M3. Plots showing alocal for all runs. 43 1.5 —1 7' 1'5 T 1‘ 82.1 1 — 1 r t .gt. :1- x 0.5 -( .O 0-5 "‘ .9 0 - ‘5 0 —4 i" I I I I I I I 0 0.5 1 0 0.5 l 1.5 Simulated Annealing Simulated Annealing (a) (b) ‘1' of. 0.8 —4 'fl 7 1 -— :' "’5 0.6 - 0.4 — 0.5 — 6. .1}; 0.2 — . D 0 9' 0 4 I I I I I I I 0 0.5 1 1.5 0 0.5 1 Gradient Descent Eigenvector Projection (C) (d) Figure 4-11 - Experiment M3, Assorted Mappings. The approximate run times for this experiment are shown in Table 4-12. Both the simulated annealing and gradient descent take longer when mapping the tightly clustered data than when mapping other, more loosely clustered data, although the annealing can take up to 50% longer to map tight clusters than to map loose clusters. Simulated 44 annealing is still significantly slower than gradient descent, but takes only five times as long as gradient descent. In experiment M2, simulated annealing was about 35 times slower than gradient descent. There seems to be some evidence that simulated annealing may be practical for large problems. Table 4-12 Experiment M3, Run times _ . Approx. Run time (seconds) Pr0jectron 0 = 0.01 0’ > 0.01 8 = 0.1 11000 7200 8 = 0.5 2700 2500 8 = 2.5 1500 Gradient Descent (20 runs) 2200 1500 Eigenvector Projection 0.5 4.4. Experiment M4 - Large Lattice Clusters This experiment looks at the quality of the mappings obtained for three large data sets. Factor A in the analysis of variance was cluster spread o with levels {0.2, 0.4, 0.6) and factor B was cooling parameter 8 with levels {0.1, 0.5, 2.5}. The value of e was fixed at 10'9 and Markov chain length was fixed at 200. The data sets all consist of five lattice clusters of 27 points per cluster in four dimensions. Replications were made by running the annealing on a single data set while varying the random number seed. The values of statistics alocal and adist averaged over the K = 3 replications are shown in Table 4-13 and Table 4-14. Results for gradient descent and eigenvector pro- jection are shown for comparison. Table 4-15 contains the Fisher F values for the analysis of variance. 45 Table 4-13 Experiment M4, Average Local Structure 8 (B) MF 0 (A) Eigen 0.1 0.5 2.5 0.3 0.6 0.20 0.0032 0.0026 1.9488 0.0022 0.0022 0.0077 0.40 0.0025 0.0041 1.2555 0.0020 0.0021 0.0075 0.60 0.0118 0.0128 0.5412 0.0063 0.0063 0.0065 Table 4-14 Experiment M4, Average Global Structure 8 (13) MF 0 (A) Eigen 0.1 0.5 2.5 0.3 0.6 0.20 0.0332 0.0332 0.0359 0.0491 0.0493 0.3454 0.40 0.0377 0.0413 0.0373 0.0493 0.0491 0.7018 0.60 0.0672 0.1363 0.1143 0.0931 0.0925 2.4519 Table 4-15 Experiment M4, Analysis of Variance F -test alocal adist F 90% F 99% F A (2.13) 0.6707 5.8786 2.62 6.01 F3 (2.13) 6.4399 0.5593 2.62 6.01 F A 3 (4.18) 0.6982 0.4822 2.29 4.58 Significant effects occur on global structure due to o at the 90% level and on local structure due to 8 at the 99% level. A significant effect on global structure due to cluster spread was also observed in experiment M3, and is most likely related to the shape of the stress function. That is, the stress function of the more tightly clustered data set has deeper local minima than that for the loosely clustered data. The effect on local structure due to 8 has to do with the type of sub-optimal mapping obtained by cooling too quickly. In experiments M2 and M3, cooling too quickly produced mappings in which the points were correctly mapped into clusters, but inter-cluster distances were not reproduced 46 correctly. Several of the sub-optimal mappings in this experiment contain clusters with outliers. The presence of an outlier significantly changes the local structure of a cluster. The final stress for each run of this experiment is in Figure 4-12. Values of adist are shown in Figure 4-13. The minimization is as usual most reliable for the smallest 8. There is also correspondence between the final stress and the value of adist. The worst values of alocal were obtained with a 8 of 2.5, due to outliers in the mappings. The high values of alocal in Figure 4-14 correspond to the higher values of stress. Figure 4-15 shows mappings of the tightly clustered data obtained by all three map- ping methods. The visual quality of the mappings obtained by simulated annealing is about the same as that of those obtained by gradient descent, but the local structure is not quite as nice. The final stress and adist values are slightly lower for simulated annealing, but the alocal values are slightly lower for gradient descent. The lower stress indicates that simulated annealing is the better optimization, although the local structure is not preserved as well as with gradient descent. As mentioned in Section 4.2, this is either a fault of the local structure statistic or of the choice of the stress criterion function. The results of the eigenvector projection were significantly worse than in either of the other mappings. The retained variances for the increasing cluster spreads are 83%, 79%, and 75%. These data sets of five clusters in four dimensions can not be very well represented in two dimensions by using only the principal components. 47 '0’: 0.20 [E30 : 0.40 8=0.1 8:05 25 25 15— __, 15 F1031 1111 WI Final WWW Stress 77777111122112!!! Stress .522 .. . :::: iii: 10—EEF13132222 ”1111111111 10-211222: 22:: 1.11"" 1111 3:; 2:2: :13' :11: 1111 1111 .... ill: :11: 1111 ill: “I: I”: II“ 1111 111111111111 1111 "I' 1111 5.. 111111111111 5.. 1111 II” 1111 111111111111 1111 ”H 1111 111111111111 1111”” 1111 111111111111 1111II111111 0 1111:3131: 0 :11: 1:1: :11: 8 = 2.5 Gradient Descent 25 25 20—._:: m 20— 2333: 1111 —— :3. ... iii?“ 1111 1111 15:5232 5335 .... :::: 15— Final '1“ '13 11:: ”"an Final my Stress .53; 3;..___:::: ”I: 1111 Stress .35; 1111 10—333' 3353 1111 "H ”I: 10%;; 1111 III: 222' 2:1: 11” IIII IIII :3: 1111 .... .... ESE; |||l :::: II" :... :::: 3552 11:: 1111 I”: 1111 5— 31223111111111... 5d 1111 .. 1111111111” ”II . 22:: 1111 1111 1111 1111 :::: 1111 1111 1111 1111 0 " 1:313:31: 0 1:1: Figure 4-12 - Experiment M4. Plots showing final stress for all runs. o = 0.20 o = 0.40 o = 0.60 E 5:01 8:05 0.3 0.3 0.2_ 0.2-— adist adist 0.1a 0.1— 1111 {SEE 1-—.... 0 111 1111 7 O 5 = 2.5 Gradient Descent 0.3 0.3 0'27 0.2_ ad“! adist 0.1.. —— 0.1_ “II ”II II” 0 O 1111, Figure 4-13 - Experiment M4. Plots showing adist for all runs. 49 .6 : 0.20 mo = 0.40 0.3 0.2 -1 alocal 0.14 0.3 0.2 ~ alocal 0.1— Gradient Descent 0.3 alocal 1.52 O-I-EEEE 4.32 llll 0.3 0.2 —- alocal 0.1— Figure 4-14 - Experiment M4. Plots showing alocal for all runs. 50 15" 1.5: .’.° 1 _ . ~o . 1 d . 2:. .: 0.5 — . 0.5 — ' F O 4 0 I I I I I I 0 0.5 l 0 0.5 1 Simulated Annealing Simulated Annealing (a) (b) 1.5 : 1 — '-: ‘: 1 —( 0.5 — .1. 0.5 — ”° '1' 0 0 __ . . . I I I I I I I I 0 0.5 l 1.5 0 0.5 1 1.5 Gradient Descent Eigenvector Projection (C) (d) Figure 4-15 - Annealing curves for run pts-al. Approximate run times for the various mapping methods are summarized in Table 4-16. The results of the simulated annealing and gradient descent mappings are compar- able, but have slightly different character. Both can be considered to produce reasonably good quality mappings. However, the simulated annealing takes approximately five 51 times longer than the gradient descent approach. Table 4-16 Experiment M4, Run times Projection Approx. Run time (seconds) 8 = 0.1 31000 8 = 0.5 10000 8 = 2.5 6400 Gradient Descent (20 runs) 6000 Eigenvector Projection 0.5 4.5. Mapping Experiment Summary The experiments of this chapter have confirmed the effect of cooling speed on the outcome of the minimization that has been described in the literature [WHI 84]; slower cooling produces a more reliable result, particularly for tightly clustered data. Experiments M1 and M2 showed some interesting effects, but served primarily as a guide to the larger experiments, M3 and M4. The most interesting artifact of sample size is the run time. In experiment M1 simulated annealing took about three times longer than gradient descent, but about five times more runs than suggested by Sammon [SAM 69] were used. Section 4.3 shows that the simulated annealing provides a better optimi- zation than gradient descent, but gradient descent is still faster. Simulated annealing produced mappings in experiments M3 and M4 with slightly better global structure than did gradient descent or eigenvector projections. The global structure is rather closely tied to the actual minimization problem taking place since the distances between patterns in different clusters comprise the largest component of the stress function. The local structure reproduced in the mappings depends on the type of local structure present in the data. For the highly structured lattice clusters, gradient des- cent produced mappings with better local structure than did simulated annealing. For the gaussian clusters, simulated annealing produced mappings with better local structure than 52 did gradient descent. This behavior may be an artifact of the stress criterion function or of the local structure statistic. One of the biggest problems with simulated annealing for the mapping problem is the long run time. Simulated annealing takes five times as long as gradient descent for large data sets, and the mapping is not significantly better, considering the structure statistics obtained and visually inspecting the mappings produced. Eigenvector projec- tion takes almost immeasurably less time, and the mappings produced, while not as good as those obtained by minimization, are adequate in some circumstances. An argument may be made in some applications that the results obtained for large data sets using the simulated annealing mapping algorithm are worth the extra computational cost. Chapter 5 Clustering Experiments Two experiments using the simulated annealing clustering algorithm are discussed in this chapter. Of particular interest is the behavior of the algorithm on clustered data of varying spread and on random data. The experimental parameters are discussed, some graphical and numerical results are shown, and then an explanation of the results is offered. Results for the clustering experiments are summarized at the end of the chapter. 5.1. Experiment Cl - Clustered Data This experiment examines clustered data having four values of cluster spread. All data sets consist of four gaussian clusters in two dimensions with thirteen points per clus- ter. Factor A in the analysis of variance is the number of clusters requested with levels {2, 3, 4, 5, 6, 7, 8}. Factor B is the cluster spread with levels {0.05, 0.10, 0.15, 0.20}. The value of 8 is fixed at 0.05, e is fixed at 0.001, starting Markov chain length is fixed at 100 and minimum Markov chain length is fixed at 10. Table 5-1 shows the values of the modified Rand statistic averaged over K =10 replications per cell using simulated annealing. The additional column in Table 5-1 labelled ’Slow’ show results with conservative choices of annealing parameters. The column is incomplete because of the long run times required. The ’slow’ annealing is made with 8 = 0.02, 8 = 0.0001, starting Markov chain length = 500, and minimum Mar- kov chain length = 20. Table 5-2 shows the modified Rand when Forgy’s algorithm is 53 used on the same data sets. Each cell in the table is the average of ten replications. 54 Table 5-3 shows the F statistics from the analysis of variance. Table 5-1 Experiment C1, Simulated Annealing Cluster Spread (B) 23:: (A)0f Fast Annealing Slow 3 0.05 0.10 0.15 0.20 0.05 2 0.372 0.386 0.338 0.338 3 0.654 0.603 0.560 0.486 4 0.727 0.786 0.609 0.500 0.956 5 0.777 0.722 0.575 0.451 0.879 6 0.679 0.645 0.474 0.412 7 0.604 0.541 0.461 0.364 8 0.553 0.500 0.407 0.348 Table 5-2 Experiment C1, Forgy Number of Cluster Spread (B) Clusters (A) 0.05 0.10 0.15 0.20 2 0.386 0.395 0.349 0.346 3 0.699 0.636 0.571 0.452 4 0.995 0.888 0.655 0.509 5 0.898 0.812 0.575 0.457 6 0.81 1 0.746 0.530 0.412 7 0.742 0.676 0.496 0.372 8 0.672 0.570 0.435 0.356 55 Table 5-3 Experiment C1, Analysis of Variance F -test Annealing Forgy F 99% F 99.9% F A (6.252) 30.99 65.07 2.80 3.74 F3 (3,252) 46.98 159.51 3.78 5.42 F A 3 (13,252) 1.57 4.96 2.04 2.51 Table 5-3 indicates highly significant effects due to both the cluster spread and number of partitions for both clustering algorithms. The Forgy algorithm produced better results in almost every cell in Tables 5-1 and 5-2. If more conservative values of the annealing parameters 8 and 8 were used, as in the rightmost column of Table 5-1, the results would probably exceed those obtained for the Forgy clustering at the cost of significantly longer run times. All the values in the Tables 5-1 and 5-2 produce significantly better clusterings than random labellings, since the modified Rand values are 15 to 40 standard deviations greater than the mean of zero under randomness [HUB 85]. Within each column, the values of the modified Rand statistic are best for partitions of four clusters, the true value, with one exception. For a fixed clustering the Rand values get better as the cluster spread decreases. These results agree with intuition. The exception is in Table 5-1, where the tightly clustered data groups into five clusters better than into four clusters. This appears to be a result of poor annealing parameters. The results obtained from the ’slow’ cooling are quite significant. If we were to look at the Rand statistic for the tightly clustered data (0 = 0.05) to try to determine the true number of clusters, we would make a correct deci- sion based on the numbers obtained from the slow cooling but an incorrect decision from the faster cooling. Figure 5-1 shows an example of how clusterings differ with annealing schedules for the tightly clustered data. Figure 5-1 shows scatter plots of the patterns and exhibits the maximum and minimum square error values from each Markov chain. The numerals indicate the labels assigned by the algorithm. The clustering in Figure 5-la has a modified Rand value of 0.413. It is one of the worst of the ten replications for 56 partitioning the tightly clustered data into four clusters. Figure 5-lb contains the plot of the same data set when clustered by more conservative cooling parameters. It corresponds to a Rand value of l. l — 4 4 4 4‘4: 4 8 _. 3; 6 _‘ 0.5 — ’33:; Square Error 2 3 4 _ 3 11:3, 3 321 I. 111 0 _ I 1 1 2 2 _ I I I l I O 0.5 0 50 100 Fast Annealing Markov Chains (a) l —- 3 3 3333: 3 4 0.5 — 2‘ Square Error 5 _ 4 4 22 2 ”at 11 1 I 1 1 1 1 0 — l I ll 0 _ I I I I I 0 0.5 0 500 1000 Slow Annealing (b) Markov Chains Figure 5-1 - Experiment C1, Assorted Plots. 57 These results again demonstrate the importance of choosing good values of the annealing parameters. Unfortunately, the problem with choosing adequate parameters values is that the run times may become so long that a run will not finish between machine downtirnes. Approximate run times for the clustering algorithms used in this experiment are summarized in Table 5-4. The run times tend to increase as the cluster spread of the original data gets larger. Clearly, the simulated annealing algorithm is not practical for clustering these small data sets. Table 5-4 Experiment C1, Run times Problem Approx. Run time (seconds) 2 clusters (fast) 160 3 clusters (fast) 320 4 clusters (fast) 450 5 clusters (fast) 500 6 clusters (fast) 550 7 clusters (fast) 550 8 clusters (fast) 600 4 clusters (slow) 8200 5 clusters (slow) 10500 Forgy (100 runs) 20 5.2. Experiment C2 - Random Data This experiment examines the effects of simulated annealing parameters 8 and Mar- kov chain length with random data. Factor A in the analysis of variance is stopping parameter 8 with levels {0.0001, 0.001, 0.01}. Factor B is Markov chain length with lev- els {50, 100, 200]. Parameter 8 is fixed at 0.05. Five different random data sets are gen- erated; each contains 50 points in two dimensions. The data are always clustered into seven partitions, which is a reasonable number for 50 points. 58 Table 5-5 contains the average (modified Hubert) gamma statistic over the K = 5 replications per cell. The final square errors are listed in Table 5-6. Slightly different results are obtained from square error than from the gamma statistic. Table 5—7 contains the F statistics from the analysis of variance. Table 5-5 Experiment C2, Gamma Statistics Markov Chain Length (B) Forgy 50 100 200 0.0001 0.841 0.837 0.749 0.001 0.738 0.763 0.749 0.853 0.01 0.526 0.634 0.527 8 (A) Table 5-6 Experiment C2, Square Error Markov Chain Length (B) Forgy 50 100 200 0.0001 0.934 0.909 0.912 0.001 1.399 1.073 0.912 0.905 0.01 2.408 1.632 1.353 8 (A) Table 5-7 Experiment C2, Analysis of Variance F -test Gamma Square Error F 99.9% FA (236) 22.07 64.16 8.50 F3 (236) 1.64 21.99 8.50 F 33 (4,36) 0.66 7.27 5.90 Table 5-7 shows highly significant effects on gamma and square error due to 8. There are also effects on square error due to Markov chain length, and significant factor interaction with square error. The effects on square error are not surprising, since square 59 error is the objective function being minimized, but the presence of these effects confirms our knowledge of how Markov chain length and termination of the annealing affect the minimization of the criterion function. The gamma statistics in Table 5-5 are significant since they are all 8 to 14 standard deviations above the mean of zero. The gamma statistic seems to be affected by only 8 with 8 = 0.05. It therefore makes little sense to use the larger values of Markov chain length, since larger values require more run time without enhancing results. The reason Markov chain length has a significant effect on square error and not on gamma may be that labelling a few patterns ’incorrectly’ produces larger changes in square error than in gamma. Gamma and square error cannot be compared directly because they do not have the same scale. Figures 5-2, 5-3, and 5-4 show scatter plots and annealing curves for some of the data clustered in this experiment. Figure 5-2 shows the resultant clustering and annealing curves for one of the worst clusterings obtained with e = 0.01 and Markov chain length = 50. Figure 5—3a shows a better clustering of the same data set with e = 0.0001 and Mar- kov chain length = 50, and Figure 5-3b shows the best clustering of the same data set obtained with e = 0.0001 and Markov chain length = 200. Figure 5—4 shows the above mentioned plots along with the result from Forgy’s algorithm for comparison. Notice that Figure 5-4a and 5—4d are the same except for cluster numbering. Visually, the clus- terings obtained with the smallest e and largest Markov chain length are the best, although some of the others are not bad. The clusterings obtained with large 8 and small chain length are the worst ones of all. 60 1— 3 4 3 4 5 5 : 3 5 3 3‘ g ‘ 1 4 5 54 4 5 41 0.5 -— 7., 7 'l5 6 46 5 2 1 2 l 7 2 6 2 6 6 6 .7 6 o: 2 2 T I I 0 0.5 1 e = 0.01, Chain length = 50 100— 80— Percent accep60— moves 40-— 20— I I I I 0 50 100 150 Chains 0.5 — Terom _. 0.1 -— l I I I 0 50 100 150 Chains Figure 5-2 - Experiment C2, Annealing curves. 61 l-q 5 2 s s 6 2 5 5 5 66 g 2 s 1 ‘6 6 l 1 1 33 0.5—1 1 1 1 3 33 1 33 1 77 1 4 4 7 7 o—I 7 ‘ ‘ T I 0 0.5 ClusteredPoims 1.... s 6 s 5 5 4 6 s s 5 44 4 5 ‘ 2 4‘ 4 2 2 17 0.5— 2 2 2’ v 17 2 77 2 1 1 1 3 3 1 1 o— 1 1‘ I I 0 0.5 ClusteredPoints I I T I 0 50 100 150 Chains (a) 5— I I!“ . I I !* Square “m l— I I I I 100 200 300 Chains (b) Figure 5-3 - Experiment C2, Annealing curves. 62 1 — 1 ... 5 6 6 5 2 2 s s 5 5 5 5 4 4 ‘4 66 6 5 5 5 6 6 6 66 2 2 2 s s 4 6 2 4‘ ‘ 1 66 6 2 77 l 33 0.5—1 22 2 22 7 77 0.5— 11 1 11 3 33 2 77 3 1 33 3 2 1 1 1 1 3 3 7 7 7 4 3 3 3 4 3 1 1 3 7 ., 4 O _ l 1 1 0 _ 7 4 ‘ I l T I I I 0 0.5 l 0 0.5 e = 0.0001, Chain length = 20 e = 0.0001, Chain length = 5 Square error = 0.89 Square error = 1.02 (a) (b) l — 4 1 —‘ 1 3 2 5 3 3 3‘ ‘ 4 2 2 ‘4 l 1 5 5 5 5 3 4 l 2 2 2 4 ‘ 1 l 4 4 5 5‘ 4 6 4‘ 4 5 41 6 77 0 5 a 7., 7 7’ 6 46 0 5 — 66 6 6‘5 7 77 5 2 5 7 2 1 7 3 7 7 2 6 6 3 3 s 6 2 3 5 6 5 5 s 7 ., s 3 3 s O A 7 2 2 0 3 3 3 I I I I I 0 0.5 1 O 0.5 l 8 = 0.01, Chain length = 5 Forgy Square error = 2.08 Square error = 0.89 (C) (d) Figure 5-4 - Experiment C2, Comparison of Clusterings The importance of choosing good annealing parameters is again demonstrated in this experiment. If e is very small, run times may be very large. A reason for this is dis- cussed in the next section. 63 The simulated annealing run times in this experiment ranged from about 150 seconds to about 5000 seconds. In general, the faster times are associated with large 8 and short Markov chain length, and the longer times are associated with small 8 and long Markov chain length, but occasionally a run will require an exceptionally long time to finish when 8 is small and Markov chain length is large. It is not possible to determine in advance when this will happen. Because the variation in the run times is so large, they can not be easily represented in a table. The results from this experiment are not very promising. The simulated annealing clustering algorithm did not perform very well in a reasonable amount of time. 5.3. Clustering Experiment Summary The simulated annealing clustering algorithm performed at least as well as the Forgy clustering algorithm when conservative parameter values are used. The simulated annealing, however, requires exceptionally long run time, whereas Forgy takes only about 20 seconds to get the best of 100 groupings of these small data sets. One problem that occurs with clustering by annealing that did not occur with the mapping algorithm has to do with the termination parameter, 8. Decreasing 8 will usu- ally decrease final cost, but run time will increase. Parameter 8 usually has a predictable effect on the run time with mapping, but not with clustering. Large values of 8 produced predictable run times, but small values made the run time marginally longer or very much longer. This is probably due to the discrete nature of the clustering problem. The labelling gets to a point where few moves will be accepted because the points are almost partitioned into a minimum square error configuration and the temperature is low, but the stopping criterion, based on 8, has not yet been reached. These studies show that even though simulated annealing works, it is so slow that simulated annealing can not be considered practical for this particular problem. Existing algorithms are significantly faster. Chapter 6 Conclusions Simulated annealing has been applied to two representative optimization problems from exploratory data analysis, nonlinear mapping and square error clustering. Empirical results were presented in Chapters 4 and 5. This chapter draws conclusions about the use of simulated annealing for these two problems and suggests future work. 6.1. Summary and Conclusions The simulated annealing mapping algorithm has been examined in four experi- ments, focussing on the effects of cooling speed, sample size, and data shape. It is not surprising that slow cooling is more reliable than fast cooling, but much more computa- tionally expensive. Both large and small data sets can be adequately mapped, but map- ping small data sets takes much more time, relatively, than mapping large data sets, when compared to the gradient descent algorithm. The effects of data shape were the most interesting. The choice of cooling speed parameter was more important for tightly clustered data than for loosely clustered data. When mapping large sets containing gaussian clusters (experiment M3), simulated annealing reproduced local structure better than gradient descent, but when mapping large sets containing lattice clusters (experiment M4), gradient descent reproduced local structure better than simulated annealing. This result may be a problem with simulated annealing when mapping highly structured data or may be an artifact of the local 64 65 structure statistic. Using final stress as the performance criterion, simulated annealing performed better than gradient descent in experiments M3 and M4 indicating that simulated anneal- ing is more practical for mapping large data sets than for small ones. The run time of the simulated annealing algorithm increased slower with sample size than did gradient des- cent when mapping large data sets. The annealing parameters 8 and Markov chain length of the clustering algorithm were examined using both clustered and random data. In general, annealing can perform as well as Forgy’s algorithm, but the run times are very long. Of particular interest is the effect that 8 can have on run time. If 8 is chosen too conservatively, some runs never finish because few or no moves are accepted at low temperatures. When this happens in the mapping problem the algorithm makes the moves smaller as suggested by [VAN 84]. There is no analagous technique for the clustering problem. The run times in this thesis suggest that simulated annealing holds more promise in optimizing the stress function than in optimizing the square error function. This is most likely due to the difference in the number of accepted moves between the continuous and the discrete problem. It is important to remember that this is only one of many cooling schedules and there may be one with termination criteria that are better suited to the clus- tering problem. As implemented the simulated annealing mapping algorithm shows promise for projecting large data sets containing gaussian clusters. De Soete, et a1. [DES 87] have studied a problem similar to the mapping problem and report that although simulated annealing works, run times are too long to make simu- lated annealing worthwhile. There are, however, many problems reported where simu- lated annealing has shown promising results [KIR 83] [CAR 85] [VAN 86]. The choice of cooling schedule and parameter values is certainly very important as is the definition of move. Efficient implementation is also important, although not a primary concern in this thesis. The major contribution of this thesis lies in the formal examination of the perfor- mance of the simulated annealing mapping and clustering algorithms. The algorithms 66 were judged not by the rninimizations, but by independent measures of validity. This methodology has not previously been used in simulated annealing research. In order to examine the mapping algorithm objectively, it was necessary to develop measures of local and global structure. This is a minor contribution. The new results from this for- mal study indicate that simulated annealing is not appropriate for small data sets. The mapping algorithm, however, performs well on large data sets containing tight gaussian clusters. 6.2. Future Work The simulated annealing algorithm in this thesis uses one of many possible cooling schedules and the data sets contain just a few of the many possible parameter variations. Another variation on the idea of simulated annealing has been presented by Bohachev— sky, et a1. [BOH 86]. Their algorithm minimizes a continuous function and differs in the interpretation of the cooling schedule and should be considered for the mapping problem. One of the problems with simulated annealing is the time spent on calculating the cost for moves that are eventually rejected, particularly with small values of the control parameter. Some work has been done to design an algorithm with fewer rejected moves [GRE 84]. This thesis handled the problem of too many rejected moves in the mapping problem by decreasing the size of the move made at lower values of the control parame- ter. The solution to the clustering problem contains no such time saving measures, but there are several ideas which may be worth investigating. Most of the patterns have been assigned to correct partitions toward the end of a simulated annealing clustering run, yet the annealing continues for a long time because patterns to be moved are picked at random and very few patterns remaining need to be reassigned to other clusters. One way to make the annealing algorithm more efficient might be to move only those patterns that are in the wrong partition. It is reasonable to expect that the misclassified patterns are those nearest the border between partitions. The actual implementation used in one set of test runs consisted of limiting the possible can- didate patterns when the number of rejected moves became too large. Only those 67 patterns whose nearest neighbor currently has a different partition'label were considered. Preliminary trials using this technique were not successful for small data sets. Another possibility worth trying is to use some k-nearest neighbor choice of candidates. Research in developing parallel implementations of the simulated annealing algo- rithm has been reported [AAR 86B]. The main problem here is that simulated annealing is inherently sequential, although the objective functions used in this thesis lend them- selves to parallel operations. A near linear speedup using parallel processors could not be expected, but these new algorithms may allow large problems to terminate that previ- ously could not do so. Certainly, simulated annealing is not practical for all optimizations, but from the limited number of experiments made here it is fairly safe to say that simulated annealing is worth trying on problems for which there are currently no other good minimization techniques. Cooling schedules other than the one used here also bear investigation and may yield better results. A nice feature of simulated annealing is that the choice of cri- terion function to optimize is unlimited. The definition of move may cause some difficulty. Production algorithms may require ’clever’ implementations and should use efficient data structures. Appendix A Structure Statistics This appendix defines the statistics used to measure the structure present in clustered data and to judge the performance of projection algorithms. A.1. Local Structure Statistic The statistic alocald defined below was designed to measure the local structure in a clustered data set in d dimensions. The statistic alocal is defined to be lalocalL— alocal, I and determines how well a mapping from L to 1 dimensions preserves the local structure of clustered data. Values close to zero imply that local structure is preserved well. nclu (‘0 alocal, = 1 [228 5‘ fl” nclu (nclu - l) ‘11:“ Here, S 5") is the within cluster spread of cluster 1', d is the dimensionality of the data, and nclu is the number of clusters. 55d) = l l ixg 1 Ex — — w— — w d): "i (=1 "i (=1 where x1,“- is the kth feature of the 1th pattern of the ith cluster, and n,- is the number of LIMm patterns in cluster 1'. Note that the clusters are specified by the generating process so the notation is different from that in Section 1.2. 68 69 The statistic is composed of the sum of all possible combinations of ratios of within cluster spread. This number distinguishes between data sets whose clusters have the same spread and those whose clusters vary in their spread. The normalization constant of alocald is the reciprocal of the number of terms in the summation. This makes it possible to interpret alocald independent of the number of clusters in the data. The closer the value is to one, the more similar are the clusters of the data set. The statistic alocald is now shown to be invariant to scale changes. Scale all pat- terns by factor K to obtain d 1 n.- 2 _ 1 n.- 2 —2(wa) - —2me "6’“ ((=1 "i (=1 L "i (=1 =alocald nclu (nclu - 1) (=1,» 4 1 n; 2 1 "i Z —X(Kxuq) — —2lekj ((=1L i(=1 _"j(=1 jj Clearly a K 2 term can be factored from both numerator and denominator. The statistic alocald is also invariant to translation since 55‘” is invariant to transla- tion. Suppose all patterns are translated by ck. iébflwlz-[i—éwaIH t—l 1 d [’1 n 1 II 1 II = 3E1 7l§l[x,2k+2ckxu+ca —[:l§1[xu+cg IEIXIHCJ ldrlfl 2!! 1!! Fla 1!! In 1n = —2 —£xfi+—chxu+-Zcz— ”ZXu'F—zck —Zx”,+—ZC d k=1_ ’1 (=1 (1 (=1 (1 (=1 _ ’1 (=1 (1 (=1 n (=1 n (=1 zlirlix2+2icx+lic2 Plix 2Cir C2 '— 1k “ k 1k — k- "' (k —— k Ik— k d k=1_ (1 (=1 (1 (—1 n (=1 _ n (=1 n (=1 p 1" 2 1" (d) — Xjk- — X =S ll Q|H Mn. It can also be argued that alocald does not vary greatly with dimensionalin since the summation over dimension is present in both numerator and denominator. 70 A.2. Global Structure Statistic The statistic adist was designed to measure the global structure in a clustered data set of gaussian clusters in (1 dimensions. The phrase " global structure" refers to the rela- tive placement of the clusters. The statistic adist is defined to be ladistL - adist,! and measures the degree to which global structure is preserved in a mapping from L to 1 dimensions. As with alocal, values close to zero imply that global structure is preserved well. The statistic adist,, is composed of the sum of all possible combinations of ratios of distances between pairs of clusters. Values close to one imply that the clusters are equally spaced. The statistic adistd is defined as follows: 4 nclu l (nclu + 1) nclu (nclu - 1) (nclu - 2) £223.33: Du adist,, = (Iii where Dij is the distance between cluster centers 2,- and zj and is given by the following formula: 1 "" 1 "i 2 — zqui _ n— 2x”, "i p=l 1 p=l d D,,-=§_j[ ((=1 The statistic adist,, can be shown to be invariant to scale and translation, and argued not to vary greatly with dimensionality as was alocald. Appendix B Cluster Validity Measures This appendix contains the formulation of the cluster validity statistics that are used to judge the performance of clustering algorithms. 8.1. External Measure - The modified Rand statistic For the experiment involving clustered data, the true class labels are known so an external measure is needed. The modified Rand coefficient [HUB 85] [DUB 86] is used. Let {L (1')} be the set of n cluster labels assigned to the (1 patterns by the clustering algorithm and {T (1')] be the apriori cluster labels. The Rand coefficient is defined to be a a+b where a = |{(i.j): i>j. [L(i)=L(I'). T(i)=T(i)l V [L(i)¢L (i), T(i)¢TU)I}| b = |{(i.J'):i>J', [L(i)=L(i). T(i)¢TU')I V[L(i)¢L(i). T(i)=T(J')l}| and a + b = ["] . The statistic a is the number of pairs of patterns which are treated the 2 same by both labellings. That is, the number of pairs which either have the same label in both labellings or have different labels in both partitions; b is the number of pairs of pat- terns which have the same label in one partition and different labels in the other partition. 71 72 The modified Rand statistic [HUB 85] contains a correction factor that accounts for random labellings of the data. The modified Rand is defined Rand — Expected Rand Maximum Rand — Expected Rand where Maximum Rand is taken to be one, and the expected Rand is computed from 721421 an 121 I n 2 n 2 2 Expected Rand = 1 + where n,- is the number of patterns in group i of the clustering, and m,- is the number of patterns in group i of the true labelling. This statistic has a value of 1 when b = O and should be around 0 when the cluster labels are assigned randomly. Some special cases of the statistic for the 52 point data sets used in experiment C1 are 0.948 when any one pat- tern is classified in the wrong cluster, and 0.975 when one pattern is placed in a singleton cluster. B.2. Internal Measure - Modified Hubert’s Gamma For the experiment involving random data, the true class labels are not known so an internal measure of cluster validity is needed. The modified Hubert gamma statistic [DUB 86] is based on the Mantel statistic [MAN 67] and Hubert’s gamma statistic [HUB 76]. The statistic is the point serial correlation coefficient between proximity matrix for the n patterns, {x,-, 1 51' S n} and a "model" matrix. The cluster centers from the cluster- ing are considered to be the "true" locations of the patterns, so the model matrix is the proximity matrix in which proximity between patterns is indicated by distance between centers of clusters containing these patterns. Let L denote the label function that maps the set of patterns to the set of cluster labels. L, =k if ya, =1 Let {z,-, 131' Sg} be the cluster centers and M be a shorthand notation for n(n-1) 2 . Denote the euclidean distance between vectors a and b by 73 L 8(a,b) = [(a — b)T(a 413 2 Note that this function 8 is not related to the simulated annealing cooling parameter. The modified Hubert statistic, MH, will be a function of the following components. n-l (I 8(x ,x)5(z ,z ) rzMigljmzH I L. L 1,. .-1 M =— 5(x,-,xj) P M i§1j=12+1 M “'1 S: 5( 1 =— z ,z M1=ljd+l L‘ L 0-2=__._1_i1 i 520.. x) M2 0? M1=lj=i+l n—l u 02: i: 2 82(214, ZL’)- M2 M1=1ja+l The statistic MH is defined: MH = I‘_Mr"_‘_c_ (5ch Since this statistic is a correlation, values close to 0 indicate a poor clustering and values close to 1 indicate a good clustering. Bibliography [AAR 85] [AAR 86A] [AAR 86B] [AFI 72] [AND 73] [BEZ 81] [BIS 81] [BOH 86] Aarts, E.H.L. and P.J.M. van Laarhoven; "Statistical Cooling: A General Approach to Combinatorial Optimization Problems", Philips Journal of Research, V40; 1985; pp. 193-226. Aarts, E.H.L. and P.J.M. van Laarhoven; "Simulated Annealing: A Pedes- trian Review of the Theory and Some Applications", NATO Advanced Study Institute on Pattern Recognition: Theory and Applications; Spa, Bel- gium; June 1986. Aarts, E.H.L., F.M.J. de Bont, E.H.A. Habers, and P.J.M. van Laarhoven; "Parallel Implementations of the Statistical Cooling Algorithm", Integra- tion, the VLSI Journal, 1986, pp. 209-238. Afifi, A.A. and SP. Azen; Statistical Analysis; Academic Press, 1972. Anderberg, Michael R.; Cluster Analysis for Applications, Academic Press, New York, 1973. Bezdek, James C.; Pattern Recognition with Fuzzy Objective Function Algorithms; Plenum Press, 1981, pp. 65-86. Biswas, 6., AK. Jain, and RC. Dubes; "Evaluation of Projection Algo- rithms", IEEE Transactions on PAM], Vol. PAMI-3, No. 6, November 1981; pp. 701-708. Bohachevsky, Ihor 0., Mark E. Johnson, and Myron L. Stein; "Generalized Simulated Annealing for Function Optimization", T echnometrics, Vol. 28, No. 3, August 1986, pp. 209-217. 74 [BUR 84] [CER 85] [CAR 85] [CHA 73] [DES 87] [DUB 76] [DUB 86] [FRI 74] [GEM 84] [GID 85] [GOR 77] 75 Burkard, RE. and F. Rendl; "A Thermodynamically Motivated Simulation Procedure for Combinatorial Optimization Problems", European Journal of Operational Research, V17; 1984; pp. 169-174. Cemy, V.; "Thermodynarnical Approach to the Traveling Salesman Prob- lem: An Efficient Simulation Algorithm", Journal of Optimization Theory and Applications, V45, No. 1; January 1985; pp. 41-51. Camevalli, P., L. Coletti, and S. Patarnello; "Image Processing by Simu- lated Annealing", IBM Journal of Research Development, V29, No. 6; November 1985, pp. 569-579. Chang, CL. and R.C.T. Lee; "A Heuristic Relaxation Method for Nonlinear Mapping in Cluster Analysis", IEEE Transactions on Systems, Man, and Cybernetics, March 1973, pp. 197-200. De Soete, Geert, Lawrence Hubert, and Phipps Arabie; "The Comparative Performance of Simulated Annealing on Two Problems of Combinatorial Data Analysis", April 1987, preprint. Dubes, RC. and AK. Jain; "Clustering Techniques: The User’s Dilemma", Pattern Recognition, Vol. 8, 1976, pp. 247-260. Dubes, R.C.; "Experiments in Estimating the Number of Clusters", Techni- cal Report, MSU-ENGR-86-019, Department of Computer Science, Michi- gan State University, 1986. Friedman, J .H. and J.W. Tukey; "A Projection Pursuit Algorithm for Exploratory Data Analysis", IEEE Transactions on Computers, V 023, No. 9; September 1974, pp. 881-889. Geman, S. and D. Geman; "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images", IEEE Transactions on PAM], V PAMI-6, 1984, pp. 721-741. Gidas, B.; "Nonstationary Markov Chains and Convergence of the Anneal- ing Algorithm", Journal of Statistical Physics, V 39, Nos. 1/2; 1985; pp. 73-131. Gordon, A.D., and J .T. Henderson; "An Algorithm for Euclidean Sum of Squares Classification", Biometrics, V 33, June 1977, pp. 355-362. [GOR 81] [GRE 84] [HAM 64] [HUB 76] [HUB 85] [KIR 83] [KIR 84] [KIR 85] [KRU 71] [LUN 86] [MAF 86] [MAN 67] [MET 53] 76 Gordon, A.D.; Classification; Chapman and Hall Ltd., 1981. Greene, J.W. and K.J. Supowit; "Simulated Annealing without Rejected Moves", Proceedings of the IEEE International Conference on Computer Design, Port Chester, November 1984, pp. 658-663. Hammersley and Handscomb; Monte Carlo Methods; John Wiley and Sons; 1964; Chapter 9, pp. 113-126. Hubert, Lawrence and James Schultz; "Quadratic Analysis as a General Data Analysis Strategy", Br. J. math. statist. Psychol., V 29, 1976, pp. 190- 241. Hubert, L., and P. Arabic; "Comparing Partitions", Journal of Classification, V 2, 1985; pp. 193-218. Kirkpatrick, 8., CD. Gelatt Jr., and MP. Vecchi; "Optimization by Simu- lated Annealing", Science, V 220, No. 4598; May 13, 1983; pp. 671-680. Kirkpatrick, S.; "Optimization by Simulated Annealing: Quantitative Stu- dies", Journal of Statistical Physics, Vol. 34, Nos. 5/6; 1984; pp. 975-986. Kirkpatrick, S. and RH. Swendson; "Statistical Mechanics and Disordered Systems", Communications of the ACM, V 28, No. 4; April 1985; pp. 363- 373. Kruskal, Joseph B.; "Comments on ’A Nonlinear Mapping for Data Struc- ture Analysis’", IEEE Transactions on Computers, V C-20, December 1971,p.1614. Lundy, M. and A. Mees; "Convergence of an Annealing Algorithm", Mathematical Programming, V 34; 1986; pp. 111-124. Maffioli, F.; "Randomized Algorithms in Combinatorial Optimization: A Survey", Discrete Applied Mathematics, V 14; 1986, pp. 157-170. Mantel, Nathan; "The Detection of Disease Clustering and a Generalized Regression Approach", Cancer Research, V 27 Part 1, February 1967, pp. 209-220. Metropolis, N., A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller; "Equations of State Calculations by Fast Computing Machines", The Journal of Chemical Physics, V 21, No. 6; June 1953; pp. 1087-1092. [MIT 86] [OTT 84] [RAN 86] [REE 87] [ROM 84] [ROM 85] [ROS 86] [SAM 69] [SMI 83] [VAN 86] [VAN 84] 77 Mitra, Debasis, Fabio Romeo, and Alberto Sangiovanni-Vincentelli; "Con- vergence and Finite-Time Behavior of Simulated Annealing", Advances in Applied Probability, V 18, 1986. pp. 747-771. Otten, R.H.J.M. and L.P.P.P. van Ginneken; "Floorplan Design Using Simulated Annealing", IEEE International Conference on Computer Aided Design, (ICCAD-84); November 1984; Santa Clara; pp. 96-98. Randelman, RE. and GS. Grest; "N-City Traveling Salesman Problem: Optimization by Simulated Annealings", Journal of Statistical Physics, Vol. 45, Nos. 5/6, 1986, Pp. 885-890. Rees, Stephen, and Robin C Ball; "Criteria for an optimum simulated annealing schedule for problems of the travelling salesman type", Journal of Physics A, V 20, April 1987, PP. 1239-1249. Romeo, F., A. Sangiovanni-Vincentelli, and C. Sechen; "Research on Simu- lated Annealing at Berkeley", Proceedings of the IEEE International Conference on Computer Design, Port Chester, November 1984, pp. 652- 657. Romeo, F., and A. Sangiovanni-Vincentelli; "Probabilistic Hill Climbing Algorithms: Properties and Applications", Proceedings of the 1985 Chapel Hill Conference on VLSI; pp. 393-417. Rossier, Y., M. Troyon, and Th. M. Liebling; "Probabilistic Exchange Algorithms and Euclidean Traveling Salesman Problems", OR Spektrum, V 8, 1986. PP. 151-164. Sammon Jr., J.W.; "A Nonlinear Mapping for Data Structure Analysis", IEEE Transactions on Computers, V C—18, No. 5; May 1969; pp. 401-409. Smith, W.E., H.H. Barrett, and RC. Paxman; "Reconstruction of Objects From Coded Images by Simulated Annealing", Optics Letters, V 8, No. 4; April 1983; PP. 199-201. van Laarhoven, P.J.M. and E.H.L. Aarts; "Simulated Annealing: A Review of the Theory and Applications", c/o Philips Reasearch Labs. Vanderbilt, D., and S.G. Louie; "A Monte Carlo Approach to Optimization over Continuous Variables", Journal of Computational Physics, V 56, 1984, pp. 259-271. 78 [WHI 84] White, SR; "Concepts of Scale in Simulated Annealing", Proceedings of IEEE International Conference on Computer Design, Port Chester, November 1984, pp. 646-651.