.
,~- q.--»~.~.

A MULTIMODAI. DISTRIBUTION
BASED CLUSTERING ALGORITHM .

Thesis for the Degree of Ph; D
MICHIQAN STATE UNIVERSITY '
, VASUDEVA ANANDA RAD
19171 '

 

      
   
 

L I B R A R Y
Michigan State
‘1: University ’

 

 

This is to certify that the

thesis entitled

A MULTIDDDAL DISTRIBUTION
BASED CLUSTERING ALGORITHM

presented by

VASUDEVA ANANDA RAO

has been accepted towards fulﬁllment
of the requirements for

Ph. D. degree in Computer Science

 

 

Major professor

Date December 8, 1971

0-7639

":

isuuv WW
‘I Mm. R :5

    

     
  

' amomc av‘

HOME 8 SDNS'

ABSTRACT
A MULTIMODAL DISTRIBUTION BASED cwsmsms ALGORITHM
By

Vasudeva Ananda Rao

A new mathematical model is prOposed for the clustering
prdblem.encountered in data analysis and pattern recognition.
The set of multivariate observations on the objects to be grouped
is considered as having been generated by an unknown multivariate
continuous probability distribution having one or more distinct
modes. This distribution is not treated as a mixture of several
source pdfs. The clustering problem is identified as that of
(1) the estimation of the number and location of the modes of
such a distribution and (2) the selection of a suitable distance
measure to group the observations based on a 'similarity measure'
defined as the distance of each observation from the modes. A
practical method is proposed for estimating the number and location
of the modes and for detecting the structure in the data in the
form of clusters. This method does not require that the number of
clusters desired be specified in advance.

For pattern classification in the absence of training
patterns from each class, the clusters so detected are treated as
the sets of training patterns for "learning" purposes. An algo-

rithm is presented for classification of patterns from unknown

Vasudeva Ananda Rao

class using a minimum distance-to-mode decision rule. Apart from
the Euclidean and Mahalanobis generalized distance measures, two
other intuitively apealing distance measures are also discussed.

Excellent numerical results have been obtained using test data.

A MULTIMDDAL DISTRIBUTION BASED CLUSTERING ALGORITHM

By

Vasudeva Ananda Rao

A THESIS
Submitted to
Michigan State University

in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Computer Science

1971

TO

MY MOTHER

ii

ACKNOWLEDGEMENTS

The people to whom one is indebted in acquiring an educa-
tion are indeed numerous. It is hoped that these people will
accept my sincere thanks.

I am highly grateful to Dr. B. Weinberg for his guidance
and encouragement. He suggested the problem as a potentially
fruitful area for research and devoted a generous part of his
time for discussions leading to this thesis.

For their guidance and interest I wish to thank the other
members of my doctoral committee: Dr. R.C. Dubes, Dr. C.V. Page,
Dr. M. Rahimi of the Department of Computer Science and
Dr. J.S. Frame of the Department of Mathematics.

For the financial support which made this work possible
I wish to express my gratitude to: the United Nations Educational,
Scientific and Cultural Organization, Paris; the Government of
India, New Delhi; the Department of Technical Education, Govern-
ment of Tamilnadu, Madras and the Department of Computer Science,
Michigan State University.

Special thanks are due to my mother, wife and children

for their patience and understanding.

iii

TABLE OF CONTENTS

Page
List of Tables 'Vi
List of Figures vii
Chapter
I Motivation ..... O O O ........ O ..... O O O O ....... O O O O O 1
1 .1 Introduction ........... O O O O C O I O O O O O O C C C O C O O 1
1.2 Pattern Recognition and Cluster Analysis ... 1
1 .3 Literature survey 0 C O O I O O C O O O O I O O O O O C O O O O O O O 4
1.4 Contributions of the Thesis ....... ......... 8
1.5 Organization of the Thesis .. ........ . ...... 10
II Mathematical Preliminaries ....... ..... ..... ..... 11
2.1 Introduction ...... ......................... 11
2.2 A Mathematical Model for the Clustering
PrOblem ......00............OOCOOCOOOOOOOOOO 11
2.3 Estimation of Modes and Mathematical
Results CO.........0.0.0.000...00.00.0000... 13
2.4 Chapter Summary ............................ 17
III The Mode Seeking Algorithm ............... . ...... 18
3.1 Introduction ............................... 18
3.2 Mode Estimation Procedure for the
Multivariate Case ... ..... .......... ...... .. 19
3.3 Mathematical Details of the Algorithm ...... 20
3.4 Algorithm for Mode Estimation - the
Univariate case I C O O I O C O O C O O O I O O O O O O O I I O O O O I 26
3.5 Results Of Tests 0 O O O O O O O O O O O O O O O O O O O O O O O O O O 33
3 O 6 Chapter sumry O I C O O O O O O O O O O O O O O O O C O O O O O O O O 39
IV Cluster Seeking and Pattern Classification ...... 41
4 .1 IntrOduction O O C O O O O O O O O I O O O O O C O O O O O O O O O O O O O 41
4.2 A Similarity Measure and an Algorithm for
Clustering .0......OOOOOOOOOOOOOOOOOO0...... 42
4.3 Distance Measures for Classification ....... 43
4.4 A Classification Algorithm ................. 46
4.5 Results ......C.......COCOOOOOOOOOOOOOOO...O 46
4.6 Chapter Summary ..... ....... . ...... . ........ 48

iv

Chapter Page
V Summary and Conclusions ...................... ... 52

5 .1 T118818 sumry O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 52

5 .2 conCIuSions O O Q C I O O O O O O O O O O O O O O O O 0 O O O O O O O C O O 53

5.3 Suggestions for Future Work ................ 54

Bibliography ... ...... ........................... 56

AppendiXA .........O..........OOOOOOOOOOOOOOOO 61

Table

LIST OF TABLES

Results of mode estimation - simulated data .....

Results of mode estimation - real data .....

Results of clustering and classification

algorithm I00............OIOOOOOOOOOOOOOO

Results of classification using training

samples ......0.0..........OOOOOOOOOOOOO.

vi

Page
34

38

49

50

Figure

LIST OF FIGURES

Flow chart for multivariate mode estimation .....

Two dimensional mode estimation example .........

Flow chart for univariate mode estimation .......

Flow chart for the clustering algorithm ........
Flow chart for classification ..................

Explanation of the asymmetrical distance measure

vii

Page
21
23
30
44
47

66

CHAPTER I

Motivation

 

1.1 Introduction:

In many diverse disciplines the scientist collects data in
the form of p measurements or observations on each of N indi-
viduals or objects. He is interested in grouping these individuals
into subgroups in such a manner that members of a subgroup are
highly similar or associated and relatively unassociated with
members not belonging to the subgroup. For example: in electrical
engineering, one is interested in the detection of signals of
unknown characteristics that recur frequently in a background of
random noise; in medicine, to group electrocardiograms of EEGs
into subgroups [D-l]; in psychology and sociology, to group peOple
into types that may relate to treatment categories or behaviour
categories; in information retrieval, to find classes of descriptors
for articles and papers [G-l]; in numerical taxonomy, to group
species of living organisms into hierarchic trees, etc. Such group-

ing is a very valuable tool in data analysis.

1.2 Pattern Recognition and Cluster Analysis:

One of the aims of pattern recognition study is to find
meaningful descriptions to adequately characterize a set of data.
The principal objective of cluster analysis as applied to pattern

recognition is to gain more information about the structure of a
l

data set than is possible by more conventional methods such as
factor analysis or principal components analysis [N-l]. In the
statistical sense the pattern classification problem (2 class case)
may be defined as a discriminatory problem as follows:

A vector random variable X of observed values x is
distributed over some p-dimensional space according to distribution
F or G. The problem is to determine which of the two distribu-
tions x came from.

A statistical approach to the above problem can be con-
sidered to fall into one of three subproblems:

(i) F and G are completely known.

(ii) F and G are known with the exception of some finite
set of parameters.

(iii) F and G are unknown except possibly for assumptions
about the existence of densities, continuity, symmetry, etc.

Subproblem (i) has been completely solved by the Neyman-
Pearson lemma and results in a likelihood ratio (LR) test. This
LR test yields Optimum results in the sense of minimum probability
of misclassifications.

In the formulation of the approach to subproblem (ii) one
assumes the availability of training samples from each of the two
distributions. It is also assumed that the forms of F and G

are given but one or more parameters are unknown. The problem

then reduces to one of parameter estimation or one can also employ
learning strategies. Once these parameters have been estimated
using the training samples the LR test is applied as though F

and G were completely known.

The "parametric" approaches to subproblems (i) and (ii)
appear reasonable provided the assumptions made are justifiable
in practice in that the assumed parametric forms are good repre-
sentations of the data. But when little g_priori knowledge exists
about the underlying probability distribution associated with each
pattern class, these parametric approaches to the classification
problem become questionable in the sense that bad results may be
Obtained. This conclusion has led researdhers to require less
stringent assumptions about the form of the data structure and this,
in turn resulted in the emergence of a variety of nonparametric
pattern recognition procedures as approaches to subproblem (iii).
One such approach is to treat the pdf of the observations as a
mixture of several unknown source pdfs and then try to identify
these pdfs. Such an approach has been considered among others by
Teicher [T-l, T-2, T-3], Yakowitz [Y—l, Y-Z], Yakowitz and Spragins
[Y-S], and Stanat [8-5]. Application and develOpment of this
technique under the name of "unsupervised learning" or "learning
without a teacher" has been mainly done by Fralick [F-3],
Spragins [8-4], Patrick [P-3], Hilborn and Lainiotis [H-l],
Patrick and Costello [P-4] and Patrick and Hancock [P-Z]. However
it is known that the class of mixture distributions which have a
unique solution for the parameters of the individual distributions
constituting the mixture, is limited and whether or not it admits
a unique solution depends on the identifiability of the mixture
distribution [F-6, Y-l, Y-2, Y-3].

Cluster analysis is another type of approach which has been

pursued for a solution to subproblem (iii). It is a non-parametric

technique to determine a type of structure describing a set of
empirical data. A second way that cluster detection is applicable
to pattern recognition is by providing an answer to the question
whether or not a given set of features constitutes a good feature

space in which to discriminate a given set of pattern classes [Z-l].

1.3 Literature Survey:

Clustering techniques have been used as long ago as 1939 by
Tryon [T-A]. At present there are a number of proposed clustering
procedures available. Widely used in numerical taxonomy [8-3]
are agglomerative and divisive hierarchical clustering schemes.
Here a small and fixed set of patterns is given and a matrix is
computed whose (i,j)th entry is the association or similarity be-
tween the i-th and j-th patterns. The agglomerative procedures
[L-Z] generally link together the most similar patterns. Then the
similarities between the groups of grouped patterns and the re-
maining groups (or patterns) are recomputed using the minimum,
maximum, or mean similarity between the two groups. The pro-
cedure continues in this manner linking together the most similar
patterns or groups. Michener and Sokal [M-4], Ward [W-l], and
McQuitty [M-Z] were early users of this scheme. In a recent paper
Rohlf [R-Z] describes sequential agglomerative hierarchical cluster-
ing schemes in particular detail and preposes several new methods.

The divisive procedures [L-Z] begin with all patterns in
the same group and splitting the group into the two most dissimilar
groups. Edwards and Cavalli-Sforza [E-l] suggest dividing the

points or patterns into two groups such that the sum of squared

distance between the sets is a maximum. They define this as a
cluster and suggest an algorithm to find the two sets with the
desired property. Because the total sum of squared distance is

a constant for a given sample of points, maximizing the between-
set sums of squared distance is equivalent to minimizing the
within-set sum of squared distance. Their algorithm is to examine

all 2“"1

- l partitions of the N points and select the one which
gives the minimum‘within-set sum of squared distance. Lance and
Williams [L-l] suggest successively splitting the groups in a way
which is expected to reduce the variance the greatest for the split
groups. Mattson and Dammann [M-l] suggest successively splitting
each group by thresholding the dominant eigenvector of the co-
variance matrix of that group. Wirth g£.§l_[W-4] suggest thres-
holding the association or similarity matrix and defining the
components of the resulting graph as clusters. Thresholding is

done successively from strict thresholds to more liberal thres-
holds. Jardine and Sibson [J-l] outline a theoretical framework
within which the preperties of classificatory systems, which operate
on data in the form of a dissimilarity coefficient on a set of
objects, may be discussed. Hierarchical clustering schemes are
also discussed by Johnson [J-Z].

The three superficially different hierarchical clustering
schemes of Sokal and Michener [M24], Edwards and Cavalli-Sforza
[E-l], and Williams and Lambert [W-B] have been compared by Gower
[G-2] and suggestions made for their improvement.

Most popular among the non-hierarchical clustering schemes

have been those iterative schemes beginning with an arbitrary set

I

of all inclusive and mutually exclusive clusters and successively
improving the set of clusters by transferring patterns from one
cluster to another until no further improvement is available.
Such methods use as an evaluation index, what has been called
the "C Criterion" by Switzer [8-6]. As a means of evaluating
any given partition of the sample, the within cluster distance
or scatter [W-Z] or between cluster distance or the ratio of total
scatter to within cluster scatter is used [F-4]. To circumvent
the computational time consuming difficulty involved in examining
all possible partitions of the sample [F-Z], Friedman and Rubin
[F-é], use what they call a "hill climbing" algorithm. In
principle their procedure attempts to examine only those partitions
of the data for which the ratio of total scatter to within-group
scatter is high. The logic of this technique can also be found
in [F -S] .

A technique for clustering which is very popular is the
ISODATA (Iterative Self-Organizing Data Analysis Technique (A))
of Ball and Hall [B-l]. This technique clusters all of the data
into distinct and independent groups. A computed mean or average
response pattern is used to represent a group of patterns, and the
iterative process creates new average response patterns to improve
the accuracy of trial or existing average response patterns. The
process also combines average response patterns that are so similar
that their being separate fails to provide a significant amount of
information about the structure of the response patterns. Each
response pattern is put into that group for which the squared

distance between it and the average response patterns (group mean)

is the least. ISODATA implements an intuitively appealing mathe-
matical idea for clustering patterns. It is not itself a standard
statistical procedure, such as analysis of covariance or factor
analysis. The order of computations and the setting of the various
thresholds in the algorithm were motivated by heuristic reasoning
and the performance of the program with standard data sets [D-Z].

Jones and Jackson [J-3] suggest an iterative technique where
clusters are found one at a time. An initial pattern is picked to
be the first pattern in the cluster. Patterns are successively
transferred into and out of the cluster in a way which increases
the within-cluster similarities and decreases the in-cluster to
out-cluster similarities.

Graph-theoretical procedures have also been applied for
clustering. Bonner [B-3] starts out by thresholding the associa-
tion or similarity matrix and defining as "core clusters" the
maximal complete subgraphs (cliques) of the resulting graph. Then
the smaller core clusters are merged into large core clusters
and largely overlapping core clusters are merged. Gotlieb and
Kumar [G-l] discuss graph theoretical clustering methods useful
in information retrieval applications. Zahn [Z-l] describes graph
theoretical algorithms based on the minimal spanning tree of a
graph and which are capable of detecting several kinds of cluster
structure in arbitrary point sets. The concept of the minimal
spanning tree of a graph (MST) for single linkage cluster analysis
(SLCA) is also discussed by Gower and Ross [G-B]. They show that
all the information required for the SLCA of a set of points is

contained in the MST of the graph of these points. Augustson and

Minker [A-Z] also analyze some graph theoretical clustering
techniques as applied to information retrieval.

The statistical technique of Principal Components analysis
used for clustering of multivariate data involves treating the
number of observations (N) in p-dimensions as N points in an
p-dimensional metric space and projecting these points onto a
space of smaller dimensions with minimum.loss of statistical in-
formation; that is, the inherent structure in the data is approx-
imately preserved under the mapping. The primary interest is in
mapping onto two or three dimensions since the resultant data con-
figuration can be easily evaluated by human observations in three
or less dimensions. Rao [R-l] discusses the application of this
technique to clustering. Nunnally [N-B] describes the application
of factor analysis procedures in clustering. Dubes [D-Z] has an
interesting report on cluster analysis and decision making with a
correlation matrix wherein he describes minimum-average-distance
clustering and mean-squared clustering with particular reference
to Gaussian distributions.

A non-linear mapping technique useful in clustering multi-

variate data is discussed by Sammon [8-1]. The idea behind Sammon's
method is similar to that of Kruskal [K-Z] who discusses the

technique as applied to non-metric hypothesis. An interesting new
mathematical formulation of the clustering problem has been provided
by Ruspini [R-3].
1.4 Contribution of the Thesis:

From the literature survey of the last section it is clear
that, basically, cluster analysis techniques call for the identifica-

tion of those regions of the observation space where the patterns

are most heavily concentrated, thereby establishing a structure in
the data in the form of clusters. In this thesis, it is believed
that, to detect such a structure, it is not absolutely necessary
(1) to treat the underlying pdf of the observations as a mixture
of several component pdfs and then (2) try to identify the indi-
vidual source pdfs. It is sufficient to consider that the under-
lying pdf, is, in general, governed by a multivariate continuous
distribution with one or more distinct modes. Since the mode of a
pdf is that outcome which is likely to occur most often, the
estimation of the number and location of the modes provides a clue
to the structure of the data. Accordingly the method for cluster-
ing proposed in this thesis is based on the premise that such an
estimation of the number and location of the modes of a mmltimodal
multivariate pdf underlying a set of empirical data is a key to a
solution to the clustering prdblem. Thereafter, the selection and
application of a suitable distance measure determines the identity
and membership of the clusters in a straightforward manner.

This intuitively appealing idea, not found in the literature,
is presented and offered as a practical solution for the clustering
problem. The performance of this method on data sets, both real
and artificial, is found to strengthen the belief in such an
approach. Specifically the contributions of this thesis are:

(i) A new mathematical model is proposed for the clustering
problem. The set of multivariate observations, on the objects to
be clustered, is assumed to have been generated by a multivariate
continuous probability distribution whose density function has one

or more distinct modes. This pdf is not treated as a mixture of

10

several source pdfs as is sometimes done in the classification
problem.[Y-l] of pattern recognition.

(ii) In his discussion on nonparametric decisions based
on distance to modes, Nilsson [N-Z] assumes the existence of a
method to find good estimates for the modes of a probability
density function, given the set of training samples. This thesis
provides a practical method, guided by statistical theory, for the
estimation of such modes. Once the modes have been estimated from
a set of training samples, a starting point for the abstraction
phase of the pattern recognition problem is available. The
abstraction phase of the problem can then be completed using, for
example, a piecewise linear machine, implementing a minimum dis-

tance classifier.

1.5 Organization of the Thesis:

Chapter II presents the proposed model for clustering and
discusses the relevant mathematical preliminaries. Chapter III
describes an algorithm for the estimation of the number and loca-
tion of the modes of a multivariate, multimodal pdf. Chapter IV
extends the mode seeking algorithmto clustering and summarizes
the results obtained on a few real and simulated sets of data,

using both supervised and unsupervised learning techniques. Con-

clusions and suggestions for future work are included in Chapter V.

CHAPTER II

Mathematical Preliminaries

2.1 Introduction:

In this chapter the theoretical preliminaries required
for a solution to the clustering problem are discussed. The prob-
lem is stated in the framework of a simple mathematical model.

The basic assumptions made are:

(i) that there exists a probability distribution which
generates the multivariate observations to be clustered: and

(ii) that the probability density function (pdf) of this
distribution is of the continuous type characterized by one or
more distinct modes.

The word mode, as used here, denotes the location of a
local maximum in the pdf. Assumption (ii) implies that the pdf
need not necessarily be symmetric.

(iii) the number of observations is large compared with the
number of dimensions of each observation.

(iv) the observations are independent.

2.2 _A Mathematical Model for the Clustering Problem:—

Let x1,x ,...,xN denote the p-dimensional observations

2
(features, measurements) of N objects. Assume these observations
are values of a random variable, X - (X1,X2,...,Xp), having a

cumulative distribution function (cdf) F(x), and the corresponding

11

12

probability density function (pdf) f(x). Let f(x) be continuous

and characterized by G distinct modes, that is,
f(x) = f(x1,x2,...,xp; M1,M2,...,MC)

where Mi’ i = 1,2,...,G, denotes the i-th distinct
p-dimensional mode vector of f(x).
Then the clustering problem is:
(i) Estimate G, the number of modes of f(x);
(ii) Estimate the column vectors Mi’ i = 1,2,...,G; and

(iii) Define a distance measure d(x ’Mi)’ j = 1,2,...,N;

J
i = 1,2,...,G, which partitions the set of observa-
tions into C groups, n1,n2,...,nG such that

x 6 n if and only if

j i

d(x ,Mi)<d(x ); ia‘k; k=l,2,...,G.

J j ’Mk
It is immediately evident from the statement of the problem
that what is mainly needed for the clustering procedure is a good
and robust method to estimate the number and locations of the modes
of a multivariate, multimodal pdf. It should be noted that the
clustering procedure dictated by the above model does not require
that the number of clusters or groups be specified a priori. A1-
most all existing techniques such as Friedman and Rubin's [F-h]
require that the number of clusters to be formed be specified in
advance. ISODATA [B-l] also calls for the value of a parameter
to be specified to indicate the number of clusters desired by the
user. The absence of the need to specify in advance, the number

of modes desired, is claimed to be a distinct advantage of this

model. It is also emphasized that the model does not treat the

13

pdf generating the observations as a mixture of several source pdfs.

2.3 Estimation of Mbdes and Mathematical Results:

The problem of directly estimating the mode of a univariate
unimodal continuous pdf, f(x), has been considered by Grenander
[6-4], Venter [V-l]. Other approaches to the estimation of modes
appear as an extension of the related problem of estimating f(x)
from a set of independent samples. For multiple mode determination
as required by the model proposed here, density estimation in some
manner is a necessity and this will be of major concern in the
remaining sections of this chapter.

In order to estimate the density from a set of independent,
identically distributed observations, various approaches have been
proposed in the statistical literature . A potential function
type of approach has been suggested by Aizerman and Braverman [A-l].
Kashyap and Blaydon [K-l] suggest a stochastic approximation type
of approach. Other methods have been proposed by Chernoff [C-l],
Murthy [M-G], Parzen [P-l]. In this thesis the univariate version
of the estimator for the pdf proposed by Loftsgaarden and Quesenbury
[L-3] is considered for mode estimation. The multivariate case is
handled by a method motivated by Mattson and Dammann [M-l]. The
p-dimensional observations are first projected on to the principal
eigenvectors of the sample covariance matrix of the observations.
For the univariate set of data so obtained, the modes are estimated,
and then reprojected back into the original space.

Because the probability density function is, in general,

assumed to be multimodal with distinct modes, an extension of the

14

method for mode estimation of unhmodal distributions is considered.
This is an adaptation of the method originally suggested by Fu and
Henrichon [F-7].

Let x1,x2,...,xN be independent observations on a p-dimen-
sional random variable X = (X1,X2,...,Xp) with absolutely con-
tinuous cumulative distribution function (cdf), F(x1,x2,...,xp),
and the corresponding pdf, f(x) = f(x1,x2,...,xp);

Let 2 = (zl,zz,...,zp) be a point at which f(x) is
positive and continuous;

Let dz(rN) denote the Euclidean distance from the point
z to the rN-th closest observation 6 {x1}:=1 where

{rN} is a non-decreasing sequence of positive

integers satisfying

lim r ~ m

N-ocn N
r

lim ‘E' = 0 and

N
N—m

-5 *
lim N r = m for some 5 > 0 .

N

N—coo

Theorem 2.1: (Loftsgaarden and.Quesenberry)
Under the above hypothesis, the estimate of f(x) at the

point 2, given by the estimator

rN-l 2. p g'
tum = N (pm (2) 2 [dzoan n

is consistent.

 

Proof:

Proof of the above theorem can be found in [L-3].

 

Loftsgaarden and Quesenberry suggest the use of an integer
closest to /N as a value of rN-

15

The following corollary to the above theorem is obtained

in the univariate case (p = 1).

A consistent estimate of f(x) at z is given by

(rN-l) /N

£N(Z) = W")- . (2.2)
2 N

Lemma 2.1: (Owen) [0-1]
Let f(x) be a uniformly continuous probability density
function and {fN(x)} be a sequence of estimates of f(x).

Let M, be the value of x, assumed to be unique, where

1
max f(x) occurs and I1 is an interval over which f(x) is con-
xEI1
tinuous.
Further let M be the value of x where max £N(x)
1N x61
1
occurs.

If fN(x) is a consistent estimate of f(x), then M

1N

is a consistent estimate of M1.

Proof of this lemma has been given by Owen.
Extension to Multiple Mode Estimation:

As before, let x1,x2,...,xN be independent observations
on a univariate random variable and dz(rN) be defined as before.
Let f(x) be uniformly continuous and positive over an interval

I and let there exist a unique mode M

1 E I .

l 1

Let a new estimate of the mode M1 be formed as follows:

M' = x

where x is

(k) (k)

the k-th smallest observation from the set of N observations and

is the k-th order statistic, i.e., x

16

k=ArgCMin d (r)x. 61D;
1 {x(i) Ma) 1

lN

contained in the interval I1 which yields the minimum value of

dz(rN) as computed by the procedure of Loftsgaarden and

that is, the estimate M of the mode is that order statistic

Quesenberry.

Further let fN(x), for x E I be constructed in a step-

19

wise fashion according to

(rN-1)/N
fa“) = 2 dx “:9 ’ “(1)
(1)

Lemma 2.2: OMoore and Henrichon)

 

Sx S x(i+1).

The estimate fN(x) converges uniformly, in probability,
to f(x), x E 11.
Proof: see [M-S].

From lemmas (1) and (2),
TheoremL2.2:

M' is a consistent estimate of the mode M1; i.e.,

lN

Proof:

By Lemma 2.2 we know that the estimate fN(x) constructed
in the manner indicated converges in probability to f(x), x E 11.

.Applying Lemma 2.1 and the fact that max fN(x) occurs at the
x61
1

value of x(i) where d (r is minimum, the theorem follows.

)
x(i) N
Q.E.D.

17

For estimation of the modes in the case of a multimodal,
univariate pdf, consider the set of L relative maxima (assumed
L L
unique), ”mi}i=l located at modes {Mi}i=1 respectively, and
such that M1 6 I1 = (a21_1,a21), i = 1,2,...,L where
a1‘< a2 <...< 82L° Further let the pdf f(x) be uniformly con-

tinuous and positive over the intervals {Ii}i=1’ and the estimates

MkN of Mk be determined by
= x
MkN (j)

where j = Arg [Min [hx

(r ) x , E I }] .
i (i) N ‘ (1) k

Then, as an immediate consequence of Theorem 2.2,
Corollary:
The set of estimates [M, }:=1 converges in probability

L
to the set {Mi}i=l'

2.4 Chapter Summary:

In this chapter a new model for cluster analysis was pro-
posed. The problem of cluster detection was identified with that
of estimating the number of modes and the modes themselves of a
multimodal pdf with distinct modes. The remainder of the chapter

related to the theoretical aspects of mode estimation.

CHAPTER III

A Mode Seeking Algorithm

3.1 Introduction:

 

The goal of this chapter is to provide a computational
procedure to estimate the number, G, and the p-dimensional vectors
Mj’ j = 1,2,...,G, of the modes of a multimodal, multivariate pdf,
f(x), of Section 2.2. The mathematical results of the last chapter
were directed towards the estimation of the modes of a univariate
pdf because one of the problems encountered in a direct application
of the mode seeking techniques to a multivariate problem consists
of deciding how to store the boundaries which separate the observa-
tion space into regions containing only one mode. To appreciate
this problem one has only to note that in the one dimensional case
these boundaries are simply points on the real line. In two
dimensions the corresponding boundaries are curvesand it is not
easy to store arbitrary curvesin a computer. The problem is even
more difficult in a higher dimensional space. One method of
circumventing this difficulty is to project the multivariate
observations on to the principal eigenvectors of the sample co-
variance matrix [M-l] to get a univariate set of data; estimate
the modes in this one dimensional space and finally transform back

to the original p-dimensional space.

18

19

The direction of the eigenvector associated with the largest
eigenvalue of the sample covariance matrix of the observations is
the direction of maximum dispersion. This is the reason for pro-
jecting the observations on to the eigenvectors rather than pro-
jecting onto the coordinate axes of the p-dimensional observation
space. If separation between the data does indeed exist, it should

be more easily detected along the eigenvectors [H-Z].

3.2 Mode Estimation Procedure for the Multivariate Case:

The proposed mode estimation procedure can be summarized
as follows:

Step (i) Compute the principal eigenvectors associated
with the sample covariance matrix determined from a set of independent
observations.

Step (ii) Project the multivariate observations onto the
first principal eigenvector to obtain a univariate data set.

Step (iii) Apply the procedure of Section 3.4 and estimate
the number and locations of the local minima for this univariate
set and the modes along this eigenvector.

If only one mode is found go to step (iv); else, go to
step (v).

Step (iv) Using the next principal eigenvector repeat
the procedure from step (ii). If all the eigenvectors have been pro-
cessed and only one mode is found along each eigenvector, compute the
location of the mode and transform back to the original Space. Stop.

Step (v) Partition the observation space into regions

by hyperplanes perpendicular to the eigenvector and passing

20

through the points of minima. Classify the observations into sub-
sets corresponding to these regions.

Step (vi) For each new region so obtained repeat the
procedure from step (1).

Figure 1 depicts the details of this procedure in flow
chart form, .An illustration of the procedure applied to a two

dimensional data set is given in Figure 2.

3.3 Mathematical Details of the Algorithm:

* * *
Let x1,x2,...,xN denote the N, p-dimensional column

vectors corresponding to the p feature measurements on each

object; that is,

x = ; i=l,2,...,N . (3.3.1)

 

 

x
L 194
The asterisk indicates that these are raw measurements.

The j-th feature average, m is defined as the sample

1

mean for the j-th feature:

N
Z x

*
j -= 1,2,...,p (3.3.2)
i=1

8
II
2 [H

and the vector of feature averages is

m B i . (3.3.3)

 

 

21

 

  
  

Store Multivariate
Observations;
Re ion # = l

l. {:9

Compute Sample
Covariance Matrix
of Data in the Region

I

[Compute Eigenvectors

 

 

 

of Sample Covariance
Matrix

 

 

 

Eigenvector # = 1

l l 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   
   

   
  

  
     

 

. Obtain Untvariate [ Increment '
Data Set by Projection E. Vector #
of Observations in this ' *
Region on this
Eigenjector
Estimate # and
Location of Modes
for univariate Data
Set. (Fig.3) No
Compute
# of E. Vectors Vector
Modes = l? Processed for this
Region

 

 
   

 

'6') 0

Figure 1. Flow chart for multivariate mode estimation

22

   

 

      
 

All Divide Observation Space
Regions No into Regions by

’roces::gzlz" Hyperplanes

 

 

 

 

 

 

Yes Increment Region
‘H’ Count
Store # of
Modes;
Location of
Modes
* Find Observations

 

 

 

Falling in each
Region

1

Region #
= Region # + l

 

 

 

 

 

 

Figure l (cont'd.)

23

x2; 1(1) A

 

 

‘ (b) ”

Eigenvector l

 

 

 

 

 

 

 

 

 

C f(x
(e) 3!
Region 1 Eigenvector 2
Processing of region 2
X2 9 f(x)
(e) x’
Eigenvector 1
£00
C2
(d) ’ x1 (f) ?
Region 2 Eigenvector 2

Regions 3 and 4 processed in a similar manner.

Note: Sketch intended only for visualizing
the steps involved in the algorithm.

Figure 2. Two dimensional mode estimation example

24

Define the normalized measurements as

xij = xij - US (3.3.4)

and the normalized observation vector is

r
x117
‘12

xi = . . (3.3.5)

 

 

X.
L 1P4
These normalized vectors are arranged in the form of a

(N X p) matrix, [A] :

 

 

 

 

T F ‘7
x1 x11 x12 .... x1p
T
x2 x21 x22 .. . x2p
m = s = ::::::'::::::::::::: - <3-3-6>
T
Lst L_XN1 xN2 °°'° prJ

The (p X p) sample covariance matrix, [S], is:

[33 = DYE—1 [HT [A] (3.3.7)

and is assumed to be positive definite.

Let x1,x2,...,xp be the eigenvalues of [S] arranged

in order such that
k1 > X2 >...> )‘p .

The first principal (column) eigenvector, C1, of [S] is

the eigenvector corresponding to the eigenvalue x1. The remaining

(column) eigenvectors are, respectively, CZ’C3"°"Cp: and the

25

(p X p) matrix of eigenvectors is
[C] = [C1,C2,...,Cp] .

In step (1) these eigenvectors Cl,C2,...,Cp, of [S]
are computed treating the entire observation space as one region.
The univariate set of step (ii) is obtained by projecting

the N observations onto C .

 

 

1
Let V k represent the univariate data set obtained by
projecting the observations onto the k-th eigenvector Ck. Then
(V a
1k
V2k
V k = I = [A]Ck ; k = 1,2,...,p . (3.3.8)
LVNks
The p data sets so obtained can be arranged as the
columns of a (N X p) matrix, [V], given by,
[V] = [A][C] . (3.3.9)

In step (iii) the mode seeking procedure of Section (3.4)
is applied to the data set represented by the first column of [V].
If only one mode is detected for this set, the procedure is applied
to the set given by the second column of [V] and so on as mentioned
in step (iv).

For the set of observations in each region of the observa-
tion space obtained in step (v), a sample covariance matrix [S],

a matrix of eigenvectors [C] and a matrix [V] is obtained.

26

Suppose that for the i-th region, only one mode is

detected for each of the data sets corresponding to

V(i) V(1) (i)
1 ,v .2 ,...,v.p .

i-th region, in the transformed space, are the elements of some

The components of the mode vector for the

row of [V(i)], say, the j-th row.
The mode vector for the region, in the original observa-

tion space is therefore,

M“ )= [C(1)T .J'lv(i)T+

. . T . . .
= [C(1)1v§1) + m(1); since [C(1)]T[C(1)] = [I] (3.3.10)

where
[C(i)] is the matrix of eigenvectors for the i-th region;
(1) (i) V(i) V(i)
V = V . e e o g
1 [ 31 V32 vjp ]
and
rm{1)1
m“) = :

(i)
m. J

is the vector of feature averages for the Ni observations in

 

 

the i-th region.

3.4 Algorithm for Mode Estimation - The Univariate Case:

The theoretical formulation of the mode estimation prob-
1em for the univariate multimodal case, as presented in the last

chapter, requires that those intervals on the real line where

27

distinct modes are assumed to exist, be known first. This is a
condition which must be satisfied prior to the application of
Theorem 2.2. Once these intervals have been identified, a con-
sistent estimate of the mode of the underlying pdf of observations
falling in this region, can be obtained in a straightforward
manner. The problem therefore reduces to that of finding intervals
{11}E=1 on the real line. One method which suggests itself
immediately is to actually obtain the pointwise density estimate

of the univariate pdf using the estimate

(r -l) /N
£N(z) = 2 dz(rN)
and to plot these to identify the intervals containing the local
extrema. The futility of this approach can be appreciated if one
actually constructs such a plot. The local variations in the
pointwise density estimate are too great to be of any use in
isolating the intervals under consideration.
To overcome these difficulties Fu g£_§l_[F-7] suggest the
following three "necessarily vague" guidelines:
(1) Some means of smoothing the pointwise approximation
of the underlying density estimate is necessary;
(ii) Modes of an underlying density which are further
apart from each other should be more readily detected; and
(iii) Modes which have associated with them a high prob-
ability mass should be more readily detected.
The approach adopted in this thesis is the construction
of an equal bin-count histogram as a primary approximation to the

underlying density. This histogram approach performs the

28

integrating effect of assumption (i), that is, local variations
in the pointwise density approximation will tend to be smoothed.
A discussion on three types of histograms useful for analyzing a
set of empirical data can be found in Dubes [D-l]. The equal
bin-count histogram approximation was motivated by this dis-
cussion. The construction of histograms for estimating the modes
has also been suggested by Sebestyen and Edie [8-2]. Such an
approach may also be useful to achieve the three objectives
referred to above but their method is not used in this chapter
and hence not discussed further.

In an equal bin-count type of histogram the widths of the
various bins indicate the concentration of the probability mass
in any interval. Bins with a smaller width imply a higher con-
centration just as in the equal bin width type of histogram a bin
with more height is indicative of more observations lying in the

interval specified by the bin width.

The first step in the algorithm to estimate the modes is
the construction of a histogram. The number of bins desired is
specified by the user. This number, NB, must be, preferably,
selected such that the number of observations N, (sample size/
region count) is a multiple of NB. Once this number is specified
the bin divisions are chosen to fulfill the condition that all
bins are required to have the same, or as close to the same as
possible, number of counts. The smaller the number of bins
specified the greater will be the smoothing effect; that is,
specifying a large number of observations per bin supresses the

local variations in the histogram approximation to a greater

29

extent. The locations of the bin divisions carry the information
required for further analysis.

An extrema seeking technique is then applied to this
histogram in order to determine appropriate intervals in which
to search for local maxima. The algorithm for selecting these
intervals depends on a parameter, PARA, to be specified by the
user. This parameter decides the value of a threshold which
essentially determines how much local variation will be tolerated
before a decision to Specify an extremum interval is made. The
algorithm first seeks bin divisions to be used in determining the
intervals of existence of local minima between successive modes.
The criterion used to store such an interval is that:

(i) there exist a bin to the left of the chosen bin such
that the difference in their widths is greater than the threshold
specified and

(ii) there exist a bin to the right satisfying a similar
condition.

The threshold is calculated as PARA times the minimum bin
width. From simulation studies it is found that the value of PARA
in the range 1 to 10 gives good results.

The points at which relative minima exist are then chosen
as the midpoints of these bins. The set of intervals {Ii}E=1
in which modes are assumed to exist is now available. The
techniques of the last chapter are now applied to estimate the
mode in each interval.

A flow chart for the algorithm is given in Figure 3.

30

Initialize:
Specify: # of bins for histogram;
r for L 6: Q method; Parameters PARA for
threshold and min. # of observations/
region allowed: MCNT

 

 

 

 

 

Construct Equal
Bin Count Histogram;
Bin Locations Given
by Di

 

 

 

 

Find Bin Width xMin of

Bin with Minimum
Width

 

 

[Threshold = xMin * PARAI

 

 

Search for Bins with Widths
Satisfying the Threshold
Criterion

 

 

 

Figure 3. Flow chart for univariate mode estimation

31

 

Find Midpoints of
Such Bin Intervals;
These are the
Local.Minima
Positions

 

 

 

 

l .-

Store Interval
Boundaries
Containing

Mode

l

Count # of Observa-
tions in each Interval

{#I

 

 

No LumplAdjacent

Regions;
Refix Boundary
of Interval

 

 

 

 

 

 

*—

Compute Distance of
Observations in the

Intervals to the rN-th

Closest Observation
in the Intervals:

dz(rN)
i

[ Find Min of dz(rN)]

Figure 3 (cont'd.)

 

 

 

 

 

32

[Mode Position Given by that Observationl

 

(order statistic) with Min dz(ru)

L

Store # of Modes
and
Mode Positions

 

 

Figure 3 (cont'd.)

33

From the flow chart it is observed that the number of
modes is controlled by the following parameters to be specified
by the user:

(i) PARA, the parameter controlling the threshold to be
used in finding the interval where the local minima of the pdf
exist; and

(ii) MCNT, the minimum number of observations the user
allows in any region. If any region is found to contain less
number of observations than MCNT, then adjacent regions are lumped
till this criterion is satisfied.

The estimate of the location of the mode in any interval
is controlled by the value of rN used in the expression for the
pointwise density estimate derived by Loftsgaarden and Quesenberry.
They suggest the use of an integer closest to ‘/N, where N is
the number of observations in a region. To use a more general
value r is taken as the integer closest to PAR ./N where PAR

N

is a third parameter to be specified by the user.

3.5 Results of Tests:

The algorithm presented in this chapter was tested on
different sets of simulated as well as real data. The simulated
sets of data were obtained by Monte Carlo methods. The results
of such test runs are summarized in Tables 1 and 2. The time shown
in each case refers to runs on a CDC 6500 digital computer.

Data set # 1. Simulated data generated by a mixture of two uni-

variate Weibull pdfs given by

f(X) a 0.4 * W(x301981,61) + 0.6 * W(XEO'2:52:52)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. . - . . - muse
. . owes owa o oNo N mmmm o HmHN N oowmmsou N
oo H ms 0 ca m Ha N com me musuxwe
mac 0 MNo Nu namN o- comm N- Housewo>am
Ne
omen
moon ﬁancee: N
oo.H cod 0.5 mm.ma mmoN.N oNoN.o NqNN.N Nnaw.o N oom mo ououxae
"ouoauo>qo=
He
5 3 3 8
o o o o o o o
.53 So: 55 8.3 c z B: u z e a canoe
mu moooz. moooz mono: swam
anuauowao any you nouoaaoo mo :oNuoooa mo oowuooon mo * oHnEom gum ouon
pom: muouoesuom
. Hoowuouooaa oouqaauwu
some oouoaoaam u coaueaaumu moo: we ouasmmm .H mum<9

 

35

where

A51 9"1 1 3' .
W(x;ai.ei.5i) = ;* (x-si) 1 * exp [- 071* (“'51) 11, 1f x 2. 5i

= 0, if x < 61 ; i = 1,2 .

= 1.0 ; 51 = 62 = 2.0 ; 6 = 0.0 ; 6 = 2.0 .

°’1=°’2 1

Data set # 2: Simulated Data:

Mixture of two dimensional Gaussian distributions given by

f(x1,x2) = 0.3 * N(x1,x2 ; “1,21) + 0.7 * N(x1,x2;u2,22)

0 -200.
Where ”1 = [o] ; ”2 = [-2 o]

1.0 0.0 0.25 o‘

0.0 1.0 0 0.25

Referring to the results in Table 1, it should be noted that the
estimated mode locations for both the sets of data are for a
mixture of the two distributions. They are not necessarily the
same as the mode locations of the component pdfs. The theorem
given below shows that indeed such is the case except under special
conditions. This theorem and the proof are given only to explain
the results obtained for the simulated data sets. It is pointed
out that in the model prOposed in Chapter II, the pdf is not treated
as a mixture of pdfs.
Theorem 3.4.1:

Let f1(x) and f2(x) be two continuous pdfs of the multi-

variate random variable X = (X1,X2,...,Xp) and let f(x) be a

36

mixture of f1 and f2 given by

£0!) = 91 £101) + p2 £201) ;p1 + p2 = 1 .

If x!n is a mode of f(x) and x0 is a mode of f1(x),

then the mode xm is not the same as x0 unless sz is zero
at x = x .

‘m
Proof:

At any mode of f(x),
V f(x) = plovf1(x) + p2°Vf2(x) = 0 .

Since Vf1(x)‘x-xo = 0, locally, in a neighborhood of x0, we have
the expansions

Vf1(x) 8 [B1] (x - x0) +-temms involving higher order terms,

vf2(x) =.A2 + [B2] (x - x0) + terms involving higher order terms,

where

A2 = Vf2(x) =3 and
x 0
[B1] and [B2] are square matrices depending on x0.
Retaining only the first order terms, the mode xm occurs when

p1[B1] (x - x0) +-p2 A2 +p2 [B2] (xm - x0) = 0

from which

"m = xo ’ “’1 [B1] + P2 [323)-1 A2

P_1 -1
3 KO ‘ [p2 [Bl] +'[Bz]] A2

37

assuming that the indicated inverse exists.

If A2 = vf2(x) 8 0 at x = x x = x

, O
O m 0 Q.E.D.

The last condition may be interpreted as either the two
distributions are well separated in the observation space or a
mode of one of the densities occurs where the other density is a
constant.

The theoretical modes of the mixture pdf for the Weibull
data are 0.7070 and 2.7035 respectively. Considering the fact
that the estimated modes were for data produced by simulation,
there is reasonably good agreement between the ideal and estimated
‘mode locations.

The two artificial data sets were used only to illustrate
the feasibility of the mode seeking algorithm. Applications to
real data are considered next.

Data sets # 3,,4, 5 and 6:

The sets of data mentioned in Table 2 were selected from
150 four dimensional patterns used by Fisher in a classical paper
[F-l]. Each pattern represents an iris and each feature repre-
sents a measurement on the iris. The patterns are divided into
three species/classes called Setosa, versicolor and virginica
with fifty patterns in each class.

In each of the tests all the four measurements were used
for mode estimation. The parameter, PAR, to be specified, was
varied for the different runs to demonstrate its effect on the
estimate for the mode locations.

The iris data set has been used by many research workers

to test the performance of the algorithms proposed for clustering

38

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mmom.~ oomm.~ m~o~.o nosasmus>
moon demo.o wNN¢.q Namn.a uoHoonuo>
m¢.¢ «dam.N NHN¢.N oNq¢.m m mm oo.a o.H oma omOuom
Nmmm.o Homn.n omHH.m mHmH
mmmm.N mNON.o
moon och.m u qum.H ooHonuH>
Ne.m mnoo.N oN¢¢.m N mm oo.H o.H OOH a cucuom
mmHm.N omH~.n mHmH
. mNmm.a BOON.o
3mm - oaoo. m 33. H so." 038:5
mN.¢ aNma.N mamm.m N mm oo.N o.H OOH a omouom
wnmN.o owON.m mHmH
onm.H mawm.a
moon wnaa.¢ wan.¢ s N mm om.H o.H ooH moaoawuw>
om.¢ NNmm.N NmmN.N a uoHouNnuo>
mNam.m damm.o mHmH
mass any Ame Asa venom azuz m<m <a<m moauowmuao
mo mono: mo coauoooa memo: muouoanuom swam
nouoneoo mo * nausea uom ouon

 

 

 

 

 

 

moon Hoom n ooHumENumu oooz_mo muasmom

.N manna

 

39

as well as classification. It is known [Z-l] that the setosa species
are well separated from the versicolor and virginica categories.
The latter two are fairly 'mixed'. In two dimensions, this fact
has been illustrated by Samuon [8-1] and Dubes [D-2].

The location of the modes given in Table 3 for data sets
5 and 6 (using the same sets of parameters) shows this very well.
The mode location for the setosas remained the same whereas for
the virginicas it changed when pooled with the versicolors.
Results for data set 4 show the effect of changing the parameter,
PAR, (used in computing rN) on the location of the mode. Though
not specifically shown in the table, for the set 4, with PAR = 1.0,
the mode location for the setosas is found to be identical to the
location estimated for sets 5 and 6. It is evident that for well
separated groups the modes remain stationary and they are perturbed

if the sets are fairly mixed.

3.6 Chapter Summary:

In this chapter a practical technique for the estimation
of the number and location of the modes of a multivariate multi-
modal pdf is presented. The points to be noted are:

(i) The histogram approximation to the underlying pdf is
used only to find the intervals in which to seek the location of
the mode and not for the estimation of the modes themselves.

(ii) The equal bin count type of histogram rather than
the equal bin width type is used for the approximation.

(iii) Pointwise density approximations are not computed

for estimating the mode.

40

(iv) The mode estimating procedure did not require that
the whole underlying density be absolutely continuous. What is
required is that the density be absolutely continuous over an
interval containing the pointwise mode estimate.

(v) ADEriori knowledge of the number of modes is not
a requirement to start the procedure although this number is
controlled by the user in specifying the parameters PARA and MCNT.

The method presented in this chapter is believed to be
useful in a number of cluster seeking problems. Other methods
for handling the multivariate problem directly (instead of obtain-
ing first a univariate data set) may or may not be computationally

faster, but such methods have not been considered in this chapter.

CHAPTER IV

Cluster Seeking and Pattern Classification

4.1 Introduction:

In this chapter two applications of the mode seeking
algorithm are presented, namely the use of the estimated modes
for clustering, and the utilization of the clusters so formed
for pattern classification.

The underlying assumption behind the clustering procedure
is that the modes are the prototype observations around which each
of the observations or patterns tend to cluster. A cluster is
defined to be the set of those observations surrounding a mode
which is nearest to them in the sense of some metric. Hence the
measure of "similarity" is the distance of an observation from
each of the modes.

A general statement of the pattern classification problem
involves the consideration of its three fundamental aspects, char-
acterization, abstraction and generalization. Characterization
involves the selection of the independent variables which char-
acterize the different classes from which the patterns originated.
Abstraction refers to the process of obtaining a decision rule for
classifying a new pattern with vector x of unknown class. The
decision rule is arrived at using all the available information.

The ability of the decision rule or classifier to correctly

41

42

categorize the samples from unknown class is called generalization.

In the classification procedure prOposed here, the clusters
produced by the clustering algorithm are used as the sets of train-
ing samples from the G pattern classes. Assuming that there
exists a pdf underlying each pattern class, the mode of the pdf
is estimated using the mode seeking algorithm of Chapter III.
Classification of any patterns with unknown class is achieved using
a minimum distance classifier.

It should be noted that the modes used for the clustering
algorithm are not necessarily identical to the modes of the
clusters estimated during the classification phase referred to
above, because the data sets are slightly different. If the data
.are initially well separated in the observation space, the two
sets of modes tend to be identical.

As is usually done in most pattern classification studies,
to test the "goodness" of the algorithms, tests are performed on
data sets whose sources are known in advance. The measure of
performance is stated in terms of the percentage or number of

observations misclassified.

4.2 A Similarity Measure and an.Algorithm for Clustering:

Let x1,x2,...,xN be the vectors of the observations
which are to be grouped using the measure of similarity defined
by the Euclidean distance d(xi’Mj) between the observation
vector x1 and the estimated mode vector Mj' The vectors Mj

are assumed to be available as a result of the mode seeking

algorithm.

43

Let n1,n2,...,nG denote the G clusters each corre-
Sponding to an estimated mode with vector Mj.

N
Then the set of observation vectors [xi}i_1 is parti-

tioned into the G clusters such that

xi E n if and only if

.1

d(xi,Mj) < d(xisr’Lk) 3 j H k: Vk .

Even though the extension of the mode seeking algorithm

for determining cluster membership is straightforward, it is

given in flow chart form in Figure 4, for the sake of completeness.

4.3 Distance Measures for Classification:

In addition to the Euclidean and the Mahalanobis generalized
distance functions, two other heuristically motivated measures are
tried for classifying an unknown pattern X, using a minimum
distance-to-mode classifier. The four measures are summarized
(1)

below, where M denotes the mode of the i-th cluster.

(1) Euclidean distance function:

dE(x,M(i)) = [(x - M(i))T(x - 11(1))115 . (4.3.1)
(11) Mahalanobis generalized D2-measure:

dM(x,M(i)) = (x - M(i))T [swj'1(x - M(i)) (4.3.2)

where [8"] is defined as the sum of the sample

covariance matrices of all the clusters.

44

From mode seeking algorithm Fig. l.

 

 

 

 

Compute
Euclidean
distance

d(x1,Mi)

L.

Find Mode
K to which
distance
is minimum

1

Observation
I belongs
to cluster K
K = 1,2,...,G

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4. Flow chart for the clustering algorithm

(iii)

(1‘!)

45

"Symmetric" generalized distance measure:

(1)) = (x Mu) T (i)

) (4.3.3)

dS (x ,M )[S( 1)](x - M

where [8(1)] is a (p X p) matrix preportional
to the inverse of the second-moment matrix of the
observations in the i-th cluster, with respect to

the mode, M(i), of the cluster. It is defined as

N

N11 -
]= "—[z (x 1‘1) - M(i))(x:(i) - M(i))T] 1 (4.3.4)

j= -1

[S (1)

where the asterisk denotes that raw measurements

are implied and N1 is the number of observations

in the i-th cluster.

'Asymmetric' generalized distance measure:

/2
dAS(x ,M(1))— 9 9” 9 + 4‘" (4.3.5)

2w

where:

.. 1148(1) A (1) [(1)

with a e x ; b = M - M ;

b t[SA(1)-]a +8 t[Sﬁi)]b- 2b CES SA(;)]b ;

[bSAfin > 0 ;

S
II

(a - b) N[s:1)] (a- b) >0; and

'1
ll

J‘lx

Md<i>

is a vector representing the center of a
constant unit distance ellipse. A detailed deriva-

tion of the measures (iii) and (iv) is given in

Appendix A.

46

The 'asymmetric' distance measure is motivated by the
belief that if the pdf underlying the observations in any cluster
is asymmetric in the sense that the mean is different from the mode,
the distance measure chosen should reflect this skew. Because the
dispersion of the observations is, in general, not the same in all
directions, it is felt that the measure of distance used should

reflect this.

4.4 A Classification Algorithm:

Based on the minimum distance-to-mode decision rule, an
algorithm for classifying a pattern vector x, from unknown class,

is given in Figure 5 in flow chart form.

4.5 Results:

The clustering algorithm and its application to pattern
classification using the different distance measures are tried
on the sets of data used in Chapter III. Another set of real
data tested consisted of measurements of different types of
grain. Ehrlich and Weinberg [E-2] discuss how grain shape may
be described as precisely as needed by a Fourier series expansion
of the radius about the center of mass utilizing co-ordinates of
peripheral points. They also give illustrative examples to show
that the shape variables easily discriminate grain differences
arising from geographic, stratigraphic and process factors. In
pattern recognition parlance this may be considered as their method
of feature extraction. Professor Weinberg made available a set

of eight feature measurements for each of the grains - Navy Bean,

47

Estimate modes
of given data set;

Clustering
Phase
Identify clusters

using Euclidean
metric

l

Estimate M(i): - modes
of the clusters

 

 

-—b

 

 

Learning Phase

Choose distance

measure for
classification

1

 

 

 

 

 

 

 

Compute
(i)
d(XaM 1) Classification
JV 3 Phase
Find > [Decide
k = ArgCMin {d(x,M(1))}) x 6 C1188 STOP
1

 

 

Figure 5. Flow chart for classification

48

Wheat and Oats. As the sample size of each variety of grain is
small compared with the number of features only four out of the
eight features are chosen for the test. Specifically, the avail-

able sample sizes are:

Navy Bean: 46
Oats : 50

Wheat : 42

The results of these tests are summarized in Table 3.

The mode seeking algorithm is also applied to estimate
the modes of the pdf underlying the actual training sets as dis-
tinct from the sets identified by the clusters. For each data
set two types of tests are conducted; first using all the samples
from each training set and next using a subset of the set from
each class. Both the tests gave encouraging results. The results
of the former type of test are tabulated in Table 4.

An inSpection of the results in Table 3 reveal that with
one exception all the reclassifications based on the initial
cluster membership are worse almost inversely proportional to the
degree of the intuitive sophistication of each distance measure.

The reason for this is not known and has not been investigated further.

4.6 Chapter Summary:

In this chapter a clustering procedure is discussed using
the simdlarity measure defined by the distance of an observation
point from the modes of the distribution which generated the set
of observations. If the ultimate aim is clustering of a set of

empirical data, the algorithm can be terminated at this stage.

.ma .o .m madman mom

 

 

 

 

 

 

 

 

 

 

 

 

 

.1
oom\q oom\HH oom\m oom\m oom\a u Nmmo.Nu omoN.o N cowmmsmo
Homo.Nu mmuo.o| museum>wm
moﬁowwua>
HmoN.H wmaN.H oemN.o a
NNoo.m mmoo.¢ mumq.a a soaouamuo>
omH\q omH\0N oma\oH onH\ma omH\m NmoN.N mmwN.N aooa.m encuom
Honw.m Oan.m aomo.n mauH
Hmoe.a mms~.H «oncamna>
NNoo.m mmoo.q u a a
ooH\m ooa\ON ooa\oH ooa\mH 00H\m NwoN.N ammN.N sodoowmuo>
m4; 358 029m 3.:
Noao.o wooo.o HmNo.o
«ono.o mNHo.o anN.o uoona
mmH\H wms\m mma\a me\e wms\m mmHo.o omao.o aaﬁo.o a seam s>uz
Naom.o mHoH.o mann.o mono
oouaaosooow oouwaouooow asuwuowam
monocoamnoz uauuuaahma. oasuoﬁﬁhm oaaoom s soumoao
museum: «newsman as case Ame Ame RHV acosm
1% nmﬂo moo: woos moo: unease uom
. no“: mo moon
coauoUNmamnoHo ow nowmammoaonaz oz sooeoz mono: noumsau sooasz

 

 

 

 

 

 

masonsowao ooNumoemammmHo one woauoumsHo mo muaomom

.m manna

 

50

 

 

 

 

 

 

 

 

 

 

 

oon\¢ oom\HH oom\w oom\m u Nmmo.Nu omoN.o+ N coanmsow
Homo.Nn mmno.0u ouoauo>am
on¢o.~ mma~.H ¢~m~.o monaamna>
NNHm.m meo.q mmaq.a uoHoonuo>
omH\m omH\m omH\N omH\Oa om¢N.N Nmom.N qooa.m a ooOuom
Ho¢¢.o aw¢m.w «omo.m mNuH
wwao.o Hooo.o HmNo.o
wwwo.o mmoo.o mmHN.o uoosz
me\H o o wma\a moao.o amao.o NNHo.o q comm h>oz
Hmom.o HHoH.o oqmm.o nuoo
manoooamnwz oonaaouooow oouaaouooow nMv ANV Aav ocean
oauuoaahm<. oauumaahm meadow woos. moo: woos. nomaan uom
madmamm ponwuo ouoo Ham mo moon
chances mucoumwa
moan: woumaaumo moves. pogaoz

emamammmaomaz nonasz

 

 

 

 

moaoamm woﬁoamuu moans ooHumowmwmmmHo mo muasmum

.s manna

 

51

However if the aim is to detect a cluster structure for
pattern classification analysis, the clusters generated could be
used as sets of training patterns, their modes estimated and a
pattern vector x from unknown class, classified using a
minimum distance-to-mode classifier.

It is also demonstrated that if training subsets from known
classes are availabe §_priori, then the algorithm could be used
for classification studies.

Different distance measures are discussed even though the
two intuitively more sophisticated measures performed poorly on
the data sets mentioned.

It may be recalled that in the mode seeking algorithm the
observation space is partitioned into regions and the modes are
estimated for observations in each such region. To explain why
the observations in these regions are not treated as the final
clusters, one has only to note that the modes could also be
estimated using other procedures which do not call for such
partitioning. The algorithm to seek clusters may also be used

with such mode seeking methods.

CHAPTER V

Conclusions and Suggestions for Future Work

5.1 Thesis Summary:

In this thesis a new model for the clustering problem is
presented. The formulation of the problem is based on assumptions
which are not severely restrictive. The assumptions are of a
nature similar to those generally made in most of the methods for
the statistical analysis of the pattern recognition problem. The
model is motivated by the belief that to find a cluster-structure
in a set of empirical data, it is not necessary to treat the under-
lying pdf as a mixture of several source pdfs. The pdf need be
treated, in its own right, as one governed by a multi-modal continuous
distribution with one or more distinct modes. This leads to the
identification of the clustering problem with that of estimating
the modes.

An algorithm is given to estimate these modes and the modes
estimated are used to detect the cluster membership. As no clas-
sified training samples from each pattern class are available 5
priori, the clusters so formed are taken to be the training sets
for training a minimum distance classifier. However if training
patterns from each class are available in advance, it is shown
that the mode seeking algorithm could be used to complete the

abstraction phase of the pattern classification problem.

52

53

.Apart from the standard.Euclidean and the Mahalanobis
Dz-measure, two heuristically motivated distance measures are
discussed. Even though the reasoning involved in the derivation
of these new measures seems to be intuitively sound, the measures
performed poorly contrary tO expectation. However the Mahalanobis
Dz-measure gives very good results when used in conjunction with
the prOposed model and strengthens the belief about the suit-

ability Of the model.

5.2 Conclusions:

The clustering procedure prOposed is useful in a number of
situations. However it is not hard to imagine examples where this
approach may not give the desired results. One such situation is
described by two concentric point sets in a plane. Ideally, one
would like to obtain from a clustering algorithm applied to this
set, exactly two clusters. However the scheme proposed here is
likely to detect more than two clusters. It is doubtful whether
any of the presently available clustering techniques will produce
exactly two concentric clusters without rejecting any of the
observations.

The method will certainly perform very well in situations
where the ratio of the inter-cluster distance to intra-cluster
distance is high; that is, in situations where the data are well
separated into groups in the observation space. Such a performance
is, of course, to be expected of any good clustering algorithm.

As implemented the algorithm is slightly more expensive as

regards computer time involved, compared with the ISODATA, Operating

54

on the same data sets. However this higher cost is compensated
by the better performance produced at least in the case of the
iris versicolor and virginica data sets. It is not claimed that
the performance will be better in general but certainly the results
are comparable to those of other pOpular algorithms currently in use.
The mode seeking algorithm is ideally suited for obtaining
non-parametric decision rules based on distance to modes, given
the training samples from each class.
In cases where the number of clusters present is not known
in advance, the applicability of this procedure is quite clear. In

fact this is a distinct advantage claimed for this method.

5.3 Suggestions for Future Work:

The two new distance measures used for classification are
based on a heuristically sound principle, that the distance measure
employed must in some way reflect the dispersion of the Observations.
The manner of implementation adopted here did not give encouraging
results. The reasons for instability, that is, why with a distance
measure as implemented, the results diverge instead of converging
to form 'ideal' clusters (or remain stationary) are worth investiga-
tion. Other methods of implementing this notion may also be con-
sidered for future work. It is quite possible that better results
may be obtained. Iterations performed with all the four distance
measures show no further improvement in performance. Even with the
classical.Euclid and Mahalanobis distance measures, the iterative
process diverges and the reason for this divergence requires

further investigation.

55

Another area of future work is the estimation of the modes
of a multivariate pdf directly from the Loftsgaarden and Quesenberry
estimator of Chapter II, instead of first constructing a uni-
variate data set. A stochastic approximation approach to estimate
the pdf [K-l] may result in a computationally more economical method

to solve the problem of mode estimation.

BIBLIOGRAPHY

[A-l]

[A-Z]

[3-1]

[3-2]
[0-1]

[0-1]

[12-11

[3-2]

[F-l]

BIBLIOGRAPHY

Aizerman, M.A., Braverman, E.M., and L.I. Rozonoer,
"Theoretical principles of potential function method

in the problem of teaching automata to recognize classes",
Autgmgtion and Remote Control, Vol. 25, No. 6, pp. 821-
837, 1964.

Augustson, J.G. and J. Minker, "An analysis of some graph-
theoretical cluster techniques", Technical Report NGL-21-
002-008, No. 70-106, University of Maryland, College Park,
Maryland, 1970.

Ball, G.H. and D.J. Hall, "ISODATA, a novel method of
data analysis and pattern classification", Technical

Report, Stanford Research Institute, Menlo Park,
California, 1965.

Bonner, R.E., "On some clustering techniques", I.B.M.
Jour. Res. Dev., Vol. 8, pp. 22-32, 1969.

Chernoff, H., "Estimation of the mode", Ann. of Inst.
Stat. Math., Toyko, Vol. 16, pp. 31-41, 1964.

Dubes, R.C., "Data reduction with grouping and Weibull
models", Interim Scientific Report NO. 7, Div. of Eng.
Res., Michigan State University, East Lansing, Michigan,
1970.

, "Information compression, structure analysis,
and decision making with a correlation matrix", Interim
Scientific Report No. 11, Div. of Eng. Res., Michigan
State University, East Lansing, Michigan, 1970.

Edwards, A.W.F. and L.T. Cavalli-Sforza, “A method for
cluster analysis", Biometrics, Vol. 21, NO. 2, pp. 362-
375, 1965.

Ehrlich, R. and B. Weinberg, "An exact method for char-
acterization of grain shape", Jour. of Sedimentapy
Petrolo , Vol. 40, No. 1, pp. 205-212, 1970.

Fisher, RsA., "The use of multiple measurements in
taxonomic problems", Appals of Eugenics, 3, Part 2,
pp. 179-188, 1936.

56

[F-B]

[F-4]

[F-S]
[F-6]

11-71

[0-1]
[0-2]

[c-31

[G-4]

[H-1]

[3.21

[J-IJ

[J-Z]

57

Fortier, J. and H. Solomon, "Clustering Procedures", in
Multivariate Analysis - 1, Ed. P.R. Krishniah, Academic
Press, New York, 1966.

Fralick, 8.0., "Learning to recognize patterns without a
teacher", IEEE Trans., Infom.lheory, Vol. IT-13,
pp. 57-“, 19670

Friedman, H.P. and J. Rubin, "On some invariant criteria
for grouping data", Jour. of Amer. Stat. Assoc., Vol. 62,

______, "The Logic of the Statistical Methods", The
Borderline Syndrome, Ch. 5, Basic Books, New York, 1968.

Fu, K.S., Seguethial Methods in Pattern Recognition and
Machine Learning, Academic Press, New York, 1968.

Fu, K.S. and E.G. Henrichon, Jr., "On non-parametric
methods for pattern recognition", Technical Report No.
TR-EE 68-19, School of Electrical Engineering, Purdue
University, Lafayette, Indiana, 1968.

Gotlieb, C.G. and S. Kumar, "Semantic clustering of
index terms", Jour. ACM, Vol. 15, pp. 493-513, 1968.

Gower, J.C., "A comparison of some methods of cluster
analysis", Biometrics, Vol. 23, pp. 623-637, 1967.

Gower, J.C. and G.J.S. Ross, "Minimum spanning trees and
single linkage cluster analysis", App. Statistics, Vol.
18, NO. 1, ppe 54"“, 1969.

Grenander, U., "Some direct estimates of the mode",
Ann. of Math. Stat., Vol. 36, pp. 131-138, 1965.

Hilborn, C.G. and D.G. Lainiotis, "Optimal unsupervised
learning multicategory dependent hypotheses pattern
recognition", lEEE lrans.jl_rlform. Theory, Vol. IT-l4,
pp. 468-470, 1968.

Hotelling, H., "Analysis of a complex of statistical
variables into principal components", Jour. Educ. Psych.,
Vol. 24, pp. 417-441, 1933.

Jardine, N. and R. Sibson, "The construction of hierarchic
and non-hierarchic classifications", Co_n_:p. Jour., Vol. 11,
pp. 177-184, 1968.

Johnson, 8.0., "Hierarchical clustering schemes",
Psychgntetrika, Vol. 32, pp. 241-254, 1967.

[J '3]

[x-2]

[1-1]

[M-l]

Un-2]

[14-3]

[M-4]

[M-5]

04-61 -

[N-l]

58

Jones, K.S. and D. Jackson, "Current approaches to
classification and clump-finding at the Cambridge
Language Research Unit", Comp. Jour., Vol. 10, pp. 29-
37, 1967.

Kashyap, R.L. and C.C. Blaydon, "Estimation of proba-
bility density and distribution functions", Technical
Report TRéEE67-l4, School of Electrical.Engineering,
Purdue University, Lafayette, Indiana, 1967.

Kruskal, J.B., "Multidimensional scaling by Optimizing
goodness of fit to non-metric hypothesis", Psychometrika,
Vol. 29, NO. 2, pp. 115-129, 1964.

Lance, G.N. and W.T. Williams, "Computer programs for
monothetic classification (association analysis)",
Comp. Jour., Vol. 8, pp. 246-249, 1965.

, '%.general theory of classificatory sorting
strategies, Part I: hierarchical systems", Comp. Jour.,
Vol. 1, pp. 82-85, 1968.

Loftsgaarden, D.O. and C.P. Quesenberry, "A non-para-
metric estimate of a multivariate density function",
Ann. of Math. Stat., Vol. 36, pp. 1049-1051, 1965.

Mattson, R.L. and J.E. Dammann, "A technique for deter-
mining and coding subclasses in pattern recognition
problems", I.B.M. Jour. Res. Develop., pp. 294-302, 1965.

Mcﬂuitty, L.L., "Hierarchical linkage analysis for the
isolation of types", Educ. Psychol. Measurement, Vol. 20,
NO. 1’ pp. 55-67, 19600

Mendel, J.M. and K.S. Fu, Adaptive, Learnipg and Pattern
Recogpition Systems, Theory and Applications, Academic
Press, New York, 1970.

Michener, C.D. and R.R. Sokal, "A qualitative approach
to a problem in classification", Evolution, Vol. 11,
pp. 130-162, 1957.

Moore, D.S. and E.G. Henrichon, "Uniform consistency of
some estimates of a density function", Purdue University
Stat. Dept., No. 168, 1968.

Murthy, V.K., "Estimation of probability density", Ann.
of Math. Stat., Vol. 36, pp. 1027-1031, 1965.

N38Ya G., "State of the art in pattern recognition",
135.1122” Vol- 56. No. 5, pp. 836-862, 1963,

[N -2]
[N-3]

[0-1]

[p-1]

[P-Z]

[p-33

[p-41

[R-1]

[3-21

[3-3]
[3-4]

[3-5]

59

Nilsson, N.J., Learning Machines: Foundations of Train-
able Pattern Classifying Systems, McGraw-Hill, New York, 1965 .

Nunnally, J., "The analysis of profile data", Ps ch.
Bulletin, Vol. 59, No. 4, pp. 311-319, 1962.

Owen, J., "The consistency of a non-parametric decision
procedure", Engr. Note No. 334, Applied Res. Lab.,
Sylvania Electronic Systems, 1964. (Quoted in [F-7]) .

Parzen, E., "On estimation of a probability density
function and mode", Ann. of Math. Stat., Vol. 33, pp. 1065-
1076, 1962.

Patrick, E.A. and J.C. Hancock, 'Non-supervised sequential
classification and recognition of patterns", IEEE Trans.
lnform. Theory, Vol. IT-12, pp. 362-372, 1966.

Patrick, E.A., "On a class of unsupervised estimation
problem", lEEE Trans. lnform. Theogy, Vol. IT-14, pp. 407-
41s, 1968.

Patrick, E.A. and J.P. Costello, "Unsupervised estimation
and processing of unknown signals", Technical Report
TR-69-430, Rome Air Development Center, Rome, New York, 1970.

Rao, C.R., "The use and interpretation of principal
component analysis in applied research", Sankhya, Th2
Indﬂ Journal of Statistics, Series A, 26, pp. 329-
358, 1965.

Rohlf, F.J., "Adaptive hierarchical clustering schemes",
Sys. Zoology, Vol. 19, No. 1, pp. 58-82, 1970.

Ruspini, E.H., "A new approach to clustering", Inform.
Contr., Vol. 15, pp. 22-32, 1969.

Sammon, J.W., Jr., "A non-linear mapping for data structure
analysis", IEEE Trans. Cgputers, Vol. C-18, No. 5,
pp. 401-409, 1969.

Sebestyen, G. and J. Edie, "An algorithm for non-para-
metric pattern recognition", IEEE Trans. Elec. Computers,
Vol. EC-15, No. 6, pp. 908-915, 1966.

Sokal, R.R. and P.H.A. Sneath, Principles of Numerical
Taxon , W.H. Freeman, San Francisco, California, 1963.

Spragins, J., "Learning without a teacher", IEEE Trans.
Inform. Theo , Vol. IT-12, pp. 223-230, 1966.

Stanat, D.F., "Unsupervised learning of mixtures of prob-
ability functions", in Pattern Recognition, Ed. L. Kanal,
Thompson, Washington, D.G., 1966.

[S-6]

[T-3]
[T-4]
1v-11

[w-1]

[Y-2]

[Y'3]

[z -1]

60

Switzer, P., "Statistical techniques in clustering and
pattern recognition", Technical Report No. 139, Depart-
ment of Statistics, Stanford University, Stanford,
California, 1968.

Teicher, H., "Identifiability of mixtures of product
measures", Ann. of Math. Stat., Vol. 38, pp. 1300-1302,
1967.

‘________, "Identifiability of finite mixtures", Ann. of
Math. Stat., Vol. 34, pp. 1265-1269, 1963.

________J "Identifiability of mixtures", Ann. of Math.
Stat., Vol. 32, pp. 244-248, 1961.

Tryon, R.C., Cluster Analysis, Edwards, Ann Arbor,
Michigan, 1939.

Venter, J.H., "On estimation of the mode, Ann. of Math.
Stat., Vol. 38, pp. 1446-1455, 1967.

 

Ward, J.H., "Hierarchical grouping to Optimize an
objective function", Jour. Auer. Stat. Assn., Vol. 58,
pp. 236-245, 1963.

Wilks, S.S., Mathematical Statistics, Ch. 18, Wiley,
New York, 1963.

Williams, W.T. and J.M. Lambert, "Multivariate methods
in plant ecology I. Association analysis in plant
communities", J. Ecol., Vol. 47, pp. 83-89, 1959.

Wirth, M., Estabrook, G., and D. Rogers, "A group theory
model for systematic biology", Systematic Zoology,
V01. 15, NO. 1, pp. 59-69, 1966.

Yakowitz, S.J., "Unsupervised learning and the identifica-
tion Of finite mixtures", IEEE Trans. Inform. Theory,
V01. IT-16’ pp. 330-338, 1970.

________J "A consistent estimator for the identification
of finite mixtures", Ann. of Math. Stat., Vol. 40, No. 5,
pp. 1728-1735, 1969.

 

Yakowitz, S.J. and J.D. Spragins, "On the identifiability
of finite mixtures", Ann. of Math. Stat., Vol. 39, No. 1,
pp. 209-214, 1968.

Zahn, C.T., "Graph-theoretical methods for detecting and
describing Gestalt clusters", lEEE Trans. Computers,
V01. C-ZO, NO. 1, pp. 68-86, 1971.

APPENDIX

Appendix A

This appendix discusses the motivation behind the use of
the symmetrical and asymmetrical generalized distance measures
referred to in Chapter IV and gives the details of the derivations
leading to (4.3.3) thru (4.3.6).

I. Symmetricygeneralized distance:
A generalized second order distance metric, D2, between

two points X and Q in a p-dimensional space is
2 T
D = (X -Q) [S] (X -Q) (A-l)

where [S] is a (p X p) positive definite matrix. We seek a
particular matrix [S] which satisfies the condition that the

average of the distances of the points in a cluster from

their mode is 1. That is, a matrix [8(1)] is sought such that,

for the i-th cluster,

i . . . . -
(1) _ M(1))T[S(1)](x*(l) _ M(1)) = 1 . (A.2)

(11*
J

1 j

11rd =2

.1.
Nij

The existence of such a matrix is shown below:

Assume that the inverse of the matrix defined by

"1

z (x310) _
1=1

(1)

M (£01) _ M(1) T

) j )

exists. Then a matrix satisfying the condition (A.2) is given by

61

62
N

. Ni . . . .
[3(1)] _[ 2:(Xj( ) _ 149598;“) _ M<1>)T]-1 . (A.3)
j:

To prove this, we note,

N, N
1 1 2 1 i *(1)_M(1>1(1>*1(> <1)
—zn=—z<x >13 (x -M>
Nij=1j Nij=l 5' 1 j
Ni
= .1. 2 mm {(M*(1) _M(1))'r [s<1)3(x’f(1> _ Mm”
N1 j=1 j 3
Ni
...1. z trace mm] (xfo) _M(1>)(x*<1>_ M(1>)T }
N1 j=l J 3
N1
= '1- trace { 1: [8(1)] (x *u ) - 11(1))(55‘1) - M(1))T}
Ni j=1 j J

*(i) _ Mo)

*(1) _ (1)1
j M H

)(Xj

NT trace {[S(i)i]2 (x
1 j=l

Substituting for [8(1)] the expression given by (A.3)

 

trace [1] =

The symmetric generalized distance measure between a pattern vector

(i)

x and a mode M of the i-th cluster is now defined as:

(1) 4 _M<1)T <1) <1)

(18(ng )=(x ) [S ](X'M

) . (11.4)

It is to be observed that [8(1)] is proportional to
the inverse of the second moment matrix of the points in
the i-th cluster, with respect to the mode, M(i), of the cluster
and is a measure of the dispersion of that cluster.

The name "symmetric generalized distance" is used to dis-

tinguish it from the asymmetric measure discussed next.

63

II. Asymmetricygeneralized distance:
For defining the asymmetric generalized distance measure

the ellipsoidal surfaces described by

(x-Q)T[S] (x-Q) =1

are centered at a new point which reflects the dispersion of the

points in a cluster. The measures of the diSpersion along

each axis, both to the left and to the right of the mode M(1)

are computed. The center of the ellipsoid is obtained by shifting

M(i) by the average of these dispersions along each axis. To

be more specific:

*(1)
jk

vector of the i-th cluster; j = 1,2,...,N

Let x be the k-th component of the j-th pattern

1; k = 1,2,...,p; and

further let this component be designated

*(i) . *(i) (i)
ijk if xjk Mk

llA

(A.5)

and

*(i) if *<1>>M<i)

rxjk xjk k ("'6)

where 141:1)

Also let L(1) and R(i) be the number of vectors from

(i)

is the k-th component of the mode vector M

the i-th cluster satisfying the inequalities (A.5) and (A.6)
respectively.
Define a measure of "spread to the left" and "spread to

the right" along the k-th axis as:

II
A
H.
V

Spread to the left

II D

E 1(1) Z (LX:1((i-) - Mlgi))2]!5 (A.7)
n

 

and
spread to the right = 0:1)
r
Rm 1
Q 1 *(i) (i) 2 2
[Rm n§1(rx“k -Mk )1 . (A.8)
Associated with each estimated mode vector, M(1), define a new
vector, Mé‘l), whose k-th component is given by
(i) <1) <1) (1)
' = -
Mk Mk + [Mok rok ]/2 (A-9)

Referring to (A.l) we now seek a matrix [S:;)] such

(1)

that (A.2) holds except that instead of the vector M for
the i-th cluster, we use the new vector M'(i).
Proceeding as before a solution for [8:;)] is
. N "1 . . . .
[31131)] = 31% z (xgu) - M'(1))(x:(1) - 14'(1))T]’1 . (11.10)
j=l

Consider the equation

(Y -M'(i))T [31%)] (Y 44"”) = 1 (11.11)

which defines an ellipsoid centered at M'(1).

The unit of distance for computing the distance to modes

is defined as the length of the vector from the mode M(i) to

the surface of the ellipsoid, measured in the direction of the
vector (X - M(i)) where X is any point the distance to which

(i) (i)

from M is required. It is assumed that M lies interior
to the ellipsoid. The distance measure is not valid if the assump-

tion is not satisfied.

65

In two dimensions, the definition of unit distance" is

illustrated in Figure 6.

Denote the various vectors as in the figure,

X _ Ml(i) ;B. = M(i) _ M'(i)

....
a

(i)

and 2 = x - M .

In terms of the unit distance, \cl‘, the distance between

M(1) and X, is given by the scalar, D, being the ratio,

(1)9-c
dAS(X,M ) — D — fzir .

Let

[W] 9- [sjsi’l -

We have
(x - M(i)) = Z = (Y - M(i)) D .

Substituting in (A.ll) and arranging terms,

(x - M'(i) D + M(i) (D-l))T[W](X - m'(i) D + M(i)(D-l)) = D2 (A.12)

which is the equation of an ellipsoid centered at

DMWD+M“)am).

It is to be noted that with a symmetric distribution,
the mean and the mode are the same and hence M'(i) = M(i) imply-
ing that the center of the ellipsoid is at the mode/mean.

Equation (A.12) is a quadratic in D and solving for D,

we get

66

0(1)

 

 

 

r 2 '-
L.
(1) ...

L C’2

(Y - M'(1))[ SAS(i)]( Y - M'(i))=l

 

 

l l l
(1) (1) .x
{.01 r 01 2

Figure 6. EXplanation of the asymmetrical distance measure

67

D = qg+,/q2 + 4wr

2w
where
t t t
q = b [W]a + a [W]b - 2b [W]b
t
w = l - b [W]b > 0
if the point M(l) is interior to the ellipsoid;

and r = (a-b)t [W] (a-b) > 0

assuming [W] to be positive definite.

HUIIIWJILIMIliﬂtﬂllﬂlﬂ'i'ﬁlﬂ'

1

0
I 3
All
Mllz

 

1117111qu