asmomm OF A BASES FOR
QGAL {IRENE CLUSTEBING

Msseﬁaﬁm far the Degree 9‘? PM).
Mimiﬁm STATE UMVERSITY
ALI BEHFORGOZ
W75

4.,

Q,

 

O!’ "I
. we: a.
a” - r ‘I‘Z‘uv . ' . 0- .0 _
_; i " ' " '
"V t" 1 “ This is to certify that the
thesis entitled
Development of 3 Basis for
Goal Oriented Clustering
presented by
Ali Behforooz
has been accepted towards fulﬁllment
of the requirements for
Ph.D degree in Computer Science

%2/%

Major professor

 

1"

t 61(0 ABSTRACT

DEVELOPMENT OF A BASIS FOR
GOAL ORIENTED CLUSTERING
By
Ali Behforooz
This thesis is concerned with the first two of the follow—

ing group of problems which were faced during the design of

a Television-Computer Education System: 1) representation of
a nominal-scale variable with unknown nominal value by numeric
data for cluster analysis, 2) development of a method of
clustering for a set of data items defined by such nominal-
scale variables , 3) assignments of an appr0priate sequence
of actions to every generated cluster, 4) measurement of the
effectiveness of the assigned actions , and 5) control or
prediction of changes in the clusters and the members of the
clusters. The basic structure of a clustering system which is
designed so as to provide solutions to these problems is
developed. It is called a Goal Oriented Clustering System,

because every step of this system is dependent on the purpose(s)
of the clustering.

In this thesis a complete solution is provided for the
first problem. The nominal-scale variables are represented
by a vector of dimension M, where M is the number of categories
of the variable. If the nominal value of a nominal variable V
is known then the vector representing this variable is a
deterministic vector. If the nominal value of V is unknown
then the vector representing this variable is a probabilistic

vector.

Also, in this thesis a complete solution is provided

for the clustering problem. To cluster a set of data items

defined by nominal-scale variables with unknown nominal value,
a data item is represented by an n x M matrix, where n is the
number of variables. An M-dimensional distance function is
proposed to calculate the association measure between data
items. Two different clustering methods are developed which are
capable of clustering data items represented by n x M matrices

as well as data items represented by n-dimensional vectors.

The last three problems mentioned at the beginning are

left for future research by others.

DEVELOPMENT OF A BASIS FOR

GOAL ORIENTED CLUSTERING

by

Ali Behforooz

A THESIS
Submitted to

Michigan State University
Department of Computer Science

in partial fulfillment of the

requirements for the degree of

DOCTOR OF PHILOSOPHY

197s

TO

FARIDEH 8 AMIR

ii

. ACKNOWLEDGMENT

I am grateful to Dr. Morteza A. Rahimi,my major profe-
ssor, for his encouragement during the course of this research.
My thanks also go to Dr. Harry G. Hedges, Dr. Richard J. Reid,
Dr. John J. Forsyth, and Dr. Dennis D. Gilliland for their
serving on my guidance committee and their critical review
of this thesis.

My special thanks go to National Educational Television
of Iran for their financial support of this work. Finally,my
thanks go to Dr. Martin O. Holien, Chairman of the Department

of Computer Science at Moorhead State University, Moorhead,

Minnesota for his support.

iii

ISI

I93

XRY
R[A]
[31R
iff
a==$b
a¢=9b
‘V a
3 a

xéfA
xfiA

LIST OF SYMBOLS

Empty set, a set with no element.

If S is a set,lSl is the number of elements of
the set.

The cardinality of the set of natural numbers.
The cardinality of the set of real numbers.

A vector with all components equal to r.

The set of positive integers.

The set of negative integers.

The set of real numbers.

The set of non-negative real numbers.

p or q.

p and q.

x is related to y by relation R.

x is not related to y by relation R.

The set of R-related elements Of A-

The set of all elements related to a by relation R.
If and only if.

a implies b.

a if and only if b.

For all a.

There exists an a.

There does nor exist an a.

x is an element of A.

x is not an element of A.

Such that.

iv

A.CB

\a,b,..

(a,b,..

[dij]
[dek]

LIST OF SYMBOLS (continue)

is a subset of.B.
set with elements a, b,..
vector with components a,

matrix with d as its

ii

three dimensional matrix.

etc.
b,..etc.

jth element of its i

th

TOW .

Chapter

I

II

III

IV

TABLE OF CONTENTS

INTRODUCTION.. ..... . ..................... . ...........
1.1 Five Major Problems.... ..... .............. ...... 1

1.2 A Brief Review of Current Clustering Methods.... 5

1.3 Scope of This Research .................. . ....... 10
1.4 Organization of This Thesis ............... . ..... 11
GOAL ORIENTED CLUSTERING .............. .... ...........

2.1 Introduction.................................... 13
2.2 Goal Oriented Clustering System (GOCS) .......... 14

2.3 Elements of the GOCS .................. .......... 16

NUMERIC REPRESENTATION OF NON-NUMERIC VARIABLES(NNV'S)

FOR CLUSTER ANALYSIS ........... ... ........... ........

3.1 Introduction .................................... 21
3.2 Category Relation ........................... .... 22
3.3 Representation of an NNV. ....................... 26
3.4 Representation of a Data Item ......... .......... 27
3.5 Representation of a Variable...r.....o -------- .. 31

3.6 Some Motivation for the Proposed Representation. 36

ASSOCIATION MEASURES..... ............................
4.1 Introduction ......... . ........... ............... 41
4.2 Association Measure Between Variables.... ...... . 42

4.2.1 Association Measure Between Numeric Variables. 43
4.2.2 Association Measure Between NNV's.. ........... 44

4.3 Association Measure Between Data Items .......... 48

vi

TABLE OF CONTENTS
(continue)

Chapter Page

V CLUSTERING PROCEDURES ................ . ....... . ......
5.1 Introduction ................... ... ............ 54
5.2 Geometric Representation of D and A .......... 55
5.2.1 Hypersphere Representation of Data Items.... 55
5.3 Clustering the set D by Transforming A.to an

N x N dissimilarity matrix.................... 58
5.4 General Centroid Method(GCM)................. 62
5.4.1 General Centroid A1gorithm(GCA)....... ...... 64
5.4.2 Analysis of the GCA ...... ................... 68
5.4.3 Computational Limits and the Problem of Storage.70
5.4.4 Some Special Cases. ......... ....... ......... 71
5.5 Graph Representation of D and A .............. 74

5.6 k-clustering Based on Graph Representation of

D and A............... ........................ 78
5.6.1 k-clustering Algorithm ...................... 86
VI SUMMARY AND CONCLUSION ............. ...... .........
6.1 Summary of this Research ...................... 92
6.2 Conclusion... ............. ... ......... ........ 94

6.3 Suggested Topics for Further Work ............ 95

BIBLIOGRAPHYooooooooo ooooooooo 00000000000000.0000. 97

vii

LIST OF FIGURES

Dynamic Property of Clusters ...................
Visual and Algorithmic Clusters ................
Visual and Algorithmic Clusters ................
Visual and Algorithmic Cldsters .......... -.....
Goal Oriented Clustering System ........... .....
Unit Spheres of M1, M2 and Md,. ..............
Geometric Representation of D and A . . . . . .......
Hypersphere Representation.of the Data Items of
Example 4.2 ........ . ................. . ....... ..
Bipartite,krpartite, Uniform k-partite, Uniform
Weighted k-partite and Average Graph.... ........
Graph Representation of D and A.of Example 4.2..

A Picture of Definition 5.9 for Weighted Graphs.

viii

Page

4
8
8
8
15
SO

56

57

76-77

77

80

CHAPTER I

INTRODUCTION

Most nominal-scale data , especially those which are to
define behavioral characteristics , are such that their nominal
value is unknown. Throughout this entire thesis, nominal-
scale data which have uncertain nominal values are referred
to as Non-Numeric Variables(NNV's). This terminology makes it
possible to classify variables into two categories. The first
category contains all variables which can be represented by a
single number regardless of whether the scale is ratio, interval,
ordinal, or nominal. Variables of this class are called
numeric variables. The second category contains those
nominal-scale variables which have uncertain nominal values

and as stated above are called NNV's.

The problem under consideration is to extend some of the
existing clustering procedures so as to be able to cluster 3
set of N data items each of which is defined by n variables

some of which are NNV's.

The major topics covered in this thesis are 1) numeric
representation of a NNV, 2) association measures between NNV's
and also between data items defined by NNV's and 3) clustering
a dissimilarity N-block matrix of a set of N data items defined
by both NNV's and numeric variables.

1.1 Five Major Problems:

The motivation for this research came from the design of a

Television-Computer Educational (TCE) system. A major goal of

the TCE system is to be able to act as an instructor and,at the

v.1

2

same time, be relatively inexpensive, efficient, and fast
enough to handle a large number of students. During the
design study of the TCE system, five major problems as given
below were faced. The first two problems are investigated
and totally solved in this research and other peOple are
working on the last three problems.

1. The Problem of Representing Data Items: The data items

 

of the TCE system are the students and each student is repre-
sented by a set of variables. Some of these variables are NNV's
and they are important to the student clustering procedure.

This type of variable must be represented numerically for clust—
ering purposes. This problem is included in the scope of this
research and is completed in chapter 3.

2. The Problem of Clustering: The number of students en-

 

rolled for a given course is large, the facilities to serve
these students are limited and the students are from different
backgrounds. Because of these reasons and the fact that the
TCE system is to be efficient, it is necessary to divide the
students into different groups. A clustering algorithm that
performs this task is essential to the operation of the TCE
system. This problem is also included in the scope of this
research and is discussed in Chapters 4 and 5.

3. The Problem of Assigning_Appropriate
Actions to the Generated Clusters:

 

 

The TCE system should be designed such that it gives prOper
assignments to the students. The action sequence or sequence of
assignments, tests, and homework corresponding to each group of
students should be determined on the basis of the "label" of

that group and modified according to certain

3
characteristics of the individual members of the group. In other

words, the action sequence for a group cannot be pre-defined.

Let Ci be the label of the ith group and Ai = Ail’AiZ’°"’Aini

be the action sequence determined for this group. The action

sequence Ai is called a pre-defined action sequence if it is

not a function of the members of the ith group.

4. The Problem of Applicability of A;: Once the students

 

are clustered and all action sequences are determined, it is
time to determine whether or not the action sequence A1 is the
best action sequence for the group Ci“ The action sequence Ai
is accepted as the best action sequence for C1 if it possesses
the following properties:

a. there are no conflicts in Ai;

b. all actions included in A1 are applicable to every
individual member of Ci;

c. the application of Ai to the members of Ci would
provide the expected changes for the group Ci as well
as the individual members of Ci°

An algorithm is needed to test those three properties OfeVerY
generated action sequence.

5. The Problem of Dynamic Clusters: The members of group

 

Ci might move their positions from one point to another under
the application of the action sequence Ai' This change of
position is due to the changes in values of variables defining
the students. This movement does not have to be in the same
direction for all members of Ci' It is true that all members

of Ci are under the effect of the same action sequence, but

this does not mean that the progress or improvement of the

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_ — - — - - - -- —— - -— _ --- —— - - - - O - o - - -- u —
- - .- — cs - - c - - a. - -_ - - O - o - q— -. o- — 0
ﬁ - - - - ‘ - . -.. - - C - - - -. .-

 

      

,‘
0 ﬂ: ‘

Ogitn}

 

 

 

 

 

 

Figure 1-1 .

Dynamic property of Clusters.

 

 

 

4“ D

5
individuals will be the same. Figure 1-1 shows a set of four
clusters at times to,t1,...,tn. The solid clusters are the
positions of the dotted clusters at time t = t1 , i=1,2,...,n.
The movement of the elements of the clusters is called the
dynamic property of the clusters. The dynamic property of a
cluster could be studied under the following two separate cases.

1. The members of a cluster move such that they stay in
the same cluster. Clusters C1(t1) and C4(t1) of Figure 1-1
are examples of this type of movement.

2. The members of a cluster move such that they explode
the cluster; that is, at least one of the members does not
remain a member of the same cluster. Clusters C2(t1) and
C3(t1) of Figure 1-1 are examples of exploded clusters.

In Figure 1-1 C1(t0) ,...,C4(to) are four initial clusters

and C1(t1),...,C4(t1) are four clusters after a period t=t1-t0.
The application of the action sequence Ai to the cluster Ci(to)
causes a change in values 0f the variables defining the members
of this cluster. The change in position of a member of a cluster

is due to the change of variables defining this member. The

enlarged clusters are special case of exploded clusters.
A system called Goal Oriented Clustering is designed in

Chapter Two such that it is possible to solve all five major

problems mentioned previously.

1.2 A Brief Review of Current
ClusterigggMefhods:

 

 

In this section a very brief description of some cluster

analysis approaches is given. For those who are interested

in a complete description , without referring to the original

works, Anderberg [2 ] Chapters Six and Seven is a good

reference. Ling [41] gives a description of some of the
current work along with some criticisms. Most of the current
methods in clustering are included in one of the following
categories:

1. Agglomerative Algorithms: Most of the widely used

 

clustering algorithms belong to this family of algorithms.
There are a large number of papers in hierarchical and non-
hierarchical clustering employing agglomerative algorithms.
For a complete description and references to the work in
this area the reader is referred to Anderberg [2]. Very
brief descriptionsof some of the widely used agglomerative
algorithms are given below.

a. The Single Linkage Algorithm: This algorithm is

 

also known as the "nearest neighbor" or the "minimum" algo-
rithm. It has been considered by Sneath [65], Sokal and Sneath
[66], Johanson [35], Shepard and Willmott [64], Zahn [79] and
others. This algorithm combinesthe two clusters whose dis-
tance is the minimum among all pairs of clusters. The distance
between two clusters Xi and Y~ is given by

J
d(X.,Y,)=Minimum (d(x,y))
1 J xEX
i
yEYj
The merging process continues until a terminating criterion

is satisfied or until all points belong to a single cluster.

b. The Complete Linkage Algorithm: This algorithm has

 

been studied by Sokal and Sneath [66], Johnson [35] and
others. It differs from the minimum algorithm only in the
definition of the distance between two clusters. If Xi and

Yj are two clusters the distance between them is defined by

7

d(Xi,Y.) = Maximum (d(x,y)).
J xtixi
yEIYj

c. The Centroid Algorithm: In this method the distance

 

between two clusters is defined by d(Xi,Yj) = d(x,Y) , where
Xi and Yj are two clusters and X , Y are the mean of these
two clusters. The distance function d is the n-dimensional

Euclidian distance function.

d. Minimizing Within-Group Sum of Square (NGSS) Algorithm:

 

This method was introduced by Fisher [16] and applied by Ward
[77]. It is also known as Ward's method. Ward's method is
employed by Jensen [36], Vinod [75], Dubes [10], Fisher and
Van Ness [15], Fukunaga and Koontz [19] and others. The WGSS
method is applicable only to some configuration similar to
Figure l-2(a), where the clusters are compact and/or
hyperelipsoidal and in either case they are well separated.
In some cases, as in Figure 1-2(b) and (c), this method will
not give an appropriate partition. Assuming the data items
are in two-dimensional space, the optimal clusters of

Figure 1-2 are the ones given in Figure 1-3. But by using
the W688 method the resultant cluster for Figure l-2(a) is
the same as in Figure l-3(a); and the resultant clusters for
Figure l-2(b) and (c) are those given in Figure 1-4(b) and (c),
respectively.

2. Divisive Algorithms: There exist a variety of tech-

 

niques based on dividing the entire set of data items into
subgroups. These methods are focused on finding the groups
which are the most widely separated from each other. Divisive

algorithms can be categorized into three different approaches.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Clusters.

,. ,, ¢«- T ‘ \.:
o .. J g Q
: Q'il‘: {a " ~.‘.‘ 6
‘ {1:‘. ..\. ‘ “‘
.I I‘\ r. 3 H‘ Q‘ ~ - ‘.
o ’ . g . \
00"". l‘ ' -' ' | “. no \‘
... .\.: . ' O 0.. ‘I :n:
:‘ 0 o o I. " "... t o ' V
«n, 'o' ... ‘.‘\ 3'/ ‘ e. ’
(a) . (b) . (c,
Figure 1-2.
(a)- (b) (c)
Figure 1-3.
.1
(C)
Figure 1-4.
Visual and Algorithmic

 

 

 

9

The first approach is under the label "association analysis"
and is discussed by Lance and Williams [43]. The second
approach is proposed by Edwards and Cavalli-Sforza [12].
Scott and Symons [61] and Harding [28] have some work
related to this approach. The third approach is based on
the use of discriminant analysis. The idea is to start with
some initial partition, compute a linear discriminant
function and then iteratively reassign points and recompute
discriminant functions so as to find the most strongly
separated groups. Dubes [10] provides a computer program
for this method.

Since the k-clustering method , introduced by Ling[41],is
one of the two methods which are extended in Chapter Five

of this thesis, some more words about this method are
appropriate. In the k-clustering method a cluster is

a well-defined entity which depends on the dissimilarity
matrix and a parameter k. Given a set of data items D, a
dissimilarity rank matrix: R, and a positive integer k, the
k-clustering algorithm will deliver all subsetsof D that

are k-clusterd according to the definition of k-cluster. In
this method the problem is not to define a chaster but the
problem is to find the clusters. As soon as the dissimilarity
matrix is determined no information about the individual data
items is necessary and the dissimilarity matrix should be

computed only once.

10

1.3 Scopeof This Research:

 

Numeric variables are those which can be represented by
a single number. For example, AGE and INCOME are two numeric
variables. Non-numeric variables are those which cannot be
represented by a single number. For example,BEHAVIOR and
PERFORMANCE are two non-numeric variables. Every clustering
procedure attempts to categorize data items defined by a set
of n variables. Each data item consist of a vector of n
dimension5(if all variables are numeric). Each component of
this numeric vector is the value of one of the n variables
for the given data item.

The first goal of this research was to find a meaningful
and/or comparable numeric representation for NNV's. The ap-
proach used is explained in Chapter Three of this thesis.

The second goal of this research was to find an appro-
priate distance function between data items defined by both
numeric variables and NNH's. The approach used is delineated
in chapter Four of this thesis.

The third goal of this research was to develop procedures
capable of clustering data items defined by both numeric
variables and NNV's. The form of the distance matrix for
such data items is different from that for data items defined
only by numeric variables so a new procedure was needed to
cluster such data items. This problem has been approached in
two different ways as explained in Chapter Five of this thesis.

The fourth goal of this research was to approach the
design of a general system which provides the solutions to

the five major problems given in section 1.1.The goal oriented

11
clustering system is a procedure which is designed in a manner
which may fully solve the above five major problems. The goal
oriented clustering system is defined in Qhapter Two of this

thesis.

1.4 Organization of This Thesis:

 

Chapter one has been an overview and description of a
number of major problems to be either solved in this research
or considered for further research.

Chapter two is a design study of a procedure called the
goal oriented clustering system(GOCS). The description of the
elements of the GOCS is given in this chapter.

Chapter three provides a vector representation of'NNV's.
A relation R,ca11ed a category relation , is defined and a
partition is introduced on the range of every NNV X. Each NNV
X is represented by an M-dimensional row vector and each data
item Di is represented by an n x M matrix.

Chapter four provides a number of appropriate distance
functions to measure the dissimilarity (or similarity) between
data items as well as between variables. An M-dimensional
distance function is defined and some of the properties of
this distance function are studied. The Minkowski metric is
extended to an M-dimensional metric and is used to measure
the dissimilarity between data items.

Chapter five provides a number of different procedures
for clustering data items defined by NNV's. Two different
approaches are provided for clustering a set of data items D.
The first approach is to change the N-block dissimilarity

matrix 2 to an N x N square matrix and then use one of the

12

current clustering methods to cluster the set D. The second
approach is to develop a new clustering procedure to cluster
the data items based on the N-block dissimilarity matrixA.
A method similar to the centroid method is develOped called
the " General Centroid Method." Ling's [41] method is extended
by using the graph representation of D and A . The theory for
this extension is also provided in Chapter 5.

Chapter six is a summary of this research. A number of

important topics for further research are given.

CHAPTER II

GOAL ORIENTED CLUSTERING

2.1 Introduction:

 

This chapter is a study of a new clustering system
designed so as to be able to provide solutions for the five
major problems given in chapter one. This system is fully
described but only the clustering block of the system is

completely developed. This system is calleda Goal Oriented

Clustering System (GOCS). The clustering block of the GOCS

consist of three different .procedures; 1) representation of
data items defined by NNV's , 2) association measures
between data items as well as between variables, and 3) a
procedure to cluster an N-block dissimilarity matrix. These
three procedures are discussed in Chapters 3, 4 and 5,
respectively.

Through discussion with people from different areas of
application, it was discoverd that the five major problems
mentioned in section 1.1 are not unique to the TCE system.
People involved with research in areas such as behavioral
science, psychology, sociology, biometry, biology,and educa—
tion have problems very similar to those «mf the TCE system.
The sociologists are vitally concerned about the overall
behavior of every cluster after each period of time. They
would like to see the relations between clusters as well as
between the members of the cluster before and after application
of the action sequences to the clusters. The psychologists

would like to be able to represent their objects by NNV's

13

14
as well as numeric ones. The sociologists want to know the
direction of any movement of their clusters. They use this
information to predict some of the future events of society.
The biologists want to know the interaction between clusters.
They want to have >full control over the dynamic property

of the clusters.

2.2 Goal Oriented ClusteringZSystem (GOCS):
The GOCS consists of the three separate blocks given
below.

1. Clustering Block: This block clusters the set of data

 

items defined by a set of n variables some of which are NNV's.
The input to the clustering block is a set of N data items

each of which is defined by an n x M matrix. The output of

one
this block is a set of m clusters C1,C2,...,Cm.

2. Action block: This block assigns an action sequence

 

to every generated cluster. These action sequences are based
on the set of actions, the set of needs, the members of each
cluster, and the cluster label. The input to the action block
is a set of m clusters and the output is a set of m pairs

(Ci, Af) , where Ci is a cluster and Ai is the action sequence
corresponding to Ci'

3. Acceptor Block: This block teststhe applicablity of

 

the action sequence A1 to the cluster Ci' Each Ai should be
checked for the three general properties mentioned in section
1.1. The input to this block is a set of pairs (C1,Ai) and

the output is either "accepted" or "rejected."

GOAL ORIENTED CLUSTERING SYSTEM

DATA ITEMS

 

 

 

 

 

CLUSTERING BLOCK

 

 

 

 

 

' c:
d H, ' B;
[ ACTION BLOCK 1
e '3; I 9'! e
‘2 D“ . f1 . e
> sf . =~ . >
SO x/ ' SVT' §z_
l ACCEPTOR BLOCK

 

YES

 

FIND NEW VALUES 0?
ALL VARIABLES

 

  

END OF ANALYSIS?

 
 

Figure 2-1.

 

15

fr

RECHECK ACTIONS AND/OR
VARIABLES TO FIND THE
CONFILICT(S)

 

   
 

 

i

REMOVE ALL
CONFLICTS

 

 

a]

 

INTERPRET
RESULTS

 

16

The process of the GOCS is not limited to finding the most
similar objects. It is to assign a set of actions to a set of
data items in order to achieve a set of pre-defined goals.
Figure 2-1 is a simple diagram of the GOCS. In order to take
care of the dynamic property of a cluster , the GOCS is set up
such that it could run periodically. The GOCS halts either by
user request or when convergence occurs. The GOCS is said to
converge if it generates only one pair (:1, Af); that is , all
data items are clustered into one group.

2.3 Elements of The GOCS:

 

The following is a list of elements needed to be defined
and/or introduced for the GOCS before using it.

1. Data Item: Every member of a finite sample space Q is

 

called a data item. For example, let afbe the set of all
graduate students at Michigan State University. Everyone
enrolled as a graduate student at MSU is a data item. To define
the set of data items , it is sufficient to define % . Every
member of i is a data item. The set D=[D1,D2,...,DN] is called
a set of N data items and is a subset of‘X..

2. Variable : A data item Di could be defined by a set

 

of n variables and represented by a characteristic vector or
by an n x M matrix(if a data item is defined by some NNV's ,
one might represent it by an n x M matrix as given in(:hapter
Three.) The componentsof this characteristic vector (or the
row of the characteristic matrix) are the values of the

n variables for a data item Di' The variables which
represent data items for the goal oriented clustering system

are totally a function of the purposes of the clustering.

17

One might represent a graduate student of MSU by

D-= (Department, 3.8. Major, Previous Institution)

1
for a purpose P1. For another purpose, say Pg. one COUId represent

the same student by

Di= (Department, GPA, Term Credits).
Variables are the means to measure the distinction and the
differences between data items.

3. What to Cluster: The set D = ]D1’ D2,..., ON} and the

 

variables V1,V2,...,Vn are chosen. Does one need to put "similar"
data items in one group (data item clustering) , does one want

to see the "similarity" between variables (variable clustering)
or does one want the simultaneous clusteiing of data items and
variables? Answers to these questions must be decidedi“order

to know what is to be clustered.

4. Variable Transformation: Every clustering procedure

 

deals with a measure of association between variables. If the
variables are not of the same scale of measurement it is necessary
to homogenize the variables to have a unique scale of measurement.

5. Concept of Similarity: One of the well known definitions

 

of clustering procedures says that a clustering procedure must
generate an optimal partition on the set of data items so that
the most "similar" data items are grouped together. The measure
of "dissimilarity" between two elements of two different groups
is to be maximum. Here, one sees the important role of similarity
measure in the clustering procedure. The concept of similarity
must be well defined aniavailable to the clustering procedure

even before choosing the clustering algorithm.

18

6. Method of Clustering: The term "cluster" is not an

 

abstract term with a fixed mathematical definition. The term

"cluster" in classification , pattern recognition, cluster

analysis and other related areas is treated much like the
term "point" in mathematics. Depending on the set of data
items, the purpose of clustering, the set of actions, the
set of needs and some other factors, one could choose the

appropriate method of clustering.

7. Algerithm: Once the data items are defined by homogeni-
zed variables,the measure of similarity (or dissimilarity) is
defined and the criterion of clustering is chosen,an algorithm
is needed to specify the steps of the clustering procedure
as well as a computer program to implement this algorithm.
The choice of algorithm is important to the result of clustering.
To choose the appropriate algorithm sufficient information must
be known about the capacity of the available computer system,
the size of the set of data items , the number of variables,
and the method of clustering.

8. Number of Clusters: In some of the clustering methods,

 

like non-hierarchical clustering, one needs to specify the
number of clusters as a parameter to the system or as aparameter
to the algorithm. In some other type of clustering the number

of clusters does not have to be specified during the operation
of the system. Since most of the existent algorithms arevnOt
defined so as to be able to decide the number of clusters,

it seems reasonable that the analyst for the clustering

system keep accurate and timely records on this parameter.

19

9. The Set of Needs: There is always at least one purpose

 

for using any clustering procedure. When one classifies his
groceries to cold, meat, frozen, canned food and dry food he
might want to satisfy the following needs:
a. To have easier access to his groceries;
b. To put the cold bags in the cooler in the back of his car;
c. To unload the cold foods before the canned foods;
The set of needs resulting in classification of groceries
consists of those listed above. In general, whatever goals one
wishes to reach by means of clustering are called the set of needs.

10. The Set of Actions: To satisfy the needs, a set of

 

appropriate actions must be applied to the clusters in a certain
order. Application of the set of actions to the generated clusters
is intended to improve, in some sense, the behavior of the members
of the cluster as well as the cluster itself. Since there is a
unique sequence of actions corresponding to each cluster, one
could represent a cluster by its action sequence.

11. Acceptor Block: In this new method of clustering the

 

procedure of clustering is to generate a partition;7=]C1,C2,...,Cm]

on the set of data items. Also, the procedure of clustering

is to rassociate an action sequence Ai with the cluster Ci. The

action sequence corresponding to Ci should be based on the setOf

needs as well as on the members of Ci“ The pair (Ci,Ai) identifies
both the group of data items and the actions which the members

of the group should 22. In this method the pair (Ci,Ai) is called
a cluster. Before applying the action sequence Ai to the group

Ci, it is reasonable to test the applicability of Ai to the

individual members of Cias well as to the group Ci' An algorithm

which teststhe applicability of A1 for the group Ci is needed.The

20

block containing this algorithm is called the acceptor block.

12. Analysis of The Result: At the time one starts

 

collecting data items for clustering analysis, his knowledge
about the data items is non-existent or very small. By the time
the procedure of clustering is done , at least the following
information is available.
a.the similarity of data items in terms of a set of given
objects;
b. the reaction of data items to the application of the
action sequences;
c. the effect of applying the action sequences.to the
individuals;
d. the relation between data itemsof one group;
e. the relation between groups;
f. the measure of "improvement;"
g. the convergence direction of the clusters as well as

individual members of clusters.

Determininthhe degree of achievement of the goals, effectiveness

of the procedure, efficiency of the algorithm and effectiveness
of the application of the sequence of actions to the individuals
are some of the factors which could be measured by analyzing
the results of the GOCS.

The GOCS is the solution to the five major problems
given in section 1.1. The GOCS is introduced in this research

and the direction for additional work on the GOCS is given in)

Chapter six.

CHAPTER III

NUMERIC REPRESENTATION OF

NNV FOR CLUSTER ANALISIS

3.1 Introduction:

 

To cluster a set of data items D = {01, Dz, ..., DN} it is
necessary to represent each data item Di by a set of n variables.
Some of these variables are such that one can associate a single
number to every one of them. This type of variable is called
a numeric variable. For example age and income are two numeric
variables. Some variables are such that they cannot be repre-
sented by a single number. That is, the meaning of this type
of variable cannot be represented by a single number as it
could be for age and income. This second type of variable is
of nominal scale , but the reason that it is considered
a different type is that the nominal category of this type of
variable is not certain. For example, let X represent the
behavior of a person ,P, from a population in which M different
types of behavior have been recognized. Usually, there is not
enough information available to define the type of behavior
for person P as one of the M types of existing behavior. That
is why X cannot be represented by a single number. This
type of variable,which has been referred to as a NNV in Chapter
One, has not previously been used in cluster analysis. In this
chapter a numeric representation for NNV's is proposed in a

manner that could be used for cluster analysis. Anderberg (2)

21

 

22

gives a complete listing of problems concerning numeric variables.

3.2 Category Relation:

 

Let X be a NNV and A be the range of X. It is assumed that
the range of every NNV is a finite set. For example, assume X
to be the behavior of an object Q and A = [A1, A2, ..., AM} to
be the range of X. Al, A2, ..., and AM are M different types of
behavior. The definition-of the elements of the set A as well
as the number of these elements are dependent on the given
problem.

Corresponding to each NNV ,X, and its range ,A, define a
set A' which is the set of the original population related to
X. Corresponding to each element Ai of the set A, define a set
A; which is a subset of A' and contains those members of the
original population which have the same value A1 for the NNV X.
For example , if X is a NNV ,say behavior, and A1, A2,...,AM
are M different types of behavior then A; is the set of all
members of the original population which have behavior of
type Ai° The following paragraph describes some of the above
notations.

Suppose the problem is to cluster a set of 1000 data items
(persons) to give them a one-year industrial job training.
Every individual data item is defined by the following six
variables: Vl=age, V2=income, V3: marital status, V4=background,
V5= behavior , and V6: performance. The original population
with respect to V5 from which the given sample is taken is 0f
size 100,000. Let A be the set of all types of behavior which

have been identified in the original population. Based on the

 

23
problem under study (clustering for industrial job training )
one can arbitrarily assume there are 4 different types of
behavior, that is, A={A1,A2,A3,A4} . Let ni be the number of
people with.behavior of type Ai,that is , ]Ai[= ni. For this
example let n1= 25,000 , n2 = 37,000 , n3 = 28,000 and n4=10,000.
This means 25,000 people of this population have behavior of
type A1, 37,000 have behavior of type A2 and so on. It is clear
that if one exchanges the problem given above with the problem
"cluster a set of 1000 persons to give them a one-year job
training in salesmanship" the definition of the different

types of behavior might change as well as their number.

Definition 3.1: Let X be a NNV and A be the range of X.

 

Let P and Q be two members of the set A' (the original population

related to X.) Define a relation R on the set A'as follows:
Let P and Q be two elements of A'. Then P is said to be

related to Q if and only if they have the same category defini-
tion in terms of X. R is called the category relation.

It is easy to verify that R, the category relation defined
by definition 3.1 , is an equivalence relation.

Lemma 3.1: A category relation R induces a partition

 

”.{Al’ A2’ "" Ad‘over the set A' °

Proof: Let the equivalence classes of the equivalence :
relation R be 81,82....,Bk. It is sufficient to show that the
partition T =[Bl,Bz,...,Bk] is the same as the partition

27 a] A',A§,...,Aﬁ} . Let P and Q be any two elements of Bi'

 

24

This implies P and Q have the same category definition;and this
further implies P and Q belong to a unique subset of A',say A5.
Therefore, Bi is a subset of A}. On the other hand let U be any
member of A; other than P and Q,if such an element exists. This
implies P and U have the same category definition which inltﬁrn
implies P and U are related. Since the 31's are disjoint, P and
U have to belong to Bi' Therefore, Bi: Ag; and this is true for
all i = 1,2,...,k and j = 1,2,...,M. In the case in which Bi
has less than two elements it is easy to show that there must
exist a corresponding A; with only one element in it- QED-

Note 1: For a given NNV X with range A and population A'
the partition 5;: [Ai,A§,...,A&} as well as the category
relation R are not unique. That is,27and R might change from
one problem to another. This is true because the definition of
the elements of the set A (categories of X) might change by
the nature of the given problem.

Note 2: Depending on the nature of a given problem one
might need to assign a set of weights to the elements of the
set A,the range of a NNV X. These weights could be any real
number in the interval (0, I]. If no weight is assigned to
an element Ai it is assumed that the weight of Ai is 1. The
weights corresponding to a set A of M elements are given by
a vector UK: (w1,wz,...,wM) or by a diagonal matrix with weights

on its diagonal.

Definition 3.2: Let A =[A1,A2,...,AM] be the range of

 

a NNV X and A' be the finite population related to A. The

density degree of an element Ai , denoted by dd(Ai), is defined

25

V
by A-
dd(Ai) = iii-#-

If A' is an infinite set the density degree of Ai is defined

for i=1,2,...,M.

by definition 3.3.

Definition 3.3: Let A and A' be as defined in definition 3.2

 

except that A' be an infinite set.Let k be the number of ele-
ments of partition 7%=[Ai,Aé,...,AlU which are infinite. Define

dd(Ai) as follows:
0 if A; is a finite element of/7

l/k if A; is an infinite element ofﬂf

 

Definition 3.4: Let f be a function from 8x8 to the
reals, where S is any set. Then f is called a distance function
if:

a. f(X,Y))0 , Vx,res;

b. f(X,Y) = 0 iff X=Y VX,Y£S;

c. f(X,Y) = f(Y,X) Vx,res.

If in addition to the properties given above one has

d. £(x,r) + f(v,z) > £(x,z) Vx,v,z es
then f is called an Euclidian distance function or a metric.

If for a function f and a set S one has the properties
(a) and (c) of definition 3.4 plus the fact that

f(X,X) = o Vxe s
then f is called a psuedo distance. A psuedo distance which
satisfies the triangle inequality is called a psuedo metric.
For example, one can define A on A'xA' as follows:

2(P.Q) = [damn-damp!

where p ‘Ai and Q eAJg,The Ai's and Ai's are as defined in the

beginning of this section. It is easy to verify that A is a

26

psuedo metric.

3.3 Representation of a NNV:

 

Throughout the remainder of this chapter assume X is a NNV

defined on the set A , A' is the population set corresponding
to X and 75" [Ai, Aé’°”’AM] is a partition on the set A'. Also,
assume N” = (w1,w2,...,wM) is a weight vector corresponding to
the partitionzg.

Given a NNV X with range A and population A' and a data
item Q there are three different possibilities as follows:

1. There is no information for the data item Q with
respect to the NNV X. In this case it is assumed Q is uniformly
distributed over A'. That is, P(Q6Ai) = dd(Ai).

2. Q is uniformly distributed over/7 , a subset ofag, with
probability p. In this case

, lAﬂ
(I'P)Tﬁﬁ— if c; = O

P(Q£EA{) =¢

 

pcdd(ci)) k if c; # a .

where r'= iA;l, A;2,...,A;k] , swigl

c; = Ain 3' and dd(Ci) =|€ii /|Bd

A'. , D = A'-B' ,
$1

3. Q is distributed over [7, a subset ofqg, with
k
distribution function (psl,p52,...,p5k) such that

1 psi - p’

1
.In this case [All

I (l-p) if c: = a
In 1

 

P(QtAi) =4

 

. , ._ .
. psj 1f Ci # O and i-sJ ,
where D, B' , Ci and/7 are the same as defined in case 2.
Note that in case: 2 and 3 it is assumed that Q is uniformly

distributed over A'-B'. If one has complete information

about the data item Q then A'-B' is an empty set.

27

Definition 3.5: For a given NNV X and a data item Q define

P, = (Pm 2A1).P(Q “5)...”ch mm).

The vector Px is called the distribution vector of X for the

 

data item Q.
If there is any weight vector corresponding to the range
of a NNV X then one might represent X by the vector
Vx = (w1P(Q 6 Ai) ,w2P(Q£ A5) ,. . . ,wMP(Q£A;‘)) .
The following example is to explain the notations Px and Vx'

Example 3.1: Let X be a NNV with range A={A1,A2,A3} and

 

population of size 10. Let 7],}= [Ai,A§,A‘3] be a partition over
A' with IA'1[= 3 , [Ail = 3 and (Ag: 4 and w”: (l/I,1,1/4)
be a weight vector corresponding to A (org). Let QEA'

and assume it is known that P(Q€Aé) = 7/10. In this example
['7 = A5 , D =[A',A'3] , |Dl= 7 , P(QéAi)=(1-7/10)3/.70 = 9/10,
and P(Q€A:'5)= (1-7/10)4/70 = 12/70. The distribution vector

of X for the data item Q is given by Px = (9/70, 49/70, 12/70L

and Vx is given by Vx=(3/70, 49/70, 3/70).

3.4 Representation of a Data Item:

 

Let D =[01, DZ’ ..., 0N] be a set of N data items and n
be the number of variables defining each data item. If all n
variables are numeric variables then each data item Di can be
represented by a row vector Di = (vil,viz,...,vin). The set of
all data items can be represented by an N x n matrix, when all
variables are numeric.

An alternative is to have s numeric variables and n-5
NNV's. Assume that the first 5 variables are numeric and the

last n-s variables are NNV's. Let 77;; be a partition of order
1

28
ni over the population of the NNV Vi , i=s+l,s+2,..,n. Define
M = Maximum(ni) . Represent each numeric variable by an
M-dimEd2inzal row vector with the first component equal to the
actual value of the variable and the next M-l components equal

to o. For NNV Vi , if ni M, add M-ni zeroes to the end of both

the distribution vector and the weight vector corresponding

1
P;. is called the augmented distribution vector of Vi.
1

to Vi and denote them by V; and P¢,,respectively. The vector
1

I .. , , .
PVi-(pll’piz’ ° ° ° ’plni,&ﬁ9£:~;ﬁ)
1
and
vVi=cwi1pil’wizpiZ"'°’winipinitosoa°°°so)s
for i = s+l,s+2,...,n., where (Pil’PiZ""’pini) is the distri-
bution vector of Vi and (wil,wiz,...,wini) is the weight vector

of Vi' Using vector representation of NNV's , each data item

can be represented by an n x M matrix as follows:

 

 

v11 0 ... 0 F
v31 0 o
i i 1
Di- p5*1’1 ps+1,2 ' Ps+1,M
i 1 Pi
ps+2,1 ps+2,2 S+2,M1
- i
i i i
LPn,1 Pn,2 --- pn,M J .

If the NNV's are represented by V¢_'s (there are weight vectors
1

O . I . i
involved) then one can replace p:+j k with "s+j,kp:+j,k 1n D

D
and denote the.resultant matrix by 0:.

In chapter pour it is shown how one can associate a

i .
dissimilarity measure to D1 and Dj using the matrices Dw and 0a.

"I

29

The letter "D" with no subscript stands for the set of
data items, with subscript "i" stands for the ith data item,
with superscript "1" stands for the matrix representation
of the 1th data item, and with superscript "i" and subscript
"w" stands for the matrix representation of the ith data item
when the weight vectors are effective.

It should be mentioned that the representation of a data
item, defined by NNV's, preposed in this chapter is one of the
possible ways to represent such data items. Of course, there
are some other ways to represent a data item defined by NNV's.
For example, an alternative method for representing a data
item defined by s numeric variables and n;s NNV's is as
follows: Suppose the first 5 variables are numeric and the
last n-s are NNV's. Let the number of categories of the Kth
NNV be nk. The column vector

' i i i T

,Ps;1 ’ns*1,...,pn,1.’...,pn’nn)

Wi=(v:,v:,...v:,P:+1,1...
is an alternative form to represent the ith data item.

In this alternative representation of data items,the
components of the vector Wi all are treated as individual
numeric variables when dealing with clustering algorithms.
That is, W1 is treated as a data item of s+ns+1+ns+2+n+nn
numeric variables. To see how this problem would affect the
result of a clustering analysis one has to collect a set of
real data and check the resulting clusters for being a well
optimized set of clusters. Since the data items of NNV's are
clustered for the first time in this thesis, the representa-

tions and clustering methods proposed in this thesis could

be a base for comparison for the future work in this area.

30

Example 3.2: Let each data item 01 be defined by six

 

variables,where U;,V;,W are three numeric variables and X, Y, Z
are three NNV's. To be specific , let each data item be a person
defined by U=age, V= income, W=marital status, X=behavior,

Y= background, and Z= performance. X,Y,and Z are defined on some
sets A,B, and C with population A',B', and C' , respectively.
The original population for all three NNV,s is the same and

it is assumed to be 10. That is, IA'I =IB'I =IC'I =10 . The

corresponding partitions are:

77A=[A'1,A'2,A'3} ,773=[B'1,35, 33,33,133 andzlc = ‘Ci,C§]
and the subpupulations are: [Ail = [Ail = 3 , [A5] = 4 ,
[Bil = lag] = 3 , [83' = 2 ,lBAl = lag] = 1 , [Ci] = [cg] = s

The weight vectors are:
N =(.1, .2, .3) , "B =(.3, .2, .l, .4, .5) and "C =(.1,.2).
Let assume for person Di it is known that:
P(DiEA£)=l/2, P(Di€B:'5) =3/5 and P(Di eCi) = 2/3.
The distribution vectors of X,Y, and Z for person Di when
computed are as follows:
(3/14, 1/2, 4/14) for X ,(2/3, 1/3) for Z and (6/70, 6/70,
42/70, 8/70, 8/70) for Y. The augmented weight and distribution
vectors are:
wA_=(.l, .2, .3, o, 0) ,
WB=(.3, .2, .l,..4, .5) ,
WC.=(.1, .2, 0, 0, 0),
Px =(3/l4, 1/2, 4/14, 0, 0) ,
Py =(6/70, 6/70, 42/70, 8/70, 8/70) ,
Pz =(2/3, 1/3, 0, 0, 0).

Let the person Di be 34 years old (U=34) with $15900 income

31

(V=15900) and married(W=0). The person 0; could be represented

by the following 6 x 5 matrix. (The weight vectors are considered.)

r 34 0 0 0 0

15900 0 0 0 0

. 0 0 0 0 0
01 =

" 3/L40 2/20 12/140 0 0

18/700 12/700 42/700 32/700 40/700

2/30 2/30 0 O O

L J

Let Dj be another person defined by the same six variables

 

 

and represented by the following information:

0 = 25 , v = 12000, w = 1 , 9x = (1/7, 3/5, 9/35) ,

Py = (0, 2/55, 10/55, 40/55, 3/55) and P2 = (3/5, 2/5).

The matrix representation of Dj using the weight vectors is:

 

 

1’25 o 0 o 0 7
12000 0 o 0 0
1 o 0 0 0
j 1/70 6/50 27/350 0 0
D" 3 0 4/550 10/550 160/550 15/550
3/50 4/50 0 0 0 .
L J

In general,let D='{D1,D2,...,DN} be a set of N data items,
n be the number of variables (consider numeric variables as
NNV‘s with partition of order 1) , and M be the maximum order

of the partitions of the population of the variables. Define
' i
01 , 0w
* Using the alternative way to represent D- one has
0%:(34,15900,0,3/140,2/20,6/70,18/700,127700,42/700,
32/700,40/700,2/3o,2/30)T.

, D , and EH as follows:

32

l

r i i 1
P11 P12 °°: PIM

’ i 1
911 p22--- 92M

1 i i
E3111 p112 ... PnMJ

 

 

where ij is the kth component of the augmented distribution

vector of the jth variable for the ith data item;

1 .
o =[--i1
H ijka
where "jk is the kth component of the weight vector correspon-
ding to the jth variable;
p = [ 01? 02; ... 30”];

and
1: 2~ : N
9-": [ D . Du E o o o :Dw] o

The matrix Di is an augmented n x M matrix containing all

th data item , excluding the weight

information about the i
vectors; glis an N-block matrixcontaining all available infor-
mation about the whole set of data items , excluding the weight
vectors. 0: contains the same information as 01 plus the
weight vectors and 2M contains the same information as 2 plus

the weight vectors. Throughout the reminder of this chapter

consider M, N, n, D, Di, 0:, D, and Dwas defined in this section.

3.5 Representation of a Variable:

Let D =[D1,DZ,...,DN} be a set of N data items , each one
represented by n variables( 5 numeric and n-5 NNV). A NNV X1 is
defined on a set Ai with population set A1 and ”a1 is a partie

tion of order ”1 on the set A1;

771, =1 A11» Aiz’HU an}

 

33

for i=s+l,s+2,...,n. According to section 3.3 each NNV can be

represented by its distribution vector P§i=(p:1,p:2,...,p§ni)

or by its augmented distribution vector

’k _ k k k
Pxi ’ (pil’ p12, 0009 pinisosos°-°:°)s

h

where pgj is the jt component of the distribution vector of

th

the 1 variable for the kth data item. Therefore, P§_ is the
1

th th

distribution vector of the i NNV for the k data item.

An alternative for representing the ith NNV for the kth

data item is

vk = pk rw.
xi xi A1
where 'WA is the weight matrix corresponding to the NNV xi
1
and is given by
r %

WAi= 0 0 W3... 0

 

 

L0 0 O o o o "nil o
The weight matrix "A1 is the same as the weight vector
(wl,w2,...,wni) but in a different notation to make multiplica-

tion possible. The augmented weight matrix of X1 is given by

le o ... 0 0 ... 0'1

0 W2 ... 0 0 ... 0

w' = 0 0 .. wniO 0
Ai 0 0 . 0 0 . 0
o 0 . 0 0 0

 

 

34

For a numeric variable , Xi, an N-dimensional vector

T
(x11, x21, ...,xNi)

will represent the set of N values of the variable Xi for N
data items 01,02,...,and DN; xki is the value of Xi for the

kth data item. If Xi is a NNV then the matrix
r. . .
1 1 1
p11 P12 --- Plni
i . o
921 P22 --- Pin,

W

 

 

i . -
1 1
le pNz ° ' ' pNniJ

will represent the set of N values of X1 corresponding to N

h row of the matrix

different data items Dl,D2,..., and DN' The jt
X1 is the distribution vector of Xi for the jth data item. One
could use the augmented distribution vectors of X1. In this case
i
X is an N x M augmented matrix , where M = Maximum(ni) . An
i:;n-s

alternative form for representing the set of all N values of a
NNV Xi is i i
Xw = X "A1 ,
where X1 is the augmented N x M matrix as defined above and

"A1 is the augmented weight matrix corresponding to Xi.

The whole information about a set of N data items of n
variables (NNV or Numeric) can be stored in a 3-dimensional
array as follows:

5 = [Pix 1 .
where i=1,2,...,-u stands for number of variables, j=l,2,...,N
stands for number of data items , and k=l,2,...,M. Another
form for representing X could be

x = [ x1! xzs...;x"],

35

where each X1 is an augmented N x M matrix. One might use x:
to represent each variable. In this case if X1 and "A- are

w 1
augmented matrices one could define Xw as follow:

2. .n
we use :xwlo

_. l:
Ew-[xw.x
In chapter four it is shown how one can associate a
correlation (dissimilarity) measure to Xi and Xjusing the
. i j
matrices Xu and X".

The following example is to explain the notations and

definitions given in this section.

 

Example 3.3: Three data items Dl,D2 and D3 are defined
by two variables X and Y with ranges A and B and populations
A' and B' , respectively. The partitions of A' and B' are:

7TA =[ A'1,A'2, A3] and 73 = [Bi,B§,B.‘5,B'4],

respectively. In this example one has n=2, N=3 and M=Max(3,4)=4.
The following information is given for each data item:

for the first data item

P(01€A'1)=P(D1EAé)=P(01€A§) = 1/3 and

P(DIEBi) = P(DléBi) = 1/2 , P(DIEB§)=P(01£83) = 0;

for the second data item

9(026A1)

0, and

P(D2 6 Ag)
I>(02 £95)

for the third data item

1/2 , 9(02 £ Ag)

9(02e31) 9(02a3') = 1/3 , ”029.35g = 0;

4

PCDSEAi) = 9(03 £113) = 1/4 , ”0332115) = 1/2 and
903353;) = 9(03235) = 9(1):, 23%) = 11(03 {135) = 9(03535)=1/4.

The weight vectors are:

u, = (.1, .2, .3) and "a = (.1, .2, .3, .4).

The augmented weight matrices are:

"A:

 

(.1 o
0 2
0 0
0 0

c

 

‘1
0 0
0 0
.3 0
0 0
J.

36

 

r.1 0 0

0 2 0
; wB = 0 0

0 0 0

The augmented distribution vectors are:

The augmented matrices X1, X ,

(1/2.

(1/4.

 

 

Y=X2) are:
’1/3 1/3
x1: 1/2 1/2
1/4 1/2
L.
'1/3 1/3
1
x"- 1/2 1/2
Ll/4 1/2
3/2 0
x2: 1/3 1/3
W
1/4 1/4

The two-block matrices X and

p

X
Ill

 

 

3/3 1/3

1/2 1/2

 

[—

g/4 1/2

 

 

 

 

 

 

 

 

 

 

 

 

(1/3, 1/3, 1/3, 0) , P; = (1/2,
1/2, 0, o) , P; = (1/3,
1/2, 1/4, 0) , p; = (1/4,

2 Xi, and X3
1/3 6 5/2 0
0 0 x2 = 1/3 1/3
1/4 g 3/4 1/4
1/3 61’11 0 0 0] 1/30
0 0 0 2 0 0 = 1/20
1/4 q 0 0 .3 0 3/40
9 0 0 0]
0 1/2‘ 71 0 0 0] ﬁ/zo
0 113 o .2 0 o = 1/30
1/4 1/3 0 0 .3 0 g/40
p 0 0 .g
Xyare:
1/3 0'hfr1/2 o 0 1/2
0 0 E 1/3 1/3 0 1/3
1/4 0 ,iL1/4 1/4 1/4 1/4

 

0 l

0

0

'ﬂ .
0, 0, 1/2) ,
1/3, 0, 1/3),
1/4, 1/4, 1/4).

(assuming X=X1 and

q

0 1/2

0 1/3 ,
I/q

 

1/4

2/30 3/30 01
2/20 0 0

2/20 3/40 0 ,

J

 

0 0 4/201

2/30 0 4/30

 

2/40 3/40 4/40,

 

 

37
’1/30 2/30 3/30 0

X = 1/20 2/20 0 0

 

 

L 1/40 2/20 1/40 0

L

The matrices D1 , D , D , D1,
w

1/3 1/3 1/3 0

1/2 0 0 1/2

1/2 1/2 0 0

1/3 1/3 0 1/3

1/4 1/2 1/4 0

1/4 1/4 1/4 1/4

The three-block matrices D and

1/3 1/3 1/3 0 3 1/2 1/2
0 = :

1/2 0 0 1/2 ; 1/3 1/3

1/30 2/30 3/30 0 f 1/20
2w: :

1/20 0 0 4/20 1/30

ﬁ/zo 0

1/30 2/30

 

3/40 2/40

. 1/30
01 =
W
1/20
2 1/20
0w =
1/30
03 _ 1/40
w -
1/40
Dyare:
0 0 3
0 1/3 E
2/20 0 0-

D2 , and D3
w

0
0

3/40

are:

2/30

0

2/20

2/30

2/20

2/40'

4/20‘

4/30

 

 

4/40) .

3/30 0

0 4/20

0 O

0 4/30

3/40 0

3/40 4/40

1/4 1/2 1/4 0

1/4 1/4 1/4 1/

51/40 2/2013/40 0

. a
2/30 0 4/30: 1/40 2/40 3/40 4/40 .

The elements of the kth row of D1 (03) are the same as the

elements of the ith row of Xk (X5). Therefore, to calculate

£.(1W) and D (2,) one has to calculate only Xk's (Xb's).

3.6 Some Motization for the '
Proposed Representation:

 

 

The representations of a NNV (distribution vector) and a

data item (an n x M matrix) are developed basically for cluster

analysis of data items. That is, the representations prOposed

in this chapter are such that one could use them for the

38

purpose of clustering.

All changes in representation of variables start from
introducing a distribution vector to represent the value of
a variable for a given data item. Therefore, for simplicity,
hereafter this new type of representation is referred to as
"Distribution Vector Representation" (DVR). The previous
representation of variables and data items is referred to as
"Single Number Representation" (SNR) .

There are some advantages as well as disadvantages to
the DVR method , concerning the cluster analysis problem.
Some of the advantages are as follows: ..-

1. Let X be a NNV with a range of M elements , that is ,
there are M nominal categories. It is known that the category
definition of X for a given data item Q is not specifically
defined. In the SNR method one either has to completely ignore
this variable or make a rough guess in assigning a nominal
value to X. In the first case,the question is; what
happens if most of the variables defining the data item Q
are NNV's such as in psychology and behavioral science?

In the second case , the question is that of assigning a
probability of l/M to every category of the range of X.
‘Comparing the distribution of X , in case of no available
information , to the rough guess method one can easily

see the difference. In the rough guess method , for N data
items on the average N/M times one would be right. For example.

let X be a NNV with range A =fA1,A2,A3f , population 10,000

and the subpopulations [Ail 100 , ”i" 6000 and ”[115“, 3900.

39
Let a sample of size 450 be drawn from this population.With
the rough guess method one would consider approximately 150
observations out of 450 to be of category one ,in terms of
NNV X. But it is known that even in the whole population there
are only 100 observations.df category one. This simple example
shows the danger associated with the rough guess method.

2. The DVR method makes it possible to compare the values
of a variable X for two data items. Let P and Q be two data
items and let Px(P) = (p1,p2,...,pﬁ) be the DVR of a NNV X for
the data item P and Px(Q) = (q1,q2,...,qM) be the DVR of X for
the data item Q. One can comparg pi to qi for i=1,2,...,M
as follows, If pi = qi it means both P and Q have the same
probability of being in category Ai' If pi~>qi then P has.a

larger probability of being in category Ai than Q does.

3. The DVR method can be applied to those nominal-scale
variables which have certain category definitions. Let X be
a nominal-scale variable and Q be a data item.Assume X has M

categories A1,A ..,AM and it is known that the category of

2"
Q is Ai' Consider r'=]Ai] and p=l in case 2 of section 3.3.
Note that in this case the subpopulationlAi] need not be
known. In this case the distribution of X for the data item Q
would be (0,0,...,1,0,0,...,0,0) . These illustrate some of
L-i-1-J ’—-M-i-————’

the advantages of the DVR method.

One of the disadvantages of the DVR method is the problem
of storage and speed of computation. The amount of computer

storage which is needed to operate a certain clustering

algorithm when the data items are represented by the DVR method

40

is almost M times as much as when the data items are repre--
sented by the SNR method. This problem can be partially solved
by using "The Storage Data Approach " methods; see Anderberg (2)
Chapter Six. In this case the operation of clustering becomes
significantly slow.

For some areas such as psychology and behavioral science,
where a large number of NNV's are required to represent the
dataitemm the amount of time needed to solve the clustering
problem when the DVR method is used is acceptable because of
the advantages previously stated. In some areas where less
NNV's ( or even nominal-scale variables with certain category
definition) are used to define the data items one should
compare the advantages of using the DVR method in spite of
its slowness. Note that slow speed computation will cause more
computer time and greater COSt. Also, note that the use of
the DVR method will cause an increase in the number of compu-
tations as much as M times more than when the SNR method is
used.

Hereafter, when the DVR method is chosen, it is assumed
that a nominal-scale variable with a certain nominal value

is a special case of a NNV as specified above.

CHAPTER IV

ASSOCIATION MEASURES

4.1 Introduction:

 

In clustering a set of N "objects" one usually deals
with some type of numeric measure!” . The measure/)1 is
assumed to be an abstract number with ratio scale. It is
also assumed thathis a mapping from D x 0 into a subset of
real numbers, where D is the set of objects(data items).

Based on the nature of the measured there are different
terminologies available such as similarity measure, closeness
measure, neighborhood distance and dissimilarity measure. It is
assumed thataﬂlis a symmetric measure ,that is,

V x,r e 0 =jn(x,v) = m(v,X).
For N data items there are (E) = N(N-l)/2 different pairs.
Therefore, the functionwdlis defined such that it has at most
N(N-1)/2 possibly different values corresponding to the set D.
Let D1,D2,...,DN be N different elements of a set of data items.
Denote the value of the function.ﬂlcorresponding to the pair
(Di,Dj) by ”(Dimj) = dij‘ The matrix S
F’ ‘!
dll dlz ... dlﬁ

S d21 dzz ... d2"

le sz o o o dNN
L . '.nt" .
is a symmetric matrix. Based on the nature of the measure.AVthe

 

 

matrix 5 might be called a similarity matrix, a dissimilarity
41

42
matrix or a distance matrix. If dij is the distance between Di
and D. or if di' is the dissimilarity between Di and D. then all

1 J J
diagonal elements of the matrix d are zero, that is, d-- = 0 for

11
i=1,2,...,N.

Let S be a similarity matrix and t be a threshold given for
the set of data items D. It is assumed that any two elements of
the set D ,say Di and Do, are similar if and only if dijj;t.

In most cases,as long as the magnitude of dij exceeds the thresh-
old t,the value of dij is not important. In these cases it is
more convenient to transform dij to a binary variable as follows:
r'1 if .11).»; ,

diJ‘ = .
0 1f dij (t .
If the dij's are binary variables‘ﬁi is called a binary matrix.
To store a binary similarity matrix only N(N-l)/2 bits are
needed,where S is an N x N matrix. If d is not a binary matrix
N(N-l)/2 computer words are neededwhere each word may have as
many as 60 bits. Therefore, transforming dij to a binary
variable is a significant way to save machine memory.
4.2 Association Measure Between Variables:

This section is divided into two parts. In the first part
some different types of measures between numeric variables are
discussed. In the second part of this section some measures of
association between NNV's are discussed.

Note that the association measure between variables is
used for the purpose of clustering as well as feature reduc-
tion. In the case of using 'an association measure fer the

purpose of feature reduction , the term "correlation" is more

meaningful for the measure between variables than the term

43

"dissimilarity." For example, in a set of N data items let

X = (x1,x2,...,xN) represent N different values of the variable
AGE and Y = (y1,y2,...,yN) represent N different values of the
variable INCOME. Seeking a correlation measure between X and Y
is more logical than seeking a similarity measure between them.

4.2.1 Association Measure Between
Numeric Variables:

 

 

It is assumed that all numeric variables involved in a
clustering procedure are of interval sca1e( after solving the
problem of mixed variableﬂ. Since this assumption has been made,
it is sufficient to consider only the association measures
between interval variables.

Let X and Y be two n-dimensional vectors given by

X = (xl,x2,...,xn)T and Y{= (y1,y2,...,yn)?

a. The inner product of X and Y is defined by

<x,r>= xTv = YTX .
b. The Euclidian norm of the column vector X is defined by

”XI! = \/(x.X> .

Define XTY

“x’” = 1x: m:
S; (xi-Yicyi-Y)
1:
and r(X,Y) =: 3 7‘—

L_‘

n
\k (xi-le \/2: (vi-If
i=1 i=1

 

 

5 1

n n
where Y = 1/n Ex. and Y = l/n C. yi.
i=1 1 i=1

The measure d(X,Y) is the cosine of the angle between the two
column vectors X and Y; and r(X,Y) is called the moment corre-

lation between X and Y. If X and Y are rank vectors then the

44

moment correlation[81] between them is defined by
n

 

Z (xi-y-lz
f’ i=1 3
(X.Y) = 1-6 . .
“n(n2-l)

The functions d, r and.f’all take values in the range [-1,l].
These three functions are some of the widely used functions for
association measures between variables. Anderberg (2) chapter
three is a good reference for further discussion about asso-

ciation measure between numeric variables.

4.2.2 Association Measure Between NNV's:

 

In chapter three,section 3.5,it was shown that a NNV can

be represented by an N x M augmented matrix

x:=[X}k]=. . ... o

 

 

is the killh

th

where xik = 'ikpik , w. component of the weight

1k

vector corresponding to the i th

variable, and pik is the k
component of the distribution vector of the it variable for
the jth data item. With this representation in mind one might
want to define some sort of association measure between NNV’s
Xi and Xj.
The problem of association measure between NNV's is

approached in two different ways. In the first approach an
average-association measure is suggested and in the second

approach an exact association measure is given by using an

M-dimensional distance function(to be defined.)

45

Definition 4.1: Let S be the set of all o‘x M matrices

 

and V be the set of all non-negative n-dimensional vectors. Let

X = [xij] and Y '[yij] be any two members of S. A function

[7: ()1, X5, ...,‘YMI from S x S into V is called an M-dimensional
distance function if

)0: 1:132:0009M5

a. Vx,vEs Pom!) = (e1,e2,...,eMl , ei

b. Vx,v£s r7(x,Y) = (0,0,...,0) g.» x=v;

c. Vx,r£s r7(x,v) --r7(Y,X);
where ei = 33(Xoi,Yoi) for i=1,2,...,M , Ki is a distance
function and xoi and in are the 1th columns of the matrices X
and Y , respectively.

If all distance functions X1, (2, ...,XM satisfy the
triangle inequality then [71s called an Mfdimensional Euclidian
distance function or an m-dimensional metric. The norm of FTX,Y)
is denoted by
, “f1(x,y)" -\/’§ij: e: .

If one or more of the 31's isldlpsuedo distance then Fis called

a psuedo distance.

Definition 4.1 can be modified to map the set S into the
set of non-negative M-dimensional column vectors. In this case
one has $9 = ( (a, [32,... ,[3an and(B(X,Y)=(b1,b2,...Pn-f,
where bi= ﬂi(xif,y1,) and xi, and yio are the ith EQYés-

of the matrices X and Y , respectively.

Definition 4.2: Let A = [aij] be an N x M matrix. The row

vector V = (v1,v2,...,vu) with

1 M .
Vl’ﬁ .21 “13°

1=l

46
is called the average row vector of the matrix A. The column
vector U a (u1,u2,...,uN)T with
1 N
u: s... a..
. M2:1,

=l
tor of the matrix A.

no...

is called the average column ve

Remark: To be more meaningful about the measures between
NNV's , the association measure is defined only between those
variables which have ranges of the same size(or partitions of
equal order or the same number of categories.) For example,
in Example 3.3 there is no association measure defined between
variables X and Y because the range of X has three elements

and the range of Y has four elements.

Definition 4.3: Let X and Y be two ﬂ-dimensional vectors

X- (x1,x2,...,x”) and Y = (y1,y2,...,yM). Define

M
(mm!) = 1m 12:1 ‘xi-yil .
The value of the function (is called the mean deviation measure
A c
of X and Y. Define (V(X,Y) = Min1mum( aTXp,YP) , where Xp and Yp
are different permutations of X and Y. The value of the function

A . .
a’is called the minimum mean deV1ation measure of X and Y.

It is easy to verify that o’is a metric. The first three
properties of a metric are the immediate results of the defi-
nition of 0’ and the triangle inequality property of 0’ reduces
to M triangle inequalities like the following one:

'xi - yilvo» ‘Yi - zit )‘xi - zi‘ .
The function a’could be considered as a dissimilarity measure
between two NNV's Xi and X- when they are represented by the

J
average rowVvectors Xi and X9. When all components of 4k"

47

and Y are between zero and one, the range of O’is [0,1].

It must be mentioned that,when one uses either the average
row vectors representing NNV's or the [3 function for finding
an association measure between NNV's , one must be sure that
the function defining the association measure must be invariant
under any permutation of the distribution vectors representing
NNV's.

An alternative for finding an association measure
between NNV's could be introduced by the N-dimensional
distance function I? . To use ﬂone needs to specify ﬂ, {9 ,...,
ﬁQ» For example ,let /%= lg: , ...,= ﬁh80’. This implies

9(Xi,xj)=(;’(xio,xl.), ¢:'(x§o,x3o),.. ., §(x§o,x§o)).r
where x§o=(xil,xﬁ2,...,xﬁM).

WhenE is used to compute the association measure between
NNV,s it is usually assumed that ﬂax Ag;...=l&. This assumption
makes it possible to compare the components of the distance
vector [7(X,Y) = (e1,e2,...,eN)T.

Since this research deals with association measures
between data items, the discussion about the association

measures between NNV's is closed at this point. The association

measure between X-

1 and Xj when they are represented by Xi and

Xi should be computed in the same way as when they were

represented by X1 and Xj.

48

4.3 .Association Measure
Between Data Items:

A set of data items,D, is given by its three-dimensional

(N-block) matrix

1 g 2! g N
2M ‘ [Dw '- Dwi'“;Dw]
or g_ a [D1 : DZE"°§DN].

The problem is to find some sort of association measure
between data items. In this section a few M-dimensional
distance functions are discussed. These distance functions as
well as many others can be used as measures between data items
with matrix representation. The algorithm for working with an
M-dimensional distance function is set up so that it is
independent of the type of the distance function. Therefore,
if later on during the practical use of the measures given

in this section, one finds a more effective and/or practical
measure he has only to change the formula for calculating the
distance matrix (to be defined later in this chapter).

Definition 4.4: Let X = (x1,x2,...,xn)T and

 

Y a (yl,y2,...,yn)T be two column vectors of the set of all
column vectors V. The function Mp from V x V into the set of

reals defined by

n [p /p

Mpcxm = 2:

i=1

 

xi ' 71

is called the Minkowski metric for all p > 1.
The limit of Mp when p approaches infinity is called the
Chebyshev metric and is defined by

Limit MP(X.Y) = M”(X.Y) = M?! Ixi'Yil.
P40 1

49
For each value of p a different metric could be obtained,
but any metric for p;»2 is of little practical use for the

area of cluster analysis. For p = l the metric is

Mlcx.v) : :3: lxi-Yil
i=1

For p = 2 the metric is

[n
M2(X,Y) g '2} (xi'yi)2 ;
1

M2 is the familiar Euclidian distance function.

 

The set of all points for which Mpal is called the unit

sphere of the metric M . The unit spheres for"M1 ,- M and M,

P 2
for a two-dimensional and three-dimensional space are given

in Figure 4-1.

Let X1: Xé‘°'°‘ X; - Mp . The association measure between

the elements of the set of data items, D, is given by

k s
Fwy . D") = (e1.e2.-.-.eh)T.

M
where . k l/p
2:: s P
ei " [j_l‘fij ’ fij‘

k

Recall that Dw - [p‘i‘j wij] , i=1,2,...,n , j=l,2,...,M ,

h

k=l,2,...,N , .p:j is the jt component of the augmented

distribution vector of the ith variable for the kth data item,

h
and wij is the jth component of the weight vector of the it
variable. For simplicity, it is assumed that pgj wij = fgj.

The special case of Mp , M2 , is used for the purpose of the
association measure between data items D: and 0:. In this

k
case M2(DE,D5) is a dissimilarity vector between Dw and D:

W

. k s T
and it 15 given by “2(Dw’Dw) = (e1,e2,...,en) , where

50

2

l/
2
, k _ $-

1

91 m...

\2.

 

 

7V

 

8’

Figure 4-1. Unit spheres of Ml,M2 and M0,,

Note : When a set of data items is defined by both NNV's

and numeric variables the magnitude of each of the numeric
variable should be normalized so that it can be compared
to a NNV. One appropriate factor to normalize a numeric

variable can be defined by

fx 8 l/Mx ,
where Mx is the maximum of all N values of the variable X
corresponding to the N different data items D1,D2,...,DN.
Hereafter, it is assumed that every numeric variable which
is involved in definition of a data item is normalized by

some factor fx.

Definition 4.5: A p x q matrix A is called a blockwise

 

symmetric matrix if there existsan r x r block partition

T-[tij] on A such that Tij a Tji for i-l,2,...,r and j-l,2,..,r.

Each T15 is a submatrix of the matrix A.

    

, gllluﬂitaﬂlgnunr 1 no

51

Definition 4.6: Let an M-dimensional distance function
r1-be defined on a set of data items. Let n be the number of
variables and D1 be the matrix representing a data item D1;
{1(05 ,‘0 )- ( d§1,d§2,...,d§“)s(d§1,dﬁ2,...,d§M)- (7(1):,05) ,
where dsj a 83(f.j’ ffj) , ffj is the jth column of the

k

matrix D" , and Xj is the jth distance function of r7 Since

there are N data items, if one calculates [7 for all possible

pairs of data items the result is an N x (N x M) matrix A ,

r 2 ‘
l 1 I 2 2 . c N N N
all dlz ooe dIM:dll dlz coo d1":ooo E dll dlz .0. d1“
1 ' 2 2 3 '

1 1 ' 2 . 2 N N N
dzl dzz coo d2“:d21 d22 one d2 .00. ' d21 dzz 00. d2"
. . '

A . e o co. 0 '0 e me. o .oo o! o o 0.0 e
o e co. 0 ' o e com o .00: . e o ace
0 ‘ z
e o me. o :0 o 000 o .000 . o e .00 o
1 1 1 2 2 2 2 : I N N
Lle sz o o a dNM ; le d N2 0 e o dNM : e o e o d"- dNZ e e o dNM

 

 

O

The matrix A is called an N-block distance matrix.

Any N-block distance matrix is a blockwise symmetric

matrix. To explain definitions 4.5 and 4. 6 one might consider

the _' of Example 3. 3

1/30 2/30 3/30 0 ' ! 1/20 2/20 0 0 3[1/40 4/40 3/40 0 ]
3 0

0
1/20 0 0 4/20 1/30 2/30 0 4/303 1/40 2/40 3/40 4/4

Using X1 - XZB X3 = M2 one has
{70th = (-024 .047 .1 .066 )= Fwﬁmb.
1 3 s 1
”(oww) " (.026 .066 .078 do )= P(0w,0w),

2
”(90:93) = (.026 .017 .105 .033)= (7033.03,).
and . .

52

'0 0 0» 0 :.024 .047 .10 .00§:.026 .066 .078 .10 I

A- .024 .047 .10' .066:0 0 o o :.026 .017 .105 .033‘

L.026 .066 .078 .10 ;.026 .017 .105 .033;0 0 0 0 J.

 

As one can see A is a blockwise symmetric matrix. Each‘block
is an 4-dimensional vector.

If the distance functions (xj's) are all the same then
lthe components of each row of the matrix A within each block

kth h

are comparable. Let d;k be the component of the it block

. h
and 3t row of A. Then dJik is the association measure between

th

the data items Di and D. based on the k elements of the

J
ranges of the NNV's. In case of XI: Y2=,...,= KM =3M2 the d;k‘s
are the dissimilarity measuresbetween Di and Dj‘ The matrix A
is a general case of a dissimilarity matrix. If it happens
that Ms] then A will be the same N x N dissimilarity matrix
as would occur by using the SNR method. The problem of storage

and computation which was discussed in chapter three will be

apparent here by looking at A.

Definition 4.6 could be modified for the n-dimensional
functionf}. In this case the form of $3 is an (n x N)x N

matrix. The following is the tranSposed form ofﬂéa.

‘

' 1 2 2 2 N N N

611 612 ... b1“ 611 612 ... bln ... 611 612 ... 6M
1 1 2 2 N N

'r bél b22 ... bZn b21 béz coo bzn oo. bzl b§2 coo bzn
A = o e on e e o e e e e o e o e o a so. 0

l3

1 l 1 2 2 2 N N N

Lle sz ... an 0N1 sz ... an ... 0N1 6N2 ... b"“J

 

 

53

As one can seed” is also an N-block distance matrix. Each
block ofA’ is a column vector of dimension n. The element ”h
is the kth component of the association vector between data
items Di and Dj. In other words, b§k is the association measure
between Di and Dj in terms of the kth variable.

In Chapter Five the M-dimensional distance function/7 is
used for computing an N-block dissimilarity matrix for a set
of N data items. But one could use the n-dimensional

function 8 or any other appropriate function to compute an

N-block dissimilarity measure for a set of N data items.

Note that [vis not invariant under the permutation of the
categories of the NNV's. If one is to use,f'one should huris-

ticaly fix the order of the categories of the NNV‘s.

To summarize this chapter , let Di and Dj be two data

items represented by two n x M matrices Dﬁ and Da,respectively.

To find the dissimilarity measure between Di and Dj’ two

approaches are given. The first approach is to define(choose)
31, 33,..., Y“ and compute‘fWD$,Da). In this case the dissi-
milarity between Di and Dj is given by an M-dimensional row

vector. The kth

component of this vector is the dissimilarity
between the kth columnsof the matrices D; and D3. The second
approach is to choose 31, 02,...,&9n and compute13(D$,Da).

In this case the dissimilarity between Di and Dj is given by
a column vector. The kth component of this vector is the
dissimilarity measure between the kth rows of the matrices

D: and Di. In this case the dissimilarity vector is of dimen-

sion n.

CHAPTER V

CLUSTERING PROCEDURES

5.1 Introduction:

In developing the clustering block of the GOC system
several changes in the representation of the data items and

variables were desirable. The following is a list of some

of the major changeS, given in Chapters Three and Pour.

a. The representation of a data item changes from an
n-dimensional vector to an n x M matrix;

b. The similarity (dissimilarity) measure between data
items changes from a single number to an M-dimensional vector;

c. The similarity (dissimilarity) matrix changes from
an N x N matrix to an N-blOCk matrix.
Because of these three major changes the data items defined

by NNV's need a new type of clustering procedure. In this

chapter two different clustering procedures are extended.
One can employ these extended procedures for clustering a
set of data items defined by NNV's as well as numeric variables.
Since most of the comparisons in this chapter are between vectors
of the same dimension , the statement given below will help to
clarify the ordering relation between vectors of the same
dimension.

Statement: Between every two vectors V = (v1,v2,...,vr)
and U = (ul,u2,...,ur) at least one of the following relations
exist:

54

55
U iff v.=u, for i=1,2,...,r ;
1 1

H
<
ll

2. V > U iff Viéui for i=1,2,...,r ;

‘0

3. v < 0 iff vi(ui for i=1,2,...,r
4. U and V are not comparable.

5.2 Geometric Representation of D andA:

Since D5‘ is a matrix and.A is a set of Euclidian distances,
a geometric representation of D andllwould help to clarify the
notations used in sections 3.4 and 3.5.

Consider two data items Di and Dj. One has D: = [fir] and
D3 = [fir], k=l,2,...,n , r=1,2,...,M. The distance between Di
and Dj is given by

ru0:,05) = (631,632,..., dJM)= (d11,d12,...,dgM),

where dgr = dj for r=1,2,...,M and

11' i i i j 2]1/2
d. = (d1 -dk r) .
Jr [k=1 kr

Each data item Di could be represented by M points in an

n-dimensional space. Call these M points Pi,P;,. .. ,P;. The M
th th

components of the vector [7(Di, Dj) , the j block of the i
row of A , are the Euclidian distances between points Pi.and
pi for k=l,2,...,M.

Figure 5.1 is the geometric representation of D and A
of Example 3.3. There were three data items of two variables
and M was 4. Therefore, each data item is represented by four
points in a 2-dimensional space. In Figure 5.1 a set of four
X's represents D1 , a set of four +'s represents D2 , and a
set of four O's represents D3. The linesbetween corrsponding
points are the elements of.A , as they are shown by arrows.
5.2.1 Hypersphere Representation of Di°

Let Pi,P;, ..., P; be M points in an n-dimensional space.
Let Di =[f§k] be the matrix representation of Di' One could

56

4
2'
no

3

acm\~ .ouuwa . Ac .oH\Hvuma . ac..on\mvu~a

mm

1

d

.ao~\~ .ovuma . ﬁcV\n.oe\mUuma . ﬁo~\m.o~\muuma

Aon\v .ovuma . ac . ovuma . aom\~.om\mvuma

H

mu
3

a3.

. HOV\H.cv\HUuma
. aOM\~.o~\Hgnma

Ao~\~.cn\mvu«a

 

 

\6

Pi ure S-l.

Geometric Representation of D andeof Example 3.3.

57
0 o i o .
have P; - (fik,f2k,...,f;k) for 1=l,2,...,N and k=l,2,...,M.
As one can see Pi is a point in an n-dimensional space.
. M .
Define [AI = l/M 5:; f¥k , then the vector
3 k=l J
’ ' i i
#1 = (pp/12. ....ﬂn)
is a row mean vector of the matrix D:. One can represent the
data item Di by the smallest hypersphere Si containing all
the points Pi's with center at,ﬂi. Figure 5-2 is the hypersphere
representation of the data items in Example 3.3. Hypersphere

(Si,f}) is called the hypersphere representation of the data

item Di'

‘6

 

 

 

.SV

 

Figure 5-2.

Hypersphere representation of the data items of Example 3.3;

58

5.3 Clustering_the Set D by
Transforming A to an NxN
Dissimilarity Matrix:

The problem of transforming,A to an NxN dissimilarity
matrix is approached in two different ways. The first approach
is by means of a threshold vector,T, and the second approach
is without use of a threshold vector.

A. Usinga Threshold Vector: In this approach it is
assumed that a threshold vector T=(t1,t2,...,tM) is given.
Two data items Di and Dj are said to be similar if all compo-
nents of the vector /7(D:,D£) are less than or equal to the

corresponding components of the given threshold vector T.
That is, P(D$,D;) \(T. As long as it is known that d;k (tk or

d§k>'tk the magnitude of the dgk is not important to the
procedure of clustering. Considering this fact, it is possible
to convert A to a binary matrix. Ab = [bgk] as follows:

. i
F0 1f djk<tk

bgk = 4

 

. i
for i=1,2,...,N , j=l,2,...,N and k=l,2,...,M.
To transform Ab to an NxN binary matrix Db’ one way is

to define Db = [bij] as follows:
M , _
1 if 2:: b}k > M/2

M .
0 if 2:: bik ( M/2 .
R: J

5..-

th
This means that if the number of 1's in the j block of the
th
i row ofAb is greater than the number of 0's in the jth

block of the ith row of'Db then bij = 1 . Otherwise bij = 0.

59

By this method of transforming.Ab to Db , two data items
are called similar if they are similar in more than half of the
categories. An alternative approach to transform Ab to an NxN
binary matrix Db which is more reasonable , but idealistic,
is as follows:

1 if there exist one or more '1" in the
jth block of the ith row of.Ab,
bij = 0 if otherwise.
In this alternative two data items are said to be similar if

they are similar in all the cases.

B. Without Using_a Threshold: In this approach it is

 

assumed there is no threshold vector given. To transform A
to an NxN dissimilarity matrix one could defineJQ=[aij] as

below:
M

i
aii "‘ g1 djk
for i=1,2,...,N and j=l,2,...,N. Definejl'=[aij] , where

aij = l/M aij‘ The matrixVQ' is called the average dissimilarity

matrix.1t is an NxN symmetric matrix. The element a!

1i 15 the

average dissimilarity measure between data items Di and Dj'

For example, the dissimilarity matrix corresponding to the

Example 3.3 is

p

0 0 0 0 :.024 .047 .10 .066:.026 .066 .078 .10
0 “v
A- .024 .047 .10 .066:0 0 0 0 ;.026 .017 .105 .033

L.026 .066 .078 .10 ;.026 .017 .105 .ossgo 0 0 0

3

 

and the average dissimilarity matrix corrsponding to this A is

rO .247 .270'

A1=1/4A.= 1/4 .247 0 .181
L‘27° .181 0

 

 

J

60

One can useﬂ orA' as the dissimilarity matrix for the set
of data items D and cluster the set D based on the matrix
Aor A'. There are a number of methods available for cluster-
ing a set of data items based on an NxN dissimilarity matrix.

(See section 1.2 for references.) Also, the algorithm given

in the next section could be used to cluster any set of data

items which is represented by an N x N dissimilarity matrix.

Note that some clustering algorithms require recalcula-
tion of the dissimilarity matrix in each iteration of the
algorithm. If one uses such an algorithm for clustering
purposes he must note that to recalculate the NxN matrix
.24 it is necessary to recalculateA.

The problem with using the matrix,€Q4') as the dis-
similarity matrix is that there is a chance of generating
invalid clusters in some special cases. This is because of
summing (or averaging) the dissimilarity measures over
the categories of the NNV's. Consider a particular case
when a set of data items consistrof four data items and is
represented by a .4eblock dissimilarity matrix
' - .25 0 .75 0 .25 .26 .27 .30“
0 .75 .25 0 .29 .30 .30 .31
0 0 0 0 .26 .25 .28 .29
L . 0 0 0 OJ

The average matrix corresponding to A is

 

 

F0 .25 .25 .271

()9. = 0 .25 .30

0 .27

 

 

L 01'

61

By looking at,A one can see the dissimilarity relation

between the pair of data items.(D1,Dz) is totaly different
from the one between the pairs (D1,D2) and (D2,D3). But the
average matrix A' says the dissimilarity relations between
the pairs (D1,Dz) , (01,03) and (D2,D3) are exactly the same.
In other words , it is very likely for a clustering algorithm
to put Dl,D2 and D3 into one group as three similar data items
(using A or A' as dissimilarity matrix). Because of this prob—
lem , the following necessary condition which is only based

on some experiments is suggested for the use of an average
dissimilarity matrix (or the matrix of the sum of the dissi-

milarity measures) for clustering a set of data items.

Necessary Condition: Calculate the standard deviation

 

of each lxM submatrix of A. There are N(N-1)/2 standard dev-
iations. Let 033 be the standard deviation of the jth block

in the ith row and [Xi]. the mean of this block.
M . o
g 2:: 1
fi' 1’” k=l djk

0’“, g J70 kZ=1( djk) ,0“
i
d,k- ff,’ ,

Aij a l/M Max
k J 1

 

for i=1,2,...,N and j=l,2,...,N.

Do not use the matrix A or A' for the purpose of clustering
1f there exist one or more 0/1]. such that 0Jij>)ij for

131.2.oo-aN and 5:142»--°'N° If it should happen that
f0 0 o
r some cases one has ggj {Aij for all 1's and 3's then one

m1ght use A or A' as a dissimilarity matrix for clustering

purposes.

' 7"

62

5.4 General Centroid Method (GCM):

In the general centroid method it is assumed that the
dissimilarity between each pair of data items is given by an
M-dimensional row vector. The ith component of this vector is
the dissimilarity between the pair of data items associated
with the ith categories of the NNV's. It is also assumed that
each data item D1 is represented by an n x M matrix D3. The
dissimilarity matrix of these data items is an N-block matrix
Note that if there is no weight vector corresponding to the
range of a particular NNV then the weight vector for this NNV
is assumed to be a vector with all components equal to 1.

In this method it is assumed that either a fixed threshold
vector T = (t1,t2,...,tM) is given for the clustering purposes
or a vector T will be calculated in each iteration of the algo-
rithm.

If there is no fixed threshold vector given then one can
define an appropriate formula to calculate a threshold vector
T when it is necessary to be calculated. The following is a
list of some apprOpriate formulas for computing T:

j=l,29000,Nr be the

1. Let Ar =[d}k] , i=1,2,...,Nr,

Nr-block dissimilarity matrix at the rth iteration of the
algorithm. Define Nr N ,
2 Z: Z: ah , k=l,2,...,M.
tk =“‘—“" i=1 j>i
Nr(Nr-l)

Then T=(t1,t2,...,tM) is a threshold vector computed for the
rth iteration of the algorithm.

2. Define tk for k=l,2,...,M to be the median of the

numbers:

63

1 -1

1 2 2 2 ‘ i N
d3k,eoo,dNrk , dSk,d4k,...’dNrk,...,di+1k,...,dNrk,...,dN;k

1
d2k’

The vector T=(t1,t2,...,tM) is a threshold vector computed for
the rth iteration of the algorithm.

3. For 0 (q {I define tk to be the smallest number such

that q*(Nr(Nr-l))/2 elements of the numbers

dur" l

1 Nrk

2 2 2 ... 4‘
dgk’ d3k,oom'd§rk , dSkpd4k3000'dNrk30003di*1k, . Nth, ’
are less than or equal to tk and (l-q)(Nr(Nr-l))/2 of them are
greater than tk. A special case of this could be for q = I. In
this case t would be defined as t = Maximum (d3 )

k k . Jk
1:1(Nr
j‘>i
The initial groups of the data items consist of N groups,
one data item per group. Assume Gi and Gj are two groups of

data items. These groups are said to be similar if and only if
i .
I7(c,,.c,’,) (T.
If during the operation of the algorithm it was decided to
combine two groups Bi and Gj into one group,Gk, then the matrix

representation of this new group is given by

G: = (63 + 633/2 ,
where GS and G; are the matrix representationsof the groups
Ci and Gj, respectively. At the end of each iteration of the
algorithm , if there are some new groups generated,then the
following items must be updated and/or recalculated:

l. the total number of groups;

2. the matrices representing the new groups;

3. the dissimilarity Nr-block matrix

64

5.4.1 General Centroid Algorithm (GCA):

The algorithm for the GCM based on an N-block dissimilarity
matrix consistsof seven steps. Note that a formula for computing
a threshold vector for the algorithm and a formula for computing
the dissimilarity measure between data items are given as a part
of the input to the algorithm. The steps of the algorithm are as
follows:

Step l-- Compute the dissimilarity N-block matrix ll.
Compute the threshold vector T if it is not given as input.

Step 2-- If there is no threshold given go to step 3.

Otherwise consider those pairs of the total (g) pairs of the
groups of data items which satisfy the condition /7(G$,G;)§'T,
where Gi and Gj are two groups of data items and G: and G; are
their matrix representations,.L§t‘

Q1 3 (Gil’cjl) 2 Q2 = (Gizscjz) 280's QC = (Gicscjc)
be those pairs which satisfy the above condition. Every one of
these pairs is a candidate pair for being combined into one
group. If -c = 0 then go to step 7 , otherwise go to step 3.
Step 3-- Assign a number qk to each pair Qk = (Gik’ij)
as follows:

M .
= 4
i qk E; thr ,

k .

where dj r 15 the rth component of the dissimilarity vector
k

between groups Gik and Gj . It could happen , although very

k
seldom , that qk=qk, for some k # k'. If this should happen

replace qk by q; as follow:

r + 1 'f . . . .
éqk __10’V+1 1 Slk+ SJk> Slkl+SJlU
0
qk 3 l
qk - 7+1 -
10 if s. + s. (sip. Sj

 

\ 1k Jk k‘ )

65

th

where Sr is the sum of the elements in the r row of the matrix

A , and Y is the maximum number of digits to the right of the

decimal point in the qk's after rounding them. If Sik+sjk is
- + S. then one should reconsider the matrices
lkl ka

representing the data items in the pairs Qk and Qk"

equal to 8

Step 4-— Rank the pairs Q1,Q2,...,Qc according to their
weights,qk's. Sort these pairs according to their ranks.

Step S-- Mark the pair with the smallest weight(or rank).

If Qk=(Gik’ij) is a marked pair,then Gi and ij should be

k
combined into one group. If no threshold vector is given go
to step 6 . Otherwise delete all pairs which have either Gik
or ij as one of their elements from the list of sorted pairs.

Repeat step 5 until every remainder pair in the sorted list

is marked.
Step 6-- Go to step seven if either all data items are

assigned into one group or if the number of groups is not
greater than a previously defined number , provided such a
number is given. Otherwise , calculate the matrices for the
new groups , set the new number of groups to N , and go back
to step I.

Step 7-- Print all the information about the generated

groups (clusters) and stop the algorithm.

 

Example 5.1: Let D=iD1,D2,...,D6} be a set of six data
items to be clustered. The initial groups are GI: {D1{, Gz=fD2‘,
GS8 303%: G4: §D4l , GS= {05}, G6: {D65 and the matrices

representing these groups are:

66

 

 

 

 

 

 

 

 

 

 

 

 

11 0 0 0 r2/3 o 0 0 ‘ r5/6 0 0 0
ci=1/3 2/3 0 0 63: 1/4 3/4 0 0 cg= 1/3 1/3 1/6 0
0/3 0 1/3 1/3 Ll/4 1/8 1/4 1/9 3/4 0 1/4 0)

f '4 F '1 { '1

1/5 0 0 0 3/5 0 0 o 1 0 o 0

03= 0 1 o 0 ,0: = 0 1/2 1/4 0 ,66 = 1/2 1/2 0 0

W
L1/2 0 1/2 0 B/z 0 0 1/2 1/2 1/4 0 0)
J J L

The distance functions X , 3%,..., XM all are the same and they
are the Euclidian distancexin an n-dimensional space. Since.A
is a blockwise symmetric matrix,one need only store N(N-1)/2
blocks ofll. In this example N is 6 and M is 4 ,therefore,.A
can be stored in an array of two dimensions. The first
dimension is equal to N(N-1)/2 = 15 and the second dimension
is equal to M=4. Call this array DELTA(15.4)'
Applying step 1 of the GCA: the initial dissimilarity matrix

is computed and stored in DELTA(I,J) as follow:

,3 1 2 3 4 1 J 1 2 3 r 4

 

 

1 .35 .15 .08 .08 11 .48 .17 .26 .50

 

 

2 .45 .33 .19 .33 12 .33 .39, .30 .00

 

 

3 .38 .33 .17 .33 13 .40 .50 .56 .50

 

 

4 .55 .17 .42 .17 14 .94 .56 .50 .00

 

 

r

 

 

 

 

 

5 .24 .30 .33 .33 15 .64 .25 .25 .5&

 

 

6 .53 .44 .17 .25

 

7 .59 .28 .25 .25

 

8 .36 .28 .35 .25

 

9 .49 .28 .25 .25

 

 

10 .76 .66 .30 .00
DELTA(I,J)

 

 

 

 

 

67

Let a fixed threshold vector T = (.50, .45, .40,.30) be given.

Applying step 2 of the GCA: Q1=(Gl,GZ), QZ=(GZ,G6) , Q3=(63,G6),
and Q4=(GZ,GS). Applying step 3 of the GCA: q1=.66,q2=l.27 ,

q3= .93 , and q4= 1.24. Applying step 4 of the GCA Q1 and

02 are marked and Q2 and Q4 are deleted from the list.
the matrix representation of the

Applying step 6 of the GCA

new groups HI and H2 are:

 

 

 

E/6 0 o '0
1 1
= - 2
HH 7/24 17/24 0 0 .(0H + Gw)/2’
[7/24 1/16 7/24 7/24J
p n
11/12 0 o 0 [
2 3 6
NW = 5/12 5/12 1/12 0 = (cw+cw)/2.
5/8 1/8 1/8 0
F .3

 

The number of new groups ,N,fis 4 and the groups are:

G =H1=101,02§ , 82: {03,06§ , 03:5041, c4={05} .

l
The second time through the algorithm , the new matrix A is

given by DELTA(I,J) as below:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3 1 2 3 4 3 1 2 3 4

1 3.
.37 .30 .19 .29 4 .84 .60 .37 .00
;Z3 .30 .21 .29 5 .54 .15 .21 .50
.43 .22 .38 .21 6 .40 .50 .56 .50

 

 

DELTA(I,J).
There is no change in the threshold T. The candidate pairs
are Ql=(Gl,Gz) and Qz=(G1,G4). The marked pair is Q1 and

the matrix representation of this new group is given by:

68

21/24 0 o 0 7
H: . 17/48 27/48 1/24 0 = (61+Gz)/2.
11/24 3/32 5/24 7/48 ,

 

 

The groups are G1 2 H1 ={Dl’DZ'DS'D6g ; stiD4] and G3 I‘DS].

The new number of groups is 3.The third time
through the algorithm c would be zero and the procedure of

clustering has to stop at step 7 of the algorithm. Corresponding
to the given threshold vector T the final clusters are the

groups generated at the second iteration of the algorithm.

5.4.2 Analysis of the General
Centroid Algorithm:

The five important features of an algorithm should be
demOnstrated for the GCA in order to call it an algorithm.
These five features are as follows:

1. Finiteness: The GCA is a finite procedure and it will
stop after (§)-l iterations at NOSt- The maximum number of times
that step 2 can be executed before going to step 7 is N(N-l)/2.
Each time through the algorithm either there is no candidate
pair or there is at least one candidate pair. If there is no
candidate pair the algorithm will halt. The worsi case occurs
when every time through step 2 there is only one candidate
pair. Each time through step 5 at least one group will be
marked. The initial number of groups is N(N-l)/2 and each time
through step 5 the number of groups reduces at least by one.
Therefore, after at most N(N-l)/2 -l iterations the number of
groups is one and the GCA will halt at step 7. Each individual
step of the GCA is finite and there is no infinite loop in

any of the seven steps of the GCA. The only step which is

69

repeated in each iteration is step 5. Let there be c pairs
included in the sorted list of step 4. The maximum number of
times that step 5 might be repeated is c; and this happens
only if all c candidate pairs are disjoint.

2. Definiteness: Step I is the calculation of the dis-

 

similarity matrix‘A and the threshold vector T. Step 2 is the
compariSon~ of at most (g) vectors of dimension M with the
vector T. The result of the comparison is equal, less than ,
greater than , or not comparible , and one should keep record
of those comparisons which result in less than or equal.

Step 3 is the calculation of the sum of the dissimilarity mea-
sures for each candidate pair of groups. Step 4 is the ranking
and sorting of a set of c pairs which have c different weights.
Steps 5, 6, and 7 are very well defined and their definiteness
is obvious.

3. Input: There are obviously some inputs to the GCA
such as , a set of N matrices each one representing a single
data item, a threshold vector T (or a formula for calculating
T) , a formula for calculating the dissimilarity measures
between each pair of data items , and N , M , and n.

4. Output: The output of the GCA is a set of M0 clusters
along with 311 relevant information about each cluster.

5. Effectiveness: All of the operations which are to be

 

performed in the GCA are sufficiently basic that they can be
done exactly and in a practical (not mathematical) finite
length of time by a person familiar with the operations using

pencil and paper. The GCA uses some basic operations

70

such as summation, subtraction, multiplication, comparison,
and square root.

5.4.3 Computational Limit and
the Problem of Storage:

Given N and M, the dissimilarity matrix A needs M (N(N-l))
computer words to be stored. A normal size problem might have
as many as 400 data items and M usually is between 3 and 10. For
M85 and N-200 the amount of memory needed to store.Ais 99,500
words which is beyond the capacity of the average size existing
computers. Because of this problem the Stored Matrix Approach
is used for programming the GCA. The dissimilarity matrix/swill
be stored on disc or on some other storage device and it will
be read as needed. The amount of memory needed for programming
the GCA consists of two two-dimensional arrays DATAl(n,M) and
DATA2(n,M) , a one-dimensional array DISV(M) , and possibly
some other small arrays as they are needed for programming
purposes.

The calculation limit of the GCA is very much problem-
dependent and is especialy dependent on the threshold vector T.
For fixed N, n, and M one might change T so that it makes the
same problem twice as long.The number of basic computations
which are required for the initial iteration of the GCA could
be given as below.

Given the Euclidian distance as the dissimilarity
measure between data items and a fixed threshold vector T ,the
number of basic computations needed in step 1 are:
(N(N-l)/2)[n+(M-l)(n-s)] subtractions (s is the number of

numeric variables) , the same number of multiplications,

71
M*(N(N-l))/2 square roots and a large number of input/output
operations. At step 2 there are M*(N(N-l))/2 comparisons plus
some input/output operations. At step 3 there are M a c summa-
tions and perhaps some additional summations when two weights
become equal. Step 4 is a sorting assignment for a list of c
elements. (there are tables available for computational time
of a binary sort for different number of data.) Computation at
step 5 is a search through a sorted list. This search might be
done as many times as : 2(c—l)+2(c-2)+....+2= 2(c-l)(c-2)/2 =
(c-l)(c-2). Steps 6 and 7.do not require many computations.
As mentioned before , the number of computations at the non-
initial iterations.of the GCA are dependent on’the threshold
vector T. Any estimation of the computation limits of the GCA
at non-initial iterations without considering the effect of T
would be a rough guess.

5.4.4 Some Special Cases:

 

As anspecial case consider M=l,that is, there is no NNV.
In ' other words, consider the case where the dissimilarity
matrix is an NxN matrix. In this case T is a threshold number.
Step 3 of the GCA is not needed for this special case exept!
for tie breaking scheme:.The other steps of the GCA stay the
same for this special case.

As another special case consider the situation where
neither a threshold vector is given nor a formula for finding it.
In this case step 2 of the GCA is redundant, that is, all

N(N-l)/2 pairs are candidate pairs. For this special case,

72

in each iteration of the algorithm there are only two
groups and always two groups to be combined into one group.
Therefore, in each iteration of the algorithm the number of
clusters reduces by one.

A third special case occurs 3 when the dissimilarity'
matrix is an NxN matrix and there is no threshold vector given.
The GCA as modified for the above two special casescan be used
to cluster this third special'case. Let D be a set of data
items and A or A' as defined in section 5.3.B. The GCA can be
used to cluster the set D with the dissimilarity matrix A. As
an example, consider the data items of Example 5.1. The matrix

A corresponding to these data items is as follows: (Only the

elements of the upper triangle are Printed‘)
F0 .66 1.30 1.72 1.30 1.20‘
0 1.39 1.37 1-24 1.27
A a 0 1.73 1.41 .93
O 1.96 2.00

0 1.64

L 0 J '

At step 5 the pair (D1,D2) is marked and at step 6 the number

 

 

of new groups is S. The new matrix A corresponding to these

fi .
[aye groups 61 {01,02} , czafnsl , 03-{04§, c4afnsgand 85={ng§

{o 1.39 1.17 1.24 1.191
0 1.73 1.41 .93
0 1.96 2.00

0 1.64

 

 

0

73

At this time the pair (63,66) is marked. The number of new groups
is 4 and the new matrix A corresponding to these four groups

GI ={D1,DZ§ , G2 '{03’06} , G3 8104fand G4 ={D5} is:

’0 1.15 1.47 1 24,

0 1.72 1.40

0 1.96

0 J.

At this time the pair (61,62) is marked. The number of new groups

 

 

is 3 and the new matrix A for these groups GI: {61,621-
SDI’DZ’DS’D6; , 62 3‘04; and Gs .{Ds} is

0 1.65 1.217
A = 0 1.96
0 .
L .

At this time the number of groups is 2 and the groups are:
81 ={01,02,03,05,06§ and 62 ={04g .

Next time there will be only one group and the algorithm will

 

 

stop at step 7.

The remainder of this chapter is a different approachto
the problem of clustering a set of N data item with an
N-block dissimilarity matrix A.. In the GCM the dissimilarity
matrix.A is to be recalculated every time repeating step 1 of
the GCA. The approach discussed in the following sections

has the advantage of calculating A only once for the whole

clustering procedure.

74

5.5 Graph Representation of D andll:

In this section a new type of graph is introduced and some
of the properties of this graph.are discussed.This new type of
graph is used to represent a set of data items D along with its
dissimilarity N-block matrix A .

Definition 5.1: A graph G=<AUB,P) is called a bipartite

 

graph [31]if ArlB = 0 and every member of P has one element
in A and the other element in B. A and B are called sources of
G and the number of elements (points or nodes) in A is called
the size of the source A. If every node of A is connected to

every node of B then G is called a complete bipartite graph.

 

N
Definition 5.2: A graph G = < U Si , P) is called an
i=1
N-partite graph if Siflsj = O and each member of P has one
1%5
element in Si and the other element in Sj for i#j , i=1,2,..,N,

and j=l,2,...,N.

 

Definition 5.3: An N-partite graph G =1< 3 Si , €> is
called a weighted N-partite graph (or an N-partit: weighted
graph) if there is a weight corresponding to every edge of the
N-partite graph G.

The mean of all weights of the edges of a source is the

weight of that source.

 

Definition 5.4: An N-partite graph G = <13 Si’ P> is
called a uniform N-partite graph if: 1:1
1. the number of nodes in every source is the same,
that is, [51' = M , i=1,2,...,N;
2. each node has exactly N-l branches (edges); and

3. there are exactly M edges between every two sources.

7S

PropositionS.l: The number of edges of a uniform

 

N-partite graph with source size M is M(N(N-l))/2.

Proof: There are (g) = N(N-1)/2 pairs of sources and by
(3) in definition 5.4 there are H edges corresponding to each
pair of sources. QED.

In Figure 5-3’(a) is a bipartite graph and‘(b) is a complete
bipartite graph. A complete bipartite graph is usually denoted
by Kr,s , where r and s are the size of its two sources. In
Figure 5-3,(c) is a S-partite graph, (d) is a weighted 4-partite
graph, (e) is a uniform 4-partite graph and (e') is a uniform
weighted 4-partite graph. A

Definition 5.5: Let G = ('3 G1 , P)>be a uniform weighted

i=1
N-partite graph and G1 =f gil’giZ""’giMx and Gj=fgj1,gjz,..,

 

J5”) be two sources of G. Let "(gik'gjk) = w be the weight

k
of the edge (gik’gjk)’ The weightM
W G-,G- = 1 M . , ,

c 1 J) / 5;} wcglk ng)

is called the average weight between two sources Gi and Gj.

N
Definition 5.6: Let G =< U Gi , P) be a uniform weighted
i=1
N-partite graph. Construct a complete weighted graph ’Ga’ as

 

follows:

1. The nodes of G8 are the sources of G;

2. The weight of each edge of G3 is the average weight
of its corrsponding nodes.
The graph G8 is called an average weighted graph corresponding
to the graph G.

Note that there are N-l average weights corresponding
to each source of a uniform N-partite weighted graph G. But

the total number of average weights of G is N(N-l)/2.

76

 

 

 

 

 

 

 

Figure 5-3.

Bipartite, k-partite, Uniform k—partite, Uniform weighted

k-partite and Average Graph.

 

 

 

 

Figure 5-3 (continued)

 

 

 

 

.0/7

 

“.‘I

026
v/US
033
o
81

 

 

 

 

 

Fi- “1'8 5’4.

Graph representation of D andA.

78
Figure 5-3 (f) is the average graph of Figure 5-3 (e').
Let D =fD1,D2,...,DN§ be a set of N data items. Let Di be
represented by M points Pi,P;,...,P; and A‘be the dissimilarity
N-block matrix of the set D. Considering D and A»as defined

above,the following theorem can be stated.

 

Theorem 5.1: Let G = <3 Di , P), where
p ”i (pimp [1,45 , #13:...»1 , j=l,2,...,N, k=l,2,..,M}.
Then G is a uniform N-partite graph.
Proof: By definition of P there are no edges between points
in Di for i=1,2,...,N. Each point of Di has only one branch or
edge to every one of the sources D1,D2,...,Di_1,Di+1,...,DN.

And the fact that every source has exactly M points (nodes)

completes the proof of the theorem. QED.
The set of data items D could be represented by the graph

G of theorem 5.1. Let dik be the weight of the edge (P:,Pi)

of the graph G.Then G is a uniform N-partite weighted graph
and is the graph representation of both D andTA.. For example,
the graph representation of D and A of Example 3.3 is given

in Figure 5.4 (a). Figure 5-4 (b) is the average graph of
Example 3.3.

5.6 Egclustering Based on Graph
Representation of D and A :

 

 

The term "k-cluster" was introduced by Ling [41]. He
proposed a method of clustering based on an NxN dissimilarity
matrix. The method was called the k-clustering method. In this
section the k-clustering terminology is used for a clustering
method which deals with an N-block dissimilarity matrix4A. This
method is based on graphical representationsof D and4A. Most

of the terminologies used in this section are taken from Ling

79

but are defined for the UNWG's.

Let D be a set of N data items and.A,be the N-block dissi-
milarity matrix corresponding to D. There is a one-to-one corress-
pondence between P of theorem 5.1 and A». Therefore, it is
possible to replace P by A in theorem 5.1. Corresponding to
each d? for i f j there is one edge between two sources Di and

3k
Dj' Therefore, throughout this section

N
c=<uni, A)
i=1

is a uniform N-partite weighted graph representing D and 13.

 

Definition 5.7: Let G1 =(A,P> and G2 = (B,Q)be two graphs:
3. GI = 9) G9 A = 9) ;
b. GIUG2 = (AﬂB, PﬂQ);
c. 610 62 =(AUB, PUQ);
d. Hcc;l e: n = (c,R>,cc:A and RCCxA
Note that PIIQ is the set of all edges which have their
nodes in Ale. Similarily, PUQ is the union of all edges of both
graphs, G1 and 62. In order to extend definition 5.7 to weighted
graphs one can define the set of edges of a weighted graph 61
as follows: (a,b,w) is an edge of a weighted graph G1 only if
(a,b) is an edge of G1 with a weight equal to w. For example,
the weighted graphs of Figure 5-5 could be represented as
follows: 61 = {A,P>', A= {a,b,c,d,e} , P =‘{(a,b,l)(a,c,5),
(a,e,2),(b,d,7),(a,d,9)} and (:2 o<a,Q) , a = {a,c,d,f,g,h} ,
Q = { (a,c,5),(a,d,3),(a,f,6),(f,g,0),(h,g,lO),(c,h,l),(d,h,4)}.

If 61 =(A,P> and G = <B,Q)are two weighted graphs then

2
part (c) of definition 5.7 is well defined only if A[)B = D.

80

 

 

 

 

 

 

Ga 3 51052

 

Figure 5-5.

A picture of Definition 5.7 for weighted graphs.

81

If Ale # D and (a,b) be an edge in both P and Q but with

differentweights ,say w1 and w2 , then the union of these two
w +w
l
edges is defined as: (a,b,w1)U(a,b,w2) = (a,b, __7_Z) . For

example, the nodes and the edges of G3=G1U G2 and G4=GlnG2 of

Figure 5-5 are as follows:

AUB = Sa,b,c,d,e,f,g,h } ;

puq ={(a,b,l),(a,c,5),(a,d,6),(a,e,2),(a,f,6),
(b,d,7),(f,g,0),(c,h,l),(d,h,4),(h,g,10)] ;

Ana = {a.c,d] , mm = {(a,c,5)}

 

Definition 5.3: Let A ={ A1, A2,..., AJand a ={Bl,nz,..,35}
be two setsof sources. Let all sources be of the same size,
say M. The union and intersection of A and B are defined as below.

(Aiu Bj if Ainaj =¢

 

 

A1983. = ¢ Ai if A1 = Bj
Undefined otherwise
A- if A- = B-
1 1 J
AiOBj={¢ .
1‘ S a,
AUB= u nuiasj)
i=1 j=1 J
and
r s ‘
Ana: u U(Ai0Bj) .
i=1 j=l

 

Note that if there is one or more undefihed elementsin AUB then
AUB is not defined. With this definition in mind one can apply
definition 5.7 to the UkWG's as well as to the regular graphs.

Definition 5.9: Let G be a UNWG. A uniform k-partite

 

weighted subgraph (UkWSG) of G ,say H =(X,Q)., is called an
r-chained UkWSG of G (or simply an r-chained subgraph) if for
every two sources of x such as A and B there exist m sources,

nn)l , in X so that the chain. A=A1,A2,...,Am=B satisfies

82
the condition
"(Ai'Ai+1) { E. for i=1,2,...,m-1. (5-2)
The set of all weights of the edges between sources Ai and Aj
is denoted by N(Ai,Aj) = (d§1,d§2,...,d§M).
In definition 5.9 r is any positive number such that

Minimum (dgk) ( r (Maximum (dg'k) .

i.j.k 1.3.k
The chain A=A1,A2,...,Am=B is called the r-chain of the sources

A and B and is denoted by rc(A,B) . For simplicity, denote

(5-2) by
rc(A,B) i 3_. (5-3)

Proposition 5.2: Every graph H =<A,Q) with Q 8 ¢ is an
r-chained UkWG for any positive integer r and for k=l,2,...,lAt.
2322:: It is an immediate result of definition 5.9.
Proposition 5.3: Let 61 be a Uklw r-chained graph and 62
be a Ukzw rechained graph and let M1 be the source size of 61
and M2 be the source size of 62.
a. 63 = 617162 is a Uk3w r-chained graph for k3:=Min(k1,k2)
provided that M1=M2.
b. G4 = GIU 62 is not necessarily a Uk4W r-chained graph.
3322:: If M1 # M2 then G3 is an empty graph and G4 is
undefined. Now let M1=M2. Since 63661 (and 63:62) , (53 is
a subgraph of a UK" r-chained graph. Therefore, G3 is a Uksw
r-chained graph. To show that G4 is not necessarily a Uk4W

r-chained graph , let G1 and G be two graphs such that GﬂlefD

2
and there exists at least one source Al in G1 and one source

A2 in Gz such that "(A1,X))5£ for every source X of G, and

N(A2,Y))>£_ for every s;urce Y of 61' There is no r-chain for:

A1 and A2 , two sources of G4, satisfying (5-3) . In

other words, rc(A1,A2))d£ for all r-chains in G4. QED.

83

Proposition 5.4: Let G1 =‘<A1,Pl\>be a URI" r-chained sub-

 

graph and G2 = (A2,P2) be a Ukzw r-chained gubgraph of G. If
Glf)Gz # D then GIU 62 is a Uk3W r-chained subgraph of G.

Pooof: Let 51(162 = (A3,P3) and A3 # ¢. Given any two
sources X and Y of GlU 62 there are two casec:

l. X and Y are both sources of G1 (or GZ)' In this case
rc(X,Y)g'£ by the fact that G1 (or Gz) is r-chained.

2. X is a source of G1 and Y is a source of 62' In this
case let 2 be a source ofGlnGz. One has rc(X,Z)(_1; and

rc(Y,Z)s:£ by the fact that G1 and 62 are both r-chained
graphs. Let X=Xl,X2,..., Xn=Z “and Z=Y1,Y2,...,Yn=Y u: two

b

chains connecting X to Z and Z to Y, respectively. The chain
X=Xl ,Xz, ..., X.=Z=Y1, Y2,...,Yn=Y
will connect X to Y and one has
{rc(X,Z)s£ and rc(Z,Y){_r_}=g,rc(X,Y) So.
therefore, 51 U G2 is r-chained. QED.

Definition 5.10: A UsWSG of G , say H= <X,Q>,is called

 

(k,r)-bounded ,k)»0, if for every source A in X there exist
a UkWSG F =<i§l T1,?) of H such that W(A,Ti)(_1; ,i=l,2,...,k.
Note that a (k,r)-bounded UkWSG is not necessarily an

r-chained UkWSG.

Definition 5.11: A (k,r)—bounded UkWSG which is r-chained

 

is called a (k,r)-chained UsWSG.

Proposition 5.5: The finite union of (k,r)-bounded

 

UkiWSG's is a (k,r)-bounded USWSG
Proof: It is an immediate result of definition 5.10.

Proposition 5.6: The finite union of (k,r)-chained UkiWSG's

 

is not necessarily a (k,r)-chained UsWSG, but the intersection

is.

84

A UsWSG of G is called a maximal (k,r)-bounded (rithained)
UsWSG if it is (k,r)-bounded (r—chained) and it is not a proper

subgraph of any other (k,r)-bounded (r-chained) UtWSG.

N
Theorem 5.2: Given a positive number r and G =(g21 Di,A) ,

the class of all maximal r-chained UkiWSG's of G is a partition
on G.

Proof: It is sufficient to show that any two maximal
r-chained UkiWSG of G ,say G1 and 62, are disjoint. Let Gln62¥¢-

By proposition 5.4 G3 = GIU G2 is an r-chained Uk MSG of G. On

3
the other hand the fact that Glg;63 (or GZCIGz) is in contradic-
tion with the hypothesis that 61 (or G2) is a maximal r-chained
UleSG of G. Therefore, Gln Gz hasNto be an empty graph. QED.
Definition 5.13: Given G =(gngi,.A>, a UkiWSG of G is
called a (k,r)-c1uster if r is the minimum value for which Gi
is (k,s)-chained for some 5 and G1 is maximal (k,r)-chained.
A subgraph of G which is a (k,r)-cluster for some r is
called a k-cluster.
Proposition 5.7: Not every UsWSG of G , s>.l, is a
(k,r)-cluster for some r.
Pooof: It is an immediate result of definition 5.13.
Proposition 5.8: Let S be a (k,r)-c1uster and S' be a
(k,r')-c1uster of G.
a. r=r' =geither S = S' or SnS' = Q.
B.
Proof: a) Let S i S' and SITS' = X # D. This implies

b. r')r=9either SCS' or 808'

SUé? is (k,r)-chained ,by proposition 5.4, and this implies
S is not maximal (k,r)-chained which is a contradiction.

Therefore , if S # S' then S and 8' must be disjoint.

85

b) Let s¢s' and 505' = x ¢ ¢ . If s=s' then this
implies r=r' which contradicts the fact that r')»r. If SIDS'
then SUS' is a (k,r)-chained and this implies S (or S') is not
the maximal (k,r)-chained UsWSG of G which is a contradiction.
Therefore, If sgts' then S and 8' must be disjoint. QED.

Theorem 5.3: If T is a nonempty maximal (k,r)-bounded

 

UpWSG for a fixed integer k then T is uniquely partitioned
into m (k,r')-clusters for some r'{'r.

Proof: Let X1,X2,...,X be p sources of T. The fact that

P
T is (k,r)-bounded implies, for each Xi,there existssome UkWSG
Ti of T with exactly k sources (excluding Xi) such that

N(Xi,U)£{£ for all U being a source of Ti‘ ,(5-4)
If there are more than one such Ti let r; be the minimal
value of r satisfying (5-4) and T; be the corresponding UkWSG
of T. Let i871 Téi # D for si€(l,2,...,p). Then

65 =(igl Xsi,Q>

is a maximal (k,r')-cluster, where r' = Max(r;l,r' ,...,r' )
and q is the number of all such T'i's. The uniqueness of
Ti's implies the uniqueness of the Cs. QED.

It is clear from the above definitions, propositions and
theorems that the clustering problem is only the problem of
finding the maximal (k,r)-bounded UkiWSG's of G. The set of
k-clusters corresponding to each (k,r)-bounded UkiWSG of G
is a well defined entity and it can be found by theorem 5:3.

Remark: Given an N-block dissimilarity matrix A.=[d;k],

assign a rank matrix R=[r%k] to A as follow: Rank dgk's from

1 to(N(N-l)/2)*M according to their magnitude and set r;k to

1

be the rank of dj The scheme of handling ties by assigning

k'

86

to the tied observations the average of their ranks should

not be used in ranking dgk's, because it could raise some of
the terms to fractional ranks which are not desirable for
computational reasons. One way to break the ties is to assign
the same rank to all tied observations and skip the unused ranks.
An alternative is to add some random noise to the tied obser-
vations to make them unequal. Some of the reasons for replacing

Ah by R in the k-clustering algorithm are as follows:

1. The use of R simplifies the computations required to
find all k-clusters of a set of N data items;

2. The use of R makes it possible to save computer memory
by storing each rank in at most 12 bits instead of storing it
in a complete computer word;

3. The use of R makes the clustering result invariant
under any monotonic transformation of/X.

5.6.1 X-clusteringTAlgorithm:

 

As mentioned before, Ling[4l] uses the term"k-cluster"
for his method. Here in the last two sections some definitions
and theorems were presented in order to make it possible that
one can generalize the idea of k-clustering to cluster a set
of data items which are represented by n x M matrices(not by
n-dimensional vectors.) The algorithm for the k-clustering
method based on graph representationsof D and A consists of
the following steps:

Step l-- Choose a positive integer k, set r=M*(k;1),
and replace A by R.

Step 2-- Find Br , the maximal (k,r)-bounded UsWSG of G.

Step 3-- If Bri D find all m ,mjel, k-clusters

87

corresponding to B1. using theorem 5.3. If B - D go to step 5.

r

Step 4-- If a new cluster is foundzaIVeaall relevant
information about this cluster.

Step 5-- Increase r byld units(or if it is appropriate,
increase it as it is specified in the following remark) and
repeat the algorithm from step 2 until:

a. the number of clusters is less than or equalto some
previously defined number of clusters, if such a number is given.
b. the largest cluster ,D, is reached, that is , every

data item is assigned into one group.

Remark: Let r* be the previous r used to find maximal
(k,r)-bounded UsWSG S and I(r*) be the set of all ranks of
clusters already feund.lnstead of increasing r bij)one can set
the next r to be the smallest integer greater than r* and not
in I(r*). The reason is that no integer in I(r*) can
generate a new k-cluster.

Steps 1 , 4 and S of the above algorithm are clearly
well defined. To show the definiteness of the steps 2 and 3,
and indeed the complete algorithm, the following computational
steps of the algorithm are specified.

Step c(l)-- Read the fixed parameters N,M,n and k.

Compute the initial value of r as r0: M*(k;1) . Compute

all N(N-l)/2 dissimilaritypg-dimensional vectors corresponding
to the N(N-1)/2 pairs of data items.The following is the order
of pairs for which the dissimilarity vectors are computed:
(D1,D2),(D1,D3),...,(D1,DN),(D2,D3),(DZ,D4),...,(D2,DN),...,

(Di,Di+l)9°°-9(D1,DN):°-'a(DN-1’DN)°

88

Store the ranks in an array called RANKS(N(N-l)/2 ,M). The ranks
are stored as they are calculated. Therefore, RANKS(I,J) is
the Jth component of the dissimilarity vector between Dt and
DS where t is the integer solution of the inequality
(t-1)(N-t/2)<I <t(N-(t+1)/2).
and s = I-N(t-1)§ t(t+l)/2.
Step c(2)-- To find the maximal (k,r)-bounded UsWSG Br of
G , first let all sources of G be included in Br’ At each
iteration each source of B1‘ is checked once to decide whether
it should remain“ in Br or not. The decision rule is as followh
If D1 is a source of B1. and there exist k other sources,say
Dj1,Dj2,..'.,D5k , in Br such that rc(Di,Djs)<ﬁoi ,s=l,2,..,k
then Di remains in Br’ Otherwise it is deleted from the list
of sources of Br; At the completion of such a pass , if some
sources of'Br were deleted from the list of the sources of B1.
then at least one more pass is required to ensure that Br is
(k,r)-bounded. This iterative process terminates when no
source is deleted from Br during a complete pass, or when
the number of remainder sources in B1. is less than k+l. If the
number of sources of Br is less than k+l then B1. is an empty
UsWSG. If Brﬁ D then it is the maximal (k,r)-bounded UsWSG of G.
Step c(3)-- If B1. = D set new r to the maximum of r+ld
and the smallest integer not in I(r*) and greater than r* and
go to step c(2). Otherwise partition Br into m , m)l, maximal
r-chained UsWSG‘s (k-clusters) C1,C2,...,Cm. The process of
partitioning Br into m k-clusters is as follow: Consider each
source of B1. to be a node of a graph. There is a path between

two nodes , A and B, of this graph if and only if rc(A,B)§f£.

89
The disjoint subgraphs of this graph are the k-clusters of Br'
If the number of clusters is the same as that for the previous
r* and B1. is identical to Br* then no new cluster is found;
In this case update the value of r and go to step c(2). If
there is only one cluster obtained and it contains all Dils then
go to step c(6). Otherwise go to step c(4).

Step c(4) --Compare the clusters C1,C2,...,C of step c(3)

m
to the previous clusters. As soon as a new cluster is found go
to step c(S).

Step c(S)-- Save the elements , the size , and the r
corresponding to this new cluster and modify I(r).

Step c(6)—- Mark those previously obtained clusters which
are a proper subcluster of the new cluster. Once a cluster is
marked it will not be compared to any other cluster in sub-
sequent iterations of the clustering procedure.

Step c(7)-- If the new cluster contains all data items
then print the results of clustering and stop the procedure.If
thé total number of disjoint clusters at this point is less than
or equal to a previously defined number of clusters, provided
such a number is given, then print the clustering results and
stOp the procedure. Otherwise update the value of r and go to
step c(2).

As an example of the k-clustering algorithm , one can

consider the data items of Example 5.1. The rank matrix for

these data items is stored in array RANKS(I,J) as follows:

90

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

J J J

1:5 1 2 3 4 1 1 2 3 4 1 1 2 3 4
1 38 6 4 4 6 s1 42 7 14 11 44 9: 22 46
2 43 3o 12 36 7 55 23 14 14 12 36 29 26 1

e. .._..___.3L._.e

3 59 3s 9 30 8 39 23 37 14 13 4o 46 53 46
4 52 7 41 5 9 4s 23 14 14 14 6o 53 46 1
5 13 26 3o 30 1o 58 57 26 1 1s 56 14 14 46

 

 

 

 

 

RANKS(I,J).
Given k=2, the first maximal (2,r)-bounded UsWSG of G=<l3 Di,R>
is a maximal (2,44)-bounded U4WSG with sources equal 1:1
D1 , D3 ’D6' This U5WSG is partitioned into only one 2-cluster,
namely C1 =lD1., OS ’06} . The maximal (2,48)-bounded USWSG of
G has Dl,D2,D3,D5,D6 as its sources and is partitioned into
one 2-cluster ,namely C2=fD1,D2,D3,D5,D6§ . Since Cl is a
2, C1 is marked. The last maximal (2,r)-buunded
UsWSG of G is a (2,56)-bounded U6WSG and is G itself. Since

subcluster of C

this last maximal (2,SS)-bounded U6WSG is partitioned into
only one cluster ,namely the set D itself, the algorithm will

halt at step 5 (Or step C(7).)
The GCM and the k-clustering method are two different

approaches. Each one has its own advantages as well as disad-
vantages. But there are two important reasons that one might
prefer the GCM. These reasons are l)forvelarge number of data
items the GCA is significantly faster than the k-clustering
algorithm, and 2) the GCA recomputes the dissimilarity
matrix each time a new cluster is obtained,therefore, more
Optimized clusters are expected from the GCA.

One of the most important advantages of the k-clustering
algorithm is that it can generate different clusters by

changing the parameter k.

91

It should be mentioned that , due to the lack of a set
of real data , the two algorithms given in this chapter have
not been tested by a set of real data. But they have been

programmed and tested by a set of generated data.

I

To summarize this chapter, a set of N different data
items , D = {01, Dz, ..., DN} , each one of them represen-
ted by an n x M matrix along with a dissimilarity matrix,

A , are given. Two different approaches are developed for
clustering the set D. The first approach is to transformld

to an N x N dissimilarity matrix and use one of the existing
clustering algorithms to cluster the set D. The second app:
roach is to use A as it is. For this second approach two
different clustering algorithms are developed. One is the
generalization of the Centroid Algorithm and the other one

is based on graphical representations of D and A . For exten-
ding Ling's[4l] method, some theoretical properties of the

graph representation of D and A are discovered in this chapter.

CHAPTER VI

SUMMARY AND CONCLUSION

6.1 Summary of this Desertation:

 

Chapter one is an overview which includes brief
descriptions of a number of major problems. These major
problems are l) the problem of representation, 2) the problem
of clustering, 3) the problem of assignment, 4) the problem of
applicability of assignments and 5) the problem of dynamic
clusters. Chapter two is the design study of a general system
called goal oriented clustering. This system is designed so as
to provide some solution to the above major problems. The goal
oriented clustering procedure(GOCP) consists of three separate
blocks. A list of elements of the GOCP is given in chapter two
along with a brief description of each element.

Chapter three is concerned with variables. Since numeric
variables have been studied for a long time by most of the
authors working in the area of clustering, only a reference is
given for the problem related to the numeric variables. The
numeric representation of NNV's is the most important step
of GOCP. Two different representations of NNV's are provided
in chapter three. Also, in this chapter some appropriate
measures corresponding to the numeric representation of NNV's
are defined. Each data item is represented by an n x M matrix
and the set of all data items is represented by an N-block
matrix. Finally , some motivation for all these proposed

representations is given at the end of chapter three.

92

93

In chapter four an M-dimensional distance function
[7: ( Yl’ 32,...,‘1M) is defined and a very special form of
this distance function is used for association measure between
variables. Also, in this chapter the Minkowski metric is
discussed and extended to an M-dimensional Minkowski metric.
The extended Minkowski metric is used for measuring the dissi-
milarity between data items. A general N-block distance matrix
is defined and . is used to represent the dissimilarity
between data items.

The following are some of the most important tOpics
discussed in chapter five .

l. Geometric Representation of D and.A: In this section
an attempt is made to interpret the matrices D and4A.by some
geometrical means.

2. General Centroid Method: Two approaches are provided
for clustering a set of data items D with a dissimilarity
matrixA. The first approach is to transform A to an N x N
dissimilaiity matrix and then use one of the existing
clustering methodsto cluster the set D. The other approach is
to work with.A as it is. The General Centroid Method is
proposed for this second approach. Some special casesof GCM
arediscussed and an algorithm for the general case of the GCM

is provided.

3. Graph Representation of D andA: In this section two

new types of graphs are introduced. These new graphs are called

N-partite Weighted and Uniform N-partite Weighted.Graphs. The
definitions and properties of these graphs are discussed and

D and A are represented by a uniform N-partite weighted graph.

94

4. k-clustering Based on Graph Representation of D andA :
Chapter five is closed by providing the necessary mathematical
theory for clustering a set of N data items when they are
represented by a UNWG G=<§gf Di,AC>. A number of theorems
and definitions are provided in this chapter dealing with graph
representationSOf D and A .

6.2 Conclusion:

 

..

One of the major results of this thesis is the development
of a method of representing NNV's as part of a characteristic
vector of a data item. NNV's had never before been considered
as part of characteristic vector of a data item. On the other
hand the contribution of this type of variable to the result
of any cluster analysis is apparent. Through use of the GOCS
one can represent an observation by any type of variable.

One can incloude behavioral performance as well as age in the
list of variables.

A most important result of this thesis is the development
of a complete solution to the clustering problem of the
GOCS.The clustering criteria proposed in this research are
generalized modesof some previous clustering criteria. The
representation of variables has been generalized to M-dimen-.
sional vectors,the representation of data items has been
generalized to an n x M matrix , and the distance measure
between data items has been generalized to an N-block matrix.
The generalization is such that if no NNV is involved then

the representations are the same as they were with the previous

criteria. Two clustering algorithms are developed.The first

95
algorithm is a generalization of the Centroid Algorithm(GCA).
The GCA ,if there is no threshold vector given, will generate
only one new cluster in each iteration. That is, the GCA
combines two of the most similar groups and then compares
this new group with all the other groups to find the next
most similar group. The second algorithm is the generalization
of the k-clustering Algorithm. The definitions and the
theorems deve10ped for generalizing the k-clustering Algorithm

are one of the important parts of this theSis.

The M-dimensional distance function introduced in(3hapter

Four could be used as wellfor factor analysis,multivariate
analysis in case of three dimensional matrices,and pattern
recognition. The clustering criteria introduced in Chapter
Five could also be used for clustering the data items with no
NNV, especially the one based on graph representation of D
ande . The advantage of this method is that the cluster in
this method is a well defined entity and is dependent on the
dissimilarity matrix and the parameter k. The definition of '
uniform k-partite graph could be extended for pure graph'

theoretic reasons.

6.3 Soggosted Topics for Further Work.: The following is a
list of suggested tOpics for further work. in the area of GOC.
1. Implementing a General Action Block: At this time the
clustering routine of the GOCP is completed but the action
routine of the GOCP is not implemented as a general computer
program . The action block should be designed such that it
acceptsas input a set of M0 clusters , a set of needs associ-

ate with the problem, a set of actions , and a set of control

96,
parameters which are dependent on the given problem. The output
of the action block should be a set of action sequences.

2. Implementing a General Acceptor Block: This block
should be designed as a subblock of the action block. The
inputs to this block are,possibly, a set of actions , a set
of needs , a set of action sequences , and a set of parameters
defining the nature of the given problem and specifying the
relations between actions. The output of this routine is
either "accepted" or "rejected."

3. Studying,the Dynamic Property of a Cluster: This subject
was mentioned as one of the five major problemuin chapter one.
As far as the author knows there has been no work related to

this subject and it is perhaps a most interesting subject

to be studied.

4. Developing a Different Approach: Solving the cluste-
ring problem of the GOCS in a different approach and comparing
the result of that approach to what is done in this thesis

could be an interesting topic for further reseach in this area.

(l)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

97

Bibliography

Aitken, M.A. and Hume, N.W., "Correlation in a Singly 'l‘rtunmted
Bivariate Normal Distribution III. Correlation between Ranks and
Variate Values." Biametrika Vol. 53, parts 1/2 pp. 278-281, 1966.

Anderberg, R.M., "Cluster Analysis for Application." National
Technical Information Service, 1972. (Published in 1973)

Babu, C.C. and Chan, W.C., "An Algorithm for Pattern Classification
Using Bigenvectors."

Bailey, 0.3. and Tryan, R.C., "Clustering Analysis." MoGraw-Hill,
1970.

Ball, G.H. and Hall, D.J., "A Clustering Technique for Summarizing
Multivariate Data." Behavioral Science, 12, pp. 153-155, 1967.

Ball, G.H. and Hall, D.J., ”ISODATA, A Novel Method of Data Analysis
and Pattern Classification." Stanford Research Inst., AD 699616,
1965.

Bartels, P.H., Bahr, G.F., Calhoun, D.W. and Wied, G.L., "Cell
Recognition by Neighbourhood Grouping Technique in TICAS." Acta
Cytological Vol. 14, No. 6, pp. 313-324, 1970.

Canover, W.J., "Practical Nonparametric Statistics." John Wiley &
Sons, Inc., 1971.

Diday, 3., "Optimization in Non-Hierarchical Clustering." Pattern
Recognition, v01. 6, No. 1, pp. 17-34, 1974.

Dubes, R.C., "Information Compression, Structure Analysis and
Decision Making with a Correlation Matrix." Michigan State Univer-
sity (1970).

Dunn, J.C. ”A Graph Theoretic Analysis of Pattern Classification via
Tamusa's Fuzzy Relation." IEEE Trans. on System, Man and
Cybernetics, pp. 310-313, May 1974.

deards, A. w. P. and L. L. Cavalli-Sforza, "A Method for Cluster
Analysis,” Biometrics, Vol. 21, No. 2, pp. 362-375, 1965.

Eigen, D.J., Northouse, R.A. and Fromm, F.R., "Cluster Analysis Based
on Dimensional Information with Application to Feature Selection and
Classification." 1883 Trans. on System Man, Cybernetics, pp. 284-294,
May 1974.

Fisher, J.L. and Zudun, J. "On the Method of Theory of Clustering."
Multivariate Behavioral Research, 4, pp. 235-250, 1969.

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

98

Fisher, L. and VanNess, J.W., "Admissible Clustering Procedures."
Biometrika Vol. 58, pp. 99-104, 1971.

Fisher, W.D., "0n Grouping for MaximumlﬂomOgeneity." Journal of
the American Statistical Association, No. 53, pp. 789-798, 1958.

Friedman, H.P. and Rubin, J., "On Some Invariant Criteria for
Grouping Data." Journal of American Statistical Association, Vol.
62, pp. 1159-1178, 1967.

Friedman, J.H. and Tukey, J.W., "A Projection Pursuit Algorithm for
Exploratory Data Analysis." IEEE Trans. Comp. C-23, pp. 881-890,
Sept. 1974.

Fukunaga, K. and Koontz, W.L.F., "A Nonparametric Valley-Seeking
Technique for Cluster Analysis." IEEE Com. Tran. C-21, 2, pp. 171-
179, Feb. 1972.

Fukunaga, K. and Olsen, D.R., "Representation of Non-Linear Data
Surface." IEEE Comp. Trans. vo1. 3, C-22, No. 10, Oct. 1973, pp.
915-923.

Fukunaga, K. and Koontz, W.L.G., "A Non-Linear Feature Exhibition
Algorithm Using Distance Transformation." IEEE Camp. Trans. Vol.
C‘Zl, NO. 1' Jan. 1972' PP. 56-63.

Fukunaga, K. and Koontz, W.L.G., "Asymptatic Analysis of a Non-
Parametric Clustering Technique." IEEE Trans. on Computer, V01.
C-Z' NO. 9' Pp. 967-974, sept- 1972.

Fukunaga, K., "Introduction to Statistical Pattern Recognition."
Academic Press, 1972.

Fukunaga, K. and Olsen, D.R., “An Algorithm for Finding Intrinsic
Dimensionality of Data." IEEE Trans. on Computer, Vol. C-20, No. 2,
Feb. 1971.

Fukunaga, K. and Kessell, D.L., "Estimation of Classification Error."
IEEE Comp. Trans. Vol. C-20, No. 12, pp. 1521-1527, Dec. 1971.

Gimilin, D.R. and Ferrell, D.R., "A K-K' Error Correction Procedure
for Nonparametric Imperfectly Supervised Learning." IEEE Trans.
SMS. pp. 304-306, May 1974.

Gower, J.C., "Comparison of Some Methods of Cluster Analysis."
Biometries, No. 23, pp. 623-637, 1967.

Harding, E. F., "The Probabilities of Rooted Tree-Shapes Generated
by Random Bifurcation," Advances in Applied Probability, Vol. 3,

Hajek, J., ”Nonparametric Statistics." Holden-Day, Inc., 1969.

Hellman, M.E., "The Nearest Neighbor Classification Rule with a
Reject Option." IEEE Trans. Vol. SSC-6, pp. 179-185, July 1970.

(31)

(32)

(33)

(34)

(35)

(36)

(37)

(38)

(39)

(40)

(41)

(42)

(43)

(44)

(45)

(46)

99

Harry T. Hsu, "An Algorithm for Finding a Numeral Equivalent Graph
of a Digraph." Journal of ACM, Vol. 22, No. 1, Jan. 1975.

Hasselb'lad, V., "Estimation of Parameters of a Mixture of Normal
Distributions." Technometrics, Vol. 8, pp. 431-444, 1966.

Jance, G. N. and Williams, W. T., "Note on the Classification of
Multi-Level Data." Computer Journal, No. 9, pp. 381-382, 1967a.

Jarvis, R. A. and Patric, E. A., "Clustering Using a Similarity
Measure Based on Shared Nearest Neighbor." IEEE Trans. Vol. C-22,
pp. 1025-1034, Nov., 1973. ‘
Johnson, S. C., "Heirarchial Clustering Schemes." Psychametrika,
Vol. 32, No. 3, Sept. 1967, pp. 241-254.

Jensen, R. E., "A Dynamic Programming Algorithm for C1uster Analy-
sis," Operation Research, Vol. 17, No. 6, pp. 1034-1057, (1969).

Kruskal, J. B., "Multidimensional Scaling by Optimizing Goodness to
Fit to a Non-metric Hypothesis." Psychametrika, 29, pp. 1-27,
(1964a).

Labovitz, S., "In Reference of Assigning Numbers to Ranks."
American Sociological Review, Vol. 36, No. 3, pp. 520-521, 1971.

Labovitz, S., "The Assignment of Numbers to Rank Order Categories."
American Sociological Review, Vol. 35, No. 3, pp. 515-524, 1970.

Labovitz, S., "Some Observations on Measurement and Statistics."
Social Forces, Vol. 46, No. 2, pp. 151-160, 1967.

Ling, R. F., "Clustering Analysis." Ph.D. dissertation, Technical
Report No. 18, Dept. of Statistics, Yale University, 1971.

Lingoes, J. C., ”An IBM 7090 Program for Guttman-Lingoes Smallest
Space Analysis--I." Behavioral Sciences, 10, pp. 183-184, 1966.

Lance, G. N. and Williams, W. T., "Computer Program for Hierarchical
Polythetic Classification (Similarity Analysis)." Computer Journal,
Vol. 9, No. 2, pp. 60-64, 1966.

Lance, G. N. and Williams, w. T., "Computer Program for Monotlletic
Classification (Association Analysis)." Computer Journal, Vol. A,
NO. 3' pp. 246-249, 1965.

Lee, Y. H. and Rosenfeld, A., "A Clustering Heuristic for Line-
Drawing Analysis." IEEE Trans. Comp. Vol. C-21, No. 8, pp. 904-911,
Aug. 1972.

McQuitty, L. L., "Hierarchical Linkage Analysis for the Isolation of
Types." Educational and Psychological Measurement, Vol. 202, No. 1,
pp. 55-67, 1960.

(47)

(48)

(49)

(50)

(51)

(52)

(53)

(54)

(SS)

(56)

(57)

(58)

(59)

(60)

(61)

100

Meisel, W. S., "Computer-Oriented Approaches to Pattern Recognition."
Academic Press, 1972.

Meisel, W. S. and Michalopaulos, D. A., "A Partitioning Algorithm
with Application in Pattern Classification and the Optimization of
Decision Trees." pp. 93-102, IEEE Trans. on Compt. Vol. C-22, No.
1, Jan. 1973.

Minsky, M., "Semantic Information Processing." MIT Press, 1967.

Morrison, D. F., "Multivariate Statistical Methods." McGraw-Hill,
1967.

Mucciardi, A. N. and Gose, E. E., "An Automatic Clustering Algorithm
and Its Properties in High Dimensional Space." IEEE Trans. Syst.
Man. Syb., Vol. SMC-2, pp. 247-254, April 1972.

Nagy, 6., "State of the Art of Pattern Recognition." Proceeding of
the IEEE, Vol. 56, No. 5, pp. 830-862, 1968.

Patric, E. A., "Fundamental of Pattern Recognition." Prentice-
Hall, Inc., 1972.

Parks, J. M., "FORTRAN IV Program for Q-Mode Cluster Analysis and
Distance Function with Printed Dendrogram." The University of
Kansas, 1970.

Rao, V. A., "A Multimodel Distribution Based Clustering Algorithm."
Ph.D. Thesis at Michigan State University, Dept. of Computer Science.

Ruspini, E. E., "A New Approach to Clustering." Infor. and Control,
No. 15, pp. 22-32, (1969).

Ruspini,-E. H., "Numerical Methods for Fuzzy Clustering." Infor.
Science, No. 2, pp. 319-350, 1970.

Sawrey, W. L., Keller, L. and Conger, J. J., "An Objective Method
of Grouping Profiles by Distance Functions and Its Relation to
Factor Analysis." Educational and Psychological Measurement, Vol.
20, No. 4, pp. 651-673, 1960.

Schweitzer, S. and Schweitzer, D. 6., "Comment on the Pearson r in
Random Numbers and Precise Functional Scale Transformation."
American Sociological Review, Vol. 36, No. 3, pp. 517-518, 1971.

Scott, A. J. and Symons, M. J., "Clustering Methods Based on Likeli-
hood Ratio Criteria." Beiometrics, Vol. 27, No. 2, pp. 387-398,
1971a. '

Scott, A. J. and Symons, M. J., "On the Edwards and Cavalli-Sforza
Method of Clustering Analysis." Biometric Vol. 27, No. 1, pp. 217-
219, 1971a.

101

(62) Shepard, R. N., "The Analysis of Proximities, Multi-dimensional
Scaling with an Unknown Distance Function." Psychometrika 27, pp.
125-139 and 219-2146, (1962).

(63) Shepard, R. N., "Representation of Structure in Similarity Data:
Problems and Prospects." Psychometrika Vol. 39, No. 4, pp. 373-
421, Dec. 1974.

(64) Shepard, M. J. and Willmontt, A. J., ”Cluster Analysis on the
Atlas Computer." Computer Journal, 11, pp. 57-62, 1968.

(65) Sneath, P. H. A., "The Application of Computer to Texonomy."
Journal of General Microbiology, Vol. 17, pp. 201-206, 1957.

(66) Sokal, R. R. and Sneath, P. H. A., "Principles of Numerical
.Taxonomy." W. H. Freeman and Company, San Francisco,71953-

(67) Sokal, R. R. and Sneath, P. H. A., "Numeric Taxonomy." W. H.
Freeman, San Francisco, 1973.

(68) Sokal, R. R. and Michener, C. D., "A Statistical Method for Evalu-
ating Systematic Relationships." University of Kansas Science
Bulletin, No. 38, pp. 1409-1438, 1958.

(69) Sokal, R. R., "Classification: Purposes, Principles, Progress,
Prospects." Science, Vol. 185, No. 4, 57, pp. 1115-1123, Sept.
1974.

(70) Stanat, D. F. and Das, S. K., "A Modified Training Procedure for
Linear Threshold Device." IEEE Trans. Com. Vol. C-21, No. 4, pp.
390-397, April 1972.

(71) Stuart, A., "The Correlation Between Variate Values and Ranks in
Samples from a Continuous Distribution." The British Journal of
Statistical Psychology, Vol. VII, Part I, pp. 37-44, 1954.

(72) Tamura, S., Higuchi, S. and Taneka, K., "Pattern Classification
Based on Fuzzy Relation." IEEE Trans. System, Man and Cypernetics,
pp. 61-66, June 1971.

(73) Uhr, L., "Pattern Recognition Learning and Thought." Prentice-Hall,
1973.

(74) Ullmann, J. R., "Pattern Recognition Techniques." Butterworth, 1973.

(75) Vinod, H. D., "Integer Programming and the Theory of Grouping."
Journal of the Ame. Stat. ASS. Vol. 64, pp. 506-519, (1969).

(76) Ward, J. H., Jr., "Heirarchical Grouping to Optimize an Objective
Function." Journal of the American Statistical Assoc., Vol. 58, pp.

236-244, 1963.

(77) Wishart, D., "An Algorithm for Hierarchical Classification."
Biometries, 25, pp. 165-170, 1969a.

(78)

(79)

(30)

(81)

102

Wo1fe, J. H., "Pattern Clustering by Multivariate Mixture Analysis."
Multivariate Behavior Research, Vol. 5, No. 3, pp. 329—350, 1970.

Zahn, C. T., "Graph-Theoretical Method for Detecting and Describing
Gestal Clusters." IEEE Trans. Computer, Vol. C-20, pp. 69-89, Jan.
1971.

Zadeh, L. A., "Fuzzy Sets." Information and Control, Vol. 8, pp.
338-353, 1965.

Berztiss, A. T., "Data Structure Theory and Practice." Academic
Press, 1975.