MSU

RETURNING MATERIALS:

 

 

Place in book drop to
remove this checkout from

 

 

LIBRARIES .

45:27:55.. your record. FINES W1“
be charged if book is
returned after the date
stamped below.

 

 

A PROBABILISTIC ASSOCIATION MEASURE

FOR PATTERN RECOGNITION

by

Xiaobo Li

A DISSERTATION
Submitted to
Michigan State University

in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

1984

ABSTRACT
A PROBABILISTIC ASSOCIATION MEASURE
FOR PATTERN RECOGNITION
by

Xiaobo Li

This thesis investigates the properties of a
probabilistic association measure in four areas of pattern
recognition and image processing. It is essential to
measure the association between patterns, between features
and between partitions in both pattern recognition and
exploratory pattern analysis. Direct interpretation and
known distribution are desirable properties for an
association measure. This thesis proposes a statistic with
these characterisitcs under a permutation null hypothesis
which has been used to advantage in the literature. The

four problems attacked in this thesis are described below.

A preliminary feature analysis studies the unusualness
of feature-category relations before feature extraction,
which is just as important as cluster tendency study before
clustering. This is formally defined in this thesis for the
first time in a statistical hypothesis testing framework.
This thesis shows that the permutation statistic is
preferable to the commonly used correlation coefficient with

binary features. Both statistics have comparable power but

the threshold of the permutation statistic has a more direct
interpretation than that of the correlation coefficient.
The use of this statistic is demonstrated on questionnaire

data.

The second and third problems attacked with the
permutation statistic are the measurement of the adequacy of
binary partitions in validating clustering results and
designing tree classifiers. The relations among three
well-known measures of cluster validity are derived and the
permutation statisti is shown to be different but highly
correlated to these measures. Computational advantages make
it preferable to the other statistics. In tree classifier
design, the permutation statistic can be used as a criterion
for choosing the feature and threshold at each node in the
tree. The threshold can be set more easily than with the
mutual information criterion. Several examples on
artificial data and real data sets demonstrate that the
permutation statistic is a reasonable criterion for node

definition and leads to successful tree classifiers.

The fourth problem investigated is the general problem
of image template matching posed as a test of hypothesis.
The alternative hypothesis is that a distorted version of
the object is in the image. The null hypothesis is that the
object is not in the image. ‘We derive an approximate

likelihood ratio statistic for testing these hypotheses and

compare it to three other statistics. The optimal statistic
is most powerful but computationally intensive. A
simplified Neyman Pearson statistic is shown to be more
powerful than other suboptimal measures, yet to remain
sensitive to the true object location in the image. The
permutation statistic acts about the same as other
sub-optimal measures, such as absolute difference and

correlation coefficent.

The last chapter of the thesis reviews algorithms for
computing cumulative hypergeometric distribution functions,
which is essential to the proposed statistic. A pipeline
architecture design implementing a recursive computation
formula is proposed. This design is more efficient than
other exact computation methods by several degrees of
magnitude. It is faster. than the best approximation

algorithm for many reasonable cases.

To my father and my mother

ACKNOWLEDGEMENTS

I thank my thesis advisor, Professor Richard C. Dubes,
for his help, guidance and encouragement throughout the
preparation of this thesis. He has introduced me to careful
scientific research and precise scientific writing and he
has contributed greatly to my interest in Pattern
Recognition and my research abilities. His continuing

direction was essential to this study.

Thanks are also due to Professor Anil K. Jain and
Professor George Stockman for their helpful discussions and
guidance in the research group and my degree committee; to
Professor Raoul LePage for his important suggestions on the
theoretical work of this thesis; to Professor Carl Page for
his advice and excellent teaching of many of my courses.
Special thanks to Professor Lionel M. Ni for his brilliant

help on the hardware implementation part of this thesis.

Acknowledgement are due in addition to past and present
graduate students of the MSU PR/IP Laboratory for their
assistance: Dr. George Cross, Dr. Weichung Lin,
Dr. Steve Smith, Dr. James Coggins, Dr. Gautam Biswas,

Dr. Ardeshir Goshtasby, Almost-Dr. Rick Hoffman, Phil Nagan,

ii

and Juan Esteva. System manager Phil Nagan's maintenance of
the laboratory greatly facilitated this thesis work. A
special acknowledgement to my office mate Rick Hoffman for
his theoretical discussions and creating useful system
software which played a large role in preparation of this

thesis.
I must thank my wife Zhemin for her sharing my burden
and difficulties. Her understanding and support, the

sacrifice of my son and my family made this study possible.

The financial support of NSF grants ECS-800716 and
ECS-83002004 is also gratefully acknowledged.

iii

TABLE OF CONTENTS

LISTOF TABLES OOOOOOOOOOOOOOOOOOOOOOOO ..... 0.0.0.0.... viii

LIST OF FIGURES ........... .............................. ix

CHAPTER 1. INTRODUCTION ................... ...... ........ l
1.1 Summary ........................................... l
1.2 Background .......................... ....... ....... 5
1.3 Probabilistic Proximity Measures

and Permutation Models ...... .................. .... 8
1.4 Proposed Probabilistic Indices

for Binary vector Pairs. ..... 0.00.00.00.00000000011

CHAPTER 2. PRELIMINARY ANALYSIS OF
DICHOTOMOUS FEATURES ...... ....... ........... 14
2.1 Are Any Two Features Unusually
Similar to One Another? .......................... 16
2.2 Is Any Feature Unusually Closely
' Related to Category? ............................. 22
2.3 Approximating the Distributions
of g and SM ..................................... 24
'2.3.1 Null Distribution of § when K=2 ............. 24
2.3.2 Approximating the Null
Distribution of s when K>2 .................. 25

2.4 Power Comparison of S and C ...................... 28

iv

2.5 An Application of Preliminary

Feature Analysis ...

2.5.1 Tests for Significant Relation

between Features I.........OOOOOOOOOOOOOOOOOO

2.5.2 Verification on Original Pattern Matrix .....

2.5.3 Feature Extraction Using Feature 9

as categorical ......‘CCCOCCOOOOOO ..... ......

2.5.4 Feature Extraction Using Feature 8

as categorical ......OOOOOOOOOOOO00.0.0000...

2.6 Summary and Conclusion ..... .......... ....... .....

CHAPTER 3 ADEQUACY OF BINARY PARTITIONS I O O O O O O O C O O I O O O O O

3.1 Four External Measures of Association

for Cluster Validity ......... ....... .............
3.1.1 Definitions ....IOOOOOOOOOOOO ......... 0......
3.1.2 Relation among Measures ... ..... .............

.3.2 Binary Tree Classifier Design ....................

3.2.1 Binary Tree Classifier ......................

3.2.2 Computation of Feature Thresholds ...........

3.2.3 Efficiency in Feature Threshold Computation .

3.2.4 Numerical Examples ..........................

3.3 Summary and Conclusions ..........................

CHAPTER4TEMPLATE MATCHING O..........OOOOOOOOOOOOOOOOOO

4.1 Introduction ......

4.2 Mathematical Model

34

36
39

40

41

42

45

46

51
56

59
62
64
72

74
75

4.2.1 General Definition .................... ..... . 79

4.2.2 Statistics for Testing Ho vs. H1x .......... 83
4.2.3 Statistics for Testing H0 vs. Hl ........... 84
4.3 Comparison Study ................... ............. . 87
4.3.1 Analytical Comparison of D and G ............ 89
4.3.2 Monte Carlo Comparison of D,G and R ........ . 89
4.4 Results ..... ........ ................... . ......... 91
4.4.1 Power Study Results ...... .......... ......... 91
4.4.2 Sensitivity Study Results ................... 96
4.4.3 Formal Comparison ... ....................... 100
4.4.4 Results on Landsat Images .................. 103
4.5 Summary and Conclusions .... ....... . ........... .. 105
CHAPTER 5 COMPUTATIONAL CONS I DERATI ON S ................. l O 7
5.1 Notation . ..... . ....... .... ......... .... ......... 107
5.2 Peizer Approximation ............................ 111
5.3 Hardware Implementation ......... ................ 113
5.3.1 An Overview of the Architecture .... ........ 113

5.3.2 Computing Y(i) ............................. 116
5.4 Comparison .............. ......... ...... ...... ... 127

5.5 Summary and Conclusions . ........................ 129

CHAPTER 6 CONCLUS I ON AND FUTURE RESEARCH 0 O O O O O O O O O O O O O O 1 3 l
6.1 Conclusions ... .................................. 131
6.2 Future Research ............ ...... . ...... ........ 133

6.2.1 Extension to Multi-valued Vectors .......... 133

vi

6.2.2 Two-sided Tests for Preliminary Analysis

6.2.3 Multi-class Tree Classifiers

6.2.4 Multi-stage and Sequential

Approach to Template Matching ............

APPENDIX A.

GIVE HIGHEST S VALUES

APPENDIX B.
APPENDIX C.
APPENDIX D.

APPENDIX E.

IN MULTIVALUED VECTOR CASES

LIST OF REFERENCES

APPROXIMATION OF Sxy

vii

WHEN y#x ....
RESULTS OF t-TESTS FOR SEC.

SUMMARY OF RESULTS FOR SEC.

4

4.

THE THRESHOLDS AT THE BOUNDARIES OF RUNS

.3 ..........

AN EXAMPLE OF THE PERMUTATION STATISTIC

3.1 .........

134
134

135

136
138
141

142

143
145

2.9
2.10
3.1
3.2
3.3
3.4
3.5
3.6

LIST OF TABLES

Frequencies of Combinations of Vl(i) and V2(i) ..... 12

Means and Variances of g and § under H0l for K=2 ... 25

0.05 Thresholds for g and T' ... .................... 26
Comparing Distribution of S and T' ................. 27
Power Estimates When K=2 ........................... 31
Power Estimates When K=5 ........................... 32
Result of Power Comparison ......................... 33
Definitions of First 13 Features ................... 35
Testing H01 and H02 ..... . ............... . ......... 37
Testing H03 and H04 .... .......................... . 38
Critical Levels of Gammas ........... . .......... .... 42

Frequencies of Combinations ........................ 49

Examples of d' Ranges and g values ..... ... ......... 54
Sample Correlation between S',E(S') and Gamma ..... . 55
Ordering of Training Patterns by Feature j ........ . 60
Parameters of Artificial Data Sets .......... . ...... 66

Tree DeSign for Data Sets 00.00.000.000.0.0.00.0... 68

viii

3.7 Tree Design for Data Set 2 ......................... 68

3.8 Tree Design for Data Set 1 ............. ...... ...... 69
3.9 Tree Design for Data Set 3 ......................... 70
3.10 Tree Design for Data Set 4 ......................... 70
3.11 Tree Design for Data set 6 ......................... 71
4.1 Effect of (p,p') on Powers .... ........... . ........ . 92
4.2 Effect of (a,b) on Powers ............. ............. 93

4.3 Effect of Parameter Knowledge on Powers ............ 94

4.4 Effect of Subtemplate Size on Powers ...... ......... 94
4.5 Sensitivities when Subtemplate is Balanced ......... 96
4.6 Effect of (p,p') on Sensitivities . ................. 97
4.7 Effect of (a,b) on Sensitivities .......... ..... .... 97

4.8 Effect of Parameter Knowledge on Sensitivities ..... 98

4.9 Effect of Subtemplate Size on Sensitivities .. ...... 98
4.10 Sensitivities when Subtemplate is Balanced ... ...... 99
4.11 Power Comparison ... ............................... 101
4.12 Sensitivity Comparison .............. ...... . ...... . 102
4.13 Results on Several Landsat Images ........ . ........ 104
5.1 Observables in Vector Pair .................. ...... 108

5.2 vTime and Circuit Complexity of the Part

Computing Y(i) from X(j) ........ ..... .... ....... .. 125
5.3 Computation Time Comparison ........... ............ 128
5.4 Typical Computation Time Ranges ......... .......... 128

ix

LIST OF FIGURES

d and E(S') vs d' when L=28, nl=9 and n2=13 ........ 53

Fourteen Examples of Category-feature Vector Pair .. 64

Tree Designed with S' .............................. 67
An Example Image and Template ...................... 76
Overall Architecture for Computing H(a) ........... 114
Computing X(j) ....... ............................. 115
Compute Y(i) from X(j) when y=4 ................... 117
Compute Y(i) from X(j) when y=7 . .................. 118
Compute Y(i) from X(j) when y=13 . ................. 119
Compute Y(i) from X(j) (general design) .......... . 120
Compute Y(i) from X(j) when y=7 and x=3 ........... 122
Time Diagram of Figure 5.7 ......... ..... . ...... ... 123

CHAPTER 1

INTRODUCTION

1.1 Summary

This thesis proposes a probabilistic association measure
for binary vectors and examines its application in pattern
recognition. It is essential to measure association between
patterns, between features, between partitions of patterns
or between subimages. There exist many association measures
in the pattern recognition literature. Only Goodall's
similarity index is directly based on probability. This
Chapter reviews previous work on probabilistic measures and
the permutation hypothesis and proposes a permutation
statistic S, which is directly based on permutation

unusualness and has known distribution under randomness.

In Chapter 2, we apply S to the preliminary analysis of
binary features for the first time and state its utility.

We define several null hypotheses expressing the relation

between features and category. Rejection of these
hypotheses motivates feature extraction. We compare S to
Pearson correlation for testing these hypotheses. Since S
has known distribution under all null hypotheses, the
selection of thresholds for the tests is direct. The
selection of the threshold for Pearson correlation demands
Monte Carlo simulation. Both statistics have similar power
for detecting Bahadur type alternative hypotheses and both
act similarly in a questionnaire data analysis example.
This suggests S be used in the preliminary analysis of

binary features.

Chapter 3 applies the permutation statistic S'=:S-0.5:
as an adequacy measure for binary partitions in cluster
validity and as a criterion. for tree classifier design.
Three statistics, Rand, Hubert's Gamma and Fowlkes' B, are
compared to S' in the permutation environment. These three
statistics are linear functions of each other and thus have
the same power against any alternative hypotheses in testing
the random permutation null hypothesis. The statistic S' is
highly correlated to those statistics when the numbers of
1's in the vectors are about half the vector length. The
other three statistics have asymptotic normal distributions

under the null hypothesis while 5' has a known distribution.

That makes 5' easier to use for assessing global fit of a

binary partition to category than the other three
statistics. The statistic S' is used for the first time in
designing a tree classifier, and is compared to a mutual
information criterion. Both criteria give similar trees but
the threshold for S' is far easier to determine. Several
artificial data sets and a real data set are used in the
numerical examples. The results demonstrate that S' is a

reasonable measure in tree classifier design.

Chapter 4 investigates the image template matching
problem. We study a null hypothesis stated in the
literature and propose a new alternative hypothesis. Under
this model, we propose a statistic R which is derived
directly from the likelihood ratio statistic to measure
similarity between template and subimages. With an extra
independence assumption, this statistic is theoretically
optimal, but its computation is very time consuming. We
suggest a simplified version of the statistic, G, as a
similarity measure and compare the powers of these two
statistics by a Monte Carlo means to the commonly used
similarity measure, the absolute difference D. The
statistic R is most powerful. The 6 statistic is more
powerful than other sub-optimal measures and has the same
complexity as others. More importantly, G is more sensitive

to the. true location of the object in the image than any

others. The correlation coefficient C and the permutation
statistic S defined in Chapter 1 are also compared to the
likelihood statistics and they perform about the same, worse
than the derived measures. All these statistics are applied
to several Landsat data sets and demonstrate that R is not
as sensitive as statistic G. In summary, G strikes the best

balance between power ans sensitivity.

Since the computation of S is directly based on the
hypergeometric distribution, Chapter 5 reviews some commonly
used algorithms for computing the hypergeometric c.d.f.
Their computation times are compared and a pipeline
architecture design is proposed to implement the recursive
computation algorithm. The pipeline design is much more
efficient than other exact computation methods. It is even
faster than the Peizer approximation using single CPU when
the vector size is small or a, the number of (1,1) pairs, is
small (a < 932). This pipeline design also solved a more
general problem. In implementing a product using pipeline
functional units with x segments, we give examples of some
individual designs for different y values and one general
design for y=2,3,...,15, which are typical pipeline length
for arithmetic operators. A computation time and circuit

complexity analysis are given.

From the above applications, we conclude that the
statistic S is a good similarity measure between features in
preliminary analysis for binary features; S is a good
similarity measure between feature and category in tree
classifier design; and S is also a reasonable adequacy
measure for binary partitions. Since S has a known
distribution under null hypotheses, the threshold and
significance level are easy to determine. In image template
matching, S acts as well as other sub-optimal similarity
measures between sub-images. Finally, a hardware
architecture design is more efficient than other methods for
computing hypergeometric c.d.f.'s, which is essential to the
computation of S. The application of S and other proposed
measures to the above pattern recognition areas and the
methodology for using S are the main contributions of this

thesis.

1.2 Background

Pattern Recognition is concerned with the description
and classification of objects, which are represented by
measurements taken from realizations of physical processes.

Pattern recognition includes three distinct steps -

Representation, Abstraction, and Generalization. The input
data for pattern recognition is usually represented as a
pattern matrix, whose rows are patterns. Each pattern
consists of a series of measurements, called features, and
describes an object. In classical pattern recognition
problems, the category, or pattern class, of each training
pattern is known. A subset of the features or some function
of this subset is extracted for the representation of the
patterns. Based on this subset and category information, a
pattern classifier is designed. Feature extraction and
classifier design are the Abstraction phase of Pattern
Recognition. The Generalization phase of Pattern
Recognition involves evaluation of algorithms and

classifiers.

There are two mathematical approaches to solving pattern
recognition problems: the decision-theoretic (or
discriminant) approach and the syntactic (or structural)
approach. In the decision-theoretic approach, each pattern
is considered as a vector in a feature space and the
recognition is made by partitioning the feature space. In
the syntactic approach, each pattern is considered as a
composition of components, called sub-patterns or
primitives, and the recognition is made by parsing the

pattern according to some syntax rules.

Cluster analysis and image processing are closely
related to pattern recognition. Cluster analysis explores
the data structure with and without category information,
which can be considered as the Representation phase of
Pattern Recognition. Image processing deals with two
dimensional pictorial patterns. Many basic techniques used
for pattern recognition and image processing are very

similar in nature.

This thesis proposes an association measure for binary
vector pairs based on permutation statistics. Its
application in various areas of pattern recognition is
compared to other measures of association and both hardware
and software computational methods are developed. The
particular areas of application studied are preliminary
analysis of binary features, validity of binary partitions,
design of tree classifiers, and image template matching.
Section 1.3 reviews previous work in probabilistic proximity
measures and permutation models. Section 1.4 defines the

proposed statistic.

1.3 Probabilistic Proximity Measures and Permutation Models

A proximity (or association) measure, either a
similarity or a dissimilarity, is a symmetric mapping from
VxV to [0,00), where V is a space of vectors. A similarity
measure increases as two vectors become more alike, while a
dissimilarity measure increases as two vectors become more
unalike. For example, Euclidean distance is a dissimilarity
measure and correlation is a similarity measure.
The history of Statistics is replete with association
measures [84]. Hundreds of proximity measures have been
proposed by researchers in the biological and engineering
literature. Comprehensive reviews [l,25,26,27,28,77] and

comparative experiments have been reported [21,30,41,71].

This thesis is mainly interested in measures using
cumulative probability, not information theory, although
some similarity measures based on information theory are
also considered to be probabilistic measures [77].
Goodall's similarity index is the best example of this
type [22,23,24]. We seek a statistic which can be easily

interpreted and is easy to compute.

A population is needed to establish the significance of
a probabilistic index of association. The null hypothesis
is that all members of the population are equally likely. A
permutation population is formed from all permutations of
the components of the vectors. The vectors could be

patterns, features, subimages or partitions.

Permutations have long been used as a randomization
procedure to generate populations [44]. Statistics useful
in pattern analysis, such as the Mantel statistic
[19,33,35,37,53,54] and the B statistic [17] use
permutations to generate null hypotheses. These two indices
have known asymptotic distributions under this null
hypothesis. Formulas for exact means and variance are

available but require a significant amount of computation.

The permutation null hypothesis for a vector pair used

for much of our work is defined below.

H0: all permutations of one vector in the pair are

equally likely.

10

If this vector pair represents partitions of a pattern
set resulting from clustering, then the testing of this
hypothesis is a cluster validity analysis, which validates
the global fit of a partition. This null hypothesis is
reasonable for external criteria, as when prior structure is
being compared to the result of a cluster analysis [33].
However, it is irrelevant as an internal criterion
[18,42,621 even though it is the only hypothesis for which
analytical results are available. Specific null hypotheses,
alternative hypotheses, and appropriate tests are stated in

each area of investigation.

For some statistics, Monte Carlo simulation is needed to
obtain the critical levels, thresholds or power estimates.
Usually, we set the size of the Monte Carlo simulation. to
100 [11]. In some permutation situations, we set the size

of Monte Carlo simulation to 1000 [33].

11

1.4 Proposed Probabilistic Indices for Binary Vector Pairs

If a statistic measuring association between data items
has known distribution under some H0, its tail probability
(the probability that this statistic is less than or equal
to a threshold) is called a p-value, and is a probabilistic
index of association. This section proposes the p-value of
the number of matches between two binary vectors as such an

index. This index has known distribution under HO.

Consider a pair of binary vectors, V1 and V2, of size L,
whose entries are either 0 or 1. The number of 1's in Vi is
ni. The set of all possible permutations of V2 forms a
sample space. Table 1.1 defines the observables for these
two vectors. For example, a' is the number of positions in

which both vectors are 0 and n1 is the number of 1's in V1.

12

Table 1.1 Frequencies of Combinations

of Two Binary Vectors

V2
0 l

+ ------------- + ------------- +
0 : a' : b' : L-nl

vl + ------------- + ------------- +
1 i c' : d' : n1

4» ------------- + ------------- +

L-n2 n2 L

The random variable D' denotes the number of positions
in which the two vectors are both 1. For example,
If V1=(100110) and V2=(110010) then D'sd's2. The following
statistic measures the degree of association between two
vectors.

S(l,2) a Pr(D' < d') + Pr(D' = d') U
where U is an independent random variable with uniform
distribution over [0,1]. Probabilities are computed

under H0.

Under H0, S(V1,V2) is distributed uniformly over [0,1].
It has a clear interpretation as permutation unusualness,
i.e., the probability that a vector pair has at most that

many (l-l) matches. The explicit formula for the statistic

13

is as follows.

This statistic is used as a similarity measure between
features in preliminary feature analysis (Chapter 2), as a
similarity measure between partitions in measuring
partitional adequacy (Chapter 3), and as a similarity
measure between subimages in image template matching
(Chapter 4). The null hypotheses are rejected for extreme

values of this statistic.

CHAPTER 2

PRELIMINARY ANALYSIS OF DICHOTOMOUS FEATURES

Feature selection in pattern recognition chooses a
subset of features which best reflects a-priori category
information in some manner [78] and which are "independent"
in some sense. Feature selection algorithms select or
extract features one at a time and stop when a predefined
threshold has been exceeded. For example, the percent
variance retained in the eigenvector method of feature
extraction [78] and the recognition rate in the sequential
forward method of feature selection [86] are thresholds
which the user must specify beforehand. Few theoretical
guidelines exist for selecting these thresholds, especially
when the relationships among the variables are completely
arbitrary. This chapter proposes a methodology for
examining a data set before feature extraction to alleviate
the danger of interpreting the results of feature extraction
inappropriately. For example, we ask whether or not binary

(dichotomous) features significantly relate to each other,

14

15

and whether or not some features significantly relate to
category. If no significant relations exist, it would
appear useless or misleading to proceed with feature
extraction and discriminant analysis. The motivation for
this preliminary analysis is much the same as for clustering
tendency [13] where the pattern set is examined for

randomness.

In this chapter, we pose several questions about
dichotomous. features in a statistical hypothesis testing
framework. We define hypotheses and examine the application
of the probabilistic similarity measure S, defined in
Chapter 1, for testing those hypotheses, in comparison with

statistics based on Pearson correlation.

An important application of the proposed methodology is
to questionnaire data, which come as an NxK pattern matrix
containing only 0'3 and 1's. Rows denote patterns and
columns denote features. Some of the features may be

reserved as category information.

In Sec. 2.1, we examine the entire pattern matrix in
terms of statistical hypotheses, and define two statistics
for establishing the existence of relationships between

columns. One is based on the correlation coefficient and

16

the other is based on the similarity S. We discuss their
distributions and tests of hypotheses. Section 2.2 asks if
any feature is related to category. We briefly discuss the
distributions of two statistics. Sec. 2.3 describes
approximations to the distributions of the statistics
defined in Sec. 2.1 and 2.2, based on the similarity 8.
Section 2.4 reports a power comparison of these statistics.
The methodology is demonstrated on a data set derived from
medical questionnaires in Sec. 2.5. Finally, Sec. 2.6

summarizes the results in this chapter.

The main contribution of this chapter is the methodology
for the preliminary analysis of binary features, which is

summarized in Sec. 2.6.
2.1 Are Any Two Features Unusually Similar to One Another?

In this section, we consider the first question in the
preliminary analysis of dichotomous features stated below.
Two null hypotheses and their tests will be defined. The
distributions of test statistics will be discussed in some

detail.

17

Consider an NxK binary pattern matrix in which columns
can be considered as categorical or measured features. This
first question searches for significant similarity between
any pair of features, including those used later as
categories. Two reasonable hypotheses for describing a
state of "no relationship" are defined below. In all cases,
we regard the rows of the pattern matrix as samples of

independent random variables.

H01: The patterns are samples of K independent
Bernoulli random variables with parameters {pi}.
H02: The pattern matrix is chosen randomly from

population P2.

Population P2 is formed by independently permuting each
column in 'the pattern matrix. There are (N!)K matrices in
P2. If Ni denotes the number of 1's in column i, then the
number of distinct matrices in P2, each occuring the same

number of times, is

18

If a matrix can be realized under both hypotheses, its

probability is lower under H01 than under H An ideal

02'
statistic for testing H01 or H02 satisfies the following
criteria:

1. It has reasonable power against certain alternatives.

2. Its value is easy to compute.

3. Its distribution is known analytically, or at least can

be approximated analytically.
4. The analytical form is simple.

5. Its distribution is independent of the parameters in

the problem.

Several measures of correspondence between dichotomous
variables have been suggested in the literature [1,30,77].
In this chapter, we examine the application of the
similarity S to test H01 and H02. We define statistic §
directly based on S below. We also define statistic g based
on the commonly used Pearson correlation coefficient for

purpose of comparison.

In

a max{c(i,j)}

S = max{S(i,j)}

Here c(i,j) is the sample Pearson correlation between column

19

i and j, and S(i,j) is defined in Chapter 1.

Let Ni be the number of 1's in column i and let Nij be

the number of rows in which columns i and j are both 1.

- _ _ 1/2
(uni. NiNj)/ [NiNj(N Ni)(N Nj)]

c(i,j) J

In our case, S(i,j) takes the following form:

5(1.3) = [ > h(N,Ni,Nj,y)]'h(N,Ni,Nj,Nij)U

where the sum is for y from max(0,Ni+Nj-N) to Nij' U is a
(continuous) uniform random variable on the unit interval,
independent of all other random variables, and h is the
density function for a hypergeometric distribution, defined

below.

Note that since h(N,K,L,y) = h(N,L,K,y), including 0-0
matches as well as 1-1 matches does not alter the nature of

the statistic.

Large positive values of c(i,j) and S(i,j) indicate a
positive relation between features i and j. With the

questionnaire type applications in mind, we only consider

20

positive relations because the coding in questionnaire data
is usually fixed and (1,1) matches have stronger meaning
than (0,0) matches.

Under H02, has a hypergeometric distribution and

Nij
c(i,j) is a linear function of Nij' The distribution of
c(i,j) has a rather complicated form so the analytical forms
for the distributions of 9 under either H01 or H02 are also
complicated. The threshold of this distribution is usually
estimated by Monte Carlo means. The distribution of S(i,j)
under H02 is clearly uniform. The fact that S(i,j) is

uniformly distributed over [0,1] under H01 follows from the

following equality.

Write S(l,2) as an explicit function of random
variables N1,N2,N12 and U. The sum is over all allowable

values of (21,22).

Pr(S(Nl,N2.N12.U) g y : 801)

> Pr(S(zl,zz,N12,U) 3 y:H02) Pr(Nl=zl,N2=zz:HOI)

y > Pr(Nl=zl,N2=22:H01)

ll
"<

21

Based on the fact that S(i,j) has a uniform distribution
under H01 and H02, Sec. 2.3 derives a simple approximation
to the distribution of § under H02 for M>2, which eliminates
Monte Carlo work when choosing thresholds for tests of

hypothesis.

A test of H01 or H02 has the following form, where T is
either 9 or S .
Reject HO if T>t
where threshold t is computed as
P(T>t:HO) = a
and a is a specified level, such as 0.05. Alternatively, we
*
compute the critical level a
' *
P(T>t*,HO) = a .
where t* is the observed value of the statistic and reject

the null hypothesis if a* is small, say 0.1 or less. -

The distribution of 9 does not have a simple analytical
form, and depends on the parameters {pi} under H01 and on
the original matrix under H02. The distribution of S(i,j)
has a simple analytical form, regardless of parameters,
under both H01 and H02. An approximation for the

distribution of S will be developed. Therefore, according

22

to Criteria 2 through 5, S is prefered over 9 . The
comparison according to Criterion l is discussed in

Sec. 2.4.
2.2 Is Any Feature Unusually Closely Related to Category?

Here we designate one feature, say feature M, as
categorical and ask whether any one of the remaining
features is unusually similar to feature M. Two null

hypotheses of interest are defined below.

H All N! permutations of feature M are equally

03‘
likely.

304: All matrices in population P4 are equally likely.

Population P4 is created in the same way as population
P2 except that feature M is not permuted. Two test

statistics based on our indices of proximity are:

CM = max{c(i,M),i#M}

S a max{S(i,M),i#M}

M

The forms of tests of these hypotheses are as in
Sec. 2.1. A test of H03 asks whether it is likely that the

observed statistic could have been produced were the

23

category labels inserted at random. Accepting H03 implies
that no feature is unusually "close” to the categorical
feature. The distribution of CM under H03 must be obtained
by Monte Carlo means, but a procedure discussed in

Sec. 2.3.3 can approximate the distribution of SM under H03.

All columns are permuted independently so the

distributions of CM and SM are known under H04.
P(SM>t: H04) = 1 — c(x'l)
P(CM>t: H04) = 1 - i- P[ c(i,M)§t: H04]
Em
where

P[c(i,M)=ti: H04] : h(N,Ni,NM.ti)

- - - 1/2
and ti— [NiNj+t[NiNj(N Ni)(N Nj] )]/N
Under H03, the distribution of 5M can be approximated
analytically, while the distribution of CM requires Monte
Carlo simulation. Under H04, both statistics have known

distributions, but that of S is uniform. This fact

M
suggests that SM is better than CM under Criteria 3 through

5.

24

2.3 Approximating the Distributions of S and SM

We first discuss the distribution of S when K=2. Then
we view S as the maximum of dependent uniform random
variables and try to approximate its distribution. The

distribution of SM is then considered.

2.3.1 Null Distributions of S for K=2

Assume K=2 so that S is simply the similarity measure
between two binary vectors; S is uniform under
H01 regardless of N, p1 and p2. Table 2.1 demonstrates
empirical means and variances for S and S . Each row in
Table 2.1 corresponds to a parameter set (N,M,pl,p2). One

hundred Monte Carlo samples were used for each row.

25

Table 2.1. Means and Variances of S and S under H01 for

20 .10 .10 -.02808 .03129 .51732 .08919
20 .10 .50 -.01565 .04664 .53185 .08632
20 .10 .90 .01468 .04015 .48699 .08465
20 .50 .50 .02982 .05273 .48520 .07934
20 .50 .90 .00577 .03625 .54219 .06516
20 .90 .90 .01331 .03786 .54297 .08038
200 .10 .10 .00578 .04061 .45545 .07565
200 .10 .50 .05577 .08443 .47903 .07233
200 .10 .90 .00137 .04455 .49546 .09711
200 .50 .50 -.07583 .09182 .47418 .08335
200 .50 .90 -.05822 .23322 .50266 .08883
200 .90 .90 .00896 .04410 .52179 .07967

The theoretical mean and variance of S under H01 are

0.5 and 0.0833. The empirical means and variances of S in
Table 2.1 are stable and close to theoretical values,
independent of N, K, p1 and p2. The distribution of S
varies with these parameters. This fact favors S over S

with respect to Criterion 5.
2.3.2 Approximating the Null Distribution of S when K>2

Statistic S is a maximum of k=K(K-l)/2 dependent U(0,l)

random variables under H01. A one-sided test of H01 is

Reject H01 if S > t+.

26

The dependences among {S(i,j)} require that t+ be estimated
from an approximate null distribution for S. Since the
distribution of the maximum of k independent random
variables is well known, we consider the possiblility of

approximating t from this distribution. Let T' have this

4.

distribution, whose c.d.f. is

.k

Pr(T' 3 t' : H01 ) = t if 0<t'<l.

The threshold t' from this distribution at size a is

for S

+ _

t'=(l-a)l/k. Table 2.2 gives the thresholds t
generated by 1000 Monte Carlo runs per entry (pi=0.5 for all
i), and the thresholds t' computed analytically for a few
values of K. The right-most column in Table 2.2 gives the
position of t' among 1000 Monte Carlo values, which is an

estimate of the size for the test using t'.

Table 2.2 0.05 Thresholds for S and T'

N K t+ t' size for t'
+ ----- + ------- + ---------- + ---------- + -------- +
: : 3 : .98045 : .98305 : .045 :
| 50 | + ---------- + ---------- + -------- +
E E 5 : .99650 : .99659 : .049 :
+ ----- + ------- + ---------- + ---------- + -------- +
: : 6 : .99535 : .99659 E .048 :
| 100 I + ---------- + ---------- + -------- +
E E 9 : .99845 : .99858 : .045 :
+ ----- + ------- + ---------- + ---------- + -------- +
: : 9 : .99827 : .99858 : .042 :
| 200 ' + ---------- + ---------- + -------- +
E E 13 : .99924 : .99934 : .044 :
+ ----- + ——————— + ---------- + ---------- + -------- +

27

The values of t' are close to those of t Thus using

+0
t' as an approximate threshold will give similar size for

the test.

In order to more formally compare the distributions of S
and T', Consider the 12 cases listed in Table 2.3. One
hundred matrices were generated for each case and the
distributions of S and T' were compared. The results of the
Kolmogorov-Smirnov test of equality between distributions
[7] are listed below. In all 12 cases, the hypothesis that
T' is the same as S is accepted at level 0.05. This
suggests that the threshold on T' is indeed a good

approximation to the threshold on S for those cases.

Table 2.3 Comparing Distributions of S and T'

N
O
mmmmmmmmmmmm
O O O O O O O
\DU‘IU‘HD—‘Hmmmi-‘HH
O O O O O O O C O O O O

\OKOU‘IUDUIHKDKDUIOUIH
O
O
u
w

.061

28

Statistic SM is a maximum of (K-l) dependent U(0,l)
random variables under H03. When K>2, our results suggest
that the distribution of T' approximates the distribution of
SM’ This point is also demonstrated in the application

described in Sec. 2.5.

2.4 Power Comparison of S and Q

The purpose of using S and S is to detect relations
among features. We now examine the powers of these
statistics under a class of alternatives where the relations
among features are governed by a Bahadur model [4], in which
the feature relations are specified in terms of the

correlation coefficient. In this section, we consider the

following alternative hypothesis.

H11: The given matrix is generated with Bahadur
parameter set (p1,...,pK and w) with w>0;

pi is the Bernoulli parameter for column 1.

Power studies are conducted for the case when pi under
301 is the same as the pi under H11 for any i, so that the

hypotheses differ only in relation among features.

29

The test of H01 relative to H11 based on S is:

Reject H01 If S > ts.

The test relative to H11 based on S is:

Reject H01 if S > tc.

The general procedure for estimating power of a test
based on S is as follows [16]. One hundred matrices are
generated under H01 with given p's and the sixth largest
value is used as the 0.05 level tc' Then, 100 matrices are
generated under H11 with given w # 0. Our study involves
the four factors listed below with levels for each factor
indicated.

(p1,p2) = (.2,.2),(.2,.5),(.2,.8),(.5,.5),(.5,.8),(.8,.8):

pi = 0.5 for 3 g i 3 K;

w = 0.1, 0.2, 0.4; '

N a 20, 60, 200;

K = 2, 5.
All combinations of levels are used except the impossible
case (pl, p2, w )=(.2,.8,.4). The power of the test based
on S is estimated by the number of matrices generated under
H11 with S value greater than to' The power of the test

based on S is estimated similarily, except that the

30

threshold ts is approximated analytically (Sec. 2.3.2).

For each combination of parameters, we have two
integers, the estimated power (P"(g)) of the correlation
statistic and the estimated power (P"(S)) of S. Let P(S)
and P(S) denote the true powers. Then p"(g) has a binomial
distribution B(100, P(§)) and P"(§) has a 8(100, P(S))
distribution. We report P”(Q) and P”(S) in Table 2.4 for
K82 and in Table 2.5 for K=5. The symbol "+" at the upper
right corner indicates that P(§) > P(Q) at confidence level
0.95. The symbol "-" means P(§) < 9(9) at confidence level

0.95.

31

Table 2.4 Power Estimates when K=2

(p1,p2)=(.2,.2) (.2,.5) <.2,.a> (.5,.5) (.5,.a> <.s,.a)

W

I

N

I10 I
I 12I

13

11

l
I
7 I

8
+-------+-----——+-------+-------+-----—-+-------+

I 12
I
I

1

0

20,

I17 I
I 25I

26

E 21

I

I
+-------+----—--+—------+-------+---—---+-------+

I 30
l
I

21

:25
I
I

2

0

20,

41

I57 I
51:

56

65

54

48

5
+-------+-------+-------+-------+-------+-------+

5

46

42

0.4 I

20,

I 14
I
I

22

22

I
15E

15

I' 35 I
I 43 I

49

46

I
45E

38

89

I 86
I
I

95

93

I
94E

:94
I

I

I

93 I
+--—----+—-----—+———----+----—--+-------+------—+

E 30
I
I

36

-E 55 I
36 I I
+-------+------—+-—-----+--—----+-------+—------+

:54
I
I

0

200,

81
100

{100

+: 74
84 E
100

{100

100 I

{100

5
100

9
{100

I
I
91 E

100

96
I

{100

0.2
0.4

200,
200,

+---------+

+---------+

32

Table 2.5 Power Estimates when K=5

( p1, p2)=(.2,.2) (.2,.5) (.2,.8) (.5,.5) (.5,.8) (.8,.8)

+----—--+-------+-------+-------+-----—-+-------+

W

N.
20,

7

I 8
I
I

9

4

9

0.1

IIII + IIII

7

8

+
l
l

4

1
l

9

+
9

3

3
l

0
l

5
l

3
l

.
.
.
_
.
.
.
+
.
.
.
.
.
.
.
+
.
.
.
.
.
.
.
+
_
.
.
_
.
.
_
+
.
.
.
.
.
_
.
...
.
.
.
.
.
.

+E 12

E 4

8

+

5 : 5
6 I

I
7 I

6

22

13

9

12

IIII + IIII + IIII

2

0

'
0
2

4.
I
0

'
0
2

l

0

0
6

III! + III... + IIII

+
.
.
.
_
.
.
.

+
_
.
.
_
.
.
.

+
.
.
.
.
.
.
.

+
.
.
.
.
.
.
.

+
.
.
_
.
.
.
.

+
.
.
.
.
.
.

Illl + I!!!

.
5
4

0
6

2

200,

.
_
.
.
.
.
+
.
.
.
.
.
.

+
.
0..
0..
.
nu .
0 .
1. .
IIII+
.
nu.
nu.
1..
nu .
nu .
1. .
IIII+
.
nu.
nu.
1..
nu .
nu .
1. .
IIII+
.
.
.
.
.
_
.
IIII+
.
nu.
nu.
1..
nu .
nu .
1. .
IIII+
.
nu.
nu.
1..
nu .
nu .
1. .
+

4

O

o

0

0

2

+---------+

+---------+

Each

2.6.

Table 2.4 and 2.5 are summarized into Table

box in Table 2.6 corresponds to a combination of K, N and w.

the

is

upper left corner

the number at the

In each box,

of p' combinations (columns in Table 2.4 or 2.5) for

number

33
which P(§) > P(S) (minus signs) at level 0.95. The number
at the lower right corner is the number of cases for which

P(Q) < P(S) (plus signs).

Table 2.6 Result of Power Comparison

4» --------------------- + --------------------- +
K I 2 I 5 I

+ ------ + ------ + ------- + ------ + ------ + ------- +

N E 20 E 50 E 200 E 20 E 50 E 200 E

w ==83888::=22:====8======3838833883883883883a:
I 0 I 1 I 2 I 0 -I 0 I 0 I

0.1 E 1 E 0 E 1 E 1 E 3 E 0 E
+ ------ + ------ + ------- + ------ + ------ + ------- +

I 0 I 1 I 0 I 0 I 0 I 2 I

0.2 E 0 E 0 E 1 E 2 E 2 E 0 E
+ ------ + ------ + ------- + ------ + ------ + ------- +

I 2 I 0 I 0 I l I 0 I 0 I

0.4 E 0 E 0 E 0 E o E 1 E 0 E
+ ------ + ------ + ------- + ------ + ------ + ------- «I-

Table 2.6 indicates that, among the 13 combinations of
(K,N,w), for some (pl,p2) values P(E)I and P(§) are
significantly different. Seven combinations of (K,N,w)
contain more cases for which P(S) > P(Q) , while in six
combinations the reverse is true. We conclude that S is as
powerful as S under the condition of this experiment, even
though the structure of the data under all was specified in
terms of correlation. Although the powers of SM and CM for
testing H03 and H04 were not compared experimentally, we

expect them to be related in the same way.

34

The analytical approximations to the distributions for S
and SM are independent of parameters. In addition, they
have the same power as C and CM against certain

alternatives. We conclude that S and SM are prefered over S

and CM.

2.5 An Application of Preliminary Feature Analysis

As a detailed application of our methodology, we studied
a set of questionnaire responses completed over the past
several years by female patients of a medical doctor
regarding self breast examination. We randomly chose 145
questionnaires for this study, no two from the same
individual. We then reduced the data to 26 features. Table
2.7 defines the meaning of a "1" value for the first 13
features. A "1" value for features 14 through 26 denote the
occurrence of cancer in a relative, such as a father,

mother, or aunt.

35

Table 2.7 Definitions of First 13 Features

Feature Meaning of ”1" Value
1 Menstrual period has stopped
2 Operative menapause
3 Pregnant at least once
4 At least one miscarriage
5 Have used female hormones
6 Currently use female hormones
7 Positive result on pap smear
8 Perform semi-annual self breast exam
9 Perform monthly self breast exam

10 Have had mammograms

11 Use contraceptive techniques
12 Have had pelvic surgery

13 Have had breast surgery

The ultimate goal was to discover any factors that
explained why self breast examination was performed by some
women but not by others. Features 8 and 9 in Table 2.7 will
be taken as category variables. This is a typical feature
extraction problem in which a subset of features that
predict category well is to be found. Some clustering
algorithms were used to cluster features [40] and the
resulting dendrograms suggested random clusters. A
preliminary feature analysis should indicate whether any

non-random relation exists among features.

Our experiments proceeded as follows.
1. Test H01 using S and S .
2. Test H02 u51ng S .

3. Test H03 and H04 using CM and SM for two categorical

36

features.

4. Verify the results by a Mantel test [53] on original
pattern matrix.

5. Perform feature extraction for any non-random
categories.

6. Verify the results by Mantel tests on selected features.

7. Perform feature extraction for any categorical feature
for which either H03 or H04 is accepted.

8. Verify the results by a Mantel test on selected

features.

The details are explained in the following sections.

2.5.1 Tests for Significant Relations between Features

First, we test H01 and H02 using statistics S and S.
Critical levels for S were estimated from 1000 Monte Carlo
trials. Random matrices were generated under H01 using Ni/N
from the original pattern matrix to estimate pi and S was
computed for each matrix. The critical level for S was
approximated by t' as explained in Sec. 2.3 and also
estimated from 1000 Monte Carlo trials. Table 2.8
summarizes the results. The approximate critical level for

S using t' is the same as the Monte Carlo result in this

case. We reject H01 and H02 and conclude that the

37

questionnaire data merit further study.

Table 2.8 Testing H01 and H02

Hypothesis Stat, Value Cr. Level Method
*
a
H01 S , 0.4528 0.001 1000 Monte Carlo Trials
H02 S , 0.4528 0.002 1000 Monte Carlo Trials
H02 S , 1.0000 0.00016 Approximation or 1000
MC Trials

We feel justified to proceed with testing H03 and H04.
We designate features 8 and 9 as categories separately, and
we ask whether any one of the remaining features is
unusually similar to these categories, based on statistics
CM and SM' The critical level of CM under H03 was estimated
from 1000 Monte Carlo trials, while the critical level of
SM was obtained by both Monte Carlo means and analytical
approximation. The results reported in Table 2.9 indicate
that some feature is close to category feature 9 but no
feature is unusually similar to category feature 8. Testing
H04 produces the same result as testing H03 so only feature

9 is used as a category in further study.

38

Test results using the two statistics

agree with each

other. The approximate thresholds for $8 and $9 produce the

same results as that derived by Monte

results support conclusions drawn ealier.

Table 2.9 Testing H03 and H04

Hypothesis Categorical Statistic

feature value
H03 8 C8 0.1172
H03 9 C9 0.2463
H03 8 S8 0.9156
H03 8 58 0.9156
H03 9 59 0.9984
H03 9 89 0.9984
H04 8 C8 0.1172
H04 8 S8 0.9156
H04 9 C9 0.2463
H04 9 59 0.9984

Cr.

0.835
0.034
0.880
0.824
0.039
0.025
0.379
0.379
0.005
0.005

Carlo

Level

means. These

Method

1000 MC Trials
1000 MC Trials
Approximation

1000 MC Trials
Approximation

1000 MC Trials
Exact

Exact

Exact

Exact

39

2.5.2 Verification on Original Pattern Matrix

We now ask whether the grouping of patterns defined by a
particular categorical feature could have been defined in a
purely random manner. We measure similarity between
patterns i and j by the Jaccard coefficient [1], J(i,j).
The reason for using the Jaccard coefficient is that (1,1)
matches are more important than (0,0) matches in
questionnaire data. If nll is the number of (l-l) matches
in rows i and j of the pattern matrix and n00 is the number
of (0-0) matches, then

J(i.j)=nll/(M-l-noo).

A category matrix is defined by:

B(i,j) 1 if rows i and j have the same value for

category feature M

0 else.

The test statistic, denoted by G is the point

amma'
serial correlation between matrices [J(i,j)] and [B(i,j)].
The baseline is defined by H03. A test of hypothesis based
on Gamma 15 called a Mantel test [53,54] 1n other
applications and tests whether the increases and decreases

observed in the two matrices are unusually similar. The

40

distribution of Gamma was estimated by Monte Carlo means and
the critical level was found to be 0.049 for the
questionnaire data, using feature 9 as category. This
suggests that H03 be rejected, which implies that the set of

features is significantly related to category feature 9.

2.5.3 Feature Extraction Using Feature 9 as Categorical

Previous tests suggest that individual features and the
entire matrix are unusually similar to categorical feature
9. Thus, we feel justified in seeking a subset of features
which "explains" categorical feature 9 and which can lead to

an efficient design for future questionnaires.

We chose a sequential forward stepwise selection method
for study [86]. The best individual feature is chosen
first. The best pair containing the best individual feature
is then found, and the best triple containing the best pair
is identified. This process is continued until a suitable
level of performance is observed. Although ‘only an
exhaustive search of all subsets ensures optimality with
binary features [8,9,14,15], the stepwise procedure is
computationally attractive and is recommended in the
literature [58,86]. The criterion used to compare two

subsets of features is recognition rate under the

41

leave-one-out method. The five features selected, in order
of selection, were features 10, 4, 6, 7, and 16. Feature 16

records whether a brother had cancer.

To verify these results, we repeated the Mantel test
described above using only the five features selected from
the questionnaire data. The critical level of the
Gamma statistic was 0.003, which suggests that the patterns
cluster by category unusually well and that the five

features reflect category better than the set of all

features.
2.5.4 Feature Extraction Using Feature 8 as Categorical

We carried out feature extraction and Mantel test using
feature 8 as category. The critical level of Gamma for the
entire matrix is 0.653, which suggests accepting H03.
Feature 1 had recognition rate 0.97, no lower than any other
feature or any feature subset selected by our stepwise
procedure. The critical level of Gamma using Feature 1
alone is 0.093. These results are summarized in Table 2.10.

42

Table 2.10 Critical Levels of Gammas

use all features use selected features

category + ------------------ + ----------------------- +
8 E 0.653 E 0.093 E

+ ------------------ + ----------------------- +

9 E 0.049 E 0.003 E

+ ------------------ + ----------------------- +

Table 2.10 indicates that feature 8 is not a valid
categorical feature. It might be caused by improper design
of questionnaires. Its high recognition rate is due to the
fact that there are only four 1's in column 8. Blindly
performing feature selection and trusting in the recognition

rate alone is misleading.

2.6 Summary and Conclusions

This chapter discussed the problem of preliminary
analysis of dichotomous features. Four null hypotheses are
stated to describe randomness in feature relationships. The
necessity of this preliminary analysis is examined for the
first time and is demonstrated with questionnaire data where
much irrelevant information is usually involved. Rejecting
these null hypotheses give us confidence in performing
feature extraction. Accepting one of these null hypotheses
prevents us from useless work and misleading results in

feature extraction. This point is clear in the application

43

where preliminary analysis shows that feature 8 is not a
valid categorical feature, although the feature extraction

could give high recognition rate.

In testing H01 to H03, statistics based on the
correlation coefficient require Monte Carlo simulation to
estimate thresholds. The statistics based on similarity S
can be approximated analytically. Under H04, both
statistics, CM and SM' have known distribution, while SM is
distributed uniformly. For detecting Bahadur type relations
between features, S and SM have reasonable power. These
facts suggest that S and SM are better than S and CM in

preliminary feature analysis.

The Mantel test can also prevent us from accepting a
misleading result, but estimating the critical level of
Gamma requires either complicated asymptotic normal
approximations [33,34] or Monte Carlo simulation. The test
of'I-i03 or H04 using SM does not require Monte Carlo work and
gives the same conclusion. This fact suggests that SM is
more suitable than Gamma for exploratory type analysis which
requires speed and simplicity.

44

Tests of H01 and H02 using the two types of statistics
give similar results for questionnaire data; tests of
H03 and H04 also provide similar results. Since the
thresholds of S and SM are easier to compute in the cases
of H02 and H04 than those for S and CM' we may choose to
limit ourselves to testing H02 and H04. In this case, the

following simplified methodology is proposed.

(1) Test H02 on the original pattern matrix using S.

(2) If H02 is accepted, stop and try to gather more data.

(3) If H02 is rejected, choose categorical feature M.

(4) Test H04 on the pattern matrix with categorical
feature M eliminated, using SM'

(5) If H04 is accepted, stop and try other categorical
features or gather more data.

(6) If H04 is rejected, go on with feature extraction.

CHAPTER 3

ADEQUACY OF BINARY PARTITIONS

This chapter studies procedures to assess the
significance of a binary partition of a pattern set. For
example, in cluster analysis one must verify structures
resulting from clustering methods. Verification procedures
assess the global fit of a hierarchy, the global fit of a
partition, and the isolation and compactness of individual
clusters. A sequence of partition fits, i.e., a sequence of
partitional adequacy measures, can provide a basis for
assessing global fit of a hierarchy [17]. We examine the
application of the S statistic defined in Chapter 1 to two
problems. One is the use of S as an external measure for
binary partitions in cluster analysis. The second
application is the design of a binary tree pattern
classifier, where S measures the correspondence of a feature

vector to category.

45

46

Section 3.1 compares S, under the permutation hypothesis
defined in Chapter 1, to three other commonly used measures
in a cluster validity framework. The relations among them
are derived and their powers and computational costs are
compared. Section 3.2 uses S in binary tree classifier
design in comparison with the mutual information criterion

[74]. A brief conclusion is given in Sec. 3.3.

3.1 Four External Measures of Association for Cluster

Validity

A clustering algorithm begins with a measure of
proximity. between all pairs of objects in a set and induces
a clustering, or partition, of the objects in (which
"dissimilar" objects are placed in separate clusters.
Clustering algorithms will impose partitions even on totally
random data, so a clustering must be validated to establish
its "adequacy”, or its degree of unusualness, when compared

to a hypothesis of randomness [13].

Two types of measures of partitional adequacy are used
in cluster analysis: external and internal [33,56]. An

external criterion validates a partition with respect- to a

47

source of information that is independent from that used to
form this partition. Permutation statistics provide clear
external criteria of partitional adequacy. For example, if
distances between patterns were used to establish a
partition, then category labels can be the second source,
and the partitional adequacy measure answers the question:
Do the patterns cluster by category? A prior conjecture,
the result of another clustering algorithm, or some
independent proximities could also serve to define the
second partition. An internal criterion assesses the fit of
a partition using only the proximity data from which the
partition was generated. Several external criteria for

partitional adequacy have been proposed [13,56].

The problem of partitional adequacy has been formulated
under two hypotheses of randomness. The Random Graph
Hypothesis [3] assumes that all possible rank order
proximity matrices are equally likely. The permutation
hypothesis [31] assumes that only those proximity matrices
corresponding to a relabeling of patterns are equally
likely. The permutation hypothesis provides a good baseline
for assessing statistical significance [32,36,37]. In this
section, we adopt the permutation hypothesis as the null
hypothesis and compare S to three related measures, the

Rand, Gamma and B statistics 1n terms of power and

48

computation time. Section 3.1.1 gives the definitions of
the four statistics and Sec. 3.1.2 discusses the relation
among them. Our goal is to examine the effectiveness of the
statistic S'=ES-0.5E in measuring the adequacy of binary

partitions.
3.1.1 Definitions

Binary partitions are the basis for hierarchical
clustering [1] and for binary tree classifier design. If we
consider a hierarchy resulting from some hierarchical
clustering method as a sequence of binary splits, then a
binary partitional adequacy measure can be use to validate
one split at a time and indicate from which point the
splitting is random. In this section, we study four
measures for the adequacy of binary partitions. These
statistics assess the unusualness of a particular pair of
partitions by testing the permutation hypothesis
H0 (Chapter 1). All have unimodal null distributions so a
threshold can be selected under the null distribution for

defining the significance of association.

Consider two independent binary partitions of a set of L
patterns. Each pattern in each partition is coded as 0 or

1. The coding is presented as two binary L-vectors,

49
Vl=[V1(i)] and V2=[V2(i)]. Indicator matrices X and Y are
defined as follows.

X(i,j)=1 if v1(i)=v1(j),

=0 otherwise;

Y(i,j)=l if V2(i)=v2(j),

=0 otherwise.

The four cells in Table 3.1 contain all the information

about the similarity between x and Y; K=L(L-l)/2.

Table 3.1 Frequencies of Combinations

Y
0 1
+ ------------- + ------------- +
0 E a E b E K-Nl
x + ------------- + ------------- +,
1E C E d E N1
+ ------------- + ------------- +
K-N2 N2 K

For example,

a .. E(i,j) : i<j, (x(i,j)=0)ﬂ(Y(i.j)=0)I-

The Rand statistic R [67], also called the simple
matching coefficient, is the most commonly used measure of
association between X and Y. It is also a basis of

comparison for permutation statistics [17].

50

where p = Z/K and q = (K-Nl-N2)/K

The Gamma statistic is the point serial correlation
between the two indicator matrices and is used in Mantel

tests [31,53,54].

d - [(c+d)(b+d)/N]

‘JN1N2(N-Nl)(N-N2)

(Fowlkes and Mallows [17] recently proposed the B

statistic as an external criterion for clustering.

' (c+d)(a+b)

- - 05
where z - (N1(K Nl))

51

The distributions for the three statistics above do not
have known analytical form under H0 and must be estimated by
Monte Carlo means or approximated by normal distributions.
The fourth statistic is

S' = ES(1,2)-0.5E where S(i,j) is defined in Chapter 1.

The statistics R, B and Gamma can also be used as
external criteria for partitions with an arbitrary number of

components.
3.1.2 Relation Among Measures

We are interested in whether the four statistics 5',

R, G and B are essentially the same in practice. First,

amma
we establish algebraic relations among them. We then study
the sample correlation between S' and Gamma and compare

their powers.

Under the permutation model and for fixed values
of N1 and N2, the three StatlStICS, R, Gamma and B are all
linear functions of d. Therefore, they are linear funCtions
of each other. In order to study the relation between S'

and the other three statistics, we need only consider the

52

Since G is a linear

relation between S and Gamma' amma

function of d, it is sufficient to study the relation
between S' and d. Under the permutation model,
d'=Eii: V1(i)V2(i)=l}E is the only independent variable. We

investigate how d and S' vary with d'.

The S statistic is defined in Chapter 1 in terms of
L, n1, n2 and (a',b',c',d'). When L, nl and n2 are fixed,
d' takes values in the fixed range: [nl+n2-L, min(nl,n2)].
In this entire range, it is clear that E(S) under H0 is a
monotone increasing function of d'. Therefore, since S has
uniform distribution under H0, E(S')=EE(S)-0.5E decreases
first, reaches a minimum then increases as d' varies over

this range. The quantity d is the following function of d'.

a' b'\ c' d' 2
d = )+ + + = 2d' + (L-2n1-2n2)d' + f
2 2] 2 2

where f = nl +n2 +n

This function reaches its minimum when d'= g =

(2n1+2n2-L)/4. If nl+n2-L > g, d and Gamma are

monotonically increasing functions of d' over the entire
range of d'. Thus the non-monotone function 5' does not

measure the same thing as G If nl+n2-L < g, such as in

amma‘

53

the case nl=n2=L/2, both d and S' are non-monotone functions
of d'. Figure 3.1 demonstrates how d and E(S') vary as

functions of d' under HO when L=20, nl=9 and n2=13.

d E(S'I
80 ‘ .5 56’)
78‘J
d

60 ‘ .3
50‘3
40 ‘ .I

d,

 

 

Fig.3.l d and E(S') vs d' when L=28, n1=9 and n2=13
Table 3.2 gives some examples of the ranges of d' and g
values. The cases marked with a star mark exhibit the

general behaviour of Figure 3.1.

54

Table 3.2 Examples of d' Ranges and 9 Values

L n1 n2 range of d' g
10 3 3 ( 0, 3 ) 0.5 *
10 3 5 ( 0, 3 ) 1.5 *
10 3 6 ( 0, 3 ) 2.0 *
10 5 5 ( 0, 5 ) 2.5 *
10 5 6 ( l, 5 ) 3.0 *
10 6 6 ( 2, 6 ) 3.5 *
20 5 5 ( 0, 5 ) 0.0
20 5 10 ( 0, 5 ) 2.5 *
20 5 13 ( 0, 5 ) 4.0 *
20 10 10 ( 0, 10 ) 5.0 *
20 10 13 ( 3, 10 ) 6.5 *
20 13 13 ( 6, 13 ) 8.0 *
40 10 10 ( 0, 10 ) 0.0
40 10 20 ( 0, 10 ) 5.0 *
40 10 25 ( 0, 10 ) 7.5 *
40 20 20 ( 0, 20 ) 10.0 *
40 20 25 ( 5, 20 ) 12.5 *
40 25 25 ( 10, 25 ) 15.0 *
80 20 20 ( 0, 20 ) 0.0
80 20 40 ( 0, 20 ) 10.0 *
80 20 50 ( 0, 20 ) 15.0 *
80 40 40 ( 0, 40 ) 20.0 *
80 40 50 ( 10, 40 ) 25. *
80 50 50 ( 20, 50 ) 30.0 *
200 50 50 ( 0, 50 ) 0.0
200 50 100 ( 0, 50 ) 25.0 *
200 50 130 ( 0, 50 ) 40.0 *
200 100 100 ( 0, 100 ) 50.0 *
200 100 130 ( 30, 100 ) 65.0 *
200 130 130 ( 60, 130 ) 80.0 *

The quantity 9 is inside the range of d'
when nl and n2 are around L/2. These cases exhibit the
behavior in Figure 3.1. To further study the relation
between S' and Gamma' we examine the sample correlation for
the special cases when n1=n2=L/2, which is representative of

the cases with star marks in Table 3.2. We generated vector

55

pairs of different lengths. For each permutation of one

vector, we compute S' and Gamma which is converted to its

z-score [33,34]. The correlation coefficients are shown in

Table 3.3, which suggests that S' and Gamma are highly

correlated under H0 when the number of 0's and 1's are

equal.
Table 3.3 Sample Correlation between
S ,E(S ) and z-score of Gamma

+ ------- + -------- + ---------- + --------------- + ---------------
E no.of E vector E Pearson E Kendall's Tau E Kendall's Tau
E perms E size Ecorr(S',z)E corr(S',z) E corr(E(S'),z)
+ ------- + -------- + ---------- + --------------- + ---------------
E 1000 E 30 E 0.873 E 0.900 E 1.000

E 1000 E 60 E 0.901 E 0.962 E 1.000

E 1000 E 90 E 0.913 E 0.969 E 1.000

E 10000 E 120 E 0.912 E 0.980 E 1.000

+ ------- + -------- + ---------- + --------------- + ---------------

The sample space consists of all permutations of one
vector. Under any alternative a non-uniform probability
assignment is made to these permutations. The fact that the
two statistics have correlation coefficient 1 under H0 thus
implies they have identical power under any alternative.

Since R, B and G are linear functions of each other, we

amma
conclude that they have the same power against any

alternative hypothesis. When nl=n2=L/2, S' is highly
correlated to R, Gamma and B. Since the rank order
correlation between S' and G is above 0.9, we expect

amma

+

+---—— +--—

56

that they have compatible power against any alternatives.
Since the threshold for S’ can be determined in a natural
fashion and it is comparable to other statistics, 5' is a

good choice for assessing partitional adequacy.

3.2 Binary Tree Classifier Design

In this section, we apply 3' to the design of a binary
tree classifier. Section 3.2.1 briefly discusses the
concept of tree classifier, some existing techniques, and
where this study fits. Section 3.2.2 describes the
computation of feature thresholds from S. Section 3.2.3
considers the efficiency of computing feature thresholds.

Some numerical examples are shown in Sec. 3.2.4.

3.2.1 Binary Tree Classifiers

Binary tree classifiers have been used in many pattern
recognition problems [50,51,59,72,73,75,85,90]. Each node
of the tree corresponds to a subset of features. A pattern
to be classified is passed through the tree from the root to
a leaf. At each non-terminal node, based on the subset of
feature values corresponding to that node, the pattern is
sent to one of two descendent nodes. Every terminal node,

or leaf, is labeled by category. The unknown pattern

57

eventually reaches a leaf and is assigned the corresponding

category label.

A tree classifier is designed from training patterns.
At each non-terminal node, a subset of features is selected
based on the training patterns available at the node. The
training pattern set is divided into two disjoint subsets
according to these features for use at successor nodes. A
terminal node is labeled by the category of the majority of
training patterns remaining. Since the computational cost
in classification is roughly proportional to the square of
the number of features used [81], we follow the literature
and use one feature per node. Thus, descendent nodes are
chosen by comparing the value of the feature corresponding

to that node to a predefined threshold.

The advantages of a tree classification rule versus a
single stage classification rule are in computational
efficiency, use of features, avoidance of the "curse of
dimensionality" and ease in human interpretation
[2,38,50,59,65]. One disadvantage is that the design of a
binary tree classifier often requires a large amount of
computing time and storage [75], especially when a fully

optimal tree is desired [65].

58

The design of a binary tree classifier with one feature
per node consists of two components [46,49,75]: (1) Defining
the structure of the tree; (2) Choosing the most effective
feature and threshold at each node. Some design criteria
[46,55,90] are low error rate, minimum number of nodes on
the tree, shortest path length, and weighted sum of these
factors [46]. Numerical examples show that local optimality
does not ensure global optimality and that no simple method
exists for specifying the optimal tree structure in a given
problem [47]. Therefore, only conditional optimality can be
achieved. Game tree search techniques and the look-ahead
property have achieved partial global optimality [76]. Most
practical tree designs use heuristic approaches and make no
claim of optimality [50,59,70,73,75,8l,90]. Our study
provides an alternative heuristic approach, without

optimization'or look-ahead.

Systematic procedures have been developed for the first
component [46,55,65]. Given thresholds for features which
partition the feature space, Meisel's dynamic programming
method will generate equivalent partitions of that space
which are optimal in the sense of having min-max path

length, minimum number of nodes and minimum expected path

59

length. Since the number of possible trees under the
constraint of a given partition is still very large, this
algorithm is not feasible for large numbers of features

[55,65].

Many approaches have been proposed to attack both
components of tree design simultaneously [48,67,74]. A
specific criterion is selected to determine the feature
thresholds. At each node, the best threshold of the best
feature splits the training pattern set on this node. This
splitting criterion is used at every non-terminal node until
the tree is constructed. The maximum distance between the
empirical c.d.f.'s of the feature under different categories
[70] and the mutual information between the category and the
thresholded feature [74] have been used as criteria. We
examine a splitting criterion based on the S statistic
between a thresholded feature vector and category vector for

the two-class problem.

3.2.2 Computation of Feature Thresholds

Consider a non-terminal node with N training patterns,
each being described by M features. The features must be at
least on an ordinal scale, although binary features are

allowed. The data are represented as an NxMx2 matrix A,

60

where A(i,j,l) denotes feature j of pattern i, A(i,j,2)
denotes the category label of pattern i for all j, and
A(*,j,l) and A(*,j,2) denote the feature and category
vectors respectively for feature j. Order the entries so
that for each j, A(i,j,l) becomes the ith smallest value and
A(i,j,2) is the corresponding category label. A threshold
value between A(i,j,l) and A(i+l,j,1) creates a binary
vector Ai(*,j,l) with i 0's and (N—i) 1's. A similarity
measure between Ai(*,j,l) and the category vector A(*,j,2)
serves as the splitting criterion. Table 3.4 is an example
of feature and category values for feature j. Threshold 5.0
will result in binary feature vector A2(*,j,l)=001111, while
the threshold 8.0 will give A4(*,j,l)=000011.

Before sorting:

A(i,j,2): 0 0 0 1 l' 1

A(i,j,l): 8.21 7.43 1.20 5.56 4.23 8.02
After sorting:

A(i,j,2): 0 l l 0 1 0

A(i,j,l): 1.20 4.23 5.56 7.43 8.02 8.21

Table 3.4 Ordering of Training Patterns by Feature j
(i=l,2,...,6)

The splitting of a node can be described as the

Pascal-like procedure SPLIT written below.

61

Procedure SPLIT (node,MINCUT);
Begin
Build matrix A for this node and set N,M;
SMAX := 0;
.For j=l,M begin
Sort to obtain A(*,j,l);
Arrange A(*,j,2) accordingly;
For i=l,N-l begin
THRESH(j):=(A(i,j,1)+A(i+l,j,l))/2;
For k=l,i Ai(k,j,1):=0;
For k=i+l,N Ai(k,j,l):=l;
S(i,j):=SIMILAR [Ai(*,j,l),A(*,j,2)];
if S(i,j)>SMAX then begin
SMAX := S(i,j); IMAX :=i; JMAX :=j
end; (* if *)
end; (* for *)
end: (* for *)
if SMAX > MINCUT then begin
For i=l,IMAX the pattern corresponding to
A(i,JMAX,l) is passed on to the left-son;
For i=IMAX+l,N the pattern corresponding to
A(i,JMAX,l) is passed on to the right-son;'
SPLIT(left-son,MINCUT); SPLIT(right-son,MINCUT)
end; (* if *)

end; (* procedure *)

62

The algorithm for designing the whole tree is simply
SPLIT(root,MINCUT), where MINCUT is the user specified
minimum splitting criterion value. Several splitting
criterion functions SIMILAR can be used, such as average
mutual information [74]. We propose

S'=ES[(A(*,j,l),A(*,j,2)]-0.5E
as the SIMILAR function, since it has a direct intepretation

and known distribution under H0 (Chapter 1).

3.2.3 Efficiency in Feature Threshold Computation

Consider the N training patterns assigned to a node and
feature. j, represented by sorted arrays A(*,j,l) and
A(*,j,2). There are at most (N-l) possible thresholds in
the procedure SPLIT. To avoid checking every possible
threshold, we investigate whether the best feature threshold
occurs between runs of 0's or 1's in the A(*,j,2) vector, or
occurs at the boundaries of the largest runs. Sethi [74]
suggested restricting the search for feature thresholds to
the boundaries of runs, i.e., check threshold
[A(i,j,l)+A(i+l,j,l)]/2 only if A(i,j,2) # A(i+l,j,2).
Sethi didn't prove that other thresholds can be ignored when

the mutual information is used as the splitting criterion.

63

Appendix A gives a simple induction proof that only
thresholds at the boundaries of runs need be checked if S'

is the splitting criterion.

We now ask whether threshold checking can be further
restricted to the boundaries of the largest runs of 0's and
1's. To be specific, suppose the run lengths in A(*,j,2)

are kt,t=1,2,...,m and let

*
k =max{kl,k2,...,km}

which is achieved between pattern k1+...+ki_l+l and

kl+...+ki. We have observed that the threshold

[A(kl+...+ki_l,j,l)+A(kl+...+ki,j,l)]/2 or
[A(kl+...+ki,j,l)+A(kl+...+ki+1]/2

results in a binary feature vector with highest 5' among all
possible thresholds. We have not been able to prove this in
general. Figure 3.2 demonstrates this phenomenon with 14
cases. In each case, two vectors are presented. The first
vector is the category vector A(*,j,2). The second is the
feature vector thresholded to have the maximum E(S') value.
In every case, the "best" threshold point is at a boundary
of a largest run. The category vectors in the last two
pairs are identical, i.e., the "best” threshold is not

unique. If we can only check the boundaries of largest

64

runs, we can further reduce the time for finding feature

thresholds.

1110000100 11011100010110 01001111010010011101
0001111111 00000011111111 00001111111111111111

01011110 1101001111 0100001100 010011010111
00011111 0000001111 0000001111 000000000111

0000000000110 01001100000000000000 1000000100
0000000000111 00000011111111111111 0111111111

0111000110110110 01000110 01000010 01000010
0000000111111111 00000111 00111111 00000011

Fig.3.2 Fourteen Examples of Category-Feature Vector Pair

3.2.4 Numerical Examples

Since there are infinitely many different types of
classification problems, a formal comparison of binary tree
design algorithms is not feasible. Following the
literature, we use several numerical examples to demonstrate
the use of S' as a splitting criterion in tree design and
indicate its advantages over the mutual information
criterion in an informal manner. Our experiments are on

four artificial data sets and two real data sets.

65

Data sets 1 through 4 are from a cluster process and are
generated on a computer. Each cluster has about the same
number of patterns. The number of clusters is a Poisson
random variable. Patterns are distributed around cluster
centers randomly chosen in a unit hypercube according to a
Normal distribution with diagonal covariance matrix, all of
whose diagonal elements equal sigma, the spread factor. The
smaller sigma, the more distinct the clusters. We code the
patterns in odd numbered clusters "1" and those in even
numbered clusters "0” to serve as category labels. One half
of the patterns in each cluster are used for training while
the rest are for testing. The parameters actually used are

shown in Table 3.5.

Data sets 5 and 6 are created from the Munson
handprinted Fortran character set, containing 48 samples of
each of four letters, namely "I", "M", "O" and "X”. Each
character is represented by an eight-dimensional pattern
[12]. Data set 5 includes characters "I" and "M". Data set
6 contains all four characters with "I", "0" being one
category and "M","X" being the other. The first half of
each cluster (Data sets 1 through 4) and each alphabet (Data

sets 5 and 6) are used for training.

66

Table 3.5 Parameters of Artificial Data Sets

+----+ ---------- + ---------- + ---------- + ---------- + ------- +
E E no. of E no. of E no. of E no. of E E
EDataE training E testing E features E clusters E Sigma E
E setE patterns E patterns E E E E
+--—-+ ---------- + ---------- + ---------- + ---------- + ------- +
E l E 74 E 76 E 6 E 5 E .09 E
E 2 E 72 E 73 E 15 E 6 E .20 E
E 3 E 71 E 74 E 15 E 6 E .30 E
E 4 E 74 E 76 E 9 E 4 E 1.0 E
E 5 E 48 E 48 E 8 E (2) E - E
E 6 E 96 E 96 E 8 E (4) E - E
+----+ ---------- + ---------- + ---------- + ---------- + ------- +

The mutual information and S' are used as splitting
criteria to construct two binary classification trees for
each data set. The procedure SPLIT is used in all cases.
We specify a minimum splitting criterion MINCUT in every
case. When no threshold for any feature exceeds MINCUT,the
node in question will be a leaf. Using the direct
interpretation of S', we set MINCUT for S' at 0.475 for all
data sets, which corresponds to a significance level of.
0.05. That is, S'>0.475 means that the thresholded feature
vector and the category vector are more closely related than
95% of the pairs formed by permuting one of these vectors.
The mutual information levels are explained with the data

sets.

67

The tree classifiers are summarized in Table 3.6 through
Table 3.11. The trees for data sets 1 and 2 are given and
written in prefix form to give a general feeling for the
tree structure. The labels in "<>" ("a' or "b") are labels
of leaf nodes. Other numbers are the features used. For

example, 2(4<b><a>)<b> indicates the tree in Figure 3.3.

F2 < .382
/ \
/ \
F4 < .457 Cat <b>
/ \
/ \

Cat <b> Cat <a>

Fig.3.3 Tree Designed with S'
(minimum splitting S'=.475)
(artificial data set 1)

68

Table 3.6 Tree Design for Data Set 5

mutual
formation

.9.

I

I

I

I
=========3=======a=s=

I

I

+

splittin
criterio

MINCUT

"DID
IIH
“:1

0.330

training
recog.rate

testing

Table 3.7 Tree Design for Data Set 2

+ ----------- + ---------------- + ----------------- +
E splitting E mutual E E
E criterion E information E S' E
E MINCUT E 0.33 E 0.475 E
+ ----------- + ---------------- + ----------------- +
E tree E 6(<a>)(9(3(<b><a>)<b>)<b>) E
+ ----------- + ---------------------------------- +
E number of E E
E nodes E 7 E
+ ----------- + ---------------------------------- +
E training E E
E recog.rateE 1.000 E
+ ----------- + ---------------------------------- +
E testing E E
E recog.rateE 0.918 E
+ ----------- + ---------------------------------- +

69

Table 3.8 Tree Design for Data Set 1

+ ----------- + ---------------- + ----------------- +
E splitting E mutual E E
E criterion E information E S' E
E MINCUT E 0.33 E 0.475 E
+ ----------- + ---------------- + ----------------- +
E tree E 2(4<b><a>)<b> E 4(<b>)(2<a><b>) E
+ ----------- + ---------------- + ----------------- +
E number of E E E
E nodes E 5 I 5 I
+ ----------- + ---------------- 4 ----------------- +
E training E E E
E recog.rateE 1.000 E 1.000 E
+ ----------- + ---------------- + ----------------- +
E testing E E E
E recog.rateE 0.987 E 0.987 E
+ ----------- + ---------------- + ----------------- +

Both methods gave the same result for Data set 5 (Table
3.6) and Data set 2 (Table 3.7). The difference between the
trees for Data set 1 under the two criteria (Table 3.8)
involves only the order of the features used and slight
changes in feature threshold values. The minimum splitting
criterion value MINCUT for mutual information was chosen
because it gave good results for data set 5 in a preliminary
trial. The distribution of mutual information is not known
under H0 , although it has an asymptotic chi-square
distribution [6]. These data sets suggest that the two

criteria give comparable results.

70

Table 3.9 Tree Design for Data Set 3

splitting mutual

criterion

MINCUT E 0.330 E 0.200 E 0.100 E 0.475
----------- +-------+-----—-+-----——+—------
number of E E E E
nodes E 1 E 3 E 17 E 17
----------- +----—-—+-——----+——-----+—--—---
training E E E E
recog.rateE 0.563 E 0 761 E 1.000 E 1.000
----------- +-------+---—---+-—----—+-------
testing E E E E
recog.rateE 0.568 E O 730 E 0 716 E 0 716
----------- +--——---+--—----+-------+-------

Table 3.10 Tree Design for Data Set 4

+ ----------- + --------------- + ------- +
E splitting E mutual E E
E criterion E information E S' E
E MINCUT E 0.500 E 0.050 E 0.475 E
+ ----------- + ------- + ------- + ------- +
E number of E E E E
E nodes E l E 25 E 13 E
+ ----------- + ------- + ------- + ------- +
E training E E E E
E recog.rateE 0.635 E 1.000 E 0.865 E
+ ----------- + ------- + ------- + ------- +
I testing I I I I
E recog.rateE 0.618 E 0.566 E 0.566 E
+ ----------- + ------- + ------- + ------- +

71

Table 3.11 Tree Design for Data Set 6

+ ----------- + ----------------------- + --------- +
E splitting E mutual E E
E criterion E information E S' E
E MINCUT E 0.350 E 0.200 E 0.020 E 0.475 E
+ ----------- + ------- + ------- + ------- + --------- +
E number of E E E E E
E nodes E 3 E 5 E 9 E 5 E
+ ----------- + ------- + ------- + ------- + --------- +
E training E E E E E
E recog.rateE 0.885 E 0.979 E 1.000 E 0 979 E
+ ----------- + ------- + ------- + ------- + --------- +
I testing I I I I I
E recog.rateE 0.885 E 0.938 E 0.917 E 0 938 E
+ ----------- + ------- + ------- + ------- + --------- +

Data set 3 (Table 3.9) is very loosely clustered. The
recognition rates are lower than those for data sets 1 and
2. The value of MINCUT for mutual information has a
significant effect on the tree obtained. Data set 4 (Table
3.10) is completely random. Different MINCUT values for the
mutual information method result in very different trees.
Data set 6 (Table 3.11) also shows the effect of MINCUT

value with the mutual information criterion.

The above examples demonstrate that S' is a reasonable
splitting criterion in tree classifier design. The trees
designed with S' are as good as those designed with the
mutual information criterion. Since 5' has a known
distribution under H0 that is independent of the number of

patterns at each node, 5' has a direct meaning and the

72

MINCUT value for S' can be based on theory. The user must
select the minimum threshold MINCUT under the mutual
information criterion. These data sets demonstrate that the
selection of MINCUT has a dramatic effect on the tree
structure and on the recognition rates. Little prior
information is available for selecting a mutual information
threshold but a threshold on S' can be defined in a natural
way. Mutual information has an asymptotic chi square
distribution [6], but the degree of freedom depends on the
number of patterns at.each node. If one really wants the
statistical significance to be consistent in the entire tree
design, a p-value for mutual information can be approximated
and different chi square tables can be used when the number
of patterns changes from node to node, assuming the
asymptotic distribution is applicable. This process is much

more tedious than selecting the threshold of S'.

3.3 Summary and Conclusions

This chapter evaluated the similarity S, defined in
Chapter 1, in two applications. Both applications require
that the adequacy of a binary partition be measured. A
version of S named S' is compared to three well known

statistics (R, G B). The three are shown to be linear

amma'
functions of each other. All have ,asymptotic normal

73

distribution under H0. The measure 5' is shown to be
different from the others. When the numbers of 1's in both
vectors are half the vector length, 5' is highly correlated
to. other measures. The statistical significance of the
other three measures demands either Monte Carlo simulation
or rather complex approximation, while the significance

level of S' is obvious.

We also applied 5' to the design of a binary tree
classifier. Numerical examples demonstrated that S' is a
reasonable splitting criterion to be used in determining
features and their thresholds. The known distribution and
known statistical significance of 5' permits one to
establish the threshold of each feature in a simple and
direct fashion. By comparison, the mutual information
threshold must be approximated from asymptotic results and

changes from node to node.

CHAPTER 4

TEMPLATE MATCHING

This chapter provides a probabilistic analysis of some
statistics used in the template matching problem on binary
images. We assume mathematical models for both null and
alternative hypotheses. An approximately optimal statistic
and two other statistics are derived and their powers are
compared. We propose' a suboptimal statistic which has
reasonable power and is more sensitive to the true object
location than existing statistics. Along with the
experiments on artificial images generated under our
mathematical model, this statistic is applied to several

real Landsat'images.

The permutation statistic S defined in Chapter 1 and the
Pearson correlation coefficient are also applied to matching
binary images and are shown to provide result similar to
other sub-optimal statistics. In this chapter, we

concentrate on the statistic derived from the Neyman Pearson

74

75

Criterion, which is optimal under a weak assumption and

which has reasonable power and sensitivity.
4.1 Introduction

Template Matching is a simple classical approach to the
problem of locating an object in a digitized image
[29,68,79,82]. An image is an Nr by Nc matrix of gray
levels. A template is an Nr' by Nc' matrix containing a
picture of an object with Nr' << Nr and Nc' << Nc' The
image may contain that object without rotation or size
distortion, or it may contain no object at all. The object
in the image is the same as the template except for lighting
conditions. An example in Figure 4.1 shows an image and a
template which are pictures of the same scene but are seen
through different light frequency channels. This chapter

considers only binary images and templates.

76

 

Fig.4.1 An Image and a Template

77

The standard solution to template matching is to scan
through the image and compare the template with subimages at
all possible template locations using some measure of
similarity to decide whether or not the image contains an

object and if it does, to estimate its location.

Since there are (Nr-Nr.+l)(NC-NC.+1) locations for the
template, the computation complexity of the standard

solution to the template matching problem is
O((Nr-Nr.+l)(NC-Nc.+1)(Nr.*Nc.)).

Several methods have been proposed to reduce the
computational burden of the standard solution
[5,43,60,61,66,87,88,89]. Bolles' planning method first
applies an interest operator [57] to eliminate locations
with low interest. This method is effective for grey level
images and for images with structure, i.e. when pixels are
dependent. Our study employs a randomness model with
independent pixel values, which is appropriate for the Two

Stage solution described below.

78

The first stage [83] compares a subtemplate to the
subimages at all possible template locations. The second
stage applies the entire template only at the locations with
a sufficient match between the subtemplate and the subimage.
The smaller the subtemplate, the lower the computational
cost, but the higher the possibility of false match or
missing a true match. The problem of choosing the optimal

subtemplate size has been studied [83].

In one binary image model [83], a background pixel takes
value 1 with probability p independent of other pixel
values. The binomial distribution and the normal
approximation describe the distributions of the simple
matching coefficient between subtemplate and subimage. The
independence assumption is more reasonable when the

subtemplate is selected at random and is ”sparse" [83].

In two stage matching, the simple matching coefficient,
or equivalently, the absolute difference of pixel values
between template and subimage, has low computational cost,
but it has not been compared to other measures, such as the
Jaccard coefficient and the correlation coefficient [29].

These measures have been compared in other applications

79

[80].

This chapter examines two stage template matching on a
binary image and with a fixed binary subtemplate. We
suggest a probabilistic model for both null hypothesis and
alternative hypothesis and seek a powerful and
computationally feasible statistic for recognizing the
object. We do not attack the subtemplate size problem or

the methodology of the second stage.

4.2 Mathematical Model

Section 4.2.1 discusses notation for the mathematical
model. Section 4.2.2 defines five statistics for testing
hypotheses. Approximations to the Neyman-Pearson statistic

and two other statistics are defined in Sec. 4.2.3.

4.2.1 General Definition

An object is a mapping from a two dimensional grid of
size (Nr.xNC.) to {0,1}. A template is a perfect copy of
the object. In our experiment, each object pixel is
generated independently with probability p' of being 1. A
subset of n template pixels forms a subtemplate. Let q be

the fraction of 1's in the subtemplate. An image G is a

80

matrix of size (erNC) with binary valued elements. The
image may contain a single distorted version of the object
(template). The template is scanned over the image but the
image is seen only through the subtemplate. Let the
possible locations of the template in the image be
arbitrarily ordered from 1 to x0. Let Lx denote the set of
image pixels covered by a grid of size (Nr.ch.) at location
x. Define Nii(x) as the number of pixels at which both the
image and the subtemplate of a template at x are i for

i=0,l, 0 5 Nii(x) g n.

The mapping G is a matrix valued random variable and two
hypotheses can be stated. We assume independent pixel
values under both hypotheses. The template and the

subtemplate are fixed.

H0 : (No distorted object in the image),

Pr[G(i,j)=1EH0]=p for any (i,j).

Hlx: (The distorted object is at location x),
Pr[G(i,j)=lE(i,j) not in Lx, Hlxl=pI
Let Tx(i,j) be the template pixel value corresponding
to image pixel (i,j) when the template is at location
x. The parameters a and b measure distortion of the
object in the image, such as lighting condition

differences between the image and the template.

81

Pr[G(i,j)=lE(i,j) in Lx' Tx(i,j)=0, Hlx]=a,

Pr[G(i,j)=lE(i,j) in Lx' Tx(i,j)=l, Hlx]=b.

H1 : (The distorted object is at some location),

{G(i,j)} are distributed as in H1x for some x.

The random experiment consists of setting the template
at all locations of the image and viewing the image at first
only through the subtemplate. At each location y, 1 g y 5
x0, we observe N00(y) and Nll(y). Let Ey denote the event
that N00(y)= n0(y) and Nll(y)= nl(y) for fixed numbers n0(y)
and nl(y), where n0(y)+nl(y) g n.

Consider the likelihood ratio Rx":

Pr((W EYEHlx)
Y
Rx": ---------------
Pr(IWZE EH0)
Y

Events Ey(l) and Ey(2) are not independent under H0, H1,
or {Hlx' all x} when the templates at locations y(l) and
y(2) overlap, but we treat {BY} as independent events to

obtain an approximation to Rx"'

82

If x=y, the template and the distorted object coincide
under Hlx° In this case, N00(y) has a B[n(l-q),1-a]
distribution and Nll(y) has a B[nq,b] distribution, where
B[M,p] is a binomial distribution representing the result of
M independent Bernoulli trials with probability of success
p. Under H0, N00(y) has a B[n(l-q),l-p] distribution and
N11(y) has a B[nq,p] distribution. In addition, N00(y) and
Nll(y) are independent under all hypotheses because the

template and the subtemplate are fixed.

Pr(E EH ) n (y) n (y)
xx = ----57-;5— = u 0 v 1 Cxx where
Pr(ExEHO)
(l-a)p b(l-p)
u: -------- v: --------
a(l-p) (l-b)p .

Cxx= (a/pIn‘l‘q’((i-hI/Ii-p>)“q.

Similar equations can be stated for the special cases

a=0, a=l, b=0, and b=1.

83

The term Sxy involves the distribution of the sum of two

binomial random variables, so no closed form exists for sxy'
Poisson approximations create complicated expressions. We
choose to use a binomial approximation, as explained in

Appendix B. Let Rx denote the approximation to Rx' when

S is approximated as in Appendix B.

xy
R = un0(x)vnl(x)C T-T [--_w§2----p_]no(y)-nl(y) }
x xx' ' _ _ xY
y¥x (l wxy)(1 p)
log(Rx) = n0(x)log(u) + nl(x)log(v) + log(Cxx) +
--- wxy p ---
+ \ {[n0(y)-nl(y)]log[ ----------- I} + \ 109(ny)
/ (l-wxy)(l-p) /

4.2.2 Statistics for Testing H0 vs. H1x

To test Hlx' we simply place the template at position x
and compute a statistic based on Noo(x) and N11(x). The

five statistics which we will compare are defined below.

Dx = N00(x)+ Nll(x),
the simple matching coefficient,
Gx = N00(x)log(u) + Nll(x)log(v),

84

an approximation to the optimal statistic.

J = Nll(x)/(n-N00(x)),
Jaccard coefficient, in which only (1,1) matches

are considered, and n is the subtemplate size,
C = Pearson correlation coefficient,

S = the S statistic defined in Chapter 1.

The test using any of the above statistics is to reject
140 when the observed value of the statistic is larger than
some threshold. AMonte Carlo comparison study will be
<3escribed in Sec. 4.3 to see if any of the statistics can

Inatch the performance of 6x“
4.2.3 Statistics for Testing H0 vs. H1

The most powerful statistic for testing H0 vs H1 can be
<obtained by combining the statistics for tests of H0 vs
141x and using the assumption that all positions are equally

Llikely for the distorted object. Define

85

x0
4321805) 1
R. g ------------ = --- \ Rx.
x0 x0 5--
puygl EYEHO) x=1

We approximate each Rx. by RK from Sec. 4.2.1. The result
is a sum of products.
x
1 --0

R = --- \ R
x0 /

=1

The computational complexity of N00(y) and Nll(y) at
each location y is O(M). The complexity of computing Rx is
O(M ky), where ky is the number of positions in which
templates overlap with the template at location y. Since
kY is of order (Nr'Nc')' the complexity of computing R is
O(x0MNr.NC.). Since R uses the information provided in
overlapping locations, it should be more powerful, but more
complicated, than other statistics, such as D and G defined
below, which have complexity O(x0M). Statistics based on

J C and Sx are not considered since they act poorly for

x' x
testing H0 vs Hlx (Sec. 4.3).

D = max( D )
x x

86

G = max( G )
x x

Note that Gx uses only the information contained in
N00(x) and Nll(x) about location x and ignores the
information in overlapping locations. The (0,0) and (1,1)
matches are weighted according- to parameters p,a and b.
Statistic D is the special case of G that weights N00(x) and
Nll(x) equally, and is equivalent to the measure used

elsewhere [69,83].

In order to compare D, G, and R in testing H0 vs H1, we
define the sensitivities V(D), V(G), and V(R). Let x be the

true object location,

V(D) = E{ y : DY 3 DxIE
V(G)=E{y:Gy 3 0x}:
V(R) = E{ y : Ry Z RXIE.

The sensitivity is the proper number of locations which
should be identified for second stage template matching.
The smaller this number, the better this statistic as a
similarity measure in first stage template matching.
Sensitivity is just as important as power for a criterion in

first stage template matching.

87
4.3. Comparison Study

We compare the powers of D J C

x' x’ x' SK and Gx in
testing H0 vs H1x for a fixed x. The image size and
template size are both 8 by 8, the subtemplate size is 19
and pixels in the subtemplate are selected randomly. The
four parameters (p,p',a and b) establish four factors whose

levels are:

{0.2,0.4,0.6,0.8} for p and p',
{(0.l,0.9), (0.2,0.8)] for (a,b).

These parameters are known. In each case, 100 images
generated under H0 are used to establish a threshold and
another group of 100 images under H1x are used to estimate

powers.

Based on the resulting power estimates, we performed a
standard two sample t-test for each pair of comparison
between statistics (Appendix C). The results indicate that

D J Cx and SK have essentially the same power, while

x' x'

Dx is slightly more powerful than the other three with
significance level 0.6. Statistic Gx is more powerful than
all others with at least 0.75 significance level.

Therefore, we limit ourselves to Gx and Dx for testing H0 vs

88

Now we compare D,G and R in testing HO vs H1 using the
following criteria: (1) power, (2) sensitivity, (3)
computational complexity, (4) feasibility of analytical
thresholding. Our experiments study the effects of the

following factors.

(1) Parameter values for p,p',a and b,
(2) State of knowledge: known or estimated parameters,

(3) Subtemplate size.

Statistics D and G are compared analytically, while a
large scale Monte Carlo simulation is used to evaluate D,G
and R. All tests have size 0.05. Empirical thresholds for
tests are estimated from 100 Monte Carlo trials under H0.
Power estimates are based on 100 trials under the
alternative hypothesis. The analytical comparison of D and
G is given in Sec. 4.3.1. The Monte Carlo comparison of D,

G and R is described in Sec. 4.3.2. Results are summarized

in Section 4.4.

89

4.3.1 Analytical Comparison of D and G

The normal distribution is used to approximate the
distributions of D and G under H0 and H1. We approximate

powers for the following levels of parameters:

{0.2,0.5,0.8} for p and p‘,
{(0.l,0.9), (0.2,0.8)} for (a,b).

To obtain numerical values, typical image/template sizes

of 16x16/8x8 and 32x32/12x12 were used.

Based on the resulting power estimates (Appendix D), we
performed a standard two sample t-test which indicates that

G is more powerful than D at level .025.

4.3.2 Monte Carlo Comparison of D, G and R

The image size is 16x16, the template size is 8x8 in
this study. For each combination of the following parameter
sets or variables, nine different random seeds are used to
start the process. For every random seed, 100 trials are
used to estimate the threshold and another 100. trials are
used to estimate power. The parameters vary in the

following ranges:

90

values of p: {.2,.5,.8};
values of p': {.2..5..8};
values of (a,b): {(.l,.9),(.2,.8)}.

There are five different states of parameter knowledge
(coded in variable SPK). For SPK>0, p is estimated from 100
random images, different from the images used in estimating
‘threshold and power.

SPK=0= know all parameters;

SPK=1: estimate p; p',a,b are known;

SPK=2: estimate p,p': assume a=.10,b=.90;

SPK=3: estimate p,p'; assume a=.25,b=.96;

SPK=4: estimate p,p': assume a=.05,b=.88.
Subtemplate sizes: {19,26,33,39];

The parameter p' is used only in statistic R. The
=Ei‘tclbtemplate is selected randomly, except in one experiment
(: ﬁL ndicated in Table 4.5), in which the number of 0's and 1's

‘3‘ 1:? e approximately equal.

91

4.4 Results

The experimental results for comparing D, G and R
described above are collected in this section. In
Sec. 4.4.1, Table 4.1 - Table 4.5 list the average powers of
D, G and R. In Sec. 4.4.2, Table 4.6 - Table 4.10 list the
average sensitivities of D, G and R. In Sec. 4.4.3, Tables
4.11 - 4.12 list the results of two sample t tests comparing

powers and sensitivities of D, G and R.
44.4.1 Power Study Results:

Table 4.1 lists the mean powers PD, PG and PR for the
<=:aase when all parameters are known, the distortion is low,
iEIlﬁd the template size is relatively small. The marginals
5L Indicate the average effect of p and p'. From the overall

‘t=~<:tal averages, we can compare powers between statistics and
<==I::mpare to results of other subtemplate sizes, other
=E=i1LJbtemplate selection methods, other parameter sets for
(:«EEI,b), and other options for estimating parameters. In
Table 4.1 and all following tables, "size" means the

s ubtemplate size .

 

92

Table 4.1 Effect of (p,p‘) on Powers
SPK=0, (a,b)=(0.l,0.9), size=l9

p= .2 .5 .8
+ ------------- + ------------- + ------------- +
E .23 E .80 El.00 E .68
P'=.2 E .31 E .82 E 1.00 E .71
E .30 E .95E 1.00 E
+ ------------- + ------------- + ------------- +
E .89 E .75 E .83 E .82
p':.s E .95 E 75 E 95 E .89
E .99 E .74 E 1.00 E .91
+ ------------- + ------------- + ------------- +
E1.00 E 79 E .25 E .68
P'=.8 E 1.00 E .82 E .37 E .73
E 1.00 E .97 E .33 :
+ ------------- + ------------- + ------------- +
71 .78 .70 .73
.76 .80 .78 .78
.77 .89 .78 .81
+ ------------- +
I l
3 PD p, I
l P I
I ........... B-I

Parameters p and p' indicate the difference in frequency
0 f 1's between the background and the object. The further
a~part p and p' , the: easier. the object is to detect. Our
11‘ esults .confirm this. The mean powers of any statistic for
t he cases (p,p')=(.2,.2) and (.8,.8) are comparable, which
I: e flects the symmetry in the problem. That is, reversing
t he coding should not alter the results. A similar comment
C an be made for other symmetric locations in the table. All
three mean powers for the cases (p,p')=(.2,.2) and (.8,.8)

are low, because the information for discrimination in these

.75

.77

93

cases is low. The expressions for Gx' u and v show that
when p is small then u<<v so N11(Y) is more heavily weighted
than N00(y). That is, the discrimination is based primarily
on N11(Y)' even though Nll(y) is small. This weighting
between N00(y) and Nll(y) made G have higher power than D.
Based on this understanding, we will develop a subtemplate
selection method to increase Nll(x) when both p and p' are
small. The result of this method will be described in Table

44.5.

The first row of Table 4.2 contains the grand averages

.ffrom Table 4.1. The second row contains the grand averages
<=rf a table for (a,b)=(0.2,0.8). Table 4.2 shows that
li.r1creasing distortion between template and object in the
image increases the difficulty in detection. Table 4.3
shows grand averages as functions of the state of parameter
l(ruowledge, while Table 4.4 shows them as functions of

S ubtemplate size .

Table 4.2 Effect of (a,b) on Powers
SPK=0, size=l9.

D G R
+ ---------- + ---------- + ---------- +
(a,b)=(.1,.9) E 0.73 E 0.78 E 0.81 E
+ ---------- + ---------- + ---------- +
(.2,.8) E 0.49 E 0.58 E 0.74 E
+ ---------- + ---------- + ---------- +

94

Table 4.3 Effect of Parameter Knowledge on Powers
(a,b)=(0.2,0.8), size=l9

D G R

+ ---------- + ---------- + ---------- +
SPK=0 E 0.49 E 0.58 E 0.74 E
+ ---------- + ---------- + ---------- +

1 E 0.49 E 0.63 E 0.74 E
+ ---------- + ---------- + ---------- +

2 E 0.49 E 0.58 E 0.68 E
+ ---------- + ---------- + ---------- +

3 E 0.49 E 0.56 E 0.69 E
+ ---------- + ---------- + ---------- +

4 E 0.49 E 0.57 E 0.67 E
+ ---------- + ---------- + ---------- +

Table 4.3 suggests that it is not crucial to know the
distortion parameters exactly. Parameter p must be
Iestimated from images similar to that used, or from prior

lcnowledge about lighting conditions.

Table 4.4 Effect of Subtemplate Size on Powers
SPK=0, (a,b)=(0.l,0.9)

D G R

+ ---------- + ---------- + ---------- +
subt.size= 9 E 0.39 E 0.41 E 0.70 E
+ ---------- + ---------- + ---------- +

13 E 0.56 E 0.60 E 0.74 E
+ ---------- + ---------- + ---------- +

19 E 0.73 E 0.78 E 0.81 E
+ ---------- + ---------- + ---------- +

26 E 0.83 E 0.87 E 0.87 E
+ ---------- + ---------- + ---------- +

33 E 0.88 E 0.94 E 0.93 E
+ ------- ;--+ ---------- + ---------- +

39 E 0.92 E 0.96 E 0.95 E
+ ---------- + ---------- + ---------- +

Table 4.4 shows that increasing subtemplate size does,

95

indeed, improve the performance, which is intuitively
expected. Since the optimal size problem has been solved
[83], we did not study this in much detail, although the
optimality was defined without an explicit alternative

hypothesis.

Recall that when (p,p')=(.2,.2) or (.8,.8), the more
laeavily weighted number of pairs (N00 or N11) is usually
small when the subtemplate is selected at random. We
aapplied another algorithm which selects a subtemplate with
aalmost equal numbers of 0's and 1's! The results are shown
i.n Table 4.5. Comparing Table 4.1 and Table 4.5 shows that

tZIie way of selecting the subtemplate increases power
153:ignificantly when p=p' is very small or very large and does
17I<3t affect it in other cases. Thus, a balanced subtemplate
performs better than a randomly selected one. Actually, we
at ight vary the fraction of 1's in the subtemplate according

‘t=-<:> the parameters p, p', a and b to get optimal results.

96

Table 4.5 Powers when Subtemplate
is Balanced

I

P'=.2 E .93
I
I

p':.s E .95
I
I

I

P'=.8 E 1.00
I
I

+-—-——-+--—-+-—-- +
\J
N

+---— +--- +--- +
\O
\O
O)
\D

41.. 49E ..2 Sensitivity Study Results

Tables 4.6 - 4.10 show the results of the sensitivity

as . .
t:“dl<:3y for the conditions of Table 4.1 - 4.5, respectively.

.96

.90

.97

.94

97

(a,b)=(0.l,0.9), size=19

Table 4.6 Effect of (p,p') on Sensitivities
SPK=0,

0.8
+------------+------------+------------+

0.5

0.2

p:

2
4.
4.
O
2
9
1
O
3
5
6
l
6
2
O
1
4.
2
O
l
2
4.
O
2
6
0
O
2
0
l
O
2
l
0
4.
l
0
O
4.
2
2
O
6
2
O
=
P

a.
.2
0
1
.5
1.
0
1
.2
.2
O
1.
nu
.3
O
.1
.2
1.
O
l
.2
.2
O
1.
o.
1.
O
1.
o.
1.
O
1.
3
.2
O
1.
a.
.2
O
1.
1.
1.
O
1
.2
.2
O
l
.5
=
P

l
7
2
8
2
O
2
7
3
O
3
3
6
O
3
3
6
O
3
0
8
O
6
3
5
O
2
4.
9
O
l
8
0
2
7
9
O
l
7
2
O
l
3
2
O
l

l
2
o
2
6
9
o
1
9
5
o
2.
9
l
o
2
0
0
o
2
9
0
o
3
5
0
o
2
3
7
o
l
0
8
o
l
0
4
o
2
4
1
o
2
9
8
o
2

+------------+

+-----------—+

size=l9

Table 4.7 Effect of (a,b) on Sensitivities
SPK=0,

D

E 1.95 E 2.21 E

2.59
+----------+----------+---—------+

(a,b)=(.1,.9)

5.22 E

.98

4

E 7.44 E

(.2,.8)

98

Table 4.8 Effect of Parameter Knowledge on Sensitivity
(a,b)=(0.2,0.8), size=l9

D c R
+ ---------- + ---------- + ---------- +
spxso E 7.44 E 4.98 E 5.22 E
+ ---------- + ---------- + ---------- +
1 E 7.21 E 5.01 E 5.14 E
+ ---------- + ---------- + ---------- +
2 E 7.27 E 4.47 E 5.15 E
+ ---------- + ---------- + ---------- +
3 E 7.44 E 6.41 E 7.53 E
+ ---------- + ---------- + ---------- +
4 E 7.44 E 4.94 E 5.58 E
+ ---------- + ---------- + ---------- 4-

Table 4.9 Effect of Subtemplate Size on Sensitivity
SPK=0, (a,b)=(0.l,0.9)

D G R

+ ---------- + ---------- + ---------- +
subt.size= 9 E 8.29 E 7.03 E 6.06. E
+ ---------- + ---------- + ---------- +

13 E 4.88 E 3.81 E 3.69 E
+ ---------- + ---------- + ---------- +

19 E 2.59 E 1.96 E 2.21 E
+ ---------- + ---------- + ---------- +

26 E 1.61 E 1.27 E 1.50 E
+ ---------- + ---------- + ---------- +

33 E 1.25 E 1.09 E 1.28 E
+ ---------- + ---------- + ---------- +

39 E 1.10 E 1.02 E 1.19 E
+ ---------- + ---------- + ---------- +

99

Table 4.10 Sensitivities when Subtemplate
is Balanced

I
I

p'=.2 : 1.04
i

p':.s

Tables 4.6 - 4.10 indicate that the effects of

( p,p'),(a,b) and subtemplate size on sensitivity are similar

t:o their effects on power. Estimating parameters does not
a: ffect sensitivity very much. Interestingly enough,
sstatistic G is more sensitive than statistic R, except when
‘tihe subtemplate is very small. An intuitive reason for this
is that the high value of Rx may happen at several locations

OVerlapping with the the true object location. The use of
1“formation at overlapping locations provides higher power

‘3}1an ignoring the overlap, but some sensitivity is lost.

100
4.4.3 Formal Comparison

Table 4.1 - 4.10 use only mean values of power estimates
and sensitivity estimates and give a general idea of the
effects of parameters. To compare the three statistics, we
performed a standard two sample t test in each case of
different subtemplate size. For example, we test the
hypothesis Power(D)>Power(G). A positive t value indicates
the acceptance of the hypothesis at the critical level wt.
A negative t value indicates the acceptance of the reverse
hypothesis Power(D)<Power(G) at the critical level wt.
Tables 4.11 and 4.12 show the values of t and the values of

w The sample size for every block is 81. In Table 4.11,

t.
PWD, PWG and PWR represent powers of D, G and R respectively

in tests of H0 vs H1.

101

Table 4.11 Power Comparison

(a,b)=(0.l,0.9)

(results of two sample t test)
SPK=0,

PWD>PWR PWG>PWR

PWD>PWG

size

+ 'II'I
.
. 0
. 0
- o
.
_ 2
. 4
. 6
- o
. 4
. .
.
+ llll
.
_ 0
. 0
_ o
.
. 5
. 0
. 2
- o
_ 5
. .
_
+ 'IIII.
.
. 5
. 3
- o
.
. 6
. 8
- 3
— o
. 0
. _
.
+
9

+ lu|.l|
_
. 0
. 0
. o
.
. 8
_ 6
. 7
— o
. 2
. .
.
+ Il'l
.
. 0
_ 0
- o
.
. l
_ 0
_ 5
- o
_ 3
. .
_
+ III-Il|
.
. 4
. 2
— o
.
. 8
. 0
. 7
— o
_ 0
. _
.

3

1

-0.672

I
.05 E

.25

+
.
no.
4 .
..
.
nu .
.4 .
1. .
. .
nu .
. .
.
uuuu+
.
4..
1..
..
.
.4 .
o. _
nu .
. .
1. .
. .
.
nunu+
.
,o.
1..
..
.
a. .
o. .
o. .
. .
no .
. .
.
unuu+
6
2

.39

: 0.601
: .27

8

.0
+------------+-—-—--------+—---------—-+

: '1.450
I
I

.03

l

- I

w l
L}.

+----------—-+
t

+-------—--

102

Table 4.12 Sensitivity Comparison
Result of Two Sample t Test
SPK=0, (a,b)=(0.l,0.9)

size V(D)>V(G) V(D)>V(R) V(G)>V(R)

+ ------------ + ------------ + ------------ +
9 : 1.132 : 1.965 : 0.881 :
: .13 : .03 : .19 :
+ ------------ + ------------ + ------------ +
13 : 1 654 : 1.826 : 0.221 :
: .05 : .04 : .41 :
+ ------------ + ------------ + ------------ +
19 g 1.681 : 0.983 : -0.881 :
: .05 : .16 : .19 :
+ ------------ + ------------ + ------------ +
26 : 2.098 : 0 649 : -2.259 :
: .02 : .26 : .01 :
+ ------------ + ------------ + ------------ +
33 : 2 108 : -0.33s : -3.609 :
: .02 : .37 : .00 :
+ ------------ + ------------ + ------------ +
39 : 2.207 : -1.665 : -4.316 :
: .02 : .05 : .00 :
+ ------------ + ------------ + ------------ +

+ ------------ +

i t i

I W I

1------_---1-1

Table 4.11 indicates that R is more powerful than the
other statistics when-the subtemplate is small. This may be
explained by the fact that overlapping locations provide
most of the information for discrimination. Both R and G
are significantly more powerful than D in almost all cases.
When the subtemplate size is 19 or larger, R is only
slightly more powerful than G. A similar situation exists
for sensitivity, except that G is significantly more

sensitive than R and D when subtemplate size is large. The

103

reason for R to perform worse when the subtemplate size is
large might be the fact that some assumptions made in
deriving R are violated when the subtemplate points are more

dependent.

4.4.4 Results on Landsat Images

This study involves landsat images of the same scene
from different light frequency channels. Each image was
converted into a binary image using the average grey level
of that image as threshold. We arrange the study into
several cases. In each case, we take a subimage of size 8x8
from a channel i image to form an object or template. Then
we take another subimage of size 64x64 from channel j (i#j)
which includes the object. We consider that this 64x64
image contains a intensity distorted (not geometrically
distorted) object. We apply our scheme to find the relative
location of the distorted object. We studied five measures
of similarity and three subtemplate sizes. Each study was
repeated three times. The parameter p is estimated by the
fraction of 1's in the 64x64 image from channel j. The
parameter p' is estimated by the fraction of 1's in the 8x8
template. The distortion parameters (a,b) are assumed to be

(0.1,0.9). A balanced subtemplate is formed in each case.

104

Table 4.13 Results on Several Landsat Images
subtemplate
size V(D) V(G) V(R) V(max(Cx)) V(max(Sx))
9 1512 1211 718 2244 1222
9 1719 1456 829 2159 1188
9 154 110 1528 110 110
l9 l8 14 541 18 22
19 165 64 1630 42 36
19 1188 979 711 1117 1117
29 79 64 544 78 102
29 926 756 750 787 722
29 719 468 681 969 719

Among the nine cases in Table 4.13, G is more sensitive

than D in every case; 6 is more sensitive than R in five

cases; more sensitive than max(Cx) in seven cases; more

sensitive than max(Sx) in five cases. These results support

our conclusion about G drawn from artificial data.

In two cases using very small subtemplates, statistic R

performs best of all statistics. However, in four cases

using large subtemplates, R performs worst. The reason may

be that the independent assumption we made in the derivation

of R is violated for the Landsat images. Since R is the

only statistic using the information from overlapping

locations, it performs rather differently from all other

statistics.

105
4.5 Summary and Conclusions

This chapter stated a null hypothesis (no object) and
alternative hypotheses (single object present) for a problem
in binary template matching and examined an approximation to
the Neyman Pearson statistic for testing them. This
statistic is compared to the simple matching coefficient and
other similarity measures including the S statistic defined
in Chapter 1. A new sensitivity index is proposed to assess
the ability of a statistic to locate the true object in the
image. Power, sensitivity and computation cost are the

criteria for comparison.

In testing H0 vs Hlx' G is more powerful than all other
statistics. In testing H0 vs H1, R is most powerful, but
less sensitive than G to the true object location. The
statistic G, with the information in overlapping locations
ignored, is almost as powerful as R, but computationally
much simpler, and more sensitive to the true object location
when the subtemplate is large. The threshold of G for
testing H0 vs H1 can be obtained analytically, which is

another important advantage of G over R.

106

The simple matching coefficient D, is less powerful and
less sensitive than G, but has the same order of
computational complexity as G. The correlation coefficient
has been used for ,grey level image template matching
problems [29]. Correlation and the 5 statistic defined in
Chapter 1 have lower power and lower sensitivity than G and
R, with computational complexity at least as large as G.
Therefore, we suggest using G in the first stage of template

matching.

CHAPTER 5

COMPUTATIONAL CONSIDERATIONS

Since the S statistic is computed directly from the
cumulative hypergeometric distribution function (c.d.f.), we
will discuss algorithms for computing hypergeometric
c.d.f.'s that have appeared in the literature. Notation is
defined in Sec.5.l. The Peizer approximation is described
in Sec.5.2. We give a hardware architecture design for the
recursive algorithm in Sec.5.3 and an overall computational
time comparison in Sec.5.4. A brief conclusion is given in

Sec.5.5.
5.1 Notation
Consider two binary N vectors V1 and V2. We first

define n,r and a as follows, to be consistent with the

literature [52].

107

108

Table 5.1 Observables in Vector Pair

V2
1 0

+ ----- + ----- +
1 : a : b : n

V1 + ----- + ----- +
0 : c i d i m

+ ----- + ----- +
r S N

For example, for the following vector pair,

V1=[0010110°]

v =[10010110]

2

N=8, n=3, r=4 and a=l.

For computational convenience [52], without
generality, we can recode V1, V2 and 0,1, so that
a g d and a < b g c,
or, equivalently,
2a + l g n 5 r i m.

Also, we let k=min(a,b-1,c-l,d) [52].

loss of

The probability density function (p.d.f.) h(i) and the

c.d.f. H(a) of the hypergeometric distribution are defined

below.

109

(‘2) (ii?)

h(i) = h(N,n,r,i) = --------- for 0 g i 5 a,
N
(.J
a
H(a) = \ h(i) for 0 5 a 5 min{n,r}.
/
i=0

The S statistic studied in the previous chapters can be

written in the form of H(.) as follows.

S(V1,V2) = H(a) - [H(a)-H(a-l)]U

where U is an independent random variable distributed
uniformly over [0,1].
various

The remainer of this chapter is concerned with

ways of computing H(a). The computation of the c.d.f. H(a)

has attracted great attention. We compare them in terms of
in

computation time, i.e. the number of time cycles involed

the computation. We first define the following notation.

Ta
Tm

*3
Q:

: number

number
number
number

number

of
of
of
of
of

cycles
cycles
cycles
cycles

cycles

needed
needed
needed
needed

needed

in
in
in
in

in

addition,
multiplication,
division,

square root operation,

logarithm operation,

110

Tt: number of cycles needed in looking up a standard

normal distribution table.

Each number is the time needed for each operation when a
single CPU is used. It is also the number of segments in
each functional pipeline unit; e.g., Ta is the number of
segments in an pipeline adder. The computation time for
each algorithm discussed below assumes a single CPU unless

otherwise specified.

The order of operations is important since the word
length in a computer is limited. The ratio of two long
products should be done by alternating division and
multiplication. Since the direct computation involves
factorials and a great deal of repetition, it is very time
consuming. The computation time is

aTa+ (a+l) [(2r-1)Tm+ 2er]

The following formula provides a recursive way to

compute h(j).

111

m(m-l)...(m-r+l)
N(N-l)...(N-r+l)
h(j+l) = h(j)X(j)
(n-j)(r-j)
where X(j) = ------------------
(j+1)(N—n-r+j+l)
The computation time for the recursive formula is

(r-1)Tm + er + a(2Tm+ 2Td+Ta).

5.2 Peizer Approximation

The approximate computation of H(a) has been
investigated by many researchers and a recent extensive
empirical study [52] of the accuracy of 12 normal and three
binomial approximations showed that a normal approximation
by Peizer is both far superior to other normal appoximations
and simple to compute. Since the binomial tail is almost as
difficult to compute as the hypergeometric tail, the
binomial approximation is not recommended [52]. In this
section, we briefly state the Peizer formula and give a

simple discussion on computational complexity.

The Peizer approximation is
8(a) ~ Fn(z)

where Fn is the c.d.f. of the standard normal distribution,

112

and

.02 .01 .01
where A = a+.5, A' = A+(l/6), A" = A'+----+---+--- ,
B,B',B",C,... are defined in a similar manner from b,c,...,

and L = Alog(AN/nr)+Blog(BN/ns)+Ciog(CN/mr)+Dlog(DN/ms).

The maximum absolute error of this approximation is .001
if k>2, with the maximum relative (percent) error being
0.71% through 19.7%. The guaranteed number of correct
decimal places in this case is at least 3.040 [52].' When k
is small, approximations are not needed since the exact

computation is trivial.

The computation time of Peizer's approximation T(Peizer)
is a constant, independent from the vectors V1 and V2. The

computation time (using a single CPU) is

37Ta + ZZTm + 18Td + 4Tg + Tr + Tt

We noticed that some portions of the computation can be done
concurrently, e.g., Alog(AN/nr) and Blog(BN/ns). If we have
sufficient hardware and perform all possible concurrent

computations, the computation time will be reduced to

113

3Ta+4Tm+Td+Tg+Tr+Tt.

5.3 A Hardware Implementation

In recent years, considerable efforts have been devoted
to developing special computer architectures for pattern
recognition and image processing [20,63]. We desribe a two
level pipeline design for the recursive computation of H(a).
Section 5.3.1 is an overview of the architecture. A
detailed discussion of the pipeline structure is given in

Sec.5.3.2.
5.3.1 An Overview of the Architecture

The overall structure is shown in Figure 5.1. The input
is the observables of vector pair V1 and V2. Since the
computation of h(O) can be implemented by a straightforward
design or by a table look-up, h(O) is considered as input in

our design.

114

Nnra

    

hti+gl

lvle-ol

Hta)

Fig.5.1 Overall Architecture for Computing H(a)

The bottom pipeline adder contains z(=Ta) segments. The
pipeline multiplier with y(=Tm) segments computes values of

h(.) in the following fashion.

h(i+y) = h(i)Y(i) where 2(1) = X(i)x(i+l)...X(i+y-l)

115

The box on the upper left corner in Figure 5.1 produces
X(j) for different j every cycle. The box right below it
computes Y(i) from X(j), which is the main part of this
design. The details of those two boxes are shown in Figures
5.2 through 5.6. The boxes marked "MUX' indicates a
multiplexer and the small boxes with a decrement or
increment input are counters. The input from above to these

counters are initiation lines.

n r N-n-r+1 1 a

 

¢-— clock
enable

.XIJ)

Fig.5.2 Compute X(j)

116
5.3.2 Computing Y(i)

We discuss this part of architecture in a general

environment. The function is
Y(i) = x(i) * x(i+1) * * x(i+y) =

where "*" can be any operator which is commutative

and associative.

In this design, a pipeline functional unit with x
segments is used and represented by a box marked with a
letter ”x". For the hypergeometric case, x=y=Tm. The
design of the pipeline can vary with y. We consider y =
2,3,...,15, which are the usual lengths of pipeline
functional units in computers such as Cray 1. Figures 5.3
through 5.5 show designs for y=4, 7 and 13. The general
design is shown in Figure 5.6. In the cases y=4 and y=13,
the design is the same as in the general case. The case y=7
takes advantages of individual y value and thus is different
from the general design. A box with one input, one output
and a number NUM, represents a NUM-bit shift register. Some
of those boxes are small, with NUM=1,2,4. Some are larger

with NUM=x,x-l,x-2.

117

XtJJ

 

 

 

 

 

 

Fig.5.3 Compute Y(i) from X(j) when y=4

118

 

 

 

[

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Fig.5.4 Compute Y(i) from X(j) when y=7

119

 

 

 

 

 

xi]:
1
L_:L. , x-l

 

 

 

 

 

Fig.5.5 Compute Y(i) from X(j) when y=13

120

 

1

1
[12:]

1

..---- 91---".%
X I X
982 g>2
4 1
o-o-gzoc-o ﬂux

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

IE}:-
I" ‘lf’
x
—_—._J‘
984 T!w 49(4
'x
g>8
9:6 g<8

HUX
Ytil

Fig.5.6 Compute Y(i) from X(j) (general design)

121

Multiplexer control is indicated either as conditions at
input lines, or as a dash line at the side connected to the
control signal. The control signals happen to be the bits

in the binary expression of y.

Y 3 Y3Y2Y1YO

To demonstrate how these pipelines work, we present a
time diagram showing the data contents in different portions
of the pipeline at different time steps. We pick y=7 and
x=3 and give the pipeline design in Figure 5.7. Boxes
(Cl,C2,C3), (Fl,F2,F3) and (11,12,13) are pipeline units.
Boxes B, D1, D2, El, E2 and G are one-bit shift registers.

The time diagram is shown in Figure 5.8.

122

 

 

 

 

 

 

 

 

 

 

 

Fig.5.7 Compute Y(i) from X(j) when y=7 and x=3

C1

C3

D1

02

E1

52

F1

F3

H1

H3

11

13

1234567891011

O1234

O123
OO12
123
OO

1

O
O12
IO

1

Fig.5.8 Time Diagram of Figure 5.7

(The left most column is a circuit element,

5

4

HQ NH #U

8 N8 U

6

5

UN Ulb

0—9 N.-

b

r-Q u.-

7

6
5
6

8 F9 NF UN ON

HQ

#10

HS NO

7

UN Old 019 NIUI

UIU 0| 8 N8 0‘8 NF

9N8 Ut-

9 1O
8 9
7 8
8 9
5 6
5 7
4 5
5 5
3 4
4 5
2 3
3 4
1 2
2 3
O 1
3 4
O O
1 2
7 8
4 5
5 7
2 3
4 5
O O
3 4
O O
1 2

12

11

1O

U9 UIN eu (Jib OIUI ~10! GIN

mm 10

US 019 0&1

13

12

11

1O
11

0|“ Ulb OIUI \IOI ON! 100

hr-

1O

401 (0‘1

93 08

the numbers on the right are the contents

in that element.)

14

13

12

11
12

~19 (DUI ~10 ON] mm

0110

11

UIO q.—

15

14

13

12
13

1O
11

~10 0‘1 mm

...
P10 N OIU (DUI

...

018 ON (0‘!

16

15

14

13
14

11
12

1O
11

(DO 0% 100

Nib

13

1O
12

1O

17

15

15

14
15

12
13

11
12

1O
11

1O

GUI 8‘1

14

2
8

18
17
15

15
15

13
14

12
13

11
12

1O
11

1O

11

15

12
14

1O
12
11

3
9

124

The computation time and the circuit complexity of the
part computing Y(i) from X(j) are shown in Table 5.2. The
time complexity, denoted by Ty, is the number of time cycles
through the pipeline. The circuit complexity, denoted by

M is the number of functional units used in the general

Y,
design in Fig.5.6. Also, we define the following functions.

fl(y) - 1 if y is an odd number,
= 0 otherwise;
f2(y,a) = 1 if y > a,

0 otherwise.

125

Table 5.2 Time and Circuit Complexity of

the Part Computing Y(i) from X(j)

 

 

“I
WOQUILIbUN

rotor-bro
bUNO-‘O

 

HHH-l-l-IH—lhiﬁ-l-I—IH

p
I]

general des1gn for Individual des1gn for
all g=2.3.....15 each 9 value when x>3
3110929] ’ 5 11°929] ’ ‘93’92’HIIQg‘1’
xrlog 912+g-1-f119l

L ooooooooooooooooooooo db ooooooooooooooooooooooooooo
x + 1 x + 1

p OOOOOOOOOOOOOOOOOOOOO C(- OOOOOOOOOOOOOOOOOOOOOOOOOOOO d
2x + 1 2x + 1

p OOOOOOOOOOOOOOOOOOOOO db OOOOOOOOOOOOOOOOOOOOOOOOOOOO 1
2x + 3 2x + 3

p ooooooooooooooooooooo q- ooooooooooooooooooooooooooo .4
3x + 3 3x + 3
3x + 5 3x + 3
3x + 5 3x + 5

...................... qb-OQo-u—ococnoocoouc---—------J
3x + 7 L 3x + 7
4x + 7 4X * 7

)- ooooooooooooooooooooo «I- ---------------------------- d
4x + 9 4x + 7
4x + 9 4X + 7

p ooooooooooooooooooooo cl oooooooooooooooooooooooooooo .4
4x + 11 4x + 7

p ooooooooooooooooooo db ............................ «1
4x + 11 4x + 11
4x + 13 4x + 11 ﬂ
4x + 13 4x + 13

 

 

 

126

Table 5.2 demonstrated that the individual design for a
fixed y value can be more efficient than the general design

for y<15.

We now derive the total computing time of 8(a), for

Fig.5.l, which is denoted by T We first define

total‘
TX( ) to be the time for computing X(0) (Fig.5.2), define

Tmerge to be the time for the bottom pipeline adder to

produce final result after h(a) is fed in the adder. We
have

T = T

total x(.)

+ TY + Tm + a + 1 + Tmerge'

T 3 Tu + T

XI.) d

T9 2 T.rlong.1 + T. - 1 - t1(Tn’

Llonglj
T a Tlrlong31-(2 -r.)+r

merge 2(T'.a)(T.-a)rlogza]

L1 0921...]

T s (2rlongn]+41TI - 2 + Td + a +

total

+ tth-.a)(T--alrlogza] - rllrn’

127
5.4 Comparison

We compare the computation times of all methods in Table
5.3. The word "soft” in parentheses means software serial
computation on a single CPU; the word "hard" means special
hardware which performs concurrent operation with possible
pipeline structures. Since the actual computation in a
computer depends on programming and the individual machine,
the numbers in Tables 5.3 and 5.4 provide only the order of
magnitude. The logarithm operation is performed by table
look-up, all of which take 4 time cycles. The square root
operation assumes a standard algorithm [39]. Based on
Cray-l parameters, every cycle is 12.5 nsec and the number

of cycles for various operations are as follows.

Ta = 6, T111 8 7, Td-29,
T9 8 4, Tr =30, Tt- 4.

The results in Table 5.3 are computed according to [64].
In Table 5.4, we show how the computing time ranges over
values of a for various methods when r=50, 150 and 450. The
times are in number of cycles. All the times are computed

when Cray-l parameters are used [10].

128

Table 5.3 Computation Time Comparison

 

 

 

 

 

 

 

 

 

general form of typical computing
computing time time based on Crag 1
°"°°‘ aT +ta+11112r-11T +2rT 1 72ar+72r-a-7
(soft) a I d
recursive _ _
(soft) . (r ilTl+er+a12Tn+2Td+Tal 36r+78a 7
Ll092TIJ
recursive (2rlong.‘|+4lTn 2 +Td+a+ a + 94 +
(hard)
+£21T..al(Tu-a1r109231-tliT.) +t2(7.a117-a)rlo92a]
Peizer
(sort) 37Ta+22Tn+iBTd+4Tg+TP+Tt 1026
Peizer
(hard) 373* “T.* Tg’ 79*Trth 113

 

Table 5.4 Typical Computation Time Ranges
( for a=0,1,2,...,r)

r=50 r=150 r=450
+ --------- + ------------- + ---------------- + ---------------- +
: direct :3,593 ~ :10,793 ~ :32,393 ~ :
: (soft) : 183,543| 1,630,643: 6,223,335:
+ --------- + ------------- + ---------------- + ---------------- +
:recursive:l,793 ~ : 5,393 ~ :16,l93 ~ :
: (soft) : 5,693: 17,093: 51,293:
+ --------- + ------------- + ---------------- + ---------------- +
:recursive: 94 ~ : 94 ~ : 94 ~ :
: (hard) : 144: 244: 544:
+ --------- + ------------- + ---------------- + ---------------- +
: Peizer : :
: (soft) : 1'025 :
+ --------- + ----------------------------------------------- +
: Peizer : i
: (hard) : 113 :
+ --------- + ----------------------------------------------- +

 

129

'The computing time of the Peizer approximation is
problem independent. Therefore this approximation is
prefered when a > 932 and the exact value of 8(a) is not
necessary. The hardware approach to the recursive exact
method is much faster than other exact methods in all cases.
It is faster than Peizer approximation using single CPU when

a < 932.
5.5 Summary and Conclusions

In this chapter, we reviewed some existing techniques
for computing the c.d.f. of the hypergeometric distribution.
Since the direct computation is very time consuming, an
approximation is desired in many cases. The recursive
algorithm for the exact computation avoids much of the
repetition of the direct method. We proposed a hardware
pipeline architecture which implements the recursive formula
and presented a detailed design. The main part of the
design is actually a possible solution for a more general
problem. This pipeline architecture is computationally
efficient. For a large range of values of parameter a, the
proposed architecture is 100 times to 10,000 times faster
than the direct computation, 50 times to 100 times faster
than the recursive method using single CPU, and two to ten

times faster than the Peizer approximation using a single

130

CPU. This design uses six functional units for pipelines
with no more than 15 segments. The modularity and the

regularity of the system make it suited for VLSI

implementation.

CHAPTER 6

CONCLUSION AND FUTURE RESEARCH

6.1 Conclusions

The statistic S has a known distribution under a
permutation null hypothesis and its critical level and
threshold are easy to determine. This statistic is a good
similarity measure between features in preliminary analysis
for binary features. It has about the same power as the
correlation coefficient. The statistic S is a good
similarity measure between feature and category in binary
tree classifier design. It gives similar trees as the
mutual information criterion. The statistic S is also a
reasonable adequacy measure for binary partitions in cluster
validity. It is different but highly correlated to
Gamma and other commonly used measures. In image template
matching, S acts as well as other sub-optimal similarity

measures between sub-images. An approximatation to the

likelihood statistic is proposed. The hardware architecture

131

132

design is more efficient than other methods for computing

hypergeometric c.d.f.'s.

The original contributions of this thesis are summarized
below:

(1) Adaption of the cumulative hypergeometric
distribution to the definition of similarity measures in
pattern recognition.

(2) Definition of the preliminary feature analysis
problem in pattern recognition and the use of S in this
analysis (Chapter 2).

(3) Study of the relation among adequacy measures for
binary partitions and the successful application of S in
binary tree classifier design (Chapter 3).

(4) Definition of alternative hypothesis in .image
template matching and the design of a modified
Neyman-Pearson statistic (Chapter 4).

(5) Design of a hardware implementation for computing the
hypergeometric c.d.f. which is much faster than

conventional means (Chapter 5).

133

6.2 Future Reseach

Future research includes extension to multivalued
vectors, extension to two sided tests in preliminary feature
analysis, extension to multi-class tree classifier design,

and the multiple stage approach to image template matching.

6.2.1 Extension to Multi-Valued Vectors

This thesis involved binary vectors. The definition of
a permutation statistic for multivalued nominal vectors is a
direct extension, if we can define "match” between nominal
values. When the match is obvious from common sense or from
the physical meaning of the variables, the definition of S
is straightforward. An example is given in Appendix E for a
special case. Since not very many similarity measures exist
for general nominal vectors, 5 could be an alternative. It
will have known distribution as in the binary case. Since
the computational complexity is high even for a moderate
number of values, some algorithm or hardware structure must
be used to apply this multi-valued permutation statistic to

the problems in this thesis.

134

6.2.2 Two-Sided Tests for Preliminary Analysis

In Chapter 2, the tests are one sided since we are only
interested in positive relations between features. This is
true for questionnaire data, but not true for the cases
where the coding (0,1) is irrelevent. Our statistic could
be max(:S(i,j)-0.5:) and still have known distribution. It

will be interesting to compare it with max(:C(i,j):).

6.2.3 Multi-class Tree Classifier Design

Tree classifiers have been used for classification
problems with a large number of pattern classes. At each
node, the recognition problem is to identify which subset to
which the unknown pattern belongs. If the tree itself is a
binary tree, each node still deals with binary partitioning
of a pattern set. A mulitivalued version of S' or some
hierarchy of the binary version of 5' might be used to
assess the association of features to the grouping of class

symbols.

135

6.2.4 Multi-stage and Sequential Approach to Template

Matching

In image template matching a multi-stage approach might
have certain advantages over the two-stage approach. An
analysis of optimal number of stages in a hypothesis testing
framework could be a future topic. It makes no sense to
scan the entire image after the true location of the object
has been found. A sequential decision making scheme under
our hypothesis testing might be more efficient than the
traditional approach. The hardware implementation of the
computation of the proposed statistics may also be a future

research topic.

136

APPENDIX A
THE THRESHOLDS AT THE BOUNDARIES OF RUNS

GIVE HIGHEST S VALUES

Let h(nl,i) denote h(n,nl,n2,i) for fixed n,n2, where
h(.,.,.,.) is defined as in Sec. 2.1; and let
n11
H(nl,nll) = > h(nl,1)

i=0
We need to prove that H(nl,nll) < H(nl+l,nll+l)
for nl+n2-n g nll 5 min(nl,n2). We proceed by induction

on I'll.

Base case (111g 1):

(1) If n 0,

11"
H(1,0)=(n-n2)/(2n)

H(2,l)=(n-n2)(n+n2-l)/[n(n-l)] > (n-n2)/(2n)

(2) If nllzl,
H(l,1)=(n+n2)/(2n)

H(2,2)=1 > (n+n2)/(2n)

Assume H(nl,nll) < H(nl+l,nll+l) for a particular n11.

137

(1) If “11.3 min(nl,n2)

(nl+l)(n- nl- n2+i)
Since H(nl +1, i)= ----------------- H(nl,i)
(n-nl )(n1+l- i)

(nl+2)(n- nl- n2+i)
H(nl +2, i)= ----------------- H(nl +1, i+l)
(n- n1 -l)(n1 +1- 1)

0< c.=-55lii15333l332ii). d..-531i311223123211’
1 (“'“1)("1+1'i) 1 (“‘“1’l)(nl+l-i)
Therefore,

> [H(n1,i)-H(nl+l,i+l)] < > [ciH(nl,i)-diH(nl+l,i+l)] < 0

(2) If n2>nl and nll'snl+l,

(nl+2)(n2- n1 -1)
+2)= --------------- H(nl +1, n H+l)<H(n +1, n l+1)
(nl+2)(n-n1-l)

H(nl +2, ,n1

138

APPENDIX B

APPROXIMATION or sxy WHEN y¥x
This section shows the derivation of Sxy (Sec. 4.2.1).

The following figure demonstrates the situation when a
distorted object is at location x and a template is at

location y where Lx overlaps with Ly.

distorted object :
at location x :

I

I

y* ------------------ +

: template :

+ ------ : at location y :

i i

I I

I I

I I

I I

+ ------------------ +

+ ------------------------------------------ +
I
. 351-321-2113

Pr( N00(y)= no: H0) Pr( Nll(y)=nl: H0)

139

Let tx be the overlap between LY and Lx expressed as a

Y

number between 0 and 1. If x=y, t =1, which results in

XY

5 defined in Sec.4.2.l. If t =0, S = 1. For 0<t <1,

xx' xy xy xy
the number of (0-0) matches can be written as the sum of two

random variables

N00(y) = N00'(y) + N00"(y).

where N00'(y) counts matches in the region of overlap and
N00“(y) counts matches between the template and background.
Under our assumptions, N00'(y) has a B[txyn(l-q),l-p"]
distribution and N00”(y) has a B[(l-txy)n(l-q),l-p]
distribution, where
p” = a(l-p')+bp'

and p' is the probability that a template pixel is l. The
distribution of the sum can be approximated by a Poisson
distribution which involves three table look-ups and
factorial computations. We choose to approximate this

distribution by a B[n(l-q),wxy] distribution, where

w = (1-p)(l-txy)+(l-p")tx

3y y

From Sec.4.2.l, N00(y) has a B[n(l-q),l-a] distribution

under H0.

140

A similar analysis shows that the distribution of Nll(y)
under Hlx can be approximated by a B[nq,1-wxy] distribution

and Nll(y) has a B[nq,p] distribution under H0.

Under these conditions,

W n ( )-n ( )
- [---xx--_9--_] 0 Y 1 Y c

S
“Y ' (l-wxy)(l-p) XY

141

APPENDIX C

RESULTS OF t-TESTS FOR SEC. 4.3

We use PWD, PWJ, PWC, PWS and PWG to denote powers of

D J C Sx and Gx respectively. In the following table,

x' x’ x’
each box reports the result of testing the hypothesis that
Power(i)>Power(j) where i (row) and j (column) can refer to
any of the five statistics above. The first number of each
entry is the actual t value and the second number is

w where the critical level is w Positive t value

t t'
indicates the acceptance of the hypothesis with level wt; a
negative t value indicates the acceptance of the reverse
hypothesis. For example, Power(Dx)>Power(Jx) is accepted at

level 0.36; Power(Cx) < Power(Gx) is accepted at level 0.10.

PWJ PWC PWS PWG
+ ---------------------------------------------------- +
pwn : .367 (.36) .472 (.32) .501 (.31) -0.748 (.23) :
pwa : .092 (.46) .122 (.45) -1.146 (.13) :
ch : .031 (.49) -l.289 (.10) :
Pws : -1.321 (.09) 1
+ ---------------------------------------------------- +

This table indicates that Gx is more powerful than all
others and Dx is slightly more powerful than the other

three.

142

APPENDIX D

SUMMARY OF RESULTS FOR SEC. 4.3.1

The following table compares analytical approximations
of the powers of D and G. In each box, the first integer is
the number of combinations of (p,p",a,b) for which Power(D)
< Power(G); the second number is the number of cases for

which Power(D) > Power(G). There are 18 cases for each box.

image size / template size

16x16 / 8x8 32x32 / 12x12

subtemplate + --------------- + ............... 1
size = 19 : 10 , 4 : 10 , 4 :
25 l 12 . 2 l 12 , 2 :

33 : 15 , 2 : 8 , 4 g

39 l 11 . 5 : 11 , 6 g

+ --------------- + --------------- +

We consider all 144 pairs of power approximations as
random samples and perform a standard two sample t-test.
The resulting t value is 2.16648 and the hypothesis Power(D)

< Power(G) is accepted at significence level (l-O.975).

143

APPENDIX E
AN EXAMPLE OF THE PERMUTATION STATISTIC

IN MULTIVALUED NOMINAL VECTOR CASES

Consider nominal vectors V1 with 3 possible values and
V2 with 4 possible values. We defined "match" between the
values of vector components as follows. Let n be the vector
length. The letter ”x” means the corresponding values in

the two vetors are considered as "match”.

2
l 2 3 4
+ -------------------- +
l : x x
V 2 ' x
l 3 : x x

Let n be the vector length, ni* be the number of

components taking value i in V1, “*j be the number of

components taking value j in V2, and let xij be the number

of components that Vl takes value i and V2 takes value j.

The following table indicates the relation among them.

144

V2
1 2 3 4

+ -------------------- +
1 1 x13 x14 1 “1*
V1 2 : x22 : n2,
3 1 x31 x32 : “3*

+ -------------------- +

“*1 11,,2 n,.,3 11,,4 n

Let the random experiment is the permutation of V2, a
random variable M be the number of matches
between V1 and V2, and let m be the observed value of M in
the original vector pair, then a permutation statistic S can

be defined as

s = Pr(M 5 m) - Pr(M=m)U

where U is as defined in Chapter 1 and

LIST OF REFERENCES

145

LIST OF REFERENCES

M.R.Anderberg, Cluster Analysis for Applications,
Academic Press, New York, 1973.

 

P.Argentiero, R.Chin and P.Beaudet, "An automated
approach to the design of decision tree
classifiers", IEEE Trans. Pattern Analysis and
Machine Intelligence, Vol.PAMI-4, pp.51-57, Jan.
1982.

F.B.Baker and L.J.Hubert, "Measuring the power of
hierarchical cluster analysis,” J. of American
Statistical Association, Vol.70, pp.31-38,
Mar.1975.

R.R.Bahadur, ”A representation of the joint
distribution of response to n dichotomous items”.
Studies lg Item Analysis and Prediction, H.Soloman
(Ed.), Palo Alto, Cali.: Stanford Univ. Press,
pp.lS8-168, 1961.

D.I.Barnea and H.E.Silverman, "A class of
algorithms for fast digital image registration,”
IEEE Trans. Computers, Vol.C-21, No.2, pp.l79-l86,
1972.

P.L.Brockett, P.D.Haaland and A.Levine,
"Information theoretic analysis of questionnaire
data“. IEEE Trans. Information Theory, Vol.IT-27,
pp.438-445, 1981

T.M.Conover, Practical Nonparametric Statistics,

10.

11.

12.

13.

14.

15.

16.

146

John Wiley & Sons, 2nd ed., New York, 1980.

T.M.Cover, "The two best independent measurements
are not the best two”. IEEE Trans. Systems, Man and
Cybernetics, Vol. SMC-4, pp.116-117,1974.

T.M.Cover and J.M.VanCampenhout, ”On the possible
orderings in the measurement selection problem”,
IEEE Trans. System, Man and Cybernetics, Vol.
sue-7, pp.657-661, 1977.

Cray research, Cray-l Computer System Hardware
Reference Manual 2240004, 1977.

P.J.Diggle, ”On parameter estimate and
goodness-of-fit testing for spatial point
patterns,” Biometrics, Vol.35, pp.87-101, March,
1979.

R.Dubes and A.Jain, ”Clustering techniques: the
user's dilemma," Pattern -Recognition, Vol.8,
pp.247-260, 1976.

R.Dubes and A.Jain, ”Validity studies in clustering
methodologies," Pattern Recognition, Vol.11,
pp.235-254, 1979.

J.D.Elashoff, R.M.Elashoff and G.E.Go1dman, "On the
choice of variabless in classification problems
with dichotomous variables", Biometrika, Vol.54,
pp.668-670, 1967.

G.S.Fang, "A note on optimal selection of
independent observations", IEEE Trans. Systems, Man
and Cybernetics, Vol.SMC-6, pp.309-311, 1979.

R.V.Foutz, "A method for constructing exact tests
from statistic that have unknown null
distribution". Journal of Statistical Computation
and Simulation, Vol.10. pp.l87-l93, 1980.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

147

E.G.Fowlkes and C.L.Mallows,"A method for comparing
two hierarchical clusterings,” J. of American
Statistical Association, Vol.78, pp.553-569,
Sept.l983.

E.B.Fowlkes and C.L. Mallows, "Rejoinder', J. of
American Statistical Association, pp.584, 1983.

J.M.Friedman and L.C. Rafsky, "Graph-theoretic
measures of multivariate association and
prediction", The Analysis of Statistics, Vol.11,
No.2, pp.377-391, 1983.

K.S.Fu and _T.Ichikawa, S ecial computer
architecture for pattern proceSS1ng. CRC Press,
Boca Raton, Fla., 1982.

G.V.Glass and Hakstian,A.R., ”Measures of
association in Comparative experiments: their
development and interpretation”. American

Educational Research Journal, pp.403-4l4, 1969.

D.W.Goodall, "A probabilistic similarity index”,
Nature, 1964.

D.W.Goodall, ”A new similarity index based on
probability", Biometrics, pp.882-907, 1966.

D.W.Goodall, ”Numerical taxonomy of bacteria - some
published data re-examined", Journal of General
Microbiology, Vol.42, pp.25-37, 1966.

L.A.Goodman and Kruskal,W.H., ”Measures of
association for cross classifications". J. of
American Statistical Association, pp.732-764, 1954.

L.A.Goodman and Kruskal,W.H., "Measures of
association for cross classifications. II: further
discussion and references", J. of American
Statistical Association, Vol.54, pp.123-163, 1959.

27.

28.

29.

30.

31.

32.

33.

34.

35.

148

L.A.Goodman and Kruskal,W.H., "Measures of
association for cross classifications. III:
approximate sample theory", J. of American
Statistical Association, Vol.58, pp.310-364, 1963.

L.A.Goodman and Kruskal,W.H., ”Measures of
association for cross classifications. IV:
Simplification of asymptotic variances", J. of
American Statistical Association, Vol.67,

pp.415-447, 1972.

A.Goshtasby, S.H.Gage, and J.F.Bartholic, "A
two-stage cross correlation approach to template
matching,“ IEEE Trans. Pattern Analysis and Machine
Intelligence, Vol.PAMI-G, No.3, pp.374-378, May
1984.

z.Hubalek, "Coefficients of association and
similarity, based on binary (presence, absence)
data - an evaluation", Biological Review, Vol.57,
pp.669-689, 1982.

L.Hubert and J.Schultz, "Quadratic assignment as a
general data analysis strategy”, British Journal
Math. Statist. Psychol., Vol.29, pp.l90-241, 1976.

L.Hubert and J.R.Levin, ”A general statistical
framework for assessing categorical clustering in
free recall", Psychological Bulletin, Vol.83, No.6,
pp.lO72-1080, 1976.

L.Hubert, "The comparison and fitting of given
classification schemes", Journal of Mathematical
Psychology, Vol.16, No.3, pp.233-253, 1977. '

L.Hubert, "Generalized proximity function
comparisons", British Journal Math. Statist.
Psychol., Vol.31, pp.l79-192, 1978.

L.Hubert, "Generalized concordance", Psychometrika,
Vol.44, No.2, pp.135-l42, 1979.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

149

L.Hubert and Subkoviak,M.J., "Confirmatory
Inference and Geometric Models”, Psychological
Bulletin, Vol 86,No.2,361-370, 1979.

L.Hubert and R.G.Golledge, "A heuristic method for
the comparison related structures", Journal of
mathematical psychology, Vol.23, pp.214-226, 1981.

G.E.Hughes,"0n the mean accuracy of statistical
pattern recognizers”, IEEE Trans. Information
Theory, Vol.IT-lS, pp.420-421, Jan. 1969.

K.Hwang, Computer Arithmetic, John Wiley & Sons,
New York, 1979.

A.K.Jain and R.C. Dubes, "Feature definition in
pattern recognition with small sample size”,
Pattern recognition, Vol.10, pp.85-97, 1978.

M.F.Janowitz, "Similarity measures on binary data",
Systematic Zoology, pp.342-359, 1980.

I.T.Jolliffe and B.J.T. Morgan, "Comment", J. of
American Statistical Association, pp.580-581, 1983.

M.D.Kelly, "Edge detection in pictures by computer
using planning," Matching Intelligence 6
(D. Michie,Ed), pp.379-409, Edinburgh Univ. Press,
Edinburgh, 1971.

O.Kempthorne, Design and analysis pf experiments ,
John Wiley & Sons, N.Y., 1952.

J.B.Kruskal, ”Multidimensional scaling and other
methods for discovering structure," Statistical
Methods for Digital Computers, pp.296-339, 1977.

A.Kulkarni and L.Kanal, "An optimization approach
to hierarchical classifier design", 3rd

47.

48.

49.

50.

51.

52.

53.

54.

55.

150

International. Joint Conf. on Pattern Recognition,
pp.459-466, 1976.

A.Kulkarni, "On the mean accuracy of hierarchical
classifiers", IEEE Trans. Computers, Vol.C-27,
No.8, pp.771-776, Aug. 1978.

A.Kulkarni, ”Admissible search strategaies for
paramtric and nonparametric hierarchical
classifiers", 4th International. Joint Conf. on
Pattern Recognition, pp.238-248, 1978.

M.Rurzynski, "The optimal strategy of tree
classifier", Pattern Recognition, Vol.16, No.1,
pp.81-87, 1983.

G.Landeweerd, T.Timmers, E.Gelsema, M.Bins and
M.Ralie, "Binary tree versus single level tree
classification of white blood cells", Pattern
Recognition, Vol.16, No.6, pp.57l-577, 1983.

Y.K.Lin and R.S.Fu, "Automatic classification of
cervical cells using a binary tree classifier”,
Pattern Recognition, Vol.16, No.1, pp.69-80, 1983.

R.F.Ling and J.W.Pratt, "The accuracy of Peizer
approximations to the hypergeometric distribution,
with comparisons to some other approximations," J.
of American Statistical Association, Vol.79,
No.385, pp.49-60, Mar.1984

N.Mantel, "The detection of disease clustering and
a generalized regression aproach," Cancer Research,
Vol.27, pp.209-220, Feb.l967.

N.Mantel and R.S.Va1and, “A technique of
nonparametric multivariate analysis", Biometrics,
Vol.26, pp.547-558, Sept.l970.

W.Meisel and D.Michalopoulos, "A partitioning

56.

57.

58.

59.

60.

61.

62.

63.

64.

151

algorithm with application and the optimization of
decision trees”, IEEE Trans. Computers, Vol.C-22,
No.1, Jan.l973.

G.W.Milligan, "A Monte Carlo study of thirty
internal criterion measure for cluster analysis,"
Psychometrika, Vol.46, pp.187-l99, June 1981.

H.Moravec and D.Gennery, Cart Project Progress
Report, Stanford Artif. Intell. Project, Internal
Memo. Oct. 1976.

A.N.Mucciardi and E.E.Gose, "A comparison of seven
techniques for choosing subsets of pattern
recognition properties", IEEE Trans. Comput.,
Vol.20, pp.1023-1031, 1971.

J.Mui and R.S.Fu, "Automated classification of
nucleated blood cells using a binary tree
classifier", IEEE Trans. Pattern Analysis and
Machine Intelligence, Vol.PAMI-Z, No.5, pp.429-443,
Sept.l980.

C.Munteanu, "Evaluation of the sequential
similarity detection algorithm applied to binary
images," Pattern Recognition, Vol.13, No.2,
pp.167-175, 1981.

R.N.Nagel and A.Rosenfeld, "Ordered search
techniques in template matching," Proc. IEEE,
Vol.60, pp.242-244, 1972.

J.W.V.Ness,"Comment”, J. of American Statistical
Association, pp.576-579, 1983.

L.M.Ni and A.K.Jain, ”A VLSI systolic architecture
for pattern clustering," to appear on IEEE Trans.
Pattern Analysis and Machine Intelligence.

L.M.Ni and K.Hwang, "Vector reduction techniques

65.

66.

67.

68.

69.

70.

71.

72.

73.

152

for arithmetic pipelines," to appear on IEEE Trans.
on Computers.

H.Payne and W.Meisel, "An algorithm for
constructing optimal binary decision trees", IEEE
Trans. Computers, Vol.C-26, No.9, pp.905-916,
Sept.1977.

M.R.Ramapriyan, “A multilevel approach to
sequential detection of pictorial features", IEEE
Trans. Computers, Vol.C-25, No.1, pp.66-78, 1976.

W.M.Rand, "Objective criteria for evaluation of
clustering methods," J. of American Statistical
Association, Vol.66, pp.846-850, 1971.

A.Rosenfeld, Picture processing py computer. New
York, Academic Press, 1969.

A.Rosenfeld and G.J.Vanderbrug, ”Coarse-fine
template matching", IEEE Trans. System, Man and
Cybernetics, Vol.SMC-7, No.2, pp.104-107, 1977.

E.Rounds, ”A combined nonparametric approach to
feature selection and binary decision tree design”,
Pattern Recognition, Vol.12, pp.313-3l7, 1980.

G.E.Sarndal, ”A comparative study of association
measures”, Psychometrika, Vol.39, pp.165-187, 1974.

J.Schuermann and W.Doster, ”A decision theoretic
approach to hierarchical classifier design",
Pattern Recognition, Vol.17, No.3, pp.359-369,
1984.

I.Sethi and B.Chatterjee, "Machine recognition of
constrained hand printed Devanagari", Pattern
Recognition, Vol.9, pp.69-75, 1977.

74.

75.

76.

77.

78.

79.

80.

81.

82.

153

I.Sethi and G.Sarvarayudu, "hierarchical classifier
design using mutual information", IEEE
Trans. Pattern Analysis and Machine Intelligence,
Vol.PAMI-4, No.4, pp.441-445, July 1982.

Q.Y.Shi and R.S.Fu, "A method for the design of
binary tree classifiers", Pattern Recognition,
Vol.16, No.6, pp.593-603, 1983.

J.Slagle and R.Lee, "Application of game tree
searching techniques to sequential pattern
recognition", Communication of ACM, Vol.14, No.2,
Feb.l97l.

 

P.E.A.Sneath and R.R.Sokal, Numerical taxonomy,
W.M.Freeman. SanFrancisco. 1973.

Special Issue on feature extraction and selection
in pattern recognition, IEEE Trans. Comput. Vol.20,
1971.

W.Sta11ings, "Approaches to Chinese character
recognition," Pattern Recognition, Vol.8, pp.87-98,
1976.

M.Svedlew, C.D.McGillem and P.E.Anuta,
"Experimental examination of similarity measures
and preprocessing methods used for image

registration," Symp. Machine Processing of Remotely
Sensed Data, 1976.

P.Swain and H.Hauska, "The decision tree
classifier: design and potential”, IEEE
Trans. Geoscience Electronics, Vol.CE-lS, No.3,
pp.l42-l47, July 1977.

S.L.Tanimoto, "Template matching in pyramids,"
Computer Graphics and Image Processing, Vol.16,
pp.356-369, 1981.

83.

84.

85.

86.

87.

88.

89.

90.

154

D.J.Vanderbrug and A.Rosenfeld, "Two-stage template
matching," IEEE Trans. Computers, Vol.C-26, no.4,
pp.384-393, 1977.

D.L.Wallace, "Comment”, J. of American Statistical
Association, pp.569-567, 1983.

Q.R.Wang and C.Y.Suen, "Analyis and design of a
decision tree based on entropy reduction and its
application to large character set recognition",
IEEE Trans. Pattern Analysis and Machine
Intelligence, Vol.PAMI-6, No.4, pp.406-4l7, 1984.

A.Whitney, "A direct method of nonparametric
measurement selection", IEEE Trans. Computers.
Vol.C-20, pp.1100-1103, 1971.

R.Y.Wong and E.L.Hall, "Scene matching with
invariant moments,” Computer Graphics and Image
Processing, Vol.8, pp.l6-24, 1978.

R.Y.Wong and E.L.Hall, "Sequential hierachical
scene matching," IEEE Trans. Computers, Vol.c-27,
No.4, April 1978.

R.Y.Wong and E.L.Hall, "Performance comparison of
scene matching techniques,” IEEE Trans. Pattern
Analysis and Machine Intelligence, Vol.PAMI-l,
No.3, July 1979. '

R.C.You and R.S.Fu, "An approach to the design of a
linear binary tree classifier", 1976 Machine
Processing of Remotely Sensed Data,
pp.3A-l - 3A-10.