ll

“WI

   

!

ﬁg (MINIMUM!!!HHINHWIUIIII}WWI!

(9007

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the
dissertation entitled

Automatic Image Annotation

presented by

Feng Kang

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Computer Science

 

 

 

Major Professor’s Signature

M

Date

MSU is an affinnative—action, equal-opportunity employer

 

.o-u-u-n-o-o-noa--a—o—--------0---—----o-n--n-o-n-n-0-o-|-----._-—.--------u-o-I---—.-—._.---n-u-c--a--

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/07 p:/CIRC/DateDue.indd-p.1

AUTOMATIC IMAGE ANN OTATION

By

Feng Kang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science

2007

ABSTRACT

AUTOMATIC IMAGE ANNOTATION

By

Feng Kang

Automatic image annotation is to annotate an image with a set of textual words. The
key in this process is to learn a statistical model which correlates the image features with the
annotation words. To construct the statistical model, we start with a set of training images,
each of which has a set of accompanying annotation words. Typically, images are ﬁrst
segmented into multiple homogeneous regions. Image features such as color and texture are
then extracted to represent each image region. The image regions in the whole collection
can also be grouped into clusters and thus each image region could be converted into its
corresponding cluster id, called a blob. In this way, we obtain a discrete representation of
the images. The correlation between annotation words and image features, either discrete or
continuous, is constructed with a statistical model. Finally, given a new test image, the same
set of image features are extracted, and words are predicated according to the relationship
between image features and annotation words established by the learned statistical model.

In this thesis, we explore the automatic image annotation task through a series of sta-
tistical models. One model based on the discrete feature representation is the translation
model, which constructs the correspondence between blobs and annotation words through
a set of translation probabilities. Due to the fact that common words co-occur with many
more blobs than rare words, the original translation model overestimates the common
words and degrades the overall performance. We thus propose two enhanced translation
models to improve the original translation model by incorporating different prior informa-
tion of the desired translation probabilities into the model. One prior ensures that each

word is associated with similar number of blobs, which is measured by the average of

the translation probabilities from different blobs to the word. Another prior considers the
translation model from two directions: forward translation model, which translates from
blobs to words; backward translation model, which translates from words to blobs. The
prior speciﬁes that the translation probabilities from these two kinds of models should be
consistent with each other. Our empirical results demonstrate the improved performance of
the two enhanced translation models over the original translation model.

However, there are still two problems with the translation models. First, they do not
consider the correlation between annotation words when making the prediction. Secondly,
they are based on the discrete representation, which potentially loses information encoded
in the continuous features. However, the correlation information is difﬁcult to explore
since the possible number of correlated words is exponential. We propose a Correlated
Label Propagation (CLP) framework to explore the correlation between annotation labels.
Based on the property of the submodular function, this framework could be solved by a
very efﬁcient greedy algorithm and thus be applicable to a large set of labels. In addition,
the continuous image features could be incorporated into the CLP framework. Our results
show that the CLP framework outperforms the translation models and also can boost the
performance to a higher level after the continuous features are incorporated.

In summary, this dissertation shows that 1) The performance of the original transla-
tion model can be improved by incorporating different priors; 2) Effectively exploring the
correlation information between labels can improve the overall performance; 3) Similarity
measurement is very important in label prOpagation and similarity measurement based on

the continuous image features can achieve better performance.

Copyright by
FENG KANG
2007

ACKNOWLEDGMENTS

There are a lot of people I would like to thank for their help in this writing of thesis.

Firstly, I would like to thank my advisor, Dr. Rong J in. This thesis is impossible without
his guidance and support. Not only does he guide the writing of the thesis, he also corrects
lots of parts in great detail. Also, thanks for his support and help during my Ph.D study so
that I could explore this interesting ﬁeld.

I would also like to thank the other committee members: Dr. Joyce Chai, Dr. Pangning
Tan, and Dr. Selin Aviyente. They provide lots of helpful suggestions and discussions.

I would also like to thank other fellow students, Y1 Liu, Liu Yang, Feilong Chen, Ming
Wu, Tong Wei, Shaoling Qu, Chen Zhang, Matthew Gerber, Zahar Prasov, Haibing Chen,

for their help during the Ph.D study.

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

1 Introduction

2

3

1.1
1.2

1.3

Methods in Automatic Image Annotation ..................
Issues in Automatic Image Annotation ....................

1.2.1 Skewed Distribution of Annotation Words ..............
1.2.2 Correlation between Annotation Words ...............
Outline of the Thesis .............................

Images Preprocessing
Feature Representation ............................

2.1

2.2

2.1.1
2.1.2
2.1.3

Color .................................
Texture ................................
Shape .................................

Image Segmentation .............................

2.2.1
2.2.2
2.2.3
2.2.4
2.2.5

Image segmentation as clustering ..................
Image Segmentation as Fitting ....................
Probabilistic Methods for Image Segmentation ...........
Segmentation by Integrating Semantic Information .........
Evaluation ..............................

Statistical Models for Image Annotation
Image Annotation Based on Quantized Image Regions ...........

3.1

3.2

3.1.1
3.1.2
3.1.3
3.1.4

Translation Model for Automatic Image Annotation ........
Cross Media Relevance Model ....................
Latent Semantic Analysis for Image Annotation ...........
Probabilistic Latent Semantic Analysis for Image Annotation . . . .

Image Annotation Based on Continuous Features of Image Regions

3.2.1
3.2.2
3.2.3
3.2.4
3.2.5
3.2.6

Hierarchical Model for Image Annotation ..............
Latent Dirichlet Allocation for Image Annotation ..........
Continuous Relevance Model ....................
Classiﬁcation Method ........................
Bootstrapping and Co-training Approach for Image Annotation
Linguistic Indexing of Pictures by a Statistical Modeling Approach

vi

viii

ix

10
11
12
12
14
15
16
16

18
18
19
21
22
24
25
26
28
30
31
33
33

4 Enhanced Translation Model for Automatic Image Annotation 37

4.1 Regularized Translation Model ........................ 39
4.1.1 Framework for the Regularized Translation Model ......... 39
4.1.2 An EM Algorithm for the Regularized Translation Model ...... 41
4.2 Symmetric Translation Model ........................ 44
4.2.1 Discrepancy between Forward and Backward Translation Models . 44
4.2.2 Symmetric Translation Model through Regularization ....... 48

4.2.3 Derivation of the EM Algorithm for Regularization based S ymmet-
ric Translation Model ......................... 50
4.3 Experiments .................................. 52
4.3.1 Impact of Prior on the Regularized Translation Model ....... 55

4.3.2 Effects of Exploring Correlation between Forward and Backward
Translation Models .......................... 57
5 Image Annotation through Label Correlation 59
5.1 Problem Deﬁnition of Multi-label Learning ................. 61
5.2 Pairwise Label Correlation based on Multi-label Maximum Entropy Model 61
5.3 Correlated Label Propagation ......................... 63
5.3.1 Correlated Label Propagation for Multi-label Learning ....... 63
5.3.2 Efﬁcient Learning Algorithm ..................... 66
5.3.3 Selection of the Label Weights and the Label Kernel Function . . . 68

5.3.4 Choice of the Similarity Function K for the Correlated Label Prop-
agation Framework .......................... 70
5.4 Proof of Function f is a Submodular Function ................ 71
5.5 Experiments .................................. 73

5.5.1 Application of the Multi-label Maximum Entropy Model to Auto-
matic Image Annotation ....................... 74

5.5.2 Application of the Correlated Label Propagation Framework on
Discrete Features ........................... 76

5.5.3 Application of the Correlated Label Propagation Framework on
Continuous Features ......................... 78
5.5.4 Comparison of All the Proposed Models on Discrete Features . . . 79

5.5.5 A Further Exploration of the Conﬁdence Scores in the CLP Frame-
work ................................. 81
6 Conclusion and Future Work 85
6.1 Summary ................................... 85
6.2 Future Work .................................. 86
BIBLIOGRAPHY 88

vii

4.1

5.1
5.2

LIST OF TABLES

An example of forward and backward translation models for two image

blobs(i.e. b1 and b2) and two words(i.e.w1 and 102). ............. 45
Examples of label kernel functions used in experiments ........... 70
Average correlation between the common words and rare words. ...... 76

viii

LIST OF FIGURES

1.1 Procedure for automatic image annotation ..................

4.1 The distribution of term frequency for annotated words in a subset of
COREL Data .................................
4.2 Dirichlet distribution for different values of a ................
4.3 The joint probabilities p(w, b) for word ’clouds’ that are computed using
the forward and backward translation models. ................
4.4 Performance measurement for different a values of Regularized Transla-
tion Model ...................................
4.5 Performance measurement for different A values for Regularized Symmet-
ric Translation Model .............................

5.1 Algorithm for ﬁnding the optimal solution to Equation 5.9 .........
5.2 The curves for different label kernel functions ................
5.3 Performance for individual words based on the Translation Model(TM) and
the Maximum Entropy Model(MEM). The words are ordered by decreasing
frequency ....................................
5.4 The correlation among categories and their corresponding parameters of R
in Multilabel Maximum Entropy Model for the Dataset ............
5.5 Performance comparison of correlated label propagation framework based
on different label kernel functions. ......................
5.6 Performance measurement for the CLP framework with continuous fea-
tures on different annotation cut points ....................
5.7 Performance comparison of different models. TM: Translation Model.
RTM: Regularized Translation Model. RSTM: Regularized Symmetric
Translation Model. CLPEXP: the CLP framework with the Exponential
Function as the Label Kernel Function, RELEVANCEzRelevance Model . .
5.8 Performance comparison of different models for 30 most common words.
TM: Translation Model. RTM: Regularized Translation Model. RSTM:
Regularized Symmetric Translation Model. CLPEXP: the CLP framework
with exponential function as the label kernel function,RELEVANCE: Rel-
evance Model .................................
5.9 Performance comparison of different models for the remaining rare words.
TM: Translation Model. RTM: Regularized Translation Model. RSTM:
Regularized Symmetric Translation Model. CLPEXP: the CLP framework
with exponential function as the label kernel function. RELEVANCE: Rel-
evance Model .................................
5.10 Values for f function and their change ....................

ix

3

54

56

70

74

75

79

80

84

CHAPTER 1

Introduction

Image retrieval started as a text-based image retrieval system in the late 1970’s. By man-
ually annotating images with descriptive texts, existing text-based Database Management
Systems were utilized to perform image retrieval. However, image retrieval based on man-

ual annotations has two major drawbacks:

o It is expensive to hire large numbers of people to annotate large-sized image collec-

tions.

0 Given that human perception of image content is highly subjective, different people
might give different interpretations for the same pictures, depending on their knowl-

edge, experience, mood and other circumstances.

To overcome the problems of image retrieval based on manual annotation, people have
proposed content-based image retrieval, in which images are indexed and represented by
their visual features, such as color, texture, shape etc. Content-based image retrieval relies
on the low-level visual feature representation of images, thus avoiding the imprecise human
annotation. A comprehensive survey on this subject can be found in[42]. While content-
based image retrieval solves the problems related to text-based image retrieval, it introduces

several new problems:

0 While images could be retrieved based on their features such as color and texture, it is

usually more natural and desirable for users to search image databases using textual

queries. This is because textual queries allow users to express their information needs

at the semantic level instead of the level of preliminary image features.

0 Compared to content-based image retrieval, textual queries usually provide more
accurate descriptions of users’ information needs. For example, consider that a user
is looking for images of tigers. Suppose he uses an image query, which consists of
a photo of a tiger on the grass. Based on the match of image features, many of the
retrieved images will be pictures of the grass without any tigers. This is because it is
unclear to the system what the user is searching for, the grass or a tiger. In contrast,
textual queries such as ‘photos of tigers’ are able to convey the information need of

the user more clearly.

To search images based on textual queries, it’s preferable that the annotation words
for a large set of image collections can be automatically generated based on a model built
from a small number of manually annotated training images. To search for images, users
can simply pose textual queries and the relevant images are retrieved based on the matching
between the textual queries and the automatically generated annotations for images. Unlike
traditional text-based image retrieval, the automatic annotation by a computer program
avoids the expense and subjectivity of the manual annotation.

Figure 1.1 illustrates the basic steps for automatic image annotation: an image is ﬁrst
segmented into multiple homogeneous regions; then image features, such as color, texture
and shape, are extracted to represent each image region; ﬁnally, annotation words are pre-
dicted based on the visual features of segmented regions using a statistical model that is

learned from the training data.

1.1 Methods in Automatic Image Annotation

One important issue with automatic image annotation is how to represent image regions.

One type of approach represents every image region with the extracted continuous features,

       
 

Eagle,
sky.
Mountain

    

Ir

 

 

 

 

Segmentation
Machine
Learning
Extraction Model

 

 

 

 

I
l
I
|
1 Feature
I
l
I
I
I

 

__________1

Figure 1.1. Procedure for automatic image annotation

such as color, texture, and shape. Another type of approach is to represent the image regions
with the discrete features from a visual vocabulary. To obtain the visual vocabulary, we ﬁrst
cluster image regions of the whole collection into a small number of groups and represent
each region by its cluster id, i.e. which cluster the region belongs to. These cluster ids are
called blobs and they are the terms in the visual vocabulary. Then we can convert the image
regions in each image to its corresponding cluster id and obtain the discrete representation.
Compared to the representation using extracted features, the quantized representation can
dramatically simplify the computation in automatic image annotation, but at the price of
potentially losing information that is encoded in extracted continuous features.

The key to automatic image annotation is to learn annotation models that can automat-
ically predict annotation words given extracted image features. Most automatic annotation
methods applied machine learning techniques to ﬁrst learn the correlation between im-
age features and textual words from the examples of annotated images and then apply the
learned correlation to predict words for unseen images. Based on what kinds of image rep-
resentations the models are using, we group the annotation methods into two categories:

the models based on discrete features and the models based on continuous features.

The models based on discrete features include: the machine translation model[41],
the co-occurrence model[39], latent space approaches[37, 38],and relevance language
models[24]. The co—occurrence model[39] collects the co-occurrence counts between
words and image features and uses them to predict annotated words for images. Duygulu
et al.[41] improved the co—occurrence model by utilizing machine translation models. An-
other way of capturing co-occurrence information is to introduce latent variables that link
image features with words. Methods in this category include latent semantic analysis
(LSA)[37] and probabilistic latent semantic analysis (PLSA)[38]

The models based on continuous features include: classiﬁcation approaches[9], the
continuous relevance language model[24], the hierarchical aspect model[28], the Gaussian
Mixture Model (GMM), the Latent Dirichlet Allocator (LDA), and the correspondence
LDA[5]. The classiﬁcation approaches for automatic image annotation treat each annotated
word as an independent class and create a different image classiﬁcation model for every
word. Work such as linguistic indexing of pictures[33], image annotation using SVM and
Bayes point machine[9] fall into this category. Continuous relevance language models[32]
is an application of relevance model[3l] to the continuous features. Hierarchical aspect
model[28], Gaussian Mixture Model (GMM), Latent Dirichlet Allocator (LDA), and cor-
respondence LDA[S] are methods that introduce the correspondence between annotation

words through latent variables. We will discuss these methods further in Chapter 3.

1.2 Issues in Automatic Image Annotation

In this section, we discuss some common issues in the image annotation task.

1.2.1 Skewed Distribution of Annotation Words

When we annotate the images, we simply put the annotation words as a description of the
images and do not identify the corresponding image regions. Thus the computer is not

sure which image region corresponds to which speciﬁc word. In addition, some of the

common scenes appear frequently in many images such as ‘water’, ‘sky’ while the major
subjects may differ across images. As a result, the annotation words follow a very skewed
distribution. A few words happen very frequently and most of the words happen rarely.
This is particulary a problem for the translation model, since the translation probability is
based on the co-occurrence statistics between annotation words and blobs. The translation
probabilities for the common words are thus much higher and the rare words have fewer
opportunities to appear in the annotation set. We thus propose two enhanced translation

models to alleviate this problem.

0 Regularized Translation Model. One reason that the common words are overesti-
mated is because the common words are associated with more blobs than rare words.
One strategy to alleviate this problem is to force each word to be associated with
a similar number of blobs. The number of blobs associated with each word can be
measured by the average of the translation probabilities to the word. This measure-
ment is treated as a prior to be incorporated into the original translation model to

obtain better set of translation probabilities.

o Symmetric Regularized Translation Model. Typically, the translation model trans-
lates blobs to words. This model assumes that each of the blobs can be translated
into one of the words. If the common words have higher probabilities associated
with the blobs, rare words will have a smaller probability of appearing in the anno-
tation set. Another view of the translation model is to translate from words to blobs,
which assumes that each of the words could be translated into one of the blobs. In
this way, the rare words could be associated with blobs with higher probability. Our

prior is to enforce the consistency of these two sets of translation probabilities.

1.2.2 Correlation between Annotation Words

We know that many of the annotation words do not exist alone. They are correlated with

each other. For example, the word ‘grass’ may appear more often with the word ‘tiger’ than

with the word ‘ship’. A ‘whale’ is usually related to the word ‘water’ but may never occur
with ‘grass’. How to effectively explore the correlation between annotation words is very
challenging given the large size of annotation labels in our task. While this problem could
be explicitly handled by specifying which set of labels are correlated together, it becomes
a combinatorial problem and is difﬁcult to scale to the large number of annotation words.

We investigate two approaches to this problem.

0 Multi-label Maximum Entropy Model. This model takes the pairwise label con-
straints into consideration by specifying which pair of labels are correlated together
and how strong the correlation is. However, due to computational issues, this model

cannot be applied to our annotation task.

0 Correlated Label Propagation(CLP) Framework. We propose a correlated label
propagation framework to take into account the correlation information. Label prop-
agation is not only based on the similarity between test images and training images.
We also consider correlation between labels. We show that this framework has an
optimal solution based on the properties of the submodular function. In addition, this
solution can be obtained through an efﬁcient greedy algorithm and thus be able to
scale to a large set of labels. Furthermore, this framework can incorporate continu-
ous features, which are missing in models based on the discrete features, such as the

translation model.

1.3 Outline of the Thesis

In Chapter 2, we will provide some background information about image preprocessing,
which mainly focuses on the feature representations and image segmentation. In Chapter
3, we discuss the two categories of statistical models built for automatic image annotation
based on whether they employ discrete or continuous representation of the images. Chap-

ter‘4 focuses on the the translation models. We will discuss in detail its over-estimated

common words problem. We then propose two enhanced translation models to alleviate
this problem and demonstrate the improved performance. Chapter 5 presents the methods
that consider correlations between labels. The Correlated Label Propagation(CLP) frame-
work is proposed and demonstrates the improved performance over discrete features. We
also show that the CLP framework can incorporate continuous features to achieve better

performance than the discrete features. Chapter 6 summarizes this thesis.

CHAPTER 2

Images Preprocessing

2.1 Feature Representation

Visual features of images can be classiﬁed into two categoriesl42]: the general image fea-
tures and the domain-speciﬁc features. The general image features include color, texture,
shape etc. The domain-speciﬁc features are application dependent. For example, to detect
and recognize human faces, special image features are required to describe the characteris-
tics of human faces. In this survey, we focus on the general image features, particularly the

features that are critical to automatic image annotation[4l].

2.1.1 Color

Color is one of the most widely used visual features in image retrieval. Compared to other
image features, color is relatively robust to the background complication and independent
of image size and orientation[42]. It is typically represented in a color space. Each color
space involves two basic components: the primaries and the matching function. The pri-
maries of a color space describe the basic and independent components of colors. The
matching function of a color space determines the weight of the primaries in matching a
source color. Suppose the primaries arePl, P2, P3 and the matching functions are f1, f2, f3.
'Ihen any color could be expressed as: f1P1 + ngg + f3P3. One typical example of color

space is RGB Color Space, in which the single wavelength primaries are used (645.16 nm

for R, 526.32 nm for G and 444.44 nm for B). Another popular color space is CIE L*u*v,
which is a more uniform space than RGB and obtained by a projective transformation of

CIE XYZ [16], which yields the CIE u, v space:

i 10- 4X 9’” (21)
u’ _ X+15Y+3Z’X+15Y+3Z '

 

Color information of an image is usually represented by its statistics. There are two

commonly used statistics for representing color information:
0 Color histogram.

In image retrieval, the color histogram is the most commonly used color feature
representation[42]. It is easily computed and in many cases very effective for image
retrieval. However, one disadvantage of the color histogram is that it is usually a
sparse representation with zero values for most of its entries. As a result, it can be

sensitive to small distortions to images when applied to image retrieval[42].

0 Color layout.

The color histogram is usually viewed as a global feature given that it is computed
based on the color distribution of the entire image. Despite that it has shown a certain
degree of discriminative power in image retrieval, one disadvantage of any global
feature is that it is unable to provide an accurate description of details in images
and thus tends to give many false positives when the size of the image collections
is large. In contrast, color layout methods take into account the local distribution of
color features. One example of color a layout method divides images into multiple
blocks and extracts image features for each block. In[34, 35], the author proposed an
approach, named ‘single blob with neighbors’ or SBN, in which an image is divided
into multiple sub-images and each sub-image is represented by the mean values of
RGB components of the sub-image itself and its four neighboring blobs (i.e., up,
down, left, right). This approach can be further improved by organizing sub-images

into more complicated structures such as the quad-tree.

2.1.2 Texture

According to[42], textures are the homogeneous visual patterns that do not result from the
presence of only a single color or intensity. Some of the commonly used texture represen-

tations include:

o Co—occurrence matrix.

This method represents image texture using the spatial correlation of gray levels of
different pixels [43]. It ﬁrst constructs a co-occurrence matrix based on the orienta-
tion and distance between image pixels and then extracts meaningful statistics from

the co-occurrence matrix as the texture representation.

0 Tamura psychological perspectives

Tamura explores the texture representation from psychological perspectives[l6]. It
identiﬁes six visual properties that are related to textures: coarseness, contrast, di-
rectionality, line-likeness, regularity, and roughness. Tamura representations of tex-
ture differ from the co—occurrence representation in that their texture properties are
visually meaningful whereas many of the texture properties extracted from the co-

occurrence matrix representation are not.

0 Spectral transform

This method applies the wavelet transform or the Fourier transform to obtain texture
representations in the frequency domain. One disadvantage of the Fourier transform
is that the computation of the transform requires the pixel information from the entire
image. As a result, the Fourier transform is unable to capture the local structure of
images, which is important in representing images. To overcome this problem, one
strategy is to use the Gabor ﬁlters[18, 47], in which a Gaussian kernel is put on top
of the Fourier transform so that the transform is performed within a neighborhood

determined by the standard deviation of the Gaussian kernel. Another strategy is to

10

use the wavelet transform. In order to capture the local structures of images, the
wavelet transform uses ‘wavelet’ as its basis functions, which are usually generated

from a mother function by translation and contraction.

2.1.3 Shape

Shape is a very important visual feature in human perception. A good shape representation

should be invariant to translation, rotation and scaling. Most shape representations can be

classiﬁed into two categories: region based methods, and boundary based methods.
Boundary based methods represent the shape of a region by its outer boundary. They

can be further classiﬁed into three sub-categories[50]:

0 Global shape descriptors include area, circularity, eccentricity, and axis orientation.

These shape features can only distinguish shapes with large dissimilarity.

0 Shape signatures utilize the local feature of a shape, including the complex coor-
dinates, the curvature and the angular features. They are usually sensitive to noise
and therefore are not robust. In addition, they require intensive computation during

similarity calculation, due to the hard normalization of rotation invariance.

0 Spectral descriptors apply spectral transformation to the shape signatures to acquire
the shape representation in the frequency domain. The most successful shape rep-
resentation is the Fourier Descriptor[42], which applies the Fourier transform to the

shape signatures.

In the region-based representations, all pixels within a shape region are taken into ac-
count to obtain the shape representation. The most successful representation is the moment

descriptor [42], which is invariant to both translation and scaling.

ll

2.2 Image Segmentation

The goal of image segmentation is to group pixels of similar properties into clusters. Ac-

cording to [16], methods for image segmentation can be classiﬁed into three categories:
0 Image segmentation based on clustering.
0 Image segmentation based on ﬁtting.
0 Image segmentation based on probabilistic methods.

In addition, recently there have been a number of studies on image segmentation at the

semantic level[2], which will also be reviewed below.

2.2.1 Image segmentation as clustering

Image segmentation can be viewed as a clustering problem in that pixels of similar visual
properties are clustered into a small number of groups. For any clustering algorithm, there
are two key components:
0 Distance measurement between any pair of data points or clusters.
Commonly used distance measurements are:
l. The single link, i.e., the shortest distance between any pair of data points in two
clusters.

2. The complete link, i.e., the maximum distance between any two points in two

clusters.

3. The group average link, i.e., the average distance between any pair of data

points in the two cluster.

0 The number of clusters. Usually, this is a very difﬁcult problem to handle.

Dendrogram[l6], a hierarchical representation structure of clusters, could be used

12

to see whether the clusters are good or not and help user make choice of clusters.
Other commonly used approaches include the information criterion (e.g., AIC and

BIC)[19].
The commonly used clustering algorithms for image segmentation are:

0 Image segmentation by the K-means clustering algorithm

The K-means algorithm[16] is one of the most commonly used clustering algorithms.
It minimizes the within-cluster distance, which is calculated as the sum of distances
of each data point in a cluster to its center. The cluster memberships of each data

point are calculated through an iteration of the following two steps[l6]:

1. Given the current estimation of cluster centers, each data point is assigned to

the cluster whose center is closest to the location of the data point.

2. Given the cluster memberships of data points, the new center of a cluster is re

computed as the average of the data points assigned to the cluster.

0 Image segmentation by graph—theoretic clustering

A clustering problem can be viewed as a problem of graph partitioning[ 16], in which
each data point corresponds to a vertex in a graph, and the weight for each edge
that connects two vertices is equivalent to the similarity of the corresponding two
data points. Identifying clusters of similar data points is equal to dividing the corre-
sponding graph into multiple disjoint sets with only edges of small weights removed.
Central to the graph partitioning approaches is the similarity measure for any pair of
data points. Different image features could be used to deﬁne the similarity measure-

ment, such as the similarity in intensity, color, texture and motion.

One of the popular algorithms in this category is the Normalized Cut[44, 45]. It
divides a graph into two disjoint sub-graphs such that the ratio of the graph cut to

the total afﬁnity (i.e., similarity) within each sub-graph is minimized. In particular,

13

to divide a weighted graph V into the two disjoint subsets A and B, based on the

Normalized Cut algorithm, it is formulated into the following optimization problem:

, cut(A, B) cut(A, B)
arg £14113 m + W (2.2)
cut(A, B) measures the similarity between the components A and B, which is de-
ﬁned as the sum of weights of all the edges in V that have one end in component A
and the other end in component B. asso(A, V) and asso(B, V) measure the simi-
larity of data points within component A and B. They are the sum of weights of all
edges that have both ends in A and B, respectively. Given that the above problem is

a combinatorial optimization problem and is NP-hard, usually approximate solutions

are provided[44, 45].

2.2.2 Image Segmentation as Fitting

The goal of ﬁtting is to determine possible structures observed in an image. An image can
be viewed as a set of tokens, which can be pixels or edge points. The ﬁtting-based image
segmentation approaches group these tokens together to form regular shapes such as lines

and circles. There are two major approaches in this category:

0 Hough transform

The Hough transform clusters pixels together based on their underlying
structures[l6]. It ﬁrst identiﬁes and stores all possible structures for each pixel, and
then searches for structures that are commonly shared by many pixels. The Hough
transform has been used to identify lines, curves, surfaces etc. It is advantageous in
that it does not require computing analytical solutions of certain equations. How-
ever, the Hough transform is usually sensitive to noise[l6], which can lead to the

identiﬁcation of phantoms.

0 Curve ﬁtting based on generative models.

14

A ﬁtting approach based on a generative model assumes a probabilistic model that
describes the probabilistic relationship between the observed pixels and the under-
lying curves. Usually, pixels are assumed to be generated from certain curves such
as lines under Gaussian noises. Parameters governing the underlying curves are esti-

mated through certain criteria, such as the least square criterion[ 16].

2.2.3 Probabilistic Methods for Image Segmentation

The Expectation Maximization (EM) algorithm[36] has been demonstrated to be an effec-
tive approach for missing data problems. Image segmentation can be viewed as a missing
data problem, in which each image segment is assumed to be generated from a mixtures
of probabilistic models and the missing information is the description of each probabilistic
model. The EM algorithm computes the segmentation in the alternation of the E-step and

the M-step:
o E-step: the EM algorithm estimates the segment membership for each pixel;

o M-step: the optimal parameters of mixture model are estimated based on the segment

memberships that are estimated in the E-step.

These two steps are iterated until the EM algorithm converges to its local optimum.

In addition to image segmentation, the EM algorithm can also be applied to ﬁt lines.
Similar to applying the EM algorithm to image segmentation, in the E-step, we estimate
the likelihood for each pixel to be in a line. Usually, this likelihood is proportional to its
distance to the line and is cut by a threshold value. In the M-step, a maximum likelihood
approach is used to re-estimate the parameters of the line. To determine when the iterative
procedure should be stopped, we need to test convergence of the algorithm, which often is
based on the change of size of the line and also the sum of distances from the points to the

line [16].

15

Despite of the success of the EM algorithm for image segmentation, a general problem
with the EM algorithm is its local minimum, which means that the choice of starting point
will have a great impact on the quality of ﬁnal solutions. One solution is to ﬁrst apply the
Hough transform to obtain the initial solution for the EM algorithm, or we can start the EM

algorithm with different initial solutions, and search for the best solution[16].

2.2.4 Segmentation by Integrating Semantic Information

According to previous studies [2], image segmentations based on low-level features usually
will not result in desirable object recognition. This is because the same object may exhibit
very different distributions of image features in different images or even in the same image.
For example, a penguin has white and black halves, and it is hard to acquire meaningful
segmentation results just based on the low-level features. In [2], a solution is proposed
by associating each segmentation region with a meaningful word. Then, two neighboring
regions are merged when they have been assigned the same annotation words. By doing

this, we are able to avoid creating too many image segments.

2.2.5 Evaluation

We have discussed different types of image features and image segmentation methods. One
important yet unresolved issue is how to measure the ‘goodness’ of feature extraction and
image segmentation. That is, the user wants to know how good a set of image features
are or how good the segmentation is. Evaluation based on human judgment can only be
done on small scales. For a large image collection, automatic mechanisms are required to
measure the quality of image feature extraction and segmentation.

One automatic approach evaluates the performance of different approaches based on
the annotations of images. Word annotations for images are automatically generated based
on the selected image features and segmentation methods. These automatically generated

annotations are then compared against the ground truth and the resulting annotation accu-

l6

racy is used as a proxy to the performance of image feature extraction and segmentation.
Based on this method, [2] compares the importance of different image features. The empir-
ical study showed that the L*a*b color space is more effective than the RGB color space.
For image segmentation algorithms, it showed that the Normalized Cut algorithm performs
signiﬁcantly better than the Blobworld segmenter. The mean-shift algorithm[ll, 12] for
image segmentation is somewhere between these two algorithms [2].

[14] compared commonly used image features including size, position, color and tex-
ture with a complicated set of features from visual content descriptions in MPEG-7. Sur-
prisingly, the empirical study showed that the commonly used image features achieve better
performance than the complicated ones.

Overall, previous studies show that no single set of image features can perform well for
all different image collections and thus it is likely that the choice of image features depends

on the characteristics of speciﬁc applications and datasets.

l7

CHAPTER 3

Statistical Models for Image Annotation

Image annotation describes the content of images with a set of textual words. This set
of annotation words can be obtained by manual annotation from humans. However, man-
ual annotation is expensive and requires extensive labor work. Also, human annotation is
very subjective. Even for the same picture, different people might use different annotation
words. Thus automatic image annotation is preferred. In this chapter, we discuss statistical
models for automatic image annotation. This type of annotation is divided into two cate-
gories. The ﬁrst one annotates the images based on quantized image regions. The second

one annotates the images based on the raw image features.

3.1 Image Annotation Based on Quantized Image Regions

One type of approaches toward automatic image annotation ﬁrst clusters image regions
into a small number of groups, which are called ‘blobs’ in [41]. The approaches then learn
the correlation between annotated words and blobs. Automatic clustering methods, such
as the K-means algorithm, can be applied to cluster image regions and get their cluster
ids. One advantage of this discrete representation is that it simpliﬁes the image represen-
tation dramatically, thus signiﬁcantly reducing the computational cost of automatic image
annotation.

Before discussing formal models for automatic image annotation based on the discrete

18

representation, we will ﬁrst formalize this problem. Let the collection of annotated images
be denoted by T, and the size of the collection be denoted by |T|. Each annotated image
J,- E J is divided into multiple regions. All the image regions are clustered into a small
number of blobs b231, big, ..., him. Then each image can be represented by a vector of
blobs and annotation words, i.e., J,- = {62117:} = {bi,1,b,;,2, ...,bi,m;wi,1,wi,2, ...,wian}.
Here, m and n are the number of blobs and annotation words, respectively; bid is the
number of j-th blob that appears in the i-th image; rum- is a binary variable, which is 1
when the j-th word appears in the i-th image and zero otherwise. The key to automatic
image annotation is to learn the statistical correlation between the blob representation and
the word representation of images.

To construct this kind of models, the ﬁrst task is to quantize the image regions. One
method is to perform a clustering procedure on the segmented image regions. The resulting
clusters are the quantized image regions and their cluster ids are treated as the discrete
representations of the images. That is, if one image region in the image belongs to one
cluster, the image region is converted to its cluster id. The translation model for automatic
image annotation[4l] and the cross media relevance model[24] use this kind of quantized
representation. Another way to quantize the image regions is to construct discrete features
directly, such as the color histogram. This kind of representation is used in models such as

Latent Semantic Analysis and Probabilistic Latent Semantic Analysis[37, 38].

3.1.1 Translation Model for Automatic Image Annotation

The translation model was originally developed for language translation[7], e.g. translating
from French text to their English equivalent. [41] views the process of annotating images
as a process of translating information from a ‘visual language’ to textual words. The
lexicon of the visual language is the blobs. Compared to other models for automatic image
annotation, such as the relevance models[24], statistical machine translation models for

automatic image annotation have the advantage that words are annotated to each image

19

region instead of to the whole image.
Based on the IBM model 1 of translation [41], given an annotated image J,- =
_)
{bin—17;} = {bi‘1,bi,2, ...,bi,m;wi,1,w,,2, ---”wi,n}, the probability of annotating image
._)
blobs bi71,b.i,2,...,b.i,m with words ll’z',1,wz',2,---,?Ui,m i.e.,p(ﬁT;-, b; ), can be expressed as
follows:
_’ n _) T1. 771
plan.) = I] point») = I] Z tjibn. (3.1)
where tjjk stands for the probability of translating the k-th blob into the j—th word and
is subject to the constraint :31ch = 1, namely each blob has to be translated into one
of the annotated words. The key to a translation model for image annotation is the set

of translation probabilities tJ-JC. These probabilities can be obtained by maximizing the

likelihood of annotated training images T, i.e.:

lTl _, ITI n m
1(T) = pri’ilbil = H H 2 tj,kbi,k (32)
1'21 i=1j=l k=1

The Expectation-Maximization (EM) algorithm[41] is applied to ﬁnd the optimal solu-
tion for Equation3.2. It iteratively updates the translation probabilities using the following

equation:
,, , . old
0. ’ ld
Z k El (’14th

i

(3.3)

where tglg and t???" are the translation probabilities learned in the previous and current

iteration, respectively. Z k is a normalization factor that ensures 2 j t j,k = 1. According to
Equation 3.3, a common word may have large translation probabilities for many different
blobs since it appears in many different annotations and therefore its term frequency wij
is non-zero for a large number of annotated examples. This could lead to overestimated
translation probabilities for common words. In Chapter 4, we will discuss some methods
to alleviate this problem, including a regularized translation model [26] and a symmetric

translation model [25].

20

3.1.2 Cross Media Relevance Model

The relevance language model was originally designed for ad-hoc information retrieval[31]
and cross-language information retrieval[30] to determine the appropriate language model,
namely the probability of observing a word w in the documents relevant to a speciﬁc in-
formation need p(w|R). The key in estimating the query language model is to determine
which documents are likely to be relevant to a given query. Assuming query words are
sampled independently from a multinomial distribution, the probability for a word to be

relevant to an information need could be approximated as:
k

P(ur)¢::|R ZP(M)P(w|M)HP(q,-(A»1) (3.4)
i=1

Here M is one of the document language models that are estimated from a collection of
documents, to is a query word, and R stands for the relevance language model for a given
query.

The relevance language model can be extended to image annotation and image retrieval
tasks, called ‘Cross Media Relevance Model’, or CMRM for short. Each annotated image
in the training set is assumed to be a candidate model M in Equation 3.4. Given an image
I = b1,b2, ..., bn to be annotated, following the relevance language model, probability

p(wk 2 III) can be estimated as:
ITI ITI
Mun. =1|J)O<p(w 2191011 J ~l—TIZPI wk=1|Jz)ﬁ P(blei)
=1 (3.5)
where p(J ) is set to be a uniform distribution. Both p(w;c =1|Jz~) and p(bj|J,-) are as-
sumed to be multinomial distributions and are computed using the Jelinek-Mercer smooth-
ing approach[ 10].

pm Z 1) = (1 w _)_,__-_#f“‘.}i:) +0 ”with (3 6)
#(b I #(h- T °
p<bjIJ.>=(1—m)—L—.JI+#—~L—r3.z m ’

#(wk, J2) is the frequency of word wk in the annotated image J5, and #(wk, T) is the

number of words in the collection. #(bi, .12) is the frequency of blob b,- in the annotated

21

image J;, and #(b;, T) is the number of blob b,- in the collection. OJ and ,BJ are smoothing
parameters and are determined by the held out data set. The essential idea of the relevance
language model for automatic image annotation is to propagate the words for the annotated
images to a test image based on their similarity to the training images.

Another usage of the relevance language model is to apply it to image retrieval for
textual queries. To bridge the gap between textual queries and image features, the above
relevance language model will be ﬁrst applied to generate a ‘visual’ language model for
blobs based on the set of annotated images and the query words. Then, the estimated
‘visual’ language model is used to determine the relevance of images in the collection.
More detailed description can be found in [24].

One obvious difference between the relevance language model and the translation
model is that the translation model assigns annotation words to different image regions
while the relevance language model only acquires annotation words for the entire im-
ages. Thus, unlike the translation model for automatic image annotation, which requires
appropriate alignment between image regions and annotation words, the relevance lan-
guage model avoids explicitly modeling the correlation between image blobs and annota-

tion words, which makes it easy to implement and robust in practice.

3.1.3 Latent Semantic Analysis for Image Annotation

Latent Semantic Analysis(LSA) originates from textual retrieval and document
analysis[l3]. It maps a high dimensional representation of a document, which often is
the term frequency vector of the document, into a low dimensional space, which is also
called the latent semantic space[13]. On one hand, latent semantic analysis can be viewed
as a dimension reduction method, and therefore is effective for alleviating the sparse data
problem that commonly exists in text-related applications. On the other hand, latent se-
mantic analysis is able to represent the document information beyond the lexical level. By

aggregating related words into concepts based on word co-occurrence patterns, LSA is able

22

to capture certain semantic correlations among words such as synonyms.

LSA is based on the Singular Value Decomposition(SVD). Suppose X is term docu-
ment matrix of dimensionality t x d, where t is the number of documents and dis the size
of vocabulary. Xio' is the frequency of word j in the document 2'. X can be decomposed
into three matrices using SVD as

X=%%% on

Here To and Do are orthonormal matrices and So is a diagonal matrix. If we keep the
ﬁrst k largest singular values in So and set the others to be 0, we obtain another matrix
X = TOSODE) that minimizes the quantity IX — X I2, where X is a matrix of rank k. This
operation also removes the corresponding columns of T0 and D0 respectively and get the
matrices T and S. The columns in the matrices T and S form the base vectors for the latent
space of documents and terms respectively. We can then map each document and each term
into this space and compute their similarities.

There is a strong analogy between textual documents and images. In textual documents,
the word sense ambiguity exits in two forms: the Polysemy, where a word corresponds to
multiple different meanings, and Synonym, where multiple words correspond to the same
meaning. Similarly, in image domain, the same distribution of image features can be related
to different objects under different contexts. Meanwhile, the same object can show different
visual properties that lead to different distributions in the space of image features. Because
of this analogy, the latent semantic analysis, which has demonstrated to be a powerful tool
in document analysis, can be applied to automatic image annotation[37]. In [37], color
histograms are used to represent images. Each entry in the histogram of every segment
is viewed as a different ‘visual’ word, totally there are 648 different ‘visual’ terms in the
‘visual’ vocabulary. Then, the LSA is applied to reduce the representation of images to the
latent space, which has fewer dimensions. The annotation of test images is computed as
the average annotations of training images that are weighted by their similarity to the test

image in latent space.

23

3.1.4 Probabilistic Latent Semantic Analysis for Image Annotation

As pointed out in [22], LSA has difﬁculties in capturing the concept of polysemy, which
refers to the case when a word has multiple senses. The author proposed the Probabilistic
Latent Semantic Analysis (PLSA) for co-occurrence data, which introduces unobservable
variables, i.e., latent topics or latent classes, to capture the correlation among words based
on co—occurrence statistics between documents and words [22].

Let p(d,-) denote the prior probability for document (1;, p(zk|dz-) denote the probability
for document d,- to be in latent class 21:, p(wj Izk) denote the probability of observing word

wj in latent class zk. Then, the joint probability p(d,-, wj) can be written as:

P(dt,'wj)= P(dii)p(ijd)
P(ijdz ) — ZkK_1p(wj|2k)p(2kldi)

Using the equation in (3.8), given a document collection d1, d2, ..., dN, where N is the

(3.8)

number of documents, its log-likelihood could be written as:
L = 29,1 23-11111 (dz-,wj) logp(di,wj)
n(dz,w
—-Z.-_1n() [10mm + 2:“: W10 g2._ 1( plwglzm and.»

,where n(d,-, wj) rs the term frequency of word wj in document di. n(d )= 23' n(d;—, wj)

(3.9)

is the length of document d;. The EM algorithm is used to learn the values for parameters
pend.) and ptwjlzn r221.

[38] extends the application of PLSA from textual collection to automatic image an-
notation. Given an image q, the probability to annotate the image with word wj could be

written as:
WWW): ZPI ijZ klp FIZkICI) (3.10)

[38] also considers that each word encodesl more information than image features, and so
words should play more important role in deciding the structure of the latent space than
image features. Based on this intuition, a variation of PLSA model, named PLSA-word,
is proposed, in which words and image feature are trained separately in the latent space.

More speciﬁcally, the learning procedure is divided into two steps:

24

1. Based on Equation (3.9), a PLSA model is trained only on the annotation word sets
to obtain the probability distribution of latent class 2 given an annotated image d,

i.e., P(z|d), and the probability distribution of words to given the latent class 2, i.e.,

P(w[z).

2. Fix P (zld), train another PLSA model to obtain the probability distribution of visual

feature 2) given latent class 2, i.e., P(v|z).
To annotate a test image, the following two steps are performed.

1. Based on the visual features v of the test image and the distribution P(v|z), the

likelihood for the test image to be in latent class 2 is computed as P(z|d).

2. Using the word distribution of each latent class, i.e., P(w|z), the probability of an-

notating the test image with a word to is computed as:

P(IUId) = ZMWIZUPI/Z'kld) (3-11)
k

3.2 Image Annotation Based on Continuous Features of Image Re-
gions

Previously, we reviewed statistical models for image annotation based on the discrete rep-
resentation of image regions. By grouping different image segments into a small number of
clusters, each image segment is mapped to a discrete id. This discretization procedure sim-
pliﬁes the image representation and hence reduces the computational cost. However, one
disadvantage of the discretization procedure is that it loses valuable information encoded
in the image features. There have been many studies on automatic image annotation that

directly uses the raw image features extracted from image regions.

25

3.2.1 Hierarchical Model for Image Annotation

The semantic meaning of different words could differ signiﬁcantly in their generality. For
example, the word ‘animal’ is more general than the word for a speciﬁc animal such as
‘tiger’. To account for different generality of different words, we can organize words into
a hierarchical structure such that general words are put on the top of the structure and the
words with speciﬁc meaning are put close to the leave nodes of the structure. For example,
word ‘animal’ will be put as one of the parent nodes for the word ‘tiger’. In document
analysis, there have been a few studies on the hierarchical models[21, 23], which extracts
the hierarchical relations between documents and abstract organization of keywords.

Similar to textual words, the patterns of image features could be organized into a hierar-
chy with each textual word corresponding to a different visual pattern. For example regions
for ‘sky’ appear more commonly than regions for ‘tiger’. Thus, the goal is to organize both
the textual words and image regions into a hierarchical structure. The general blobs and
words are put at the top levels of the hierarchy that are close to the root, and the blobs and
words with speciﬁc meaning are put on nodes close to the leaves. Annotated images are
put to the leaf nodes of the hierarchy, which are considered as clusters. The hierarchical
structure deﬁnes a path along which each annotated image generates its image blobs and
annotation words[3, 28]. Refer to paper [3, 28] for an excellent ﬁgure to illustrates this
procedure.

Because of the uncertainty in determining the cluster membership for an annotated
image, to compute the probability to generate the observations D (i.e., words, and image
regions) of an annotated image d, we need to sum over the uncertainty in the distribution
of assigning annotated image d to different clusters, i.e.,

p(D|d) = [no 1‘] Zp(ill,c)p(llc, d) (3.12)
C iED l
,where p(l lc, d) is the probability of going through level I given the document d and cluster

c. p(z‘|l, c) is the probability to generate the observation 2' given the level l and cluster c.

26

The EM algorithm could be used to train the parameters in the likelihood function. The
probability of generating observations of words p(z'|l, c) is computed based on counting.
A Gaussian distribution is used to compute the probability of generating observations of
image regions. More information about the application of the EM algorithm to the hierar-
chical structures could be found in [21, 23].

Given the learned hierarchical structure, the probability of annotating an image B with

a word to, i.e. p(w|B), is written as:

p(w|B) = ZCPWICI B)p(0|B) 0< Zcmwlc, B)p(BIC)p(C)
= Zcpfwlc, B) IIbeBp(bIC)p(C) (3.13)
= 2c 22 19(wlc, l)10(l lc, B) HbeB (ZI(P(bIl, 6)le IC))) 29(6)

However, the above model is not a true generative model, since the probability of generating
words and image segments rely on the speciﬁc document. In [1], three different hierarchical
models have been developed. They are:

0 Model I-0.

This model is deﬁned as:
it“: will:
p(=D|d) 2p c) 1'] [2p (wllc)p(lld)] H[2p(blz,c)p(zld)]
beB l

wEW
(3.14)

0 Model H.

In this Model, the probability of generating level p(l |d) in Model LG is replaced with
p(l |c, d) , which also depends on the cluster the document belongs to. The generation

probability of observations is changed to:

m": 7%?)
p(DId)= —Zp(c)H [Zp(wIzc)p woo] H[2p(bIl,c)p(ch,d)I

weW beB I
(3.15)

0 Model I—2.

27

Model I-2 further modiﬁes the probability of generating level 1, and makes it in-

dependent of the documents and only dependent on the cluster. The probability of

observation is thus changed to:

N“! N
w,d Nbbg

p(D|d) = 219(6) H Zp(wll,6)p(lIC) II 219(1)”, C)P(IIC)

C l

wEW (263 l
(3.16)

3.2.2 Latent Dirichlet Allocation for Image Annotation

Multi-type data problem is a general problem in which each object/pattem is represented by
different types of data. Annotated images can be viewed as multi-type data, in which there
are two types of data, the image regions and annotation words. These two types of data
should correlate with each other since they describe the same images. In [5], three models
are introduced to solve the multi-type data problem. In these models, image features are
assumed to follow a multivariate Gaussian distribution with diagonal covariance matrix.

The words are assumed to follow a multinomial distribution.

0 Model 1: Gaussian multinomial mixture model

In this model, each latent variable is an indicator of cluster, both words and image re-
gions are considered as different types of data that are generated from the same latent
variable. Refer to [5] for a ﬁgure to illustrate the generation procedure of this model.
The joint probability for latent class 2, annotated words 172’ = {w1, w2, ..., w M}, and
image regions 7’ = {7:1, r2, ..., rN}, i.e.,p(z, ‘1‘", '23), could be formally represented
as:

N M
Ma 7", ii) = p(zIA) H p(rn|2,#,0) H p(wm|z,ﬁ) (3.17)
m=1

n=1

where p and o govern the Gaussian distributions that determine the generation prob-
abilities of image regions, and ﬂ govems the multinomial distributions that determine

the generation probabilities of annotation words.

28

To predict words for a given image, we ﬁrst compute the probability that the image
belongs to a latent class 2 based on the image region features and then words are
generated from the latent class 2. Finally, the words from different clusters are mixed

together to obtain the annotation. This is formally described as:
p(wlr) = Zp(2Ir)p(wlz) (318)
Z

Model 2: Gaussian Multinorrrial Latent Dirichlet Allocation

In GMM model, the generation of the words and image features is conditioned on the
same latent variable. A more ﬂexible model is Latent Dirichlet Allocation (LDA).
LDA was ﬁrst developed for document clustering[6]. Each document is considered to
consist of several topics and observations in one document are generated from these
different topics. For image collection, we could also view an image as consisting
of several different topics. Each of the topics will generate the corresponding words
and image regions belonging to the topic. The overall observation is a mixture from

these different topics. Refer to [6] for this graphical model.

Let latent variables 2 and '0 model the different topics from which words and image
regions are generated. A variable 0 is used to indicate that the two types of topics
for words and image regions are correlated. Also, it models the concept that a single
document could consist of multiple topics, which are sampled based on the variable

6. The'formal representation of this generation procedure is expressed as:

N M
p('r,w,0,z,v) = p(6Ia) H p(an0)p(TnIzTh/'Lto) H p(vﬂlI6)p(wmIvm1B)
n=l m=1

(3.19)

Model 3. Correspondence LDA.

While the GMM model is too restrictive and LDA is too ﬂexible, a model is proposed
to position between the two kinds of models. Due to the ﬂexible and correlated

relation between image regions and words, correspondence LDA is introduced to

29

model the correspondence between speciﬁc words and image segmentations. That
is, the generation of words is based on both the latent variables to generate the words
as in LDA and the latent variables to generate the image segmentations. Refer to [6]

for a graphical model to illustrate this procedure.

The formal representation of the model is

N M
Maw, 9. z, y) = pIGIa) H p(zn|9)p(rnlzn,u,0) H p(ymIN)p(wmem, aﬂ)

(3.20)

Further comparison of the different models.

We could make a further discussion of the three models based on latent variables:
Gaussian Multinomial Mixture Model (GMM), Probabilistic Latent Semantic Anal-
ysis (PLSA), and Latent Dirichlet Allocation (LDA). From the following discussion,
we could see that these three models differ mainly in how they sample the topics.

The topics are modeled by latent variables.

In GMM, the probability of sampling a topic is ﬁxed by the latent variable. In PLSA,
the sampling of the topic depends on each of the documents. In LDA, the ﬂexibility
of sampling of the latent variable is between GMM and PLSA. The topic probability
depends on the parameters sampled out from a Dirichlet distribution but not on the
document. From this perspective, LDA is more restrictive than PLSA. However, the
capability to change the parameters of topic sampling based on Dirichlet distribution

also provides more ﬂexibility than GMM.

3.2.3 Continuous Relevance Model

Previously, we discussed the application of the discrete relevance model to image annota-

tion. The related words to a test image could be predicted by aggregating annotation words

of labeled instances in the training set weighted by the visual similarity between test image

30

and the labeled image. The continuous relevance model is discussed in[32]. The difference
is to use a different method to generate the image segments and words.
The conditional probability of generating words for given images is approximated by

the joint probability of observing the image regions and words:

"B ”A
(DU/1,103): ZPT(J)Hpv(wbIJ)H/kPR(TaIQa)PG(9aI-])dga (3.21)
JET b=1 a=1 R

p R(rq | ga) is a global probability distribution responsible for mapping generator vectors
9 E Rk to actual image regions 7' E R. In [32], it is assumed that for every image there is
only one corresponding generator. So a particularly simple form for the distribution PR is

assumed to be:

1 .
N- szU’) = 9
P r = 9
R( lg) { 0 otherwise
,where N9 is the number of all regions in R.
Gaussian distribution pG(g|J) = % 3:1 1 exp (9 — G(r,-)Tz_1(g - G(r,-))

ilzknk det(2)
is used to generate the image features from a model J. G (ri) is the feature vector of every

region of image J. In this model, it is assumed that the feature vector 9 in given test image
could be generated from every region rz- in the image J following a Gaussian distribution
with r,- as the mean and a diagonal matrix as the covariance matrix.

The word annotation probability is estimated based on multinomial distribution with
Dirichlet smoothing. The expression for this Bayesian estimation of posterior of word

probability is
”p12 ‘I' N 2),]

u+Zer,
v,J

In [32], the model probability p(J) is assumed to be uniform.

pv(v|J) =

 

(3.22)

3.2.4 Classiﬁcation Method

The annotation procedure can be viewed as a classiﬁcation problem. A classiﬁer is trained

for each word and this classiﬁer is then used to determine whether the word should appear

31

in the annotation set of a given image. These trained classiﬁers are then applied to both
labeled and unlabeled images to get a vector, which includes all the concepts predicted
from classiﬁers for each of the concepts. The values of the components of the vector are
conﬁdence values obtained from the classiﬁer[9]. The application of the multiple binary
classiﬁers for image annotation task could be viewed as an ensemble of binary classiﬁers
with a method of One-Per-Class(OPC). This is a simple ensemble strategy compared to the
pair-wise combination and Error Correct Output Coding(ECOC)[29]. However, the way
OPC used in [9] is not exactly the traditional usage of OPC. In the traditional usage of
OPC, the class label with the maximum score is used to annotate the image. In [9], the
conﬁdence to annotate the image with one concept is computed by adding the scores of
different ensembles for this concept. The framework is called Content-Based Soft Annota-
tion (CBSA). [9]uses Support Vector Machine(SVM) and Bayesian Point Machine as the
classiﬁers.

Support Vector Machine tries to ﬁnd a hyperplane that separates the training data with
maximum margin. The points closest to the hyperplane are called support vectors[8]. The
Bayesian Point Machine approximates the Bayesian average of statistical inference with a
unique classiﬁer called Bayesian Point[20]. Bayesian inference gives a Bayesian optimal
solution for a classiﬁcation task. However, it is often impossible to get a unique classiﬁer
that has the same result as Bayesian inference. An approximation is made by ﬁnding a
hypothesis in a ﬁxed smaller space. The hypothesis constructed is called Bayes Point,
which is believed to well approximate the Bayesian inference. This Bayes Point Machine
could be computed as the center-of—mass in the version space. [9] shows that the classiﬁer
with Bayesian Point Machine performs better than the classiﬁer of Support Vector Machine.
Also, for the words with prediction score higher than 0.5, the results are very likely related

to the semantics of the pictures.

32

3.2.5 Bootstrapping and Co-training Approach for Image Annotation

All of the previous models have certain limitations. They require a large number of anno-
tated images for effective learning, which are often difﬁcult to obtain. Furthermore, most
existing techniques for automatic image annotation require semantically meaningful seg-
mentation of images, which is again difﬁcult to accomplish given the current state-of—art
segmentation techniques. [15] proposes using bootstrapping and co-training methods to
annotate large image collections. The basic idea is to start with a small set of annotated
examples, and successively annotate and learn from a large set of unlabeled examples. To
incorporate more instances from the unlabeled sets for the training purpose, two statisti-
cally independent classiﬁers are used to co-annotate the images and the quality of the ﬁnal
result is determined by the consistency of annotations. This procedure is called co-training.
The consistent labels are considered as good examples and added to the training set. The
inconsistent labels are considered as bad examples and are used for the user’s manual an-
notation. More precisely, the co-training approach [15] is carried out as follows. TWO SVM
classiﬁers are used to determine the words for images. Each SVM uses a different subset
of image features extracted from automatically segmented image regions. One set contains
the color histogram and another includes texture and shape features. These SVMs trained
on different kinds of features for different segmented regions are then applied to annotate
images. If the annotated words could agree with each other, the words will be used for
annotating the image. If there is some conﬂict between different classiﬁers, some disam-
biguation rules are applied to pick the correct one. If the conﬂict cannot be resolved, the
image is prompted for the user to annotate manually. At the same time, the training set is

enlarged and the classiﬁers are updated.

3.2.6 Linguistic Indexing of Pictures by a Statistical Modeling Approach

Most of the previous methods require semantically meaningful segmentation to correlate

the segmented regions with words. However, given the current segmentation techniques,

33

semantically meaningful regions are difﬁth to obtain. A number of papers propose ap-
proaches that do not need to get the meaningful segmentation but still effectively model the
structures of the images[33].

In [33], image features are modeled as 2 dimensional Multi-resolution Hidden Markov
Model (2D MHMM). Images are divided into blocks and represented in multi-resolution.
The number of blocks in both rows and columns are reduced successively by half to get
lower and lower resolutions. At each resolution level, features are generated by a 2D HMM,
which takes into account the spatial relationship between blocks at the speciﬁc resolution
by specifying the diagonally upper blocks as the previous blocks. For every state in the
2D MHMM, the feature vectors are assumed to follow a multivariate Gaussian distribution.
To model the relationship across different resolutions, a one dimensional Markov chain
is used to connect the related parent-children blocks at different resolutions and thus a
2D MHMM is formed to represent the information of the image. The 2D MHMM model
mainly captures the spatial relationship of the feature vectors, for both inter-scale and intra-
scale statistical dependence. The number of states along horizontal and vertical directions
is unknown and thus different combinations are tried. Refer to [33] for a ﬁgure to illustrate
this idea.

A dictionary of concepts is deﬁned as the possible set of annotation words. For each
of the concepts in the dictionary, a set of images is collected and also a short description
of the concept is given. It is easier to collect a set of images related to a concept with a
short description than provide a set of annotations for each of the image. Given this set of
images, a 2D MHMM is trained for each of the concepts.

Given a test image, the similarity score between the image and images in the training
set is computed as the likelihood of generating this test image with the 2D MHMM for
the concept of the category. The top Is: categories with the highest likelihood are picked
out and the word descriptions of the categories are treated as the candidates for annotating

the test image. The paper introduces a method to pick appropriate words by ranking the

34

statistical signiﬁcance of the words in the candidate description. The statistical signiﬁcance
is computed by comparing the occurrence of the word in the predicted k categories with
randomly selected k categories. The top ranked words are selected as the annotation of the

image.

35

CHAPTER 4
Enhanced Translation Model for Automatic

Image Annotation

Feng Kang, Rong J in and Joyce Y. Chai. Regularizing translation models for better
automatic image annotation. In CIKM ’04: Proceedings of the Thirteenth ACM Conference
on Information and Knowledge Management, pages 350-359, 2004.

Feng Kang and Rong J in. Symmetric statistical translation models for automatic image
annotation. In Proceedings of the ﬁfth SIAM International Conference on Data Mining,

pages 616-620, 2005.

36

CHAPTER 4

Enhanced Translation Model for Automatic

Image Annotation

The translation model in section 3.1.1 represents images by a visual vocabulary that is
constructed from the clustering results of image regions. The image annotation is viewed as
a translation from a visual language to a textual language. The key challenge in training the
translation model arises from the fact that the alignment between image regions and their
annotation words is not provided. According to [7], one strategy to avoid the alignment
problem is to utilize the co—occurrence statistics between image regions and annotation
words. If an image blob co-occurs more frequently with word ‘A’ than with other words, it
will be more likely for the image blob to be associated with ‘A’.

However, the term frequency of annotated words follows Zipf’s law, namely that a small
number of words appear very often in image annotations and most words are used only by
a few images. Figure 4.1 plots the percentage of times that each word is used by image an-
notations from a subset of images from the COREL dataset[14]. As a result of the skewed
distribution, a common word can ‘accidentally’ co-occur with a blob that is associated with
a rare word. For example, word ‘grass’ is used for annotations much more frequently than
word ‘tulip’. Meanwhile, for most images where tulip appears, it is always surrounded by
grass. As a result, the type of blobs for tulip co-occurs with word ‘grass’ as frequently as

word ‘tulip’. The problem with the co-occurrence statistics in automatic image annotation

37

 

0.07 I I I I I I l

0.06 - s

.O

O

U"
r

.o
S
l

.o

o

(A)
I
r

 

normalized frequency

9

o

N
r
r

 

 

 

0 t l I x . r r
1 50 100 1 50 200 250 300 350 400
words index

Figure 4.1. The distribution of term frequency for annotated words in a subset of COREL Data

is further complicated by the noise in clustering the massive number of image regions into a
relatively small number of blobs. Because each image region is represented by a set of ﬁxed
features such as colors, textures and shapes, regions for different annotated words can have
similar distributions over the space of image features and therefore are grouped into the
same cluster, or the same image blob. As a result, a blob for a rare word can co-occur more
frequently with a common word than the rare word. Using the previous example, consider
that image regions for tulips are grouped together with image regions for other ﬂowers. If
most ﬂowers are surrounded by the grass, word ‘grass’ will co-occur more frequently with
the blob for ﬂowers than any single ﬂower name. It is the inaccurate co-occurrence statistics
that allow common annotated words to be associated with many irrelevant image blobs and
thus degrade the quality of auto-annotations generated by the machine translation models.
We propose two categories of modiﬁed translation models, namely the regularized transla-
tion model, and the symmetric translation model, that alleviate the problem caused by the

skewed distribution. The basic idea of the enhanced models is to raise the number of blobs

38

that are associated with uncommon words. The regularized translation model accomplishes
this goal through the introduction of a special Dirichlet prior and the symmetric translation
model considers both the translation from blobs to words and the translation from words to
blobs. In this chapter, we will ﬁrst present the two frameworks and then show our empirical
study with both modiﬁed translation models. Chapter 5 addresses their comparison with

other models.

4.1 Regularized Translation Model

In this section, we will ﬁrst present the framework of the regularized translation model for
automatic image annotation, followed by the description of an efﬁcient EM algorithm for

ﬁnding the optimal solution[26].

4.1.1 Framework for the Regularized Translation Model

The basic idea of the regularized translation model is to impose our prior knowledge of the
desired translation model in the selection of translation models. If each blob represents a
different type of objects, we would expect that roughly equal number of blobs are associ-
ated with each word, or at least each word is associated with a certain number of blobs.
In the framework of Bayesian Learning, this prior preference of translation models can be
introduced into the original statistical translation model through an appropriate prior.

In order to form such a prior, the ﬁrst task is to ﬁnd an appropriate measurement that in-
dicates the number of blobs that are associated with each word. We use the normalized sum
of translation probabilities for each word, i.e.,ﬁj = Qﬂélﬁ where m is the total number of
blobs. Since the measurement ﬂj is proportional to the sum of translation probabilities for
the j — th word, it does provide a good indication of how many blobs that are associated
with the word j. Meanwhile, 53' can also be interpreted as the probability for the j — th
word to be associated with any blobs. Particularly, Bj satisﬁes the axioms for probability,

namely

39

Because of the probability interpretation for ﬁj, we can introduce a prior distribution
for Bj that will indirectly inﬂuence the results for translation probabilities tjﬁ.

The second task for forming a prior is to choose an appropriate distribution for Bj. Since
the desired translation model is to have almost equal number of blobs to be associated with

each word, a Dirichlet prior can be used for B,
1 n
" __ a—I
Pm) ~ Dirichletog — m j|_|1 B]. (4.1)

where a > 0 is the hyper-parameter that determines the shape of the Dirichlet distri-
bution and n is the total number of words. 8 is the normalization constant and follows
multinomial beta distribution as:

W) = (r(a))"

F (na)

 

1‘ rs Gamma function deﬁned as I‘( (z): f0°° tz 1e.—tdt Note that Dirichlet distribution
is the conjugate prior of the multinomial distribution.

In our model, we set a; as a constant. One property of Dirichlet distribution is that it
reaches the maximum point when ,Bj is a constant. Furthermore, the larger the a is, the
narrower the distribution will be. Figure 4.2 illustrates the Dirichlet distribution with two
ﬂ component and different values of the parameter a.

By adding the above prior to the original translation model, the posterior probability for

the training images is then modiﬁed into the following form:

0T n m
lreg(T)= ﬁ)HPthb)ocl—I(Ztls) anzl( tj,kbi,)k (4-2)

i=1j

With same notation as the previous translation model, m is the number of blobs, n is

the number of words, and T is the number of images in the training set.

40

 

or value is 2 or value is 4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.04 - e e . 0.05 - -
0.03) . 0'04’
0.03)
0.02)
0.02:
001’ ‘ 0.01l
o . - - . 0 . - - -
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
avalueisB avalue is 16
0.08 . a - . 0.1 . - - ,
0_06_ , 0.08»
0.06'
0.04:
0.04:
0'02' 0.02)
O . . . 0 . L
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 4.2. Dirichlet distribution for different values of a

The optimal translation probabilities are obtained by maximizing the objective function
in Equation 4.2. Compared to the original translation model in Equation 3.2, the objective
function in Equation 4.2 requires that not only should the optimal translation probabilities
explain well the correspondence between image blobs and annotated words but also be
consistent with the prior preference on translation models, namely that different words are
associated with a similar number of image blobs. Therefore, the resulting translation model
from Equation 4.2 will be more desirable than the model obtained from Equation 3.2. For
late reference, we call this modiﬁed translation model ‘regularized translation model’, or

RTM.

4.1.2 An EM Algorithm for the Regularized Ti'anslation Model

With the regularized translation model in Equation 4.2, the next important question is how

to efﬁciently obtain the optimal translation probabilities tj,‘c that maximize the function in

41

Equation 4.2. The difﬁculty with optimizing Equation 4.2 lies in two aspects:

1. It has a large number of parameters. The number of parameters in translation mod-
els is m - n, i.e., the number of blobs times the number of unique words. For the

experiment conducted in this study, the number of parameters is close to 20,000.

2. It is a constrained optimization problem. The optimal solution to Equation 4.2 should

satisfy the axioms of probability, namely 0 g tj’k S 1‘v’j,k and 23- tj’k = IVk.

In the following, we present an EM algorithm for efﬁciently optimizing the objective func-
tion in Equation 4.2.

First, instead of optimizing the likelihood of training data in Equation 4.2, we can
optimize the log-likelihood of training data, i.e.,

ITI n

(I) = log (lreg(T — —aZlog (LT: 19)1 ZZwiJ-log (Z tj kbz k) (4.3)

i=1j=1
Then, following the idea of the EM algorithm, we update the optimal solution itera-
tively. Particularly, at each iteration, we need to ﬁnd a set of translation probabilities t”ew
better than the old ones tJOll‘c’ that are computed for the previous iteration. To this end, we
can examine the difference in the log-likelihood between two consecutive iterations, i.e.,

:ltn—__e—w +ITI n :21: lttrwwbi:

+i=lj=1

 

 

tolsd tneu ITI n wij tolkb k tniut
J i J
>a§::— (...)+§:z . .()

l 18 1 3:2-1J-12k=1wzi.7tj,zkbik J,k
The new translation probabilities tg’i’“ are obtained by maximizing the above difference.

Taking derivative and setting it to 0, the updating equation becomes:

old old
in?” __ 1 Zka( t] k +2 “11.3th kb3 k k) (4 4)
9 '_ ld '
J 21 told 2k: 1 “1:,th kbz‘

 

 

where Z k is a normalization factor that ensures 23' t3???“ = 1,

42

Comparing the above updating equation to Equation 3.3, we can see that Equation 4.4

has an extra term a—Lk—H in the right hand side of the equation. For each word j, this extra
:1 ’3’]

term gives a larger share of a to t jsk’ the maximum translation probability for the j — th

word, than any other translation probabilities t 331” 7g Is: for the same word. As a result,

the maximum translation probability for each word will beneﬁt most from this extra term.

Furthermore, the updating equation 4.4 is able to adjust the sum of translation probabilities

for different words (i.e.,Z,c thC) to be close. This is because according to Equation 4.4, a

word J with a small sum of translation probabilities, which is the denominator of a—Afﬁ,

Z, t .~
will get more promotion than a word that has a large sum of translation probabilities. Ujslu-
ally, the rare words co-occur with fewer blobs than the common words and the sum of
translation probabilities related to rare words is thus smaller and gets more promotion.

Comparison to the normal usage of Dirichlet priors. Note that the Dirichlet prior in-
troduced in this work is different from the Dirichlet priors used in many other studies [49].

For most previous studies of Bayesian learning, Dirichlet priors simply introduce constant

pseudo counts into the estimation of probabilities. However, here the pseudo count in-
troduced by the Dirichlet prior (i.e., a—Lk—la) is no longer a constant. In fact, it is this
216’:

non-constant pseudo count that leads to a more balanced distribution of the number of
blobs associated with each word.

The global optimum for the EM algorithm. The objective function in Equation 4.3
is strictly convex. It can be easily understood by treating each term (22”:1 t1,8)a in the
prior as a number of pseudo-annotated images that include all blobs in its picture and
are annotated only by l — th word. As a result, the regularized model is almost identical
to Translation Model 1 except that the regularized model uses both the pseudo-annotated
images and the annotated images from the training dataset. Since the translation model
for any number of annotated images is strictly convex [7], the new objective function in
Equation 4.3 will be strictly convex. Therefore, it does not have any local optimum and the

EM algorithm presented in Equation 4.4 will guarantee to ﬁnd the global optimal solution.

43

The choice of a. As already revealed in the previous discussion, constant a has a great
impact on the resulting translation model. A larger value for a will introduce more pseudo-
annotated images and therefore result in a more balanced distribution for the number of
blobs that is associated with each word. In the experiment part, we provide a detailed study

of how the value of a will inﬂuence the quality of auto-annotations.

4.2 Symmetric Translation Model

Most previous studies on translation models for automatic image annotation focus on the
model that translates image regions/blobs into textual words, which is called forward trans-
lation model in our view. Apparently, we can apply the translation model in a reverse way,
namely translating textual words into image blobs. We call this translation model a back-
ward translation model. In this section, we propose a symmetric translation by combining

these two kinds of translation models together[25].

4.2.1 Discrepancy between Forward and Backward Translation Models

Although both forward and backward translation models utilize the same set of co~

occurrence statistics, they make different assumptions for words and image blobs:

' l. The forward translation model assumes that each image blob is translated into a sin—
gle word, while each word can be translated into multiple blobs. Due to the prob-
lem with accidental co-occurrence, common words are usually associated with many
more image blobs than uncommon words. In particular, many rare words are even
associated with no image blobs at all, which makes it impossible for these words to

be used as annotations.

2. The backward translation model is based on the assumption that each word is trans-
lated into a single image blob. Thus, for each annotation word, a large translation

probability is assigned to an image blob if it co—occurs relatively frequently with the

44

Table 4.1. An example of forward and backward translation models for two image blobs(i.e. bl
and b2) and two words(i.e.'w1 and 102).

UV

 

given word compared to other image blobs. As a result, an image blob can be as-
signed with a large translation probability for a word although the absolute number

of co—occurrence between the word and the image blob is rather small.

In order to better illustrate the difference between these two types of translation models,
consider a simple example of co-occurrence statistics that is shown in Table 4.1. On one
hand, for the forward translation model, the translation probabilities for word ‘wl’ are
dominative for both image blobs, and word ‘w2’ is not strongly associated with any of the
two blobs. As a result, it is unlikely for word ‘w2’ to be used in any auto-annotations.
On the other hand, for the backward translation model, we do ﬁnd that image blob ‘bl’ is
strongly associated with word ‘w2’. But, the chance for image blob ‘b2’ to be associated
with word ‘w2’ is also high (i.e., 42%). This large uncertainty can signiﬁcantly (@3112;
the accuracy of auto-annotations. Thus, neither of the two translation models provides a
satisfactory answer.

To utilize the two directions of translation models, we can adjust the translation proba-
bilities in Table 4.1. On one hand, for the forward translation model, to balance the number
of blobs associated with different words, we need to increase probability p(w2|bl), and
decrease probability p(wl|b1). This will result in word ‘w2’ to be associated with blob
‘bl’. On the other hand, for the backward translation model, the uncertainty as to which

blob is associated with word ‘w2’ can be reduced using the information from the forward

45

translation model. This is because, according to the forward translation model, it is more
likely for blob ‘b2’ to be associated with word ‘wl’ than for blob ‘bl’. Thus, to balance
the number of image blobs associated with each word, we need to reduce translation prob-
ability p(bg|w2) and increase p(bl |w2). The reﬁnement of the forward translation model
and the backward translation model will be carried out iteratively until they both converge
to the desired distribution.

As indicated by the above analysis, the key is to explore the correlation between the
forward translation model and the backward translation model under the assumption that
the number of blobs associated with different words should be evenly distributed. Using
the information collected from the backward translation model, we are able to alleviate the
problem with the forward translation model in which rare words are associated with few
image blobs; Using the information collected from the forward translation, we are able to
alleviate the problem with the backward translation model in which large uncertainty exists
for uncommon words as to which image blob the word is associated with. By combining
these two translation models together, we will be able to avoid the difﬁculties of each
individual translation model that are caused by their underlying assumptions.

The mathematical formula for the backward translation model is similar to the forward
translation model. In the backward translation model, the translation probability from an-

notation words to image blobs is written as:

m m bi,j
p(bi|ru)1—Ip(b,JIwO<H (Z uj.sz',k)
k=1

where try-J: stands for the probability of translating the k — th word into the j — th
blob subject to the constraint Zj uj k"— — 1, namely each word has to be translated into a
single image blob. Similarly, EM rs used to ﬁnd the set of translation probabilities and the
updating equation is written as:

()lti

 

. new _ 1 wi’jb’ 7}:qu 4
u‘jJJ — Z—u01d, (.5)
k 2' Zklw LA, k uj k,

46

 

(ID

 

- Forward Translation Model
8 _ — Backward Translation Model

 

joint probability
A 01 O)

(a)
I
I

2

100 200 300 400 500
blob index

 

A

 

 

 

 

Figure 4.3. The joint probabilities p(w, b) for word ’clouds’ that are computed using the forward
and backward translation models.

Z k is a normalization factor that ensures 2 j ug‘iw = 1.

mm this deﬁnition, we could illustrate the discrepancy between forward and backward
models with the results trained using the COREL dataset(See the experiment part for the
details of this dataset). Then, we compute the joint probability p(wj, bk) for every word
and every blob using both forward and backward translation models. More speciﬁcally, the

joint probability based on the forward translation model is computed as:
P(wjablc§ f) = tj,kp(bk)

The joint probability for backward translation model is computed as:
P(wj» bk; 5) = uk,jp(wj)

If the two translation models are consistent, we would expect the two joint distributions

based on the two different translation models to be similar. However, as shown in Figure

47

4.3, the joint probabilities for word ‘clouds’ that are computed from two translation models
are quite different. Furthermore, we also compute a relative KL divergence between the
joint distributions based on the two translation models. It measures the relative difference

between two distributions and is deﬁned as:

p(wZ-,bk:f)

231:1 ZZZ] p(wj1bkif)10gp(wjabki f)

The resulting value is 0.16286, which indicates that the overall difference between the

Rel — KL 2

 

two translation models is large.
In the following section 4.2.2, we will introduce a soft regularization term to enforce

the consistency between the forward and backward translation models.

4.2.2 Symmetric Translation Model through Regularization

As discussed in the above section, there is a certain discrepancy between the translation
probabilities of forward and backward translation model. A regularization-based sym-
metric translation model (RSTM) is proposed to correct this discrepancy by utilizing the
information across the two models. First, we introduce a symmetric KL divergence term
that measures the discrepancy between the forward and backward models:

p01,] '.'b piL b b
KL— ZZPW wj,br f) 10g(::(bk f) +221) u bub )10g( ((wj’b: ———f)))

(4 6)
According to the property of KL divergence, the above expression becomes zero iff
p(wj,bk;f) = p(wj,bk;b) for anyj E [1..n] and k E [1...m] . Then, we add the KL
divergence term into the objective function as the regularization term to ensure the consis-

tency between the forward and backward translation models:

 

ITI ITI n
QRSTM = {Z :1 (M2751, kbik) +2: 21%(Z'uk,jwi,j)} (4-7)
tranglration

48

P<wj15k f) P"(“j1bk 5)
—/\{ZZp(w wjibk f) )::10g(—(— wjbkib) )+Z:Zp( wjvbkib) )10g(——— P(1Uj,bk' ——f—))}

regularization
where A is used to determine the degree of consistency between the two models. Efﬁciently
ﬁnding the optimal solution to (4.7) is more complicated than the EM algorithm for the

standard translation model. We will give the updating equation in the following and leave

the derivation in the next section.

 

 

 

 

 

 

 

 

thew : 2Cjk (4.8)
. 2 . .
BM + \/Bj,k + 4AJ‘kCJ’k
_ P(bi~)
AM — 2’\ told
J, k
33,1... = 111111..) 11111-(13‘3‘1— 111111..) log<urjp<w3>> + 11.
[TI toldb
2 11:
CL}; = Z Z7nt01db + Ap(ur] )uk J-
i=1=1tjkb2 k
where Ak IS the normalization factor that ensures Zj We“) 2 1.
112621! _ 2Fk~j (4 9)
J .
Ekj + \/EEj + 4Dk,ij,j
111113)
Dka’ 2)‘ uold
k,j
E1...- = 111111) 111110133) — 1111111101013 11111» + 133-
m 210M211
k .] 1.?
F 1 = +A‘)b.t.-.
I“ Z X” 210m, 11). .r I( k) N”

i=1 jlzl k,j 2,]
where 63- is the normalization factor that ensures :1: u“?‘"— —— 1.
Note that the translation probability ukyj for the backward translation model is utilized
in the estimation of t M through the computation of 83-, k and Cch- Changes in the back-
ward translation model will be reﬁected in the computation of translation probabilities for

the forward translation model, and vice versa. Thus, through the regularization term, the

49

two translation models are able to exchange information to enhance the estimation of their
translation probabilities.

Choice of probabilities p(w) and p(b). One natural choice for p(w) and p(b) is to use the
empirical values that are estimated from the training corpus. However, one problem is that,
the empirical distribution for term frequency follows a skewed distribution (Zipf’s law). As
a result, if the empirical p(2u) is used directly, the regularization term will mainly focus on
the consistency checking for common words. For rare words, since its empirical p(w) is
very small, its impact in the regularization term is almost ignorable. To put equal emphasis
on both common words and uncommon words, we decide to use a uniform distribution for
both p(w) and p(b), which turns out to have better performance in our empirical studies.
Choice of A. Weight A determines the degree of consistency that we want between the two
translation models. A very large A will enforce the two translation models to have almost
identical prediction, which can signiﬁcantly degrade the performance for automatic image
annotation. A very small A will simply ignore the correlation between the two translation
models and return back to the individual translation models. To gain the best performance,
we divide the training data into the set for training the model and a set for evaluation set to
determine the value of A. After deciding the value for A, we will retrain the regularization—

based symmetric translation model using all training data.

4.2.3 Derivation of the EM Algorithm for Regularization based Symmetric Transla-
tion Model

Suppose that the objective function is l 2 ll + 12, and the parameter set is 6 = {t j.k1 "k, 1}
Deﬁne 11 = log (I (6)) — log (I (6°‘d)), where
ITI ITI

m Tl
log (1(6)) 2 Z 2 log (2: tj,kbi.k) + Z 2 log :l‘kaLj
k2]

i=1 {ﬂu-113:1} i=1 {Mbjkzi} 3'21

50

Applying Jensen’s inequality, the bound for the log likelihood of ll can be written as:

ITI taldbik

[1 > Z: Z Z Zm I told/Mb log (tj1kb‘1k)

1': 1{j|-111,- j=1}k= 1 Z:k’=1J

old
,uk sz'j
+2 2 2:11 Jold _ ‘09(f"k1j“"1i)

1: 1{ka k=1}j= 126:1113111’11'

 

 

Deﬁne [2 as the penalization term of KL-divergence.

kl? (b1) #1- MIN)
12 = 4 2:01PM) 1<)g(t:"j‘_jWP.)+ZZP1JP(WJ)10g(t£1413)

= 4221-1112(51) (10g (4:3, 3,) +109; (t‘jh‘) +log(p(b1-.)) —10g(1”k.jp(wj)))

thc

—A Z ZjikJ-p (211]) (log (135131) + log (1101‘) + log (19015)) — 10g (tjjcl’ (1%»)
j k k 3'

Suppose

12,1 1 — 1061(5) + 10g<p (‘11)) - 10% (l‘ka'p (11.13))

6,”- : log ([15:13!) + log (1) (21.5)) - log (tjﬂ) (bk))

With inequality log (.1 )3 :r — 1, the lower bound of [2 could be written as

12 > —AZ:1jkp(bk))(log(:——fﬁ —1)+nj‘k)
jfk

[1' ..
422111.11) ('wJ') (103C311 1) + 0‘1?)
j k

k,j

So the objective function we will optimize is l with constraints 23' t j}; = 1 for each I:

and 2k Mm" = for each j.
After introducing the Lagrangian term, the function we will optimize is

(I) :2 2: Z 2111 tomb 109(t1'1kb‘1“)

1= 111111113=111~= 1 21% 1

 

51

IT I . 11111

+2 2 221”“:123 109(I‘k,Jwi-J)

1:1{1111 k=1}J=12J’=1“kj'wiJ"

 

—,\ZZtJ-Jkp(bk) (10g (:‘Z—E: —1) + 72.7.13)
j k tJ' ’6

He,
—/\2 2,1116th (1113-) (10g (— —l_0$— I) + 6k,j)
j Ic

“kg

+Ak 1— 2%,]: + ﬁj (1— Z/ij)
J' k

Taking the derivative with respect to t j,k and setting it to 0, we can get the solution for

thc

201.1
2
Bj,k + \/3ij + 4Aj,ij,k

new
tj,

 

 

 

_ 19091)
AJ'J: — 2)‘ told

8J1 = 1 p<b1)log<t;"13>— Ap<b1)log(ru1,J-p(wj)>+11

 

ITI toldb k
2,
0.7113 : :—:th_‘7k tOZd,b +Ap(wj)uk1j
z=1=1tjk’bz,k

With constraint ZJ- tJ-Jk = 1, we could treat Ak as a variable and search the solution for
this equation. After obtaining the value of A1,, we could in turn get the solution for tJ-Jk

Similarly, we could get the updating equation for 11”.

4.3 Experiments

In this section, we will investigate the two enhanced methods on the COREL data set

used in [41]. The data set consists of 5000 annotated images, among which 4500 images

52

are used for training and selection of parameters and the rest 500 images are used for
testing. 374 different words are used for annotating both the training and testing images.
The maximum number of annotation words for one image is 5 and the average number of
annotation words for each image is 3.5. Similar to the previous studies on automatic image
annotation, the quality of automatic image annotation is measured by the performance of
retrieving auto-annotated images regarding to single-word queries. For each single-word
query, precision and recall are computed using the retrieved lists that are based on the true
annotations and the auto-annotations. Let I J- be a test image, tj be its true annotation,
and gj be its auto-annotation. For a given query word w, precision and recall are deﬁned
respectively as:

|{Ijlw E tj /\ w E gj}|
|{Ijlw Eng

|{Ijlw E tj /\ w E gj}|
|{1jlw Eth

 

precision(w) =

 

recall(w) =

The precisi0n(w) measures the accuracy in annotating images with word 11) and the
recall (w) measures the completeness in annotating images with word to. After we ob-
tain the precision and recall values for each word, the F measure of the word is deﬁned
as:

2precisz'0n(w)recall(w)

 

F :
(w) precisi0n(w) + recall(w)

The average of precision, recall, and F measure over different single-word queries are
used to measure the overall quality of automatically generated annotations.
The forth metric, #Ret_Query, is the number of single-word queries for which at least

one relevant image can be retrieved:
#ReLQuery = [{1_L.!|pT€C’i.9i071(’lU) > 0 /\ recall(w) > 0}]

This metric compensates the metrics of average precision, average recall, and average F

measure by providing information about how wide is the range of words that contribute

53

 

100 . . -

 

 

—-°~ all words

 

 

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

- 30 most common words
2‘ 80 041? x" -«--' uncommon words
a a ' ~. , -
O
‘3. 60 g; ’
m
a: 3%)) J.
“5 40- 5 0.2»: _ . ‘
2 1
20*
O0 400 800 1200 1600 2000 00 400 800 1200 1600 2000
aVaIue aValue
(a) #Ret-Query (b) Average Recall
0.4 . ﬂ r 0.4 f 4
—0— all words + all words
- 30 most common words - 30 most common words
c --+-- uncommon words 9 -+- uncommon words
8 a
a . a. . 8 " vino-.0004}
93 _. ' ° 2
a 0.2‘- _ ° ' 3’ u.
8) 4, W0 0
<3 ‘*"**”““" ‘v 3
m In.
3 3
00 400 800 12100 1600 2000 O0 400 800 12100 16100 2000
aValue aVaIue
(c) Average Precision ((1) Average F Measure

Figure 4.4. Performance measurement for different a- values of Regularized Translation Model

to the average precision ,recall, and F measure. In this section, we divide the training set
into two parts: 4000 images are used to train the model and 500 images are used as the
evaluation set to determine the parameters of the model. When choosing the parameters
for each of the models, we use 5 words as the size of the annotation set and evaluate the
performance of the two individual models under different values of the parameters. After
the value for the parameter is chosen, we use the whole training set to train each of the

models again and the resulting performance is compared with other models in Chapter 5.

54

4.3.1 Impact of Prior on the Regularized 'Ii'anslation Model

As already pointed out, the constant a has a large impact on the performance of regularized
translation model for automatic image annotation. In this experiment, we measure the
change with respect to the four metrics for different values of a. Figure 4.4 plots the curve
for #ReLQue'ry, average precision, average recall, and average F measurement when a
is varied from 0 to 2000. We observe that the performances ﬁrst improve and then the
improvements become saturate when a is larger than 400. This is because, when a is a
small value, the distribution of the number of blobs associated with words is rather skewed
and thus increasing the value for a will have a great impact on balancing the distribution,
which leads to signiﬁcant improvement with respect to the four metrics. However, when
(1 becomes large, the number of blobs associated with different words tends to be evenly
distributed over different words. As a result, increasing the value of a will make little
adjustment to the distribution of blob numbers and therefore little change will be made to
the translation model. From Figure 4.4, both average precision and recall are increased
substantially when a is increased from 0 to 200. This is contradictory to many studies
in information retrieval, in which improvement in recall usually leads to degradation in
precision. To have a better understanding of this phenomenon, we divide the words into
two groups: a group of common words and a group of uncommon words. In Figure 4.4,
in addition to the average precision ,recall and F measure, we also plot the curve for the
average precision, average recall, and average F measure for both common words and
uncommon words using the dotted lines and dashed lines, respectively. According to Figure
4.4, for both common words and uncommon words, the change in precision and recall
follows the normal patterns, namely that an increase in recall is usually accompanied with
decrease in precision. Furthermore, the trends in the change of precision and recall for these
two groups of words are almost opposite to each other: increase in the recall of uncommon
words is usually accompanied with a decrease in the recall of common words. Since the

overall average value for precision and recall is the mean of precision and recall for these

55

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

85* r,,,-._._,
0.4" . .-.'*"”“‘--’-+’ 1

OSBO’ E I
O
..175— 6‘2
Cl.) 0)
E 70. 0’ / .
0 $0.22, 0 o . .... . '1’
E 65- 3,5 1' —o— all words
2 - 30 most common words

60* - +- uncommon words

550 2 4 6 a 10 O0 2 Z 0 a 10

A Value A Value
(a) #ReLQ-ue-ry (b) Average Recall
0.4 . . 0.4 L . .
—0— all words -°— all words
- 30 most common words - 30 most common words
c — +—- uncommon words 9 — ~- - uncommon words
_% 0. . o 0‘ ,. ' o ‘ 0 g
0 a)
9 2
0' U.
a)
a) L—
i 31’
l
00 2 4 0 a 10 00 2 4 6 10
A Value A Value
(c) Average Precision (d) Average F Measure

Figure 4.5. Performance measurement for different A values for Regularized Symmetric Transla-
tion Model

two groups of words, the opposite trends in these two groups somehow compensate each
other and lead to increase in both the average precision and recall. One disadvantage of
using large values for a is that a larger a usually results in a slower convergence for the
EM algorithm. Apparently, the performance of the regularized model saturates after a is
greater than 400, a = 400 has the best tradeoff between computational cost and predication

accuracy. This value is used in our later experiment.

56

4.3.2 Effects of Exploring Correlation between Forward and Backward Translation
Models

The importance of exploring the correlation between the forward and backward translation
model can be illustrated by varying the value for A. The results for the four metrics are
plotted in Figure 4.5. When A = 0, no correlation between the two translation models are
taken into account. By increasing A, more and more correlations are introduced between
the forward and the backward model. We experiment the set of values that are multiplier
of 5000. The multiplier ranges from 1 to 20 is used in our study. As indicated in Figure
4.5(Note: When plotting the ﬁgure, the A values are divided by 104 to be ﬁt in the paper),
the performance improves at ﬁrst and then saturates after A = 50, 000. We also plot the
performance of the 30 most common words and the performance of the rest of words.
We observe that average recall keeps decreasing for common words and keeps increasing
for rare words, which indicates that with the increasing of A, we obtain better and better
performance on rare words. This could be used to determine the value of A to have a
good tradeoff between the performance of rare words and common words. In addition,
the average precision ﬁrst increases and then keeps relatively stable. This result further
indicates that introducing more correlation between the forward and backward translation
model can signiﬁcantly enhance the quality of auto-annotations. In our later experiment,

we choose A = 50, 000.

57

CHAPTER 5

Image Annotation through Label Correlation

Feng Kang, Rong Jin and Rahul Sukthankar. Correlated Label Propagation with Ap-
plication to Multi-label Learning. In C VPR ’06: Proceedings of the 2006 IEEE Computer

Society Conference on Computer Vision and Pattern Recognition, pages 1719-1726, 2006.

58

CHAPTER 5

Image Annotation through Label Correlation

In Chapter 4, we discussed translation models for the automatic image annotation. Two
enhanced translation models are proposed to address the over-estimated common word
problem with the traditional translation model. While the proposed enhanced translation
models improve the original model substantially, there are two major problems that prevent

the performance from being further improved:

0 The discrete representation of images loses information and result in high annotation
errors. To construct the translation models for the automatic image annotation, we
need ﬁrst apply clustering algorithms to group image regions into clusters to be able
to construct the visual vocabulary to represent the images. While this representation
reduces the storage space compared with the continuous features, it loses informa-
tion in the clustering procedure. For example, it might group regions of different
ﬂowers into the same cluster. As a result, the identities of different ﬂowers become
indistinguishable. However, this difference will be retained if we use continuous

features.

0 None of the previous models explores the correlation among different annotation
words. Effectively exploring the semantic correlation among annotation words is
important since the visual features are often insufﬁcient for deciding the appropriate

annotation words. For instance, the words ‘ocean’ and ‘sky’ are both strongly related

59

to the blue color feature. Therefore it may be difﬁcult to distinguish these two words
based on the color features alone. However, if we are conﬁdent that an image should
be annotated with ‘grass’, then it is more likely that a region of blue in the same

image should be annotated as ‘sky’ rather than ‘ocean’.

In this chapter, we ﬁrst present the framework for multi-label learning that explores
the correlation among different class labels and then apply the proposed framework to
automatic image annotation.

A natural way to deﬁne a set of correlated labels is to explicitly specify which set
of labels are correlated among the whole label sets. However, one big concern of this
explicit representation is the scalability issue. Even for 20 different labels, the number of
possible label combinations reaches 220 = 1048576. The combinatorial problem is usually
computationally expensive and thus may only apply to a small number of labels. One
simpliﬁcation of this method is to only consider the pairwise correlation among any two
class labels instead of all possible labels, which is adopted by several researchers[51, 46,
17]. However, due to the large number of labels in our annotation task, even this simpliﬁed
approach may not be able to work efﬁciently, since the complexity is on the order of 0(122),
where n is the number of labels.

In the following, we will ﬁrst present one of the methods based on the explicit speciﬁca-
tion of label correlation, the Multi-label Maximum Entropy Model[51], which extends the
single label Maximum Entropy Model[4] by specifying the pairwise constraints between
any two class labels. The Multi-label Maximum Entropy Model[51] was originally applied
to multi-label text classiﬁcation. In this study, we apply the model to automatic image
annotation.

We then present a framework of multi-label learning that addresses the correlation
among different class labels, called the Correlated Label Propagation(CLP) framework
[27]. Based on properties of the submodular function, the framework can obtain the opti-

mal solution using a very efﬁcient algorithm, although there are still an exponential number

60

of constraints. Furthermore, we show how this framework can be applied to automatic im-
age annotation using either discrete feature representation of images or continuous feature

representation of images.

5.1 Problem Deﬁnition of Multi-Iabel Learning

Let D = {(x1,51), . . . , (xn, 8,,)} denote the set of labeled examples, where n is the num-
ber of training examples. Each x,- = (123,31, . . . ,rw) is an input vector of d dimension.
Each set 8,- contains the class labels that are assigned to the i-th training example. For the
convenience of presentation, we will employ a binary vector to represent a set of class la-
bels. In particular, for a class label set 8,, its vector representation t(S,-) = (tin, . . . Jim)
has its j-th element set to 1 only when j E S and zero otherwise. In total ,we have
m different labels. Given a test point xt, our goal is to determine a conﬁdence vector
2) = { 31.1: . . . ,ztym} such that each component 2m indicates the conﬁdence of assigning

xt to the i-th class.

5.2 Pairwise Label Correlation based on Multi-label Maximum En-

tropy Model

Let P(x|t), Q(:c|t) denote the empirical and the model distributions respectively, where :c
is data and t is the set of labels assigned to the data point as deﬁned before. For the single

label Maximum Entropy Model, the framework to maximize the entropy is formulated as:

(1": max H(:c, tlQ) = min < log q(t|1?) >Q
q (1

subject to:

<t>Q=<t>p

<tJEl>Q=<t171>p,V1_<_l£ d

61

, where H (r, tlQ) is the entropy of data :r and labels t. d is the dimension of the data, and
tr, is the l-th feature with respect to each category in t. < - > p denotes the expectations
with respect to the distribution P. The constraints essentially force the expectation under
the empirical distribution P consistent with the expectation under model distribution Q.

The solution could be proved to take the format:

 

q(t|a:) = exp (t(b + ”LUTIIT))

1
Z(.r)

,where Z (r) = Zr exp (t(b + 'szr)) is the partition function. 10 and b are the parameters
to optimize, which could be computed through numerical optimization methods.
To extend to the multi-label case, the correlations between pair-wise labels are added

to the framework:
< tit)“ >Q=< til’j >p,V1 S 'i Sj S m.

The solution of the Multilabel Maximum Entropy Model is:

 

gal-7") : exp (tTU) + Rt + Vszr))

1
Z (.r)
,where Z (:r) = 21 exp (t(b + wTr)) is the partition function. Note that t is a binary vec-
tor of labels to indicate whether label j appears in the annotation set. Z (r) is to summarize
all possible combinations of the labels. To predict the assignment of possible labels to a

new test instance, the method enumerates all possible label sets to ﬁnd the most probable

one based on:
{2 Intax tT (12+ fit + III) (5.1)

From the above formula for Multi-label Maximum Entropy Model, we observe that the
number of parameters to be estimated is at the order of 0(m at d + m2), where m is the
number of labels and d is the dimension of data. Due to the large number of parameters
to be estimated, it’s difﬁcult to ﬁnd the optimal solution when the number of labels is

large since we need to compute the normalization factor Z (r), which requires to compute

62

0(2’") conditional probabilities to get the gradient. As a matter of fact, when applying
this method to the text classiﬁcation task [51], the author only considers 10 possible class
labels. Considering the size of labels in our annotation task, this method is not feasible to

our image annotation task.

5.3 Correlated Label Propagation

In the following, we ﬁrst describe the proposed framework for multi-label learning, fol-
lowed by the efﬁcient greedy algorithm using the concept of submodular functions and

discussion of implementation issues in the framework.

5.3.1 Correlated Label Propagation for Multi-Iabel Learning

To motivate our proposed framework, we ﬁrst describe the kemel-based kNN approach,
which is one of very popular learning methods.

Suppose the similarity of any two data points is measured by a kernel function K(-, -) :
Rd x Rd —+ R. Consider the case of single-step propagation. The score of assigning the

j -th class to the test example xt, i.e., 3!. j, could be estimated by

TL

314 = 25700139010 6 51'), (52)

1'21
where 1(j E 8) is an indicator function that outputs 1 when the j—th class belongs to set
8 and is zero otherwise. However, there are two problems with the expression in Equa-
tion 5.2:
o Overestimated Conﬁdence Score. Equation 5.2 assumes that a training example x,-
will propagate all of its class labels to the test example xt according to the similarity
K (xt, xi). This is not necessarily true since maybe only some of the class labels of
x,- should be propagated to xt even though x,- is similar to xt.
0 Independent label Propagation. As indicated in Equation 5.2, each class label is

propagated from training examples to the test example independently of the other

63

class labels. In particular, the computation of the conﬁdence score 40' for the j-th

label is independent from the conﬁdence scores assigned to other class labels.
To resolve the problem of overestimated conﬁdence scores, we replace the equality con-
straint in Equation 5.2 with the following inequality constraint:

n
3m g ZK(xt,x,-)I(j e 5,). (5.3)
£21

The above inequality indicates that the conﬁdence score propagated from training examples
to the test example is upper bounded by the sum of the pairwise similarity K (-, -). Note
that no explicit value of the conﬁdence score th is speciﬁed in the above expression.

To incorporate the label correlation information into label propagation, we consider
the propagation of multiple labels. Let the binary vector 8 be the set of labels that are
propagated from the training examples D to the test example xt. We denote by st(8) the
conﬁdence score of assigning any subset of S to xt. Similar to Equation 5.3, we introduce

the following constraint on the conﬁdence score .9) (S), i.e.,
n
s48) s [Irma-Mons. # a5) (5.4)
i=1

where I (S n S,- # ¢) is used to ensure that only the training examples whose class labels
overlap with the set S are included in computing the conﬁdence score. To link 2“, i.e.,
the conﬁdence score of assigning individual classes, to st(8), i.e., the conﬁdence score of

assigning multiple classes, we assume the following inequality,

m

Z zw-uj e S) 5 5,09). (5.5)
j=1 .

The above inequality implies that for a single data point, the conﬁdence of assigning any
subset of class label set S to xt should be no less than the conﬁdence of assigning the class
label separately in S to xt. Combining Equation 5.5 with Equation 5.4, we obtain

2:31.310 E 5) S ZKWLXJHS (152-7; <15)-
j=1

i=1

The above expression can be simpliﬁed if we present it in the vector form of the class

labels, i.e.,
23(5) 3 ZK<x..x.->I(t<8)7‘t<s.>> (5.6)
i=1

Hence, given m different class labels and multi-labeled training examples D, the conﬁ-
dence z of assigning individual classes to the test example xt is subject to the following
constraints:
71
Vt e {0, 0'". zit s ZK<xi.x.->I<1Tt<s.>>

i=1

2 t 0. (5.7)

Furthermore, we can generalize the indicator function I (:c) to a concave function 0(32),
which we term the Label Kernel Function. Then, the constraints in Equation 5.7 are gener-
alized to the following form:
n
Vt 6 {0,1}”", 2ft 3 ZK(xt,x,j)Q(tTt(Sl-))

2:]

z t 0 (5.8)

A detailed discussion of the label kernel function Q(:1:) appears later.

It is insufﬁcient to identify the appropriate conﬁdence scores 2 only with the constraints.
Thus, we assume that among all the conﬁdence scores that satisfy the constraints in Equa-
tion 5.8, the optimal solution 2 is the one that ‘maximally’ satisﬁes the constraints based

on certain weights. This assumption leads to the following optimization problem for z:

m

max (1 .2. .

zeRlTl Z A k
k=l

’n.
s.t. Vt 6 {0,1}’” :th g ZK(x,,xq)Q(tTt(S,-))
i=1

2 t 0 (59)

where {ak > 0}Zf__1 are the weights for the class labels. Notice that the problem in
Equation5.9 is a linear programming problem, and therefore the solution will be on the

extreme points of the region bounded by the constraints.

65

 

Input
0 xt: the test example
0 0120122...Zam>0
Output: optimal label scores (3“, . . . 73ml) for xt
Fork: l,...,m
e Let class label setTk = {1,2, . ..,k.}.
0 f(7'k) = 21:1 K(Xiaxt)9(tT(7lc)t(Si))
' 31,1; = f(Tt-) - fat—1)

 

 

 

Figure 5.1. Algorithm for ﬁnding the optimal solution to Equation 5.9

Despite the simplicity, solving Equation 5.9 efﬁciently is not trivial. This is because:

0 Eﬁ‘iciency: The number of constraints in Equation 5.9 is exponential in the number
of classes m. When m is large (e.g., 100), the number of constraints will be too large

to be handled by any linear programming algorithm.

0 Undetermined Weights: The solution to Equation 5.9 depends on the weights

{00221, whose exact values are difﬁcult to determine.

5.3.2 Efﬁcient Learning Algorithm

In this section, we show that when the label kernel function 9(33) is a concave function, '
there is a simple greedy algorithm for ﬁnding the optimal solution to the problem in Equa-
tion 5.9. Furthermore, the solution only depends on the relative order of weights {ak}g?:1,
and is independent of their exact values. The algorithm for estimating label conﬁdence
scores 2 is summarized in Figure 5.1. This greedy algorithm is based on the following
theorem from discrete optimization [40]:

Given: (1) a ﬁnite set N, (2) a set function f : 2N —+ R with f(¢>) 2 O, and (3) a

66

weight vector w E RIM. Then, the linear programming problem:

max WTX
WERINI
s. t. v.4 g N, Z a:(e) g f(A)

eEA
Ve E N,:L'(e) 2 0

can be solved by the following greedy algorithm if the set function f is submodular:
0 Sort elements ofN as w(e1) 2 w(e2) 2 _>_ w(en)

O LetV0=¢

For 2' =1,. ..,n, let

Vi = Vi—i + 3i, and 33(62') = f(V2;) - f(Vz'—1).

The validity of applying the above theorem to our problem deﬁned in Equation 5.9
relies on the fact that the function f in our algorithm, i.e.,
n
f(u) = Z K(Xt, Xi)9(uTti) (5-10)
i=1
is a submodular function. We present a proof that f is submodular if 0(a) is a concave
function in Section 5.4.
Remark: it is interesting that the kemel-based k Nearest-Neighbor is a special case of

the algorithm in Figure 5.1, given by setting (2(3) = 17.1 This is because

thc = f(7lc)—f(7lc—1)

= Draw.) (Mk) — «mufﬂe-D)
i=1

: gram.) ($100))
i=1

2 ifﬂxbxiﬂw E Si)

i=1

 

1The linear function :1: is both a concave and a convex function.

67

 

 

where ek is the vector whose elements are all zero except that the k-th element is 1. In the
last step of the above derivation, we use the property:

T 1 k E Si
t - =
8‘“ (8’) { 0 otherwise.

5.3.3 Selection of the Label Weights and the Label Kernel Function

This framework contains three key factors: weights {ak}Z‘=1, the label kernel function
9(23), and the similarity function K (xi, 331:)- This section discusses the impact of different
choices for weights {(1021:} and the label kernel function 52(3).

Choice of Weights: The solution returned by the label propagation algorithm is depen-
dent only on the relative order of the weights, {akm‘zl but not the exact values. There are
two straightforward choices for the weights:

1. order the weights 0: to be in the same order as class frequency, namely oz,- 2 aj <—>

Pi 2 193';
2. order the weights to be in the reverse order of class frequency, namely a,- 2 aj <——>

Pi S PJ'-
Above, p,- is the frequency of the i-th class in the training data. A potential problem with the
ﬁrst choice for a is that assigning large weights to the popular classes will mean that those
classes will be selected before the rare classes. Since popular classes are correlated with
many more classes than rare classes, choosing the weights for the popular classes ﬁrst could
cause the test sample to overlap with the labeled samples heavily from the beginning and
thus the label kernel function saturates from the beginning due to the property of concave
function. This is best illustrated using an example. Consider the two classes ‘water’ and
‘whale’, where the former is popular and the latter rare; every time ‘whale’ appears in
the label set of an training data, ‘water’ also appears but not vice versa. If ‘water’ were
chosen before ‘whale’, then no additional overlap information would be introduced when
the weight for ‘whale’ was determined. That is, according to the computation formula,

since the set {water} and the set {water, whale} have the same overlapping with the label

68

set of each of the training example and thus the (2 function has the same value and thus the
conﬁdence for the word ‘whale’ will be zero if we already predicted the word ‘water’. By
contrast, the second choice (selecting weights in reverse order of class frequency) allows
the rare classes to determine their conﬁdence scores before the popular classes. In our
example, this means that new overlapping information is introduced when the weight for
‘water’ is selected after the weight for ‘whale’. For these reasons, our experiments employ
the second choice for the weights.

Choice of the label kernel function: As discussed above, a prerequisite for the label
kernel function 9(a) is that it should be concave. In addition to ensuring that f (u) in
Equation 5.10 is a submodular function, the choice of a concave function is also consistent
with the principle of Decreasing Marginal Returns in Economics. Namely, that more in-
formation is gained from the ﬁrst few observations than from the repeated observation of
the same evidence.

Table 5.1 lists four examples of label kernel functions: the 6 function, the sigmoid
function, the exponential function, and the count function. We also plot the curve for four
different label kernel functions in Figure5.2 (The count function is multiplied with a small
constant to stretch it in order to draw curves on the same graph).

As indicated by the expressions in Table 5.1 and Figure 5.2, these four functions be-
have very differently. The 6 function outputs 1 whenever the input is positive. Thus, no
matter how many class labels are shared between those of a training example and those
of propagation, the amount of label conﬁdence propagated from the training example re-
mains the same. The exponential function is a monotonically-increasing function with a
maximum value of 1 (for :c 2 0). Unlike both the 6 function and the exponential function,
which output zero when the input is zero, the sigmoid function has a non-zero output even
when the input is zero. This allows for any training example to propagate conﬁdence score
to the test example even when its assigned classes do not overlap with the class labels of

propagation. This property plays a role analogous to smoothing in information retrieval

69

 

 

 

 

 

 

 

 

 

' a .3 ..... . ..
0.9 1
0.8 .A" ~
(I) .'
.5 A"
‘5 0.7 _
r: A
a
B 0.6 _.A- "
E A".
Q o 5. _. -
E ”A
n
3 0.4- EA -
‘5 D A”.
a) _ _
a 0‘3 A". --u~ exp func
> . ' + sigmoid func

0.2- . A —+— 8func ‘

.A’ --A~- oountfunc
0.1-5 .
.: _A
"ii 5 10
Number of Ovedapping

Figure 5.2. The curves for different label kernel functions

Table 5.1. Examples of label kernel functions used in experiments

 

 

 

 

 

 

. 0 f = 0
6 function 9(23) = 6(23) = { 1 if: > 0
sigmoid function 9(33) 2 ﬁg
exponential function (2(23) = 1 — 2“”
count function 9(23) = a:

 

 

(e.g., [48]). Finally, as discussed in the previous section, the count function leads to the

standard kemel-based kNN algorithm.

5.3.4 Choice of the Similarity Function K for the Correlated Label Propagation

Framework

One of the key components in the CLP framework is the similarity function K (;), which
could be very domain speciﬁc. When applying to image annotation task, the similarity
could be computed based on the discrete features or the continuous features extracted from

the images. In the following, we will consider two methods to compute this measurement.

70

To compute the similarity based on discrete features, the kernel function K (x, x’) is
computed based on the relevance language model [31], which has been successfully applied
to automatic image annotation [24]. More speciﬁcally, the similarity of a test example xt

to a training example 23,-, denoted by K (xt, xi), is calculated as:
K(Xt,xi) = “0909“) = Hlﬂklxtﬂxt’k (5-11)
k

where

351,): _ 0) 231:1 $331:
Ere 331,th 23' = 1” Zia mud
This similarity could also be constructed based on the continuous features. We take the

p(klxz') = 0 + (1

 

same method as used in the continuous relevance model[32]. The similarity is computed

as:

nzt
K<xt,x.-) = Pr<xtlx.-) = 1‘1 [R 1. m<v~alga>pc<galx0dga (5.12)
a=1
where pR(rq|ga) takes the values of a constant. Gaussian distribution pg(ga|xk) =
1 n I T —l - -
— -= exp 9 — G r.) 2 g — G r- is used to generate the image
nxgrm (a (J (a (3))
features from a model xk. C(rj) is the feature vector of every region of image xk. In this
model, it is assumed that the feature vector 9 in the test image could be generated from

every region rj in the image xk following a Gaussian distribution with rj as the mean and

a diagonal matrix as the covariance matrix}

5 .4 Proof of Function f is a Submodular Function

_Theorem 1 Function f (u) in Equation 5.10 is a submodular function if the label kernel
function 9(1‘) is concave.

Proof

To show that f (u) is submodular, we use the following necessary and sufﬁcient conditions

for submodular functions:

71

ForanysetAngA/andelementeEN\B,
f(AU6)-f(A) 2f(3U6)-f(3)

iff f is a submodular function. Using the binary vector representation of sets, the above

condition can be written as:

f(tM U 6)) - NM» 2 f(t(3 U 6)) - f(t(3))

holds when t(B) : t(A) and t(e)Tt(B) = 0. Using the expression of f (u) in Equa-

tion 5.10, we have

f(t(AU6)) — 1004)) = (5.13)
2 Km. wt) (0(tT(A u ens.» — 0<tT<A>t<si>>)
2'21

and

f(t(3 U 6)) - f(t(3)) = (5-14)

2";be (ct) (0(tT(B u e)t(s,-)) — 0(tT(B)t(s,-)))
Since 6 e N \ B and—A ; B, we have
t(A 0 e) = t(A) + t(e), t(B 0 e) = 1(3) + t(e)
Thus,

(t(«4 U 6) - t(A))Tt(Si) = (W3 U 6) - t(B))Tt(Sz-)

= t(e)Tt(s,-) (5.15)
Furthermore, based on the property A Q A U e, B (_2 B U e, we have

t(A)Tt(s,-) g t(AUe)Tt(s,-), (5.16)

t(B)Tt(s,-) < t(B 0 e)Tt(s,-). (5.17)

72

Now, based on the properties in Equations 5.15 and 5.17, for any concave function (2(2‘),

we have

0<t<A>Tt<s.->) + 0003 u e>Tt<s.->> s

0(t(B)Tt(s,-)) + 0(t(A u e)Tt(s,-)) (5.18)
The above inequality holds because of the following property of the concave function, i.e.,
Q(r6) + Q(y) S 0(1)) + Q(c1)
ifzc Sp,q g yandr+y=p+q. Byletting

a: = t(A)Tt(s,-), y = t(B 0 e)Tt(s,)
q = t(B)Tt(s.-), p = 1M u e>Tt<s.->

we have Equation 5.18.
Finally, substituting the inequality in Equation 5.18 into Equations 5.13 and 5.14, we

obtain

f(tM U 6)) - f(t(A)) Z f(t(3 U 6)) - f(t(3))

5.5 Experiments

In this section, we conduct three sets of experiments. The ﬁrst set focuses on the study of
the pairwise label constraints based on the Multi-label Maximum Entropy Model(MMEM).
Due to its incapability to handle a large set of labels, we only evaluate the algorithm by a
small subset of class labels. The second set of experiments is based on the Correlated Label
Propagation(CLP) framework, which will be evaluated by a large number of class labels.
Finally, we conduct the set of experiments to compare all the methods on automatic image
annotation task.

Chapter 4’s experiments use 5 annotation words to choose parameters. After choosing

the parameters, we will train the model based on the whole set using these parameters. In

73

 

 

 

 

 

 

 

 

 

 

, ' a MMEM
600 ° .. 0.8 . - TM
E500»
:1 = 0.6 a
5400 § = ° . a ;
u" '1 o 4 ° ..
g 300 “ ' ° -
g a n a . o
200 .. 0.2
"’" 1 2 3 4 5 6 7 6 9 10 " 7 2 '3 4 5 6 7 8 0 10
Word Index Word Index
(a) Word Frequency (b) Average Recall
" .. MMEM " - . MMEM
0.6 - TM 0.6 ' ' ' TM
0.5 ° . , 0.5 , . u
= e
2% °-4 - a 0-4 . °
2 o 3 - ° g o 3~ .. '
o. ' . . 1L - . a
0.2- , ° 0.2 . ..
0'1 r - D 0.1 * ;

 

 

 

 

 

 

2 76910

4 5 6
Word Index

(c) Average Precision

n
U

1

2 78010

21 5 6
Word Index

(d) Average F Measure

Figure 5.3. Performance for individual words based on the Translation Model(TM) and the Maxi-
mum Entropy Model(MEM). The words are ordered by decreasing frequency.

this section’s experiments, the performances are evaluated at different annotation cut point,

each point of the top k ranked class labels ([1: ranging from 1 to 10).

5.5.1 Application of the Multi-label Maximum Entropy Model to Automatic Image

Annotation

To further investigate how pairwise correlations ﬁt our exploration of the correlation be-

tween labels, we apply the Multi-label Maximum Entropy Model to the image annotation

task. Due to its incapability to handle large sets of annotation labels, we consider a subset

of 10 different class labels, which is also the number of labels used in the paper where the

model was developed[51]. To have enough support of training examples for this method,

74

 

 

 

 

-20 .. -
o
-40 .. O -
o
o O
m -50 c o o0 O O o o a
O O o
o 0 o o 0
db C9 0
o
...80 1. O o O ..
O O O o
o
O O O
o
-100 - ° 2 .
o
_12 r r r r r 1
-8.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3
Correlation

Figure 5.4. The correlation among categories and their corresponding parameters of R in Multilabel
Maximum Entropy Model for the Dataset.

we choose the common words in the dataset. 1000 examples are used for training and
500 examples are used for testing. Since there is no ranking information in the results of
Multi-label Maximum Entropy Model and the maximum number of predicted words from
the Multi-label Maximum Entropy Model is 4, we thus choose 4 annotation words for the
translation model to make them comparable. We plot the frequency of each of the annota-
tion words and their corresponding performances in Figure 5.3. From Figure 5.3(a), which
shows the frequency of each of the individual words, we observe that although we choose
a small set of common words as labels, the distribution is still very skewed. Some words
happen more frequently than the rest. We also observe that the Multi-label Maximum
Entropy Model performs better than the Translation Model on the relatively less frequent
words, especially for the average recall, while the Translation Model performs better than
the Multi—label Maximum Entropy Model on the common word set.

To further investigate the role of the correlation in this framework, we investigate the

75

 

common words uncommon words
common words ~74.3507 -93.77ll
uncommon words -93.771I -53.6874

 

 

 

 

 

 

 

Table 5.2. Average correlation between the common words and rare words.

relationship between the Pearson correlation coefﬁcients and the values of R in Equation
5.1 for pairs of labels. The result is listed in Figure 5.4. The Pearson correlation coefﬁcient
between label t,- and tj is deﬁned as,

2013' - t—j)(tz‘k -t—1~)

(TL _ 1)Stjstk

 

Tt.

J’tlc = (519)

, where t}- is the average of label tj across all the images and st]. is the sample standard
deviation for label t j.

We can observe a general trend from these data. When the correlation gets larger, the
coefﬁcient values of R, the values of correlation between two labels used in Equation 5.1,
also get larger. This demonstrates that the values of R obtained from the framework reﬂect
the correlation between the labels.

We further investigate the values of R between individual words. We divide the words
into two groups: the top 5 frequent words as common words and the remaining 5 words as
rare words. We compute the average values of the R matrix among each set of words and
between different sets of words. The result is listed in Table 5 .2.

From the table, we observe that the rare words have the largest R values among them-
selves. This indicates that the Multi-label Maximum Entropy Model tends to predict the
words as rare words due to the fact that they have a larger R values and thus the model has

good performance on rare words but not on the common words.

5.5.2 Application of the Correlated Label Propagation Framework on Discrete Fea-
tures
In this experiment, we compute the similarity between the test image and the training im-

ages based on the discrete representation of images and the formula is in section 5.3.4. The

76

 

 

    

 

    
   
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

120 0.4 E
1001 ref",
5 . r" g = 0.3
, _Ar‘—J— J, (U
8 80* :Z»—--“' 8
...l 9‘ 0:
6'2 60- goz
"6 f '- ' ’
E 401.,{1’ + Exponential Func S —-— Exponential Func
g ,r’ —~— Sigmoid Func < 0_1. —~— Sigmoid Func
20. —-- 8 Func ' -+- 5Func
-+- Count Func -+- Count Func
“1 3 4 6 6 7 8 10 "1 3 4 5 6 7 8 1o
Annotation Cut Point Annotation Cut Point
(21) #ReLQuery (b) Average Recall
0 25 v 0 25
c 02 x.’ ““m.-. ‘ 93 02 7‘
.9 . ‘ g
.8 ’ a ,'
$0.15 , £0.15 , ”m--. ..............
h ’1’ 4 ‘ __-—w ~—
r'; 1’ LI— ,J
U) o 1 5‘ 8) 0 1- ,0" .
g ' —-— Exponential Func E ' I’, 1’ —-— ExponentIal Func
3 —-— Sigmoid Func 2 / + SIgmord Func
0.05- -+. 5 Func < 0.05 w- 6 Func
-*—- Count Func -+- Count Func
3 4 6 6 7 8 10 " 3 4 6 6 7 6 1o
Annotation Cut Point Annotation Cut Point
(c) Average Precision (d) Average F Measure

Figure 5.5. Performance comparison of correlated label propagation framework based on different
label kernel functions.

parameter ,0 is set to be 0.9 as suggested in paper[24]. The value of the exponent in the
exponential label kernel function is set to 0.6 through the evaluation set.

We ﬁrst compare the performances of different label kernel function 0(a). Figure 5.5
shows the performances of the four kernel label functions on the COREL data set. Note
that the count function corresponds to the kernel—based kNN approach.

From Figure5.5, we observe that, among the four label kernel functions, the 6 function
performs the Worst for almost all ranks. This is due to the property of the 6 function that
gives a constant value regardless of the number of class labels overlapping between the
assigned class labels of training examples and the propagated class labels. Second, the per-

formance of the three kernel label functions, namely the sigmoid function, the exponential

77

function, and the count function, differ signiﬁcantly when the cut-off rank is small. Once
the cut-off rank is large (e.g., 5 for the COREL dataset), the three label kernel functions
deliver similar results. This is because when the number of selected class labels is large,
all of the label kernel functions will be able to identify more or less the similar set of class
labels. As a result, the F measure of all three label kernel functions are close to each other
when the cut-off rank is large. Third, we see that, among the four functions, the exponential

function appears to provide the best or close to the best performance.

5.5.3 Application of the Correlated Label Propagation Framework on Continuous

Features

In this section, we evaluate the performance of the CLP framework on the set of continu-
ous features. That is, we compute the similarity measurement in the CLP framework based
on the Equation 5.12 in section 5.3.4. For the baseline model, we choose the Continuous
Relevance Model(CRM). To compare them with the models based on discrete features, we
also include the method with the best performance on discrete features, the CLP framework
with the exponential function as the label kernel function. The results at different annota-
tion cut points are presented in Figure 5.6. We observe that the CLP framework based on
the exponential function achieves best performance and is better than the continuous rel-
evance model, especially when the number of annotation words is less than or equal to 4.
This might be related to the average number of annotation words for this dataset, which is
3.5. We can also observe that the performance of the CLP framework based on continuous
features are much better than the performance based on discrete features.

This demonstrates that while the more powerful model could improve the performance,
the simple modiﬁcation of the similarity measurement could be another effective way to

improve the performance.

78

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

140 0.7
120 ..--y«4"-*"""““‘ 0.6
E‘ 100 , > = 05 .::_3
3 1'" § - —
«1 30 ,1' tr 0.4
g , 8, -
"6 6° CLPEXP $03 + CLPEXP
s TM ,2 —- TM
E 40‘ RTM 0'2 4' -.._. RTM
20 - RSTM 0.1 -+- RSTM
RELEVANCE —>— RELEVANCE
“1 3 4 5 6 7 6 1o "1 3 4 5 6 7 8 1o
Annotation Cut Point Annotation Cut Paint
(a) #ReLQuery (b) Average Recall
0.4 0.35 _ _._-.
, 0.3 ,,..~’
c I ‘~ :‘nx‘ 2
g 0'3 'i “ ff-.-“ .3 0.25
‘3 “V.\_ g 1.],
L- E 0.2 f’ll
‘L 0.2 L ,
8, w 0 15 ’
E _._ CLPEXP ~—( 1:» - —-— CLPEXP
g _._ TM 2 0 1 —°— TM
<04 -+~ RTM : ' -+- RTM
'+- RSTM o_o5 ‘+- RSTM
_._ RELEVANCE —+— RELEVANCE
‘ T l 10 2 A A 10

 

3 4 5 6 7 6
Annotation Cut Point
(0) Average Precision

 

3 4 5 6 7 8
Annotation Cut Point

(d) Average F Measure

Figure 5.6. Performance measurement for the CLP framework with continuous features on differ-
ent annotation cut points

5.5.4 Comparison of All the Proposed Models on Discrete Features

In this section,we compare the performances of the enhanced translation models, the rele-
vance model, and the CLP framework on automatic image annotation task based on discrete
features. The result is given in Figure 5 .7. Same as before, we use the union of the words
for evaluation, which in total is 141 words, and the performance is evaluated at different
cut points.

From Figure 5.7, we can see that overall the CLP framework with the exponential func-
tion performs best among all the models. When we have 5 annotation words, the average F

measure for the method reaches 0.1975, while the next best, regularized symmetric trans-

79

 

 

 

 

 

 

6 go...
3 o
0 <11
.1 0:
g 30.2
6
E >
g < 0.1
RSTM
RELEVANCE RELEVANCE

6 4 5 6 7 6
Annotation Cut Point
(3) #Ret.Query

 

    

 

 

 

 

 

 

 

 

 

 

3 4 5 6 7 a
Annotation Cut Point
(b) Average Recall

10

 

  

 

     

 

 

 

 

 

 

 

0.22 - . 0.2
0.2 o 12f--:.:__ ‘ T.
C s. _ —"'- _._- *—
3 0_18_ § 0.15 , 1
g o
._ 0.16 E
:3 ' 1. 0.1
$014“ 81 —-— CLPEXP
‘- ' E —0— TM
2 012‘ RTM °’ 0 05
< i . -+- RTM
0.1 RSTM »+- RSTM
RELEVANCE —>— RELEVANCE
no 1 1 1 n
“" 3 4 5 6 7 8 10 “1 2 10
Annotation Cut Point

(c) Average Precision

3 4 5 6 7 6
Annotation Cut Point

((1) Average F Measure

Figure 5.7. Performance comparison of different models. TM: Translation Model. RTM: Regu-
larized Translation Model. RSTM: Regularized Symmetric Translation Model. CLPEXP: the CLP
framework with the Exponential Function as the Label Kernel Function, RELEVANCE:Relevance
Model

lation model is 0.1706, followed by the regularized translation model at 0.1600 and the
original translation model with only about 0.1310.

We further split the annotation words into two groups according to their frequency: 30
most common words in the union set and the remaining 111 words as uncommon words.
We further compare the common word performance and the uncommon word performance.
The result is listed in Figure 5.8and Figure 5.9 respectively.

From these ﬁgures, we can see that for the common words, the uanslation model per-
forms best since it predicts most of the annotation words as common words. The second

is the CLP framework with the exponential function, while the two modiﬁed translation

80

models do not perform well on the common words.

However, for the uncommon words, the trend is reversed. The two modiﬁed translation
models perform very well and the CLP framework with exponential function performs
second best and the translation model performs very bad on the rare words.

While the two modiﬁed translation models reduce the likelihood to predict the common
words by boosting the performance of the rare words, it over-corrected the performance of
the common words. The CLP framework reaches a balance between the performance for

common words and rare words and obtains a good overall performance.

5.5.5 A Further Exploration of the Conﬁdence Scores in the CLP Framework

To further investigate the different impact of the label kernel function for the CLP frame-
work. We plot the values of the f function in the computation procedure for the exponential
function and the count function. The results are shown in Figure 5.10. The :6 axis is the
order of labels in the computation procedure, starting from the least frequent word. We
observe that the f function is a nondecreasing function due to the fact that both the label
kemel functions increase as more labels are introduced in the computation procedure. At
the beginning, the f values of both functions are close to O and the changes of the values are
very small. This is because the f function is determined by two parts, both the similarity
function K and the label kernel function (I. At the beginning, because we start with the
rare words, there is very little overlap with the annotation words in training data. Most of
the values obtained from the label kernel functions are zero and only a small number of
training images are considered to be similar to the test image and contribute to the value
of f function. The two curves increase more as it comes to more and more popular words,
which have more overlapping with the training data. We also plot the change of f values of
the neighboring labels, which is the conﬁdence score for the corresponding label. From the
ﬁgure, we observe that for the CLPEXP model, the change of the f values decreases when

the curve extends to the common words. This is because as it becomes close to common

81

words, the overlapping between the label set for the test image and the training images get
larger. However, the amount of increase is decreased because the increase of the exponen-
tial label kernel function 9 is decreased due to the property of concave function. Although
more similar images are introduced into the computation of the f values, the overall in-
crease is decreased. However, the count function still gives big boost to the change of the
f values, since every new overlapping introduces the whole similarity to the f values and

thus the conﬁdence score.

82

 

 

+ CLPEXP

0.6' + TM

-+- RTM

§ 0.5 -+- RSTM
-+- RELEVANCE

 

 

 

 

  
 
   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

01 1 1. :1 5 a 1 1 0
Annotation Cut Point
(a)Average Recall
+ CLPEXP;
0.4 -°- TM
1: -+- RTM
5% -+- RSTM
'6 - +- RELEVANCE
2 0.3 1
O.
0 1
O)
‘3
$0.2
01111151111
Annotation Cut Point
(b)Average Precision
0.35 . . 4 . . . -
0.3
2
a 0.25»
8
2 0.2»
l I
810-15; —+-CLPEXP
9 ’ .5 _._ TM
6; 0.11 x
< . -+- RTM
0'05‘?’ -+- RSTM
- + - RELEVANCE
01 1 .1 1‘1 6 1 ‘ é
Annotation Cut Point
(c)Average F Measure
Figure 5.8. Performance compari-

son of different models for 30 most
common words. TM: Translation
Model. RTM: Regularized Transla-
tion Model. RSTM: Regularized Sym-
metric Translation Model. CLPEXP:
the CLP framework with exponen-
tial function as the label kernel func-
tion,RELEVANCE: Relevance Model

10

83

 

.0
.rs

 

 

 

 

+ CLPEXP
+ TM ,w-j'
-+' RTM IZf,—’*"

0'3 -+- RSTM :6"
-~—- RELEVANCE

Average Recall

 

  
   

 

 

10

 

 

 

 

 

 

 
 

 

 

 

 

1 2 3 4 6 6 7 6 0
Annotation Cut Point
(a)Average Recall
0.2 . 1N.
C
.9
a
O
9
m
0
g . + CLPEXP
1; , —-— TM
(0.054 -+- RTM
-+- RSTM
-+- RELEVANCE
01 1 1 :1 s e 1 1 9
Annotation Cut Point
(b)Average Precision
0.2 . . . . . . a

10

 

0.15:

 

Average F Measure
0
-L

 

     

 

 

 

 

 

005* -4” RTM
. -+— RSTM
1 -+- RELEVANCE
01 2 3

4 5 6 7 6
Annotation Cut Point
(c)Average F Measure

Figure 5.9. Performance compari-
son of different models for the re-
maining rare words. TM: Translation
Model. RTM: Regularized Translation
Model. RSTM: Regularized Symmet-
ric Translation Model. CLPEXP: the
CLP framework with exponential func-
tion as the label kernel function. REL-
EVANCE: Relevance Model

10

 

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CLPEXP Discrete ‘ CLPEXP Continuous
—— CLPCOUNT Discrete 12- —- CLPCOUNT Continuous
3" 10.
a 3 8~
E 2 01
> >
1- 4’
2.
G1 50 100 160 200 250 300 $0 Zoo 01 50 100 150 200 250 300 350 400
0rd ndex Word Index
(a1) f values on discrete features (a2) f values on continuous features
-14
2X10 . . . ~ - . 0.7
: 0.6L
8 i S 0 5
s a s '
LI- : ; l-L 0.4"
"5 1 : g “5 ;
81 i 1 810-3’ :
8 I i 1% : .
’ 1 - 1 ' 1
5 t i . : 5 0'2 : i i i' i:
' 1 1 I I 1
i i g : 0.1~ 1: l 5 ll ;
: . . . 1 : 1J3; 1'. i s i: I
0 1.: 1 I " .11 1311 ii J.I.J...(Ii.L-11L K o H
1 50 100 150 200 250 300 350 400 1 50 100 150 200 250 300 350 400
Word Index Word Index
(b1)CLPEXP on discrete features (b2)CLPEXP on continuous features
6x10'“ , . 2 L
5L g : ,
5 : E g 31.51 : 1
§4' i i : § : I
l 1 E : u- : 5
”53 : : I "6 1 : . :
81 i I i g i : g
62- 5 E 5 1 5 5 5 i
'5 5 5,.5 8 0.5- 5 5 ;‘E
. . . I1 1 :
1 i E :i g : Si 1 E j E; E
01.4 .41.” u‘ :1 \I L:1I.:L':5J.1 c EAL :$ . 1'. Lnia“ it;
1 50 100 150 200 250 300 350 400 1 50 100 150 200 250 300 350 400

Word Index Word Index
(c1) CLPCOUNT on discrete features (c2) CLPCOUNT on continuous features

Figure 5.10. Values for f function and their change

84

CHAPTER 6

Conclusion and Future Work

Automatic image annotation is to generate a set of textual words to describe the content
of images. With annotated textual words, a number of tasks related to images could be
constructed, such as image retrieval, image browsing etc. The key in this procedure is
to construct a machine learning model to correlate the image features with textual words.
Then we can apply this model to annotate new images. This thesis presents several statis-

tical models to conduct automatic image annotation.

6.1 Summary

We ﬁrst make an extensive study on the statistical translation model. We identiﬁed the
problems of the statistical translation model, which is the overly estimated common words.
Then we propose two methods to enhance the statistical translation model. One is based on
limiting the number of blobs associated with each of the word, the regularized translation
model. Another one is based on two kinds of translation probabilities: forward translation
model, which is the traditional translation model; backward translation model, which is
to translate from words to image blobs. By minimizing the difference between these two
kinds of translation probabilities, better results are produced.

However, in all the previous translation models, the labels for images are assumed to

be independent of each other. This assumption might ignore one of the very important

85

information in image annotation task, the correlation between annotation words. In addi-
tion, the translation models use discrete features and thus miss the information encoded in
continuous image features.

Based on this observation, we propose to develop models based on the correlations
between labels. We ﬁrst study the Multi-label Maximum Entropy Model, which explores
the correlation information through pairwise constraints between labels. However, due to
the computation issue, it’s infeasible to apply this model to our task. We then present the
Correlated Label Propagation (CLP) framework, which explores the correlation between
large set of labels. We show that our framework could have an efﬁcient greedy algorithm to
obtain the optimal solution and thus apply to large set of labels. Furthermore, the continu-
ous features could be incorporated into our framework to achieve even better performance.
One thing worth pointing out is that we prove that the kernel based K Nearest Neighbor

approach is a special case of our CLP framework.

6.2 Future Work

0 Better Clustering Results. From the previous study, we know that the quality of the
clustering results is one of the key issues for the models based on the discrete features.
The blob information in our experiments is obtained through K-means algorithm. It
does not consider the annotation words while performing the clustering procedure.
How to incorporate this word information into the clustering procedure could be
one of the methods to obtain better clustering results. Another problem is how to
determine the appropriate number of clusters, the size of the visual vocabulary. In
this study, the number of clusters is predeﬁned as 500. Determine more appropriate

number of clusters could be another way to obtain better clustering results.

0 Better Similarity Measurement. We also observe that the similarity measurement

plays a key role in our Correlated Label Propagation framework. After we apply the

86

similarity measurement computed based on continuous features, the performance is
improved substantially compared with the similarity measurement computed from
discrete features. How to obtain a better similarity measurement is one of the key

issues in further improving the performance.

Converting images into textual words is a very challenging task, especially when one
image could have multiple annotation words. However, it’s also a very interesting topic and
holds the promises to constructing better image search engines and many other applications

such as image browsing.

87

BIBLIOGRAPHY

[1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. I. Jordan. Matching
words and pictures. Journal of Machine Learning Research, 321107—1135, 2003.

[2] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, and D. Forsyth. The effects of segmen-
tation and feature choice in a translation model of object recognition. In CVPR ’03:
Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages 675—682, 2003.

[3] K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In ICCV

’01: Proceedings of the International Conference on Computer Vision, pages 408-
415, 2001.

[4] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy approach to
natural language processing. Comput. Linguist, 22(1):39—71, 1996.

[5] D. M. Blei and M. I. Jordan. Modeling annotated data. In SIGIR ’03: Proceedings of
the 26th Annual International ACM SIGIR Conference on Research and Development
in Informaion Retrieval, pages 127—134, 2003.

[6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine
Learning Research, 3:993—1022, 2003.

[7] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer. The mathematics
of statistical machine translation: Parameter estimation. Computational Linguistics,
19:263-311, 1993.

[8] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data
Min. Knowl. Discov., 2(2), 1998.

[9] E. Chang, K. Goh, G. Sychay, and G. Wu. Cbsa: Content-based soft annotation for
multimodal image retrieval using bayes point machines. IEEE Transactions on Cir-
cuits and Systems for Video Technology Special Issue on Conceptual and Dynamical
Aspects of Multimedia Content Description, 13(1):26—38, 2003.

88

[10] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language
modeling. In Proceedings of the Thirty-Fourth Annual Meeting of the Association for
Computational Linguistics, 1996.

[11] D. Comaniciu and P. Meer. Mean shift analysis and applications. In ICCV ’99:

Proceedings of the International Conference on Computer Vision-Volume 2, pages
1197—1203, 1999.

[12] D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analy-
sis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5):603—619,
2002.

[13] S. Deerwester, S. T. Dumais, G. W. Fumas, T. K. Landauer, and R. Harshman. In-
dexing by latent semantic analysis. Journal of the American Society of Information
Science, 41(6):391—407, 1990.

[14] P. Duygulu, O. C. Ozcanli, and N. Papemick. Comparison of feature sets using multi-
media translation. In ISCIS XVIII - Eighteenth International Symposium on Computer
and Information Sciences, 2003.

[15] H. Feng and T.-S. Chua. A bootstrapping approach to annotating large image collec-
tion. In MIR ’03: Proceedings of the 5th ACM SIGMM International Workshop on
Multimedia Information Retrieval, pages 55—62, 2003.

[16] D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall,
2003.

[17] N. Ghamrawi and A. McCallum. Collective multi-label classiﬁcation. In CIKM ’05:
Proceedings of the 14th ACM international conference on Information and knowledge
management, pages 195—200, 2005 .

[18] S. E. Grigorescu, N. Petkov, and P. Kruizinga. Comparison of texture features based
on gabor ﬁlters. IEEE Transactions on Image Processing, 11(10):] 160-1 167, 2002.

[19] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
Springer, New York, 2001.

[20] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. Journal of Machine
Learning Research, 1:245—279, 2001.

[21] T. Hofmann. Learning and representing topic. a hierarchical mixture model for word
occurrences in document databases. In Proceedings of the Conference for Automated
Learning and Discovery ( C ONALD ), 1998.

[22] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Ma-
chine Learning, 42(1-2):177-196, 2001.

89

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

T. Hofmann and J. Puzicha. Statistical models for co-occurrence data. In A] Memo
I625, CBCL Memo 159, Artiﬁcial Intelligence Laboratory and Center for Biological
and Computational Learning, Cambridge, MA, USA, 1998. Massachusetts Institute
of Technology.

J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval
using cross-media relevance models. In SIGIR ’03: Proceedings of the 26th Annual
International ACM SIGIR Conference on Research and Development in Informaion
Retrieval, pages 119—126, 2003.

F. Kang and R. Jin. Symmetric statistical translation models for automatic image an-
notation. In Proceedings of the ﬁﬁh SIAM International Conference on Data Mining,
pages 616—620, 2005.

F. Kang, R. J in, and J. Y. Chai. Regularizing translation models for better automatic
image annotation. In CIKM ’04: Proceedings of the Thirteenth ACM Conference on
Information and Knowledge Management, pages 350—359, 2004.

F. Kang, R. Jin, and R. Sukthankar. Correlated label propagation with application to
multi-label learning. In CVPR ’06: Proceedings of the 2006 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, pages 1719—1726, 2006.

P. D. Kobus Barnard and D. Forsyth. Clustering art. In CVPR ’01: Proceedings of the
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recogni-
tion, pages 434—439, 2001.

E. B. Kong and T. G. Dietterich. Error correcting output coding corrects bias and
variance. In ICML ’95: Proceedings of the 24th International Conference on Machine
learning, pages 313—321, 1995.

V. Lavrenko, M. Choquette, and W. B. Croft. Cross-lingual relevance models. In
SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on
Research and development in information retrieval, pages 175-182, 2002.

V. Lavrenko and W. B. Croft. Relevance based language models. In SIGIR ’01:
Proceedings of the 24th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages 120—127, 2001.

V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pie-
tures. In Advances in Neural Infomzation Processing Systems I 6, 2003.

J. Li and J. Z. Wang. Automatic linguistic indexing of pictures by a statistical model-
ing approach. IEEE Trans. Pattern Anal. Mach. Intell., 25(9): 1075—1088, 2003.

90

[34] O. Maron and T. Lozano-Prez. A framework for multiple-instance learning. In NIPS
’97: Proceedings of the 1997 conference on Advances in neural information process-
ing systems 10, 1998.

[35] O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classiﬁca-
tion. In ICML ’98: Proceedings of the Fifteenth International Conference on Machine
Leaming, pages 341-349, 1998.

[36] T. Mitchell. Machine Learning. McGraw Hill, 1997.

[37] F. Monay and D. Gatica-Perez. On image auto-annotation with latent apace models.
In MULTIMEDIA ’03: Proceedings of the eleventh ACM International Conference
on Multimedia, pages 275—278, 2003.

[38] F. Monay and D. Gatica-Perez. Plsa-based image auto-annotation: constraining the
latent space. In MULTIMEDIA ’04: Proceedings of the 12th Annual ACM Interna-
tional Conference on Multimedia, pages 348—351, 2004.

[39] Y. Mori, H. Takahashi, and R. Oka. Image-to—word transformation based on divid-
ing and vector quantizing images with words. In Proceedings of the International
Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.

[40] R. G. Parker and R. L. Rardin. Discrete Optimization. Academic Press, 1988.

[41] N. d. F. Pinar Duygulu, Kobus Barnard and D. Forsyth. Object recognition as machine
translation: Learning a lexicon for a ﬁxed image vocabulary. In ECCV’ 02, Proced-
dings of the Seventh European Conference on Computer Vision, pages 97-112, 2002.

[42] Y. Rui, T. S. Huang, and S.-F. Chang. Image retrieval: Current techniques, promising
directions and open issues. Journal of Visual Communication and Image Representa-
tion, 10(4):39—62, 1999.

[43] L. G. Shapiro and G. C. Stockman. Computer Vision. Prentice Hall, 2001.

[44] J. Shi and J. Malik. Normalized cuts and image segmentation. In CVPR ’97: Pro-
ceedings of the 1997 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, pages 731—737, 1997.

[45] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 22(8):888—905, 2000.

[46] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In Advances in
Neural Information Processing Systems I6, 2003.

[47] T. P. Weldon, W. E. Higgins, and D. F. Dunn. Efﬁcient gabor ﬁlter design for texture
segmentation. Pattern Recognition, 29(12):2005-2015, Dec. 1996.

91

[48] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied
to information retrieval. ACM Trans. Information Systems, 2(2): 179—214, 2004.

[49] C. Zhai and J. D. Lafferty. TWO-stage language models for information retrieval. In
SIGIR ’02: Proceedings of the 25th Annual International ACM SIGIR Conference on
Research and Development in Informaion Retrieval, pages 49—56, 2002.

[50] D. Zhang and G. Lu. Content based shape retrieval using different shape descriptors:
A comparative study. In 2001 IEEE International Conference on Multimedia and
Expo (ICME2001 )1 Pages 317—320, 2001.

[51] S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labelled classiﬁcation using maximum
entropy method. In SIGIR ’05: Proceedings of the 28th annual international ACM

SIGIR conference on Research and development in information retrieval, pages 274—
281, 2005.

92

l

     

1111111111111111111111111111