ll “WI ! fig (MINIMUM!!!HHINHWIUIIII}WWI! (9007 LIBRARY Michigan State University This is to certify that the dissertation entitled Automatic Image Annotation presented by Feng Kang has been accepted towards fulfillment of the requirements for the Ph.D. degree in Computer Science Major Professor’s Signature M Date MSU is an affinnative—action, equal-opportunity employer .o-u-u-n-o-o-noa--a—o—--------0---—----o-n--n-o-n-n-0-o-|-----._-—.--------u-o-I---—.-—._.---n-u-c--a-- PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/07 p:/CIRC/DateDue.indd-p.1 AUTOMATIC IMAGE ANN OTATION By Feng Kang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 2007 ABSTRACT AUTOMATIC IMAGE ANNOTATION By Feng Kang Automatic image annotation is to annotate an image with a set of textual words. The key in this process is to learn a statistical model which correlates the image features with the annotation words. To construct the statistical model, we start with a set of training images, each of which has a set of accompanying annotation words. Typically, images are first segmented into multiple homogeneous regions. Image features such as color and texture are then extracted to represent each image region. The image regions in the whole collection can also be grouped into clusters and thus each image region could be converted into its corresponding cluster id, called a blob. In this way, we obtain a discrete representation of the images. The correlation between annotation words and image features, either discrete or continuous, is constructed with a statistical model. Finally, given a new test image, the same set of image features are extracted, and words are predicated according to the relationship between image features and annotation words established by the learned statistical model. In this thesis, we explore the automatic image annotation task through a series of sta- tistical models. One model based on the discrete feature representation is the translation model, which constructs the correspondence between blobs and annotation words through a set of translation probabilities. Due to the fact that common words co-occur with many more blobs than rare words, the original translation model overestimates the common words and degrades the overall performance. We thus propose two enhanced translation models to improve the original translation model by incorporating different prior informa- tion of the desired translation probabilities into the model. One prior ensures that each word is associated with similar number of blobs, which is measured by the average of the translation probabilities from different blobs to the word. Another prior considers the translation model from two directions: forward translation model, which translates from blobs to words; backward translation model, which translates from words to blobs. The prior specifies that the translation probabilities from these two kinds of models should be consistent with each other. Our empirical results demonstrate the improved performance of the two enhanced translation models over the original translation model. However, there are still two problems with the translation models. First, they do not consider the correlation between annotation words when making the prediction. Secondly, they are based on the discrete representation, which potentially loses information encoded in the continuous features. However, the correlation information is difficult to explore since the possible number of correlated words is exponential. We propose a Correlated Label Propagation (CLP) framework to explore the correlation between annotation labels. Based on the property of the submodular function, this framework could be solved by a very efficient greedy algorithm and thus be applicable to a large set of labels. In addition, the continuous image features could be incorporated into the CLP framework. Our results show that the CLP framework outperforms the translation models and also can boost the performance to a higher level after the continuous features are incorporated. In summary, this dissertation shows that 1) The performance of the original transla- tion model can be improved by incorporating different priors; 2) Effectively exploring the correlation information between labels can improve the overall performance; 3) Similarity measurement is very important in label prOpagation and similarity measurement based on the continuous image features can achieve better performance. Copyright by FENG KANG 2007 ACKNOWLEDGMENTS There are a lot of people I would like to thank for their help in this writing of thesis. Firstly, I would like to thank my advisor, Dr. Rong J in. This thesis is impossible without his guidance and support. Not only does he guide the writing of the thesis, he also corrects lots of parts in great detail. Also, thanks for his support and help during my Ph.D study so that I could explore this interesting field. I would also like to thank the other committee members: Dr. Joyce Chai, Dr. Pangning Tan, and Dr. Selin Aviyente. They provide lots of helpful suggestions and discussions. I would also like to thank other fellow students, Y1 Liu, Liu Yang, Feilong Chen, Ming Wu, Tong Wei, Shaoling Qu, Chen Zhang, Matthew Gerber, Zahar Prasov, Haibing Chen, for their help during the Ph.D study. TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES 1 Introduction 2 3 1.1 1.2 1.3 Methods in Automatic Image Annotation .................. Issues in Automatic Image Annotation .................... 1.2.1 Skewed Distribution of Annotation Words .............. 1.2.2 Correlation between Annotation Words ............... Outline of the Thesis ............................. Images Preprocessing Feature Representation ............................ 2.1 2.2 2.1.1 2.1.2 2.1.3 Color ................................. Texture ................................ Shape ................................. Image Segmentation ............................. 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 Image segmentation as clustering .................. Image Segmentation as Fitting .................... Probabilistic Methods for Image Segmentation ........... Segmentation by Integrating Semantic Information ......... Evaluation .............................. Statistical Models for Image Annotation Image Annotation Based on Quantized Image Regions ........... 3.1 3.2 3.1.1 3.1.2 3.1.3 3.1.4 Translation Model for Automatic Image Annotation ........ Cross Media Relevance Model .................... Latent Semantic Analysis for Image Annotation ........... Probabilistic Latent Semantic Analysis for Image Annotation . . . . Image Annotation Based on Continuous Features of Image Regions 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 Hierarchical Model for Image Annotation .............. Latent Dirichlet Allocation for Image Annotation .......... Continuous Relevance Model .................... Classification Method ........................ Bootstrapping and Co-training Approach for Image Annotation Linguistic Indexing of Pictures by a Statistical Modeling Approach vi viii ix 10 11 12 12 14 15 16 16 18 18 19 21 22 24 25 26 28 30 31 33 33 4 Enhanced Translation Model for Automatic Image Annotation 37 4.1 Regularized Translation Model ........................ 39 4.1.1 Framework for the Regularized Translation Model ......... 39 4.1.2 An EM Algorithm for the Regularized Translation Model ...... 41 4.2 Symmetric Translation Model ........................ 44 4.2.1 Discrepancy between Forward and Backward Translation Models . 44 4.2.2 Symmetric Translation Model through Regularization ....... 48 4.2.3 Derivation of the EM Algorithm for Regularization based S ymmet- ric Translation Model ......................... 50 4.3 Experiments .................................. 52 4.3.1 Impact of Prior on the Regularized Translation Model ....... 55 4.3.2 Effects of Exploring Correlation between Forward and Backward Translation Models .......................... 57 5 Image Annotation through Label Correlation 59 5.1 Problem Definition of Multi-label Learning ................. 61 5.2 Pairwise Label Correlation based on Multi-label Maximum Entropy Model 61 5.3 Correlated Label Propagation ......................... 63 5.3.1 Correlated Label Propagation for Multi-label Learning ....... 63 5.3.2 Efficient Learning Algorithm ..................... 66 5.3.3 Selection of the Label Weights and the Label Kernel Function . . . 68 5.3.4 Choice of the Similarity Function K for the Correlated Label Prop- agation Framework .......................... 70 5.4 Proof of Function f is a Submodular Function ................ 71 5.5 Experiments .................................. 73 5.5.1 Application of the Multi-label Maximum Entropy Model to Auto- matic Image Annotation ....................... 74 5.5.2 Application of the Correlated Label Propagation Framework on Discrete Features ........................... 76 5.5.3 Application of the Correlated Label Propagation Framework on Continuous Features ......................... 78 5.5.4 Comparison of All the Proposed Models on Discrete Features . . . 79 5.5.5 A Further Exploration of the Confidence Scores in the CLP Frame- work ................................. 81 6 Conclusion and Future Work 85 6.1 Summary ................................... 85 6.2 Future Work .................................. 86 BIBLIOGRAPHY 88 vii 4.1 5.1 5.2 LIST OF TABLES An example of forward and backward translation models for two image blobs(i.e. b1 and b2) and two words(i.e.w1 and 102). ............. 45 Examples of label kernel functions used in experiments ........... 70 Average correlation between the common words and rare words. ...... 76 viii LIST OF FIGURES 1.1 Procedure for automatic image annotation .................. 4.1 The distribution of term frequency for annotated words in a subset of COREL Data ................................. 4.2 Dirichlet distribution for different values of a ................ 4.3 The joint probabilities p(w, b) for word ’clouds’ that are computed using the forward and backward translation models. ................ 4.4 Performance measurement for different a values of Regularized Transla- tion Model ................................... 4.5 Performance measurement for different A values for Regularized Symmet- ric Translation Model ............................. 5.1 Algorithm for finding the optimal solution to Equation 5.9 ......... 5.2 The curves for different label kernel functions ................ 5.3 Performance for individual words based on the Translation Model(TM) and the Maximum Entropy Model(MEM). The words are ordered by decreasing frequency .................................... 5.4 The correlation among categories and their corresponding parameters of R in Multilabel Maximum Entropy Model for the Dataset ............ 5.5 Performance comparison of correlated label propagation framework based on different label kernel functions. ...................... 5.6 Performance measurement for the CLP framework with continuous fea- tures on different annotation cut points .................... 5.7 Performance comparison of different models. TM: Translation Model. RTM: Regularized Translation Model. RSTM: Regularized Symmetric Translation Model. CLPEXP: the CLP framework with the Exponential Function as the Label Kernel Function, RELEVANCEzRelevance Model . . 5.8 Performance comparison of different models for 30 most common words. TM: Translation Model. RTM: Regularized Translation Model. RSTM: Regularized Symmetric Translation Model. CLPEXP: the CLP framework with exponential function as the label kernel function,RELEVANCE: Rel- evance Model ................................. 5.9 Performance comparison of different models for the remaining rare words. TM: Translation Model. RTM: Regularized Translation Model. RSTM: Regularized Symmetric Translation Model. CLPEXP: the CLP framework with exponential function as the label kernel function. RELEVANCE: Rel- evance Model ................................. 5.10 Values for f function and their change .................... ix 3 54 56 70 74 75 79 80 84 CHAPTER 1 Introduction Image retrieval started as a text-based image retrieval system in the late 1970’s. By man- ually annotating images with descriptive texts, existing text-based Database Management Systems were utilized to perform image retrieval. However, image retrieval based on man- ual annotations has two major drawbacks: o It is expensive to hire large numbers of people to annotate large-sized image collec- tions. 0 Given that human perception of image content is highly subjective, different people might give different interpretations for the same pictures, depending on their knowl- edge, experience, mood and other circumstances. To overcome the problems of image retrieval based on manual annotation, people have proposed content-based image retrieval, in which images are indexed and represented by their visual features, such as color, texture, shape etc. Content-based image retrieval relies on the low-level visual feature representation of images, thus avoiding the imprecise human annotation. A comprehensive survey on this subject can be found in[42]. While content- based image retrieval solves the problems related to text-based image retrieval, it introduces several new problems: 0 While images could be retrieved based on their features such as color and texture, it is usually more natural and desirable for users to search image databases using textual queries. This is because textual queries allow users to express their information needs at the semantic level instead of the level of preliminary image features. 0 Compared to content-based image retrieval, textual queries usually provide more accurate descriptions of users’ information needs. For example, consider that a user is looking for images of tigers. Suppose he uses an image query, which consists of a photo of a tiger on the grass. Based on the match of image features, many of the retrieved images will be pictures of the grass without any tigers. This is because it is unclear to the system what the user is searching for, the grass or a tiger. In contrast, textual queries such as ‘photos of tigers’ are able to convey the information need of the user more clearly. To search images based on textual queries, it’s preferable that the annotation words for a large set of image collections can be automatically generated based on a model built from a small number of manually annotated training images. To search for images, users can simply pose textual queries and the relevant images are retrieved based on the matching between the textual queries and the automatically generated annotations for images. Unlike traditional text-based image retrieval, the automatic annotation by a computer program avoids the expense and subjectivity of the manual annotation. Figure 1.1 illustrates the basic steps for automatic image annotation: an image is first segmented into multiple homogeneous regions; then image features, such as color, texture and shape, are extracted to represent each image region; finally, annotation words are pre- dicted based on the visual features of segmented regions using a statistical model that is learned from the training data. 1.1 Methods in Automatic Image Annotation One important issue with automatic image annotation is how to represent image regions. One type of approach represents every image region with the extracted continuous features, Eagle, sky. Mountain Ir Segmentation Machine Learning Extraction Model I l I | 1 Feature I l I I I __________1 Figure 1.1. Procedure for automatic image annotation such as color, texture, and shape. Another type of approach is to represent the image regions with the discrete features from a visual vocabulary. To obtain the visual vocabulary, we first cluster image regions of the whole collection into a small number of groups and represent each region by its cluster id, i.e. which cluster the region belongs to. These cluster ids are called blobs and they are the terms in the visual vocabulary. Then we can convert the image regions in each image to its corresponding cluster id and obtain the discrete representation. Compared to the representation using extracted features, the quantized representation can dramatically simplify the computation in automatic image annotation, but at the price of potentially losing information that is encoded in extracted continuous features. The key to automatic image annotation is to learn annotation models that can automat- ically predict annotation words given extracted image features. Most automatic annotation methods applied machine learning techniques to first learn the correlation between im- age features and textual words from the examples of annotated images and then apply the learned correlation to predict words for unseen images. Based on what kinds of image rep- resentations the models are using, we group the annotation methods into two categories: the models based on discrete features and the models based on continuous features. The models based on discrete features include: the machine translation model[41], the co-occurrence model[39], latent space approaches[37, 38],and relevance language models[24]. The co—occurrence model[39] collects the co-occurrence counts between words and image features and uses them to predict annotated words for images. Duygulu et al.[41] improved the co—occurrence model by utilizing machine translation models. An- other way of capturing co-occurrence information is to introduce latent variables that link image features with words. Methods in this category include latent semantic analysis (LSA)[37] and probabilistic latent semantic analysis (PLSA)[38] The models based on continuous features include: classification approaches[9], the continuous relevance language model[24], the hierarchical aspect model[28], the Gaussian Mixture Model (GMM), the Latent Dirichlet Allocator (LDA), and the correspondence LDA[5]. The classification approaches for automatic image annotation treat each annotated word as an independent class and create a different image classification model for every word. Work such as linguistic indexing of pictures[33], image annotation using SVM and Bayes point machine[9] fall into this category. Continuous relevance language models[32] is an application of relevance model[3l] to the continuous features. Hierarchical aspect model[28], Gaussian Mixture Model (GMM), Latent Dirichlet Allocator (LDA), and cor- respondence LDA[S] are methods that introduce the correspondence between annotation words through latent variables. We will discuss these methods further in Chapter 3. 1.2 Issues in Automatic Image Annotation In this section, we discuss some common issues in the image annotation task. 1.2.1 Skewed Distribution of Annotation Words When we annotate the images, we simply put the annotation words as a description of the images and do not identify the corresponding image regions. Thus the computer is not sure which image region corresponds to which specific word. In addition, some of the common scenes appear frequently in many images such as ‘water’, ‘sky’ while the major subjects may differ across images. As a result, the annotation words follow a very skewed distribution. A few words happen very frequently and most of the words happen rarely. This is particulary a problem for the translation model, since the translation probability is based on the co-occurrence statistics between annotation words and blobs. The translation probabilities for the common words are thus much higher and the rare words have fewer opportunities to appear in the annotation set. We thus propose two enhanced translation models to alleviate this problem. 0 Regularized Translation Model. One reason that the common words are overesti- mated is because the common words are associated with more blobs than rare words. One strategy to alleviate this problem is to force each word to be associated with a similar number of blobs. The number of blobs associated with each word can be measured by the average of the translation probabilities to the word. This measure- ment is treated as a prior to be incorporated into the original translation model to obtain better set of translation probabilities. o Symmetric Regularized Translation Model. Typically, the translation model trans- lates blobs to words. This model assumes that each of the blobs can be translated into one of the words. If the common words have higher probabilities associated with the blobs, rare words will have a smaller probability of appearing in the anno- tation set. Another view of the translation model is to translate from words to blobs, which assumes that each of the words could be translated into one of the blobs. In this way, the rare words could be associated with blobs with higher probability. Our prior is to enforce the consistency of these two sets of translation probabilities. 1.2.2 Correlation between Annotation Words We know that many of the annotation words do not exist alone. They are correlated with each other. For example, the word ‘grass’ may appear more often with the word ‘tiger’ than with the word ‘ship’. A ‘whale’ is usually related to the word ‘water’ but may never occur with ‘grass’. How to effectively explore the correlation between annotation words is very challenging given the large size of annotation labels in our task. While this problem could be explicitly handled by specifying which set of labels are correlated together, it becomes a combinatorial problem and is difficult to scale to the large number of annotation words. We investigate two approaches to this problem. 0 Multi-label Maximum Entropy Model. This model takes the pairwise label con- straints into consideration by specifying which pair of labels are correlated together and how strong the correlation is. However, due to computational issues, this model cannot be applied to our annotation task. 0 Correlated Label Propagation(CLP) Framework. We propose a correlated label propagation framework to take into account the correlation information. Label prop- agation is not only based on the similarity between test images and training images. We also consider correlation between labels. We show that this framework has an optimal solution based on the properties of the submodular function. In addition, this solution can be obtained through an efficient greedy algorithm and thus be able to scale to a large set of labels. Furthermore, this framework can incorporate continu- ous features, which are missing in models based on the discrete features, such as the translation model. 1.3 Outline of the Thesis In Chapter 2, we will provide some background information about image preprocessing, which mainly focuses on the feature representations and image segmentation. In Chapter 3, we discuss the two categories of statistical models built for automatic image annotation based on whether they employ discrete or continuous representation of the images. Chap- ter‘4 focuses on the the translation models. We will discuss in detail its over-estimated common words problem. We then propose two enhanced translation models to alleviate this problem and demonstrate the improved performance. Chapter 5 presents the methods that consider correlations between labels. The Correlated Label Propagation(CLP) frame- work is proposed and demonstrates the improved performance over discrete features. We also show that the CLP framework can incorporate continuous features to achieve better performance than the discrete features. Chapter 6 summarizes this thesis. CHAPTER 2 Images Preprocessing 2.1 Feature Representation Visual features of images can be classified into two categoriesl42]: the general image fea- tures and the domain-specific features. The general image features include color, texture, shape etc. The domain-specific features are application dependent. For example, to detect and recognize human faces, special image features are required to describe the characteris- tics of human faces. In this survey, we focus on the general image features, particularly the features that are critical to automatic image annotation[4l]. 2.1.1 Color Color is one of the most widely used visual features in image retrieval. Compared to other image features, color is relatively robust to the background complication and independent of image size and orientation[42]. It is typically represented in a color space. Each color space involves two basic components: the primaries and the matching function. The pri- maries of a color space describe the basic and independent components of colors. The matching function of a color space determines the weight of the primaries in matching a source color. Suppose the primaries arePl, P2, P3 and the matching functions are f1, f2, f3. 'Ihen any color could be expressed as: f1P1 + ngg + f3P3. One typical example of color space is RGB Color Space, in which the single wavelength primaries are used (645.16 nm for R, 526.32 nm for G and 444.44 nm for B). Another popular color space is CIE L*u*v, which is a more uniform space than RGB and obtained by a projective transformation of CIE XYZ [16], which yields the CIE u, v space: i 10- 4X 9’” (21) u’ _ X+15Y+3Z’X+15Y+3Z ' Color information of an image is usually represented by its statistics. There are two commonly used statistics for representing color information: 0 Color histogram. In image retrieval, the color histogram is the most commonly used color feature representation[42]. It is easily computed and in many cases very effective for image retrieval. However, one disadvantage of the color histogram is that it is usually a sparse representation with zero values for most of its entries. As a result, it can be sensitive to small distortions to images when applied to image retrieval[42]. 0 Color layout. The color histogram is usually viewed as a global feature given that it is computed based on the color distribution of the entire image. Despite that it has shown a certain degree of discriminative power in image retrieval, one disadvantage of any global feature is that it is unable to provide an accurate description of details in images and thus tends to give many false positives when the size of the image collections is large. In contrast, color layout methods take into account the local distribution of color features. One example of color a layout method divides images into multiple blocks and extracts image features for each block. In[34, 35], the author proposed an approach, named ‘single blob with neighbors’ or SBN, in which an image is divided into multiple sub-images and each sub-image is represented by the mean values of RGB components of the sub-image itself and its four neighboring blobs (i.e., up, down, left, right). This approach can be further improved by organizing sub-images into more complicated structures such as the quad-tree. 2.1.2 Texture According to[42], textures are the homogeneous visual patterns that do not result from the presence of only a single color or intensity. Some of the commonly used texture represen- tations include: o Co—occurrence matrix. This method represents image texture using the spatial correlation of gray levels of different pixels [43]. It first constructs a co-occurrence matrix based on the orienta- tion and distance between image pixels and then extracts meaningful statistics from the co-occurrence matrix as the texture representation. 0 Tamura psychological perspectives Tamura explores the texture representation from psychological perspectives[l6]. It identifies six visual properties that are related to textures: coarseness, contrast, di- rectionality, line-likeness, regularity, and roughness. Tamura representations of tex- ture differ from the co—occurrence representation in that their texture properties are visually meaningful whereas many of the texture properties extracted from the co- occurrence matrix representation are not. 0 Spectral transform This method applies the wavelet transform or the Fourier transform to obtain texture representations in the frequency domain. One disadvantage of the Fourier transform is that the computation of the transform requires the pixel information from the entire image. As a result, the Fourier transform is unable to capture the local structure of images, which is important in representing images. To overcome this problem, one strategy is to use the Gabor filters[18, 47], in which a Gaussian kernel is put on top of the Fourier transform so that the transform is performed within a neighborhood determined by the standard deviation of the Gaussian kernel. Another strategy is to 10 use the wavelet transform. In order to capture the local structures of images, the wavelet transform uses ‘wavelet’ as its basis functions, which are usually generated from a mother function by translation and contraction. 2.1.3 Shape Shape is a very important visual feature in human perception. A good shape representation should be invariant to translation, rotation and scaling. Most shape representations can be classified into two categories: region based methods, and boundary based methods. Boundary based methods represent the shape of a region by its outer boundary. They can be further classified into three sub-categories[50]: 0 Global shape descriptors include area, circularity, eccentricity, and axis orientation. These shape features can only distinguish shapes with large dissimilarity. 0 Shape signatures utilize the local feature of a shape, including the complex coor- dinates, the curvature and the angular features. They are usually sensitive to noise and therefore are not robust. In addition, they require intensive computation during similarity calculation, due to the hard normalization of rotation invariance. 0 Spectral descriptors apply spectral transformation to the shape signatures to acquire the shape representation in the frequency domain. The most successful shape rep- resentation is the Fourier Descriptor[42], which applies the Fourier transform to the shape signatures. In the region-based representations, all pixels within a shape region are taken into ac- count to obtain the shape representation. The most successful representation is the moment descriptor [42], which is invariant to both translation and scaling. ll 2.2 Image Segmentation The goal of image segmentation is to group pixels of similar properties into clusters. Ac- cording to [16], methods for image segmentation can be classified into three categories: 0 Image segmentation based on clustering. 0 Image segmentation based on fitting. 0 Image segmentation based on probabilistic methods. In addition, recently there have been a number of studies on image segmentation at the semantic level[2], which will also be reviewed below. 2.2.1 Image segmentation as clustering Image segmentation can be viewed as a clustering problem in that pixels of similar visual properties are clustered into a small number of groups. For any clustering algorithm, there are two key components: 0 Distance measurement between any pair of data points or clusters. Commonly used distance measurements are: l. The single link, i.e., the shortest distance between any pair of data points in two clusters. 2. The complete link, i.e., the maximum distance between any two points in two clusters. 3. The group average link, i.e., the average distance between any pair of data points in the two cluster. 0 The number of clusters. Usually, this is a very difficult problem to handle. Dendrogram[l6], a hierarchical representation structure of clusters, could be used 12 to see whether the clusters are good or not and help user make choice of clusters. Other commonly used approaches include the information criterion (e.g., AIC and BIC)[19]. The commonly used clustering algorithms for image segmentation are: 0 Image segmentation by the K-means clustering algorithm The K-means algorithm[16] is one of the most commonly used clustering algorithms. It minimizes the within-cluster distance, which is calculated as the sum of distances of each data point in a cluster to its center. The cluster memberships of each data point are calculated through an iteration of the following two steps[l6]: 1. Given the current estimation of cluster centers, each data point is assigned to the cluster whose center is closest to the location of the data point. 2. Given the cluster memberships of data points, the new center of a cluster is re computed as the average of the data points assigned to the cluster. 0 Image segmentation by graph—theoretic clustering A clustering problem can be viewed as a problem of graph partitioning[ 16], in which each data point corresponds to a vertex in a graph, and the weight for each edge that connects two vertices is equivalent to the similarity of the corresponding two data points. Identifying clusters of similar data points is equal to dividing the corre- sponding graph into multiple disjoint sets with only edges of small weights removed. Central to the graph partitioning approaches is the similarity measure for any pair of data points. Different image features could be used to define the similarity measure- ment, such as the similarity in intensity, color, texture and motion. One of the popular algorithms in this category is the Normalized Cut[44, 45]. It divides a graph into two disjoint sub-graphs such that the ratio of the graph cut to the total affinity (i.e., similarity) within each sub-graph is minimized. In particular, 13 to divide a weighted graph V into the two disjoint subsets A and B, based on the Normalized Cut algorithm, it is formulated into the following optimization problem: , cut(A, B) cut(A, B) arg £14113 m + W (2.2) cut(A, B) measures the similarity between the components A and B, which is de- fined as the sum of weights of all the edges in V that have one end in component A and the other end in component B. asso(A, V) and asso(B, V) measure the simi- larity of data points within component A and B. They are the sum of weights of all edges that have both ends in A and B, respectively. Given that the above problem is a combinatorial optimization problem and is NP-hard, usually approximate solutions are provided[44, 45]. 2.2.2 Image Segmentation as Fitting The goal of fitting is to determine possible structures observed in an image. An image can be viewed as a set of tokens, which can be pixels or edge points. The fitting-based image segmentation approaches group these tokens together to form regular shapes such as lines and circles. There are two major approaches in this category: 0 Hough transform The Hough transform clusters pixels together based on their underlying structures[l6]. It first identifies and stores all possible structures for each pixel, and then searches for structures that are commonly shared by many pixels. The Hough transform has been used to identify lines, curves, surfaces etc. It is advantageous in that it does not require computing analytical solutions of certain equations. How- ever, the Hough transform is usually sensitive to noise[l6], which can lead to the identification of phantoms. 0 Curve fitting based on generative models. 14 A fitting approach based on a generative model assumes a probabilistic model that describes the probabilistic relationship between the observed pixels and the under- lying curves. Usually, pixels are assumed to be generated from certain curves such as lines under Gaussian noises. Parameters governing the underlying curves are esti- mated through certain criteria, such as the least square criterion[ 16]. 2.2.3 Probabilistic Methods for Image Segmentation The Expectation Maximization (EM) algorithm[36] has been demonstrated to be an effec- tive approach for missing data problems. Image segmentation can be viewed as a missing data problem, in which each image segment is assumed to be generated from a mixtures of probabilistic models and the missing information is the description of each probabilistic model. The EM algorithm computes the segmentation in the alternation of the E-step and the M-step: o E-step: the EM algorithm estimates the segment membership for each pixel; o M-step: the optimal parameters of mixture model are estimated based on the segment memberships that are estimated in the E-step. These two steps are iterated until the EM algorithm converges to its local optimum. In addition to image segmentation, the EM algorithm can also be applied to fit lines. Similar to applying the EM algorithm to image segmentation, in the E-step, we estimate the likelihood for each pixel to be in a line. Usually, this likelihood is proportional to its distance to the line and is cut by a threshold value. In the M-step, a maximum likelihood approach is used to re-estimate the parameters of the line. To determine when the iterative procedure should be stopped, we need to test convergence of the algorithm, which often is based on the change of size of the line and also the sum of distances from the points to the line [16]. 15 Despite of the success of the EM algorithm for image segmentation, a general problem with the EM algorithm is its local minimum, which means that the choice of starting point will have a great impact on the quality of final solutions. One solution is to first apply the Hough transform to obtain the initial solution for the EM algorithm, or we can start the EM algorithm with different initial solutions, and search for the best solution[16]. 2.2.4 Segmentation by Integrating Semantic Information According to previous studies [2], image segmentations based on low-level features usually will not result in desirable object recognition. This is because the same object may exhibit very different distributions of image features in different images or even in the same image. For example, a penguin has white and black halves, and it is hard to acquire meaningful segmentation results just based on the low-level features. In [2], a solution is proposed by associating each segmentation region with a meaningful word. Then, two neighboring regions are merged when they have been assigned the same annotation words. By doing this, we are able to avoid creating too many image segments. 2.2.5 Evaluation We have discussed different types of image features and image segmentation methods. One important yet unresolved issue is how to measure the ‘goodness’ of feature extraction and image segmentation. That is, the user wants to know how good a set of image features are or how good the segmentation is. Evaluation based on human judgment can only be done on small scales. For a large image collection, automatic mechanisms are required to measure the quality of image feature extraction and segmentation. One automatic approach evaluates the performance of different approaches based on the annotations of images. Word annotations for images are automatically generated based on the selected image features and segmentation methods. These automatically generated annotations are then compared against the ground truth and the resulting annotation accu- l6 racy is used as a proxy to the performance of image feature extraction and segmentation. Based on this method, [2] compares the importance of different image features. The empir- ical study showed that the L*a*b color space is more effective than the RGB color space. For image segmentation algorithms, it showed that the Normalized Cut algorithm performs significantly better than the Blobworld segmenter. The mean-shift algorithm[ll, 12] for image segmentation is somewhere between these two algorithms [2]. [14] compared commonly used image features including size, position, color and tex- ture with a complicated set of features from visual content descriptions in MPEG-7. Sur- prisingly, the empirical study showed that the commonly used image features achieve better performance than the complicated ones. Overall, previous studies show that no single set of image features can perform well for all different image collections and thus it is likely that the choice of image features depends on the characteristics of specific applications and datasets. l7 CHAPTER 3 Statistical Models for Image Annotation Image annotation describes the content of images with a set of textual words. This set of annotation words can be obtained by manual annotation from humans. However, man- ual annotation is expensive and requires extensive labor work. Also, human annotation is very subjective. Even for the same picture, different people might use different annotation words. Thus automatic image annotation is preferred. In this chapter, we discuss statistical models for automatic image annotation. This type of annotation is divided into two cate- gories. The first one annotates the images based on quantized image regions. The second one annotates the images based on the raw image features. 3.1 Image Annotation Based on Quantized Image Regions One type of approaches toward automatic image annotation first clusters image regions into a small number of groups, which are called ‘blobs’ in [41]. The approaches then learn the correlation between annotated words and blobs. Automatic clustering methods, such as the K-means algorithm, can be applied to cluster image regions and get their cluster ids. One advantage of this discrete representation is that it simplifies the image represen- tation dramatically, thus significantly reducing the computational cost of automatic image annotation. Before discussing formal models for automatic image annotation based on the discrete 18 representation, we will first formalize this problem. Let the collection of annotated images be denoted by T, and the size of the collection be denoted by |T|. Each annotated image J,- E J is divided into multiple regions. All the image regions are clustered into a small number of blobs b231, big, ..., him. Then each image can be represented by a vector of blobs and annotation words, i.e., J,- = {62117:} = {bi,1,b,;,2, ...,bi,m;wi,1,wi,2, ...,wian}. Here, m and n are the number of blobs and annotation words, respectively; bid is the number of j-th blob that appears in the i-th image; rum- is a binary variable, which is 1 when the j-th word appears in the i-th image and zero otherwise. The key to automatic image annotation is to learn the statistical correlation between the blob representation and the word representation of images. To construct this kind of models, the first task is to quantize the image regions. One method is to perform a clustering procedure on the segmented image regions. The resulting clusters are the quantized image regions and their cluster ids are treated as the discrete representations of the images. That is, if one image region in the image belongs to one cluster, the image region is converted to its cluster id. The translation model for automatic image annotation[4l] and the cross media relevance model[24] use this kind of quantized representation. Another way to quantize the image regions is to construct discrete features directly, such as the color histogram. This kind of representation is used in models such as Latent Semantic Analysis and Probabilistic Latent Semantic Analysis[37, 38]. 3.1.1 Translation Model for Automatic Image Annotation The translation model was originally developed for language translation[7], e.g. translating from French text to their English equivalent. [41] views the process of annotating images as a process of translating information from a ‘visual language’ to textual words. The lexicon of the visual language is the blobs. Compared to other models for automatic image annotation, such as the relevance models[24], statistical machine translation models for automatic image annotation have the advantage that words are annotated to each image 19 region instead of to the whole image. Based on the IBM model 1 of translation [41], given an annotated image J,- = _) {bin—17;} = {bi‘1,bi,2, ...,bi,m;wi,1,w,,2, ---”wi,n}, the probability of annotating image ._) blobs bi71,b.i,2,...,b.i,m with words ll’z',1,wz',2,---,?Ui,m i.e.,p(fiT;-, b; ), can be expressed as follows: _’ n _) T1. 771 plan.) = I] point») = I] Z tjibn. (3.1) where tjjk stands for the probability of translating the k-th blob into the j—th word and is subject to the constraint :31ch = 1, namely each blob has to be translated into one of the annotated words. The key to a translation model for image annotation is the set of translation probabilities tJ-JC. These probabilities can be obtained by maximizing the likelihood of annotated training images T, i.e.: lTl _, ITI n m 1(T) = pri’ilbil = H H 2 tj,kbi,k (32) 1'21 i=1j=l k=1 The Expectation-Maximization (EM) algorithm[41] is applied to find the optimal solu- tion for Equation3.2. It iteratively updates the translation probabilities using the following equation: ,, , . old 0. ’ ld Z k El (’14th i (3.3) where tglg and t???" are the translation probabilities learned in the previous and current iteration, respectively. Z k is a normalization factor that ensures 2 j t j,k = 1. According to Equation 3.3, a common word may have large translation probabilities for many different blobs since it appears in many different annotations and therefore its term frequency wij is non-zero for a large number of annotated examples. This could lead to overestimated translation probabilities for common words. In Chapter 4, we will discuss some methods to alleviate this problem, including a regularized translation model [26] and a symmetric translation model [25]. 20 3.1.2 Cross Media Relevance Model The relevance language model was originally designed for ad-hoc information retrieval[31] and cross-language information retrieval[30] to determine the appropriate language model, namely the probability of observing a word w in the documents relevant to a specific in- formation need p(w|R). The key in estimating the query language model is to determine which documents are likely to be relevant to a given query. Assuming query words are sampled independently from a multinomial distribution, the probability for a word to be relevant to an information need could be approximated as: k P(ur)¢::|R ZP(M)P(w|M)HP(q,-(A»1) (3.4) i=1 Here M is one of the document language models that are estimated from a collection of documents, to is a query word, and R stands for the relevance language model for a given query. The relevance language model can be extended to image annotation and image retrieval tasks, called ‘Cross Media Relevance Model’, or CMRM for short. Each annotated image in the training set is assumed to be a candidate model M in Equation 3.4. Given an image I = b1,b2, ..., bn to be annotated, following the relevance language model, probability p(wk 2 III) can be estimated as: ITI ITI Mun. =1|J)O
=(1—m)—L—.JI+#—~L—r3.z m ’
#(wk, J2) is the frequency of word wk in the annotated image J5, and #(wk, T) is the
number of words in the collection. #(bi, .12) is the frequency of blob b,- in the annotated
21
image J;, and #(b;, T) is the number of blob b,- in the collection. OJ and ,BJ are smoothing
parameters and are determined by the held out data set. The essential idea of the relevance
language model for automatic image annotation is to propagate the words for the annotated
images to a test image based on their similarity to the training images.
Another usage of the relevance language model is to apply it to image retrieval for
textual queries. To bridge the gap between textual queries and image features, the above
relevance language model will be first applied to generate a ‘visual’ language model for
blobs based on the set of annotated images and the query words. Then, the estimated
‘visual’ language model is used to determine the relevance of images in the collection.
More detailed description can be found in [24].
One obvious difference between the relevance language model and the translation
model is that the translation model assigns annotation words to different image regions
while the relevance language model only acquires annotation words for the entire im-
ages. Thus, unlike the translation model for automatic image annotation, which requires
appropriate alignment between image regions and annotation words, the relevance lan-
guage model avoids explicitly modeling the correlation between image blobs and annota-
tion words, which makes it easy to implement and robust in practice.
3.1.3 Latent Semantic Analysis for Image Annotation
Latent Semantic Analysis(LSA) originates from textual retrieval and document
analysis[l3]. It maps a high dimensional representation of a document, which often is
the term frequency vector of the document, into a low dimensional space, which is also
called the latent semantic space[13]. On one hand, latent semantic analysis can be viewed
as a dimension reduction method, and therefore is effective for alleviating the sparse data
problem that commonly exists in text-related applications. On the other hand, latent se-
mantic analysis is able to represent the document information beyond the lexical level. By
aggregating related words into concepts based on word co-occurrence patterns, LSA is able
22
to capture certain semantic correlations among words such as synonyms.
LSA is based on the Singular Value Decomposition(SVD). Suppose X is term docu-
ment matrix of dimensionality t x d, where t is the number of documents and dis the size
of vocabulary. Xio' is the frequency of word j in the document 2'. X can be decomposed
into three matrices using SVD as
X=%%% on
Here To and Do are orthonormal matrices and So is a diagonal matrix. If we keep the
first k largest singular values in So and set the others to be 0, we obtain another matrix
X = TOSODE) that minimizes the quantity IX — X I2, where X is a matrix of rank k. This
operation also removes the corresponding columns of T0 and D0 respectively and get the
matrices T and S. The columns in the matrices T and S form the base vectors for the latent
space of documents and terms respectively. We can then map each document and each term
into this space and compute their similarities.
There is a strong analogy between textual documents and images. In textual documents,
the word sense ambiguity exits in two forms: the Polysemy, where a word corresponds to
multiple different meanings, and Synonym, where multiple words correspond to the same
meaning. Similarly, in image domain, the same distribution of image features can be related
to different objects under different contexts. Meanwhile, the same object can show different
visual properties that lead to different distributions in the space of image features. Because
of this analogy, the latent semantic analysis, which has demonstrated to be a powerful tool
in document analysis, can be applied to automatic image annotation[37]. In [37], color
histograms are used to represent images. Each entry in the histogram of every segment
is viewed as a different ‘visual’ word, totally there are 648 different ‘visual’ terms in the
‘visual’ vocabulary. Then, the LSA is applied to reduce the representation of images to the
latent space, which has fewer dimensions. The annotation of test images is computed as
the average annotations of training images that are weighted by their similarity to the test
image in latent space.
23
3.1.4 Probabilistic Latent Semantic Analysis for Image Annotation
As pointed out in [22], LSA has difficulties in capturing the concept of polysemy, which
refers to the case when a word has multiple senses. The author proposed the Probabilistic
Latent Semantic Analysis (PLSA) for co-occurrence data, which introduces unobservable
variables, i.e., latent topics or latent classes, to capture the correlation among words based
on co—occurrence statistics between documents and words [22].
Let p(d,-) denote the prior probability for document (1;, p(zk|dz-) denote the probability
for document d,- to be in latent class 21:, p(wj Izk) denote the probability of observing word
wj in latent class zk. Then, the joint probability p(d,-, wj) can be written as:
P(dt,'wj)= P(dii)p(ijd)
P(ijdz ) — ZkK_1p(wj|2k)p(2kldi)
Using the equation in (3.8), given a document collection d1, d2, ..., dN, where N is the
(3.8)
number of documents, its log-likelihood could be written as:
L = 29,1 23-11111 (dz-,wj) logp(di,wj)
n(dz,w
—-Z.-_1n() [10mm + 2:“: W10 g2._ 1( plwglzm and.»
,where n(d,-, wj) rs the term frequency of word wj in document di. n(d )= 23' n(d;—, wj)
(3.9)
is the length of document d;. The EM algorithm is used to learn the values for parameters
pend.) and ptwjlzn r221.
[38] extends the application of PLSA from textual collection to automatic image an-
notation. Given an image q, the probability to annotate the image with word wj could be
written as:
WWW): ZPI ijZ klp FIZkICI) (3.10)
[38] also considers that each word encodesl more information than image features, and so
words should play more important role in deciding the structure of the latent space than
image features. Based on this intuition, a variation of PLSA model, named PLSA-word,
is proposed, in which words and image feature are trained separately in the latent space.
More specifically, the learning procedure is divided into two steps:
24
1. Based on Equation (3.9), a PLSA model is trained only on the annotation word sets
to obtain the probability distribution of latent class 2 given an annotated image d,
i.e., P(z|d), and the probability distribution of words to given the latent class 2, i.e.,
P(w[z).
2. Fix P (zld), train another PLSA model to obtain the probability distribution of visual
feature 2) given latent class 2, i.e., P(v|z).
To annotate a test image, the following two steps are performed.
1. Based on the visual features v of the test image and the distribution P(v|z), the
likelihood for the test image to be in latent class 2 is computed as P(z|d).
2. Using the word distribution of each latent class, i.e., P(w|z), the probability of an-
notating the test image with a word to is computed as:
P(IUId) = ZMWIZUPI/Z'kld) (3-11)
k
3.2 Image Annotation Based on Continuous Features of Image Re-
gions
Previously, we reviewed statistical models for image annotation based on the discrete rep-
resentation of image regions. By grouping different image segments into a small number of
clusters, each image segment is mapped to a discrete id. This discretization procedure sim-
plifies the image representation and hence reduces the computational cost. However, one
disadvantage of the discretization procedure is that it loses valuable information encoded
in the image features. There have been many studies on automatic image annotation that
directly uses the raw image features extracted from image regions.
25
3.2.1 Hierarchical Model for Image Annotation
The semantic meaning of different words could differ significantly in their generality. For
example, the word ‘animal’ is more general than the word for a specific animal such as
‘tiger’. To account for different generality of different words, we can organize words into
a hierarchical structure such that general words are put on the top of the structure and the
words with specific meaning are put close to the leave nodes of the structure. For example,
word ‘animal’ will be put as one of the parent nodes for the word ‘tiger’. In document
analysis, there have been a few studies on the hierarchical models[21, 23], which extracts
the hierarchical relations between documents and abstract organization of keywords.
Similar to textual words, the patterns of image features could be organized into a hierar-
chy with each textual word corresponding to a different visual pattern. For example regions
for ‘sky’ appear more commonly than regions for ‘tiger’. Thus, the goal is to organize both
the textual words and image regions into a hierarchical structure. The general blobs and
words are put at the top levels of the hierarchy that are close to the root, and the blobs and
words with specific meaning are put on nodes close to the leaves. Annotated images are
put to the leaf nodes of the hierarchy, which are considered as clusters. The hierarchical
structure defines a path along which each annotated image generates its image blobs and
annotation words[3, 28]. Refer to paper [3, 28] for an excellent figure to illustrates this
procedure.
Because of the uncertainty in determining the cluster membership for an annotated
image, to compute the probability to generate the observations D (i.e., words, and image
regions) of an annotated image d, we need to sum over the uncertainty in the distribution
of assigning annotated image d to different clusters, i.e.,
p(D|d) = [no 1‘] Zp(ill,c)p(llc, d) (3.12)
C iED l
,where p(l lc, d) is the probability of going through level I given the document d and cluster
c. p(z‘|l, c) is the probability to generate the observation 2' given the level l and cluster c.
26
The EM algorithm could be used to train the parameters in the likelihood function. The
probability of generating observations of words p(z'|l, c) is computed based on counting.
A Gaussian distribution is used to compute the probability of generating observations of
image regions. More information about the application of the EM algorithm to the hierar-
chical structures could be found in [21, 23].
Given the learned hierarchical structure, the probability of annotating an image B with
a word to, i.e. p(w|B), is written as:
p(w|B) = ZCPWICI B)p(0|B) 0< Zcmwlc, B)p(BIC)p(C)
= Zcpfwlc, B) IIbeBp(bIC)p(C) (3.13)
= 2c 22 19(wlc, l)10(l lc, B) HbeB (ZI(P(bIl, 6)le IC))) 29(6)
However, the above model is not a true generative model, since the probability of generating
words and image segments rely on the specific document. In [1], three different hierarchical
models have been developed. They are:
0 Model I-0.
This model is defined as:
it“: will:
p(=D|d) 2p c) 1'] [2p (wllc)p(lld)] H[2p(blz,c)p(zld)]
beB l
wEW
(3.14)
0 Model H.
In this Model, the probability of generating level p(l |d) in Model LG is replaced with
p(l |c, d) , which also depends on the cluster the document belongs to. The generation
probability of observations is changed to:
m": 7%?)
p(DId)= —Zp(c)H [Zp(wIzc)p woo] H[2p(bIl,c)p(ch,d)I
weW beB I
(3.15)
0 Model I—2.
27
Model I-2 further modifies the probability of generating level 1, and makes it in-
dependent of the documents and only dependent on the cluster. The probability of
observation is thus changed to:
N“! N
w,d Nbbg
p(D|d) = 219(6) H Zp(wll,6)p(lIC) II 219(1)”, C)P(IIC)
C l
wEW (263 l
(3.16)
3.2.2 Latent Dirichlet Allocation for Image Annotation
Multi-type data problem is a general problem in which each object/pattem is represented by
different types of data. Annotated images can be viewed as multi-type data, in which there
are two types of data, the image regions and annotation words. These two types of data
should correlate with each other since they describe the same images. In [5], three models
are introduced to solve the multi-type data problem. In these models, image features are
assumed to follow a multivariate Gaussian distribution with diagonal covariance matrix.
The words are assumed to follow a multinomial distribution.
0 Model 1: Gaussian multinomial mixture model
In this model, each latent variable is an indicator of cluster, both words and image re-
gions are considered as different types of data that are generated from the same latent
variable. Refer to [5] for a figure to illustrate the generation procedure of this model.
The joint probability for latent class 2, annotated words 172’ = {w1, w2, ..., w M}, and
image regions 7’ = {7:1, r2, ..., rN}, i.e.,p(z, ‘1‘", '23), could be formally represented
as:
N M
Ma 7", ii) = p(zIA) H p(rn|2,#,0) H p(wm|z,fi) (3.17)
m=1
n=1
where p and o govern the Gaussian distributions that determine the generation prob-
abilities of image regions, and fl govems the multinomial distributions that determine
the generation probabilities of annotation words.
28
To predict words for a given image, we first compute the probability that the image
belongs to a latent class 2 based on the image region features and then words are
generated from the latent class 2. Finally, the words from different clusters are mixed
together to obtain the annotation. This is formally described as:
p(wlr) = Zp(2Ir)p(wlz) (318)
Z
Model 2: Gaussian Multinorrrial Latent Dirichlet Allocation
In GMM model, the generation of the words and image features is conditioned on the
same latent variable. A more flexible model is Latent Dirichlet Allocation (LDA).
LDA was first developed for document clustering[6]. Each document is considered to
consist of several topics and observations in one document are generated from these
different topics. For image collection, we could also view an image as consisting
of several different topics. Each of the topics will generate the corresponding words
and image regions belonging to the topic. The overall observation is a mixture from
these different topics. Refer to [6] for this graphical model.
Let latent variables 2 and '0 model the different topics from which words and image
regions are generated. A variable 0 is used to indicate that the two types of topics
for words and image regions are correlated. Also, it models the concept that a single
document could consist of multiple topics, which are sampled based on the variable
6. The'formal representation of this generation procedure is expressed as:
N M
p('r,w,0,z,v) = p(6Ia) H p(an0)p(TnIzTh/'Lto) H p(vfllI6)p(wmIvm1B)
n=l m=1
(3.19)
Model 3. Correspondence LDA.
While the GMM model is too restrictive and LDA is too flexible, a model is proposed
to position between the two kinds of models. Due to the flexible and correlated
relation between image regions and words, correspondence LDA is introduced to
29
model the correspondence between specific words and image segmentations. That
is, the generation of words is based on both the latent variables to generate the words
as in LDA and the latent variables to generate the image segmentations. Refer to [6]
for a graphical model to illustrate this procedure.
The formal representation of the model is
N M
Maw, 9. z, y) = pIGIa) H p(zn|9)p(rnlzn,u,0) H p(ymIN)p(wmem, afl)
(3.20)
Further comparison of the different models.
We could make a further discussion of the three models based on latent variables:
Gaussian Multinomial Mixture Model (GMM), Probabilistic Latent Semantic Anal-
ysis (PLSA), and Latent Dirichlet Allocation (LDA). From the following discussion,
we could see that these three models differ mainly in how they sample the topics.
The topics are modeled by latent variables.
In GMM, the probability of sampling a topic is fixed by the latent variable. In PLSA,
the sampling of the topic depends on each of the documents. In LDA, the flexibility
of sampling of the latent variable is between GMM and PLSA. The topic probability
depends on the parameters sampled out from a Dirichlet distribution but not on the
document. From this perspective, LDA is more restrictive than PLSA. However, the
capability to change the parameters of topic sampling based on Dirichlet distribution
also provides more flexibility than GMM.
3.2.3 Continuous Relevance Model
Previously, we discussed the application of the discrete relevance model to image annota-
tion. The related words to a test image could be predicted by aggregating annotation words
of labeled instances in the training set weighted by the visual similarity between test image
30
and the labeled image. The continuous relevance model is discussed in[32]. The difference
is to use a different method to generate the image segments and words.
The conditional probability of generating words for given images is approximated by
the joint probability of observing the image regions and words:
"B ”A
(DU/1,103): ZPT(J)Hpv(wbIJ)H/kPR(TaIQa)PG(9aI-])dga (3.21)
JET b=1 a=1 R
p R(rq | ga) is a global probability distribution responsible for mapping generator vectors
9 E Rk to actual image regions 7' E R. In [32], it is assumed that for every image there is
only one corresponding generator. So a particularly simple form for the distribution PR is
assumed to be:
1 .
N- szU’) = 9
P r = 9
R( lg) { 0 otherwise
,where N9 is the number of all regions in R.
Gaussian distribution pG(g|J) = % 3:1 1 exp (9 — G(r,-)Tz_1(g - G(r,-))
ilzknk det(2)
is used to generate the image features from a model J. G (ri) is the feature vector of every
region of image J. In this model, it is assumed that the feature vector 9 in given test image
could be generated from every region rz- in the image J following a Gaussian distribution
with r,- as the mean and a diagonal matrix as the covariance matrix.
The word annotation probability is estimated based on multinomial distribution with
Dirichlet smoothing. The expression for this Bayesian estimation of posterior of word
probability is
”p12 ‘I' N 2),]
u+Zer,
v,J
In [32], the model probability p(J) is assumed to be uniform.
pv(v|J) =
(3.22)
3.2.4 Classification Method
The annotation procedure can be viewed as a classification problem. A classifier is trained
for each word and this classifier is then used to determine whether the word should appear
31
in the annotation set of a given image. These trained classifiers are then applied to both
labeled and unlabeled images to get a vector, which includes all the concepts predicted
from classifiers for each of the concepts. The values of the components of the vector are
confidence values obtained from the classifier[9]. The application of the multiple binary
classifiers for image annotation task could be viewed as an ensemble of binary classifiers
with a method of One-Per-Class(OPC). This is a simple ensemble strategy compared to the
pair-wise combination and Error Correct Output Coding(ECOC)[29]. However, the way
OPC used in [9] is not exactly the traditional usage of OPC. In the traditional usage of
OPC, the class label with the maximum score is used to annotate the image. In [9], the
confidence to annotate the image with one concept is computed by adding the scores of
different ensembles for this concept. The framework is called Content-Based Soft Annota-
tion (CBSA). [9]uses Support Vector Machine(SVM) and Bayesian Point Machine as the
classifiers.
Support Vector Machine tries to find a hyperplane that separates the training data with
maximum margin. The points closest to the hyperplane are called support vectors[8]. The
Bayesian Point Machine approximates the Bayesian average of statistical inference with a
unique classifier called Bayesian Point[20]. Bayesian inference gives a Bayesian optimal
solution for a classification task. However, it is often impossible to get a unique classifier
that has the same result as Bayesian inference. An approximation is made by finding a
hypothesis in a fixed smaller space. The hypothesis constructed is called Bayes Point,
which is believed to well approximate the Bayesian inference. This Bayes Point Machine
could be computed as the center-of—mass in the version space. [9] shows that the classifier
with Bayesian Point Machine performs better than the classifier of Support Vector Machine.
Also, for the words with prediction score higher than 0.5, the results are very likely related
to the semantics of the pictures.
32
3.2.5 Bootstrapping and Co-training Approach for Image Annotation
All of the previous models have certain limitations. They require a large number of anno-
tated images for effective learning, which are often difficult to obtain. Furthermore, most
existing techniques for automatic image annotation require semantically meaningful seg-
mentation of images, which is again difficult to accomplish given the current state-of—art
segmentation techniques. [15] proposes using bootstrapping and co-training methods to
annotate large image collections. The basic idea is to start with a small set of annotated
examples, and successively annotate and learn from a large set of unlabeled examples. To
incorporate more instances from the unlabeled sets for the training purpose, two statisti-
cally independent classifiers are used to co-annotate the images and the quality of the final
result is determined by the consistency of annotations. This procedure is called co-training.
The consistent labels are considered as good examples and added to the training set. The
inconsistent labels are considered as bad examples and are used for the user’s manual an-
notation. More precisely, the co-training approach [15] is carried out as follows. TWO SVM
classifiers are used to determine the words for images. Each SVM uses a different subset
of image features extracted from automatically segmented image regions. One set contains
the color histogram and another includes texture and shape features. These SVMs trained
on different kinds of features for different segmented regions are then applied to annotate
images. If the annotated words could agree with each other, the words will be used for
annotating the image. If there is some conflict between different classifiers, some disam-
biguation rules are applied to pick the correct one. If the conflict cannot be resolved, the
image is prompted for the user to annotate manually. At the same time, the training set is
enlarged and the classifiers are updated.
3.2.6 Linguistic Indexing of Pictures by a Statistical Modeling Approach
Most of the previous methods require semantically meaningful segmentation to correlate
the segmented regions with words. However, given the current segmentation techniques,
33
semantically meaningful regions are diffith to obtain. A number of papers propose ap-
proaches that do not need to get the meaningful segmentation but still effectively model the
structures of the images[33].
In [33], image features are modeled as 2 dimensional Multi-resolution Hidden Markov
Model (2D MHMM). Images are divided into blocks and represented in multi-resolution.
The number of blocks in both rows and columns are reduced successively by half to get
lower and lower resolutions. At each resolution level, features are generated by a 2D HMM,
which takes into account the spatial relationship between blocks at the specific resolution
by specifying the diagonally upper blocks as the previous blocks. For every state in the
2D MHMM, the feature vectors are assumed to follow a multivariate Gaussian distribution.
To model the relationship across different resolutions, a one dimensional Markov chain
is used to connect the related parent-children blocks at different resolutions and thus a
2D MHMM is formed to represent the information of the image. The 2D MHMM model
mainly captures the spatial relationship of the feature vectors, for both inter-scale and intra-
scale statistical dependence. The number of states along horizontal and vertical directions
is unknown and thus different combinations are tried. Refer to [33] for a figure to illustrate
this idea.
A dictionary of concepts is defined as the possible set of annotation words. For each
of the concepts in the dictionary, a set of images is collected and also a short description
of the concept is given. It is easier to collect a set of images related to a concept with a
short description than provide a set of annotations for each of the image. Given this set of
images, a 2D MHMM is trained for each of the concepts.
Given a test image, the similarity score between the image and images in the training
set is computed as the likelihood of generating this test image with the 2D MHMM for
the concept of the category. The top Is: categories with the highest likelihood are picked
out and the word descriptions of the categories are treated as the candidates for annotating
the test image. The paper introduces a method to pick appropriate words by ranking the
34
statistical significance of the words in the candidate description. The statistical significance
is computed by comparing the occurrence of the word in the predicted k categories with
randomly selected k categories. The top ranked words are selected as the annotation of the
image.
35
CHAPTER 4
Enhanced Translation Model for Automatic
Image Annotation
Feng Kang, Rong J in and Joyce Y. Chai. Regularizing translation models for better
automatic image annotation. In CIKM ’04: Proceedings of the Thirteenth ACM Conference
on Information and Knowledge Management, pages 350-359, 2004.
Feng Kang and Rong J in. Symmetric statistical translation models for automatic image
annotation. In Proceedings of the fifth SIAM International Conference on Data Mining,
pages 616-620, 2005.
36
CHAPTER 4
Enhanced Translation Model for Automatic
Image Annotation
The translation model in section 3.1.1 represents images by a visual vocabulary that is
constructed from the clustering results of image regions. The image annotation is viewed as
a translation from a visual language to a textual language. The key challenge in training the
translation model arises from the fact that the alignment between image regions and their
annotation words is not provided. According to [7], one strategy to avoid the alignment
problem is to utilize the co—occurrence statistics between image regions and annotation
words. If an image blob co-occurs more frequently with word ‘A’ than with other words, it
will be more likely for the image blob to be associated with ‘A’.
However, the term frequency of annotated words follows Zipf’s law, namely that a small
number of words appear very often in image annotations and most words are used only by
a few images. Figure 4.1 plots the percentage of times that each word is used by image an-
notations from a subset of images from the COREL dataset[14]. As a result of the skewed
distribution, a common word can ‘accidentally’ co-occur with a blob that is associated with
a rare word. For example, word ‘grass’ is used for annotations much more frequently than
word ‘tulip’. Meanwhile, for most images where tulip appears, it is always surrounded by
grass. As a result, the type of blobs for tulip co-occurs with word ‘grass’ as frequently as
word ‘tulip’. The problem with the co-occurrence statistics in automatic image annotation
37
0.07 I I I I I I l
0.06 - s
.O
O
U"
r
.o
S
l
.o
o
(A)
I
r
normalized frequency
9
o
N
r
r
0 t l I x . r r
1 50 100 1 50 200 250 300 350 400
words index
Figure 4.1. The distribution of term frequency for annotated words in a subset of COREL Data
is further complicated by the noise in clustering the massive number of image regions into a
relatively small number of blobs. Because each image region is represented by a set of fixed
features such as colors, textures and shapes, regions for different annotated words can have
similar distributions over the space of image features and therefore are grouped into the
same cluster, or the same image blob. As a result, a blob for a rare word can co-occur more
frequently with a common word than the rare word. Using the previous example, consider
that image regions for tulips are grouped together with image regions for other flowers. If
most flowers are surrounded by the grass, word ‘grass’ will co-occur more frequently with
the blob for flowers than any single flower name. It is the inaccurate co-occurrence statistics
that allow common annotated words to be associated with many irrelevant image blobs and
thus degrade the quality of auto-annotations generated by the machine translation models.
We propose two categories of modified translation models, namely the regularized transla-
tion model, and the symmetric translation model, that alleviate the problem caused by the
skewed distribution. The basic idea of the enhanced models is to raise the number of blobs
38
that are associated with uncommon words. The regularized translation model accomplishes
this goal through the introduction of a special Dirichlet prior and the symmetric translation
model considers both the translation from blobs to words and the translation from words to
blobs. In this chapter, we will first present the two frameworks and then show our empirical
study with both modified translation models. Chapter 5 addresses their comparison with
other models.
4.1 Regularized Translation Model
In this section, we will first present the framework of the regularized translation model for
automatic image annotation, followed by the description of an efficient EM algorithm for
finding the optimal solution[26].
4.1.1 Framework for the Regularized Translation Model
The basic idea of the regularized translation model is to impose our prior knowledge of the
desired translation model in the selection of translation models. If each blob represents a
different type of objects, we would expect that roughly equal number of blobs are associ-
ated with each word, or at least each word is associated with a certain number of blobs.
In the framework of Bayesian Learning, this prior preference of translation models can be
introduced into the original statistical translation model through an appropriate prior.
In order to form such a prior, the first task is to find an appropriate measurement that in-
dicates the number of blobs that are associated with each word. We use the normalized sum
of translation probabilities for each word, i.e.,fij = Qflélfi where m is the total number of
blobs. Since the measurement flj is proportional to the sum of translation probabilities for
the j — th word, it does provide a good indication of how many blobs that are associated
with the word j. Meanwhile, 53' can also be interpreted as the probability for the j — th
word to be associated with any blobs. Particularly, Bj satisfies the axioms for probability,
namely
39
Because of the probability interpretation for fij, we can introduce a prior distribution
for Bj that will indirectly influence the results for translation probabilities tjfi.
The second task for forming a prior is to choose an appropriate distribution for Bj. Since
the desired translation model is to have almost equal number of blobs to be associated with
each word, a Dirichlet prior can be used for B,
1 n
" __ a—I
Pm) ~ Dirichletog — m j|_|1 B]. (4.1)
where a > 0 is the hyper-parameter that determines the shape of the Dirichlet distri-
bution and n is the total number of words. 8 is the normalization constant and follows
multinomial beta distribution as:
W) = (r(a))"
F (na)
1‘ rs Gamma function defined as I‘( (z): f0°° tz 1e.—tdt Note that Dirichlet distribution
is the conjugate prior of the multinomial distribution.
In our model, we set a; as a constant. One property of Dirichlet distribution is that it
reaches the maximum point when ,Bj is a constant. Furthermore, the larger the a is, the
narrower the distribution will be. Figure 4.2 illustrates the Dirichlet distribution with two
fl component and different values of the parameter a.
By adding the above prior to the original translation model, the posterior probability for
the training images is then modified into the following form:
0T n m
lreg(T)= fi)HPthb)ocl—I(Ztls) anzl( tj,kbi,)k (4-2)
i=1j
With same notation as the previous translation model, m is the number of blobs, n is
the number of words, and T is the number of images in the training set.
40
or value is 2 or value is 4
0.04 - e e . 0.05 - -
0.03) . 0'04’
0.03)
0.02)
0.02:
001’ ‘ 0.01l
o . - - . 0 . - - -
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
avalueisB avalue is 16
0.08 . a - . 0.1 . - - ,
0_06_ , 0.08»
0.06'
0.04:
0.04:
0'02' 0.02)
O . . . 0 . L
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 4.2. Dirichlet distribution for different values of a
The optimal translation probabilities are obtained by maximizing the objective function
in Equation 4.2. Compared to the original translation model in Equation 3.2, the objective
function in Equation 4.2 requires that not only should the optimal translation probabilities
explain well the correspondence between image blobs and annotated words but also be
consistent with the prior preference on translation models, namely that different words are
associated with a similar number of image blobs. Therefore, the resulting translation model
from Equation 4.2 will be more desirable than the model obtained from Equation 3.2. For
late reference, we call this modified translation model ‘regularized translation model’, or
RTM.
4.1.2 An EM Algorithm for the Regularized Ti'anslation Model
With the regularized translation model in Equation 4.2, the next important question is how
to efficiently obtain the optimal translation probabilities tj,‘c that maximize the function in
41
Equation 4.2. The difficulty with optimizing Equation 4.2 lies in two aspects:
1. It has a large number of parameters. The number of parameters in translation mod-
els is m - n, i.e., the number of blobs times the number of unique words. For the
experiment conducted in this study, the number of parameters is close to 20,000.
2. It is a constrained optimization problem. The optimal solution to Equation 4.2 should
satisfy the axioms of probability, namely 0 g tj’k S 1‘v’j,k and 23- tj’k = IVk.
In the following, we present an EM algorithm for efficiently optimizing the objective func-
tion in Equation 4.2.
First, instead of optimizing the likelihood of training data in Equation 4.2, we can
optimize the log-likelihood of training data, i.e.,
ITI n
(I) = log (lreg(T — —aZlog (LT: 19)1 ZZwiJ-log (Z tj kbz k) (4.3)
i=1j=1
Then, following the idea of the EM algorithm, we update the optimal solution itera-
tively. Particularly, at each iteration, we need to find a set of translation probabilities t”ew
better than the old ones tJOll‘c’ that are computed for the previous iteration. To this end, we
can examine the difference in the log-likelihood between two consecutive iterations, i.e.,
:ltn—__e—w +ITI n :21: lttrwwbi:
+i=lj=1
tolsd tneu ITI n wij tolkb k tniut
J i J
>a§::— (...)+§:z . .()
l 18 1 3:2-1J-12k=1wzi.7tj,zkbik J,k
The new translation probabilities tg’i’“ are obtained by maximizing the above difference.
Taking derivative and setting it to 0, the updating equation becomes:
old old
in?” __ 1 Zka( t] k +2 “11.3th kb3 k k) (4 4)
9 '_ ld '
J 21 told 2k: 1 “1:,th kbz‘
where Z k is a normalization factor that ensures 23' t3???“ = 1,
42
Comparing the above updating equation to Equation 3.3, we can see that Equation 4.4
has an extra term a—Lk—H in the right hand side of the equation. For each word j, this extra
:1 ’3’]
term gives a larger share of a to t jsk’ the maximum translation probability for the j — th
word, than any other translation probabilities t 331” 7g Is: for the same word. As a result,
the maximum translation probability for each word will benefit most from this extra term.
Furthermore, the updating equation 4.4 is able to adjust the sum of translation probabilities
for different words (i.e.,Z,c thC) to be close. This is because according to Equation 4.4, a
word J with a small sum of translation probabilities, which is the denominator of a—Affi,
Z, t .~
will get more promotion than a word that has a large sum of translation probabilities. Ujslu-
ally, the rare words co-occur with fewer blobs than the common words and the sum of
translation probabilities related to rare words is thus smaller and gets more promotion.
Comparison to the normal usage of Dirichlet priors. Note that the Dirichlet prior in-
troduced in this work is different from the Dirichlet priors used in many other studies [49].
For most previous studies of Bayesian learning, Dirichlet priors simply introduce constant
pseudo counts into the estimation of probabilities. However, here the pseudo count in-
troduced by the Dirichlet prior (i.e., a—Lk—la) is no longer a constant. In fact, it is this
216’:
non-constant pseudo count that leads to a more balanced distribution of the number of
blobs associated with each word.
The global optimum for the EM algorithm. The objective function in Equation 4.3
is strictly convex. It can be easily understood by treating each term (22”:1 t1,8)a in the
prior as a number of pseudo-annotated images that include all blobs in its picture and
are annotated only by l — th word. As a result, the regularized model is almost identical
to Translation Model 1 except that the regularized model uses both the pseudo-annotated
images and the annotated images from the training dataset. Since the translation model
for any number of annotated images is strictly convex [7], the new objective function in
Equation 4.3 will be strictly convex. Therefore, it does not have any local optimum and the
EM algorithm presented in Equation 4.4 will guarantee to find the global optimal solution.
43
The choice of a. As already revealed in the previous discussion, constant a has a great
impact on the resulting translation model. A larger value for a will introduce more pseudo-
annotated images and therefore result in a more balanced distribution for the number of
blobs that is associated with each word. In the experiment part, we provide a detailed study
of how the value of a will influence the quality of auto-annotations.
4.2 Symmetric Translation Model
Most previous studies on translation models for automatic image annotation focus on the
model that translates image regions/blobs into textual words, which is called forward trans-
lation model in our view. Apparently, we can apply the translation model in a reverse way,
namely translating textual words into image blobs. We call this translation model a back-
ward translation model. In this section, we propose a symmetric translation by combining
these two kinds of translation models together[25].
4.2.1 Discrepancy between Forward and Backward Translation Models
Although both forward and backward translation models utilize the same set of co~
occurrence statistics, they make different assumptions for words and image blobs:
' l. The forward translation model assumes that each image blob is translated into a sin—
gle word, while each word can be translated into multiple blobs. Due to the prob-
lem with accidental co-occurrence, common words are usually associated with many
more image blobs than uncommon words. In particular, many rare words are even
associated with no image blobs at all, which makes it impossible for these words to
be used as annotations.
2. The backward translation model is based on the assumption that each word is trans-
lated into a single image blob. Thus, for each annotation word, a large translation
probability is assigned to an image blob if it co—occurs relatively frequently with the
44
Table 4.1. An example of forward and backward translation models for two image blobs(i.e. bl
and b2) and two words(i.e.'w1 and 102).
UV
given word compared to other image blobs. As a result, an image blob can be as-
signed with a large translation probability for a word although the absolute number
of co—occurrence between the word and the image blob is rather small.
In order to better illustrate the difference between these two types of translation models,
consider a simple example of co-occurrence statistics that is shown in Table 4.1. On one
hand, for the forward translation model, the translation probabilities for word ‘wl’ are
dominative for both image blobs, and word ‘w2’ is not strongly associated with any of the
two blobs. As a result, it is unlikely for word ‘w2’ to be used in any auto-annotations.
On the other hand, for the backward translation model, we do find that image blob ‘bl’ is
strongly associated with word ‘w2’. But, the chance for image blob ‘b2’ to be associated
with word ‘w2’ is also high (i.e., 42%). This large uncertainty can significantly (@3112;
the accuracy of auto-annotations. Thus, neither of the two translation models provides a
satisfactory answer.
To utilize the two directions of translation models, we can adjust the translation proba-
bilities in Table 4.1. On one hand, for the forward translation model, to balance the number
of blobs associated with different words, we need to increase probability p(w2|bl), and
decrease probability p(wl|b1). This will result in word ‘w2’ to be associated with blob
‘bl’. On the other hand, for the backward translation model, the uncertainty as to which
blob is associated with word ‘w2’ can be reduced using the information from the forward
45
translation model. This is because, according to the forward translation model, it is more
likely for blob ‘b2’ to be associated with word ‘wl’ than for blob ‘bl’. Thus, to balance
the number of image blobs associated with each word, we need to reduce translation prob-
ability p(bg|w2) and increase p(bl |w2). The refinement of the forward translation model
and the backward translation model will be carried out iteratively until they both converge
to the desired distribution.
As indicated by the above analysis, the key is to explore the correlation between the
forward translation model and the backward translation model under the assumption that
the number of blobs associated with different words should be evenly distributed. Using
the information collected from the backward translation model, we are able to alleviate the
problem with the forward translation model in which rare words are associated with few
image blobs; Using the information collected from the forward translation, we are able to
alleviate the problem with the backward translation model in which large uncertainty exists
for uncommon words as to which image blob the word is associated with. By combining
these two translation models together, we will be able to avoid the difficulties of each
individual translation model that are caused by their underlying assumptions.
The mathematical formula for the backward translation model is similar to the forward
translation model. In the backward translation model, the translation probability from an-
notation words to image blobs is written as:
m m bi,j
p(bi|ru)1—Ip(b,JIwO —AZ:1jkp(bk))(log(:——ffi —1)+nj‘k)
jfk
[1' ..
422111.11) ('wJ') (103C311 1) + 0‘1?)
j k
k,j
So the objective function we will optimize is l with constraints 23' t j}; = 1 for each I:
and 2k Mm" = for each j.
After introducing the Lagrangian term, the function we will optimize is
(I) :2 2: Z 2111 tomb 109(t1'1kb‘1“)
1= 111111113=111~= 1 21% 1
51
IT I . 11111
+2 2 221”“:123 109(I‘k,Jwi-J)
1:1{1111 k=1}J=12J’=1“kj'wiJ"
—,\ZZtJ-Jkp(bk) (10g (:‘Z—E: —1) + 72.7.13)
j k tJ' ’6
He,
—/\2 2,1116th (1113-) (10g (— —l_0$— I) + 6k,j)
j Ic
“kg
+Ak 1— 2%,]: + fij (1— Z/ij)
J' k
Taking the derivative with respect to t j,k and setting it to 0, we can get the solution for
thc
201.1
2
Bj,k + \/3ij + 4Aj,ij,k
new
tj,
_ 19091)
AJ'J: — 2)‘ told
8J1 = 1 p