MULTIPLE KERNEL AND MULTI - LABEL LEARNING FOR IMAGE CATEGORIZATION By Serhat Selc¸uk Bucak A D ISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2014 A BSTRACT MULTIPLE KERNEL AND MULTI - LABEL LEARNING FOR IMAGE CATEGORIZATION By Serhat Selc¸uk Bucak One crucial step in recovering useful information from large image collections is image categorization. The goal of image categorization is to find the relevant labels for a given image from a closed set of labels. Despite the huge interest and significant contributions by the research community, there remains much room for improvement in the image categorization task. In this dissertation, we develop efficient multiple kernel learning and multi-label learning algorithms with high prediction performance for image categorization. There are many image representation methods available in the literature. However, it is not possible to pick one as the best method for image categorization, since different representations work better in different scenarios. Multiple kernel learning (MKL), a natural extension of kernel methods for information fusion, is often used by researchers to improve image representation by integrating it to the learning step for selecting and combining different image features. MKL is mostly considered as a binary classification tool, and it is difficult to scale up MKL when the number of labels is large. We address this computational challenge by developing a stochastic approximation based framework for MKL that aims to learn a single kernel combination that benefits all classes. Another contribution of this dissertation is to develop efficient multi-label learning algorithms. Multi-label learning is arguably the most suitable formulation for the image categorization task. Many researchers have employed decomposition methods, particularly one-vs-all framework, with SVM (support vector machines) as a base classifier for addressing the image categorization problem. However, the decomposition methods have several shortcomings, such as their inability to exploit label correlations. Further, they suffer from imbalanced data distributions when the number of labels is large. Our contribution is to address multi-label learning via a ranking approach, termed multi-label ranking. Given a test image, multi-label ranking algorithms aim to order all the image classes such that the relevant classes are ranked higher than the irrelevant ones. The advantage of the proposed multi-label ranking approach, termed MLR-L1 (multi-label ranking with L1 norm), over other multi-label learning methods is its computational efficiency and high prediction performance. Image categorization is a supervised learning task, thus requiring a large set of training images annotated by humans. Unfortunately, labeling is an expensive process, and it is often the case that the annotators provide a limited set of labels, meaning that they only give a small subset of relevant tags for an image. One of the contributions of this dissertation is defining the problem of multi-label learning with incomplete class assignments and presenting a robust multi-label ranking algorithm, termed MLR-GL (multi-label ranking with group lasso norm), that addresses the challenge of learning from incompletely labeled data. Finally, we present a multiple kernel multi-label ranking algorithm to simultaneously address two essential factors for improving the performance of image categorization: Heterogeneous information fusion, and exploiting label correlations in multi-label data. We propose a multiple kernel multi-label ranking method that learns a shared sparse kernel combination that benefits all image classes. This way, we not only improve the training and prediction efficiency, but also improve the accuracy, particularly for classes with a small number of samples, by enabling information sharing between classes. We integrate the proposed MLR-L1 algorithm with an efficient semi-infinite linear programming (SILP) based MKL solver and develop a computationally efficient wrapper algorithm, termed MK-MLR (multiple kernel multi-label ranking). To Dani iv ACKNOWLEDGMENTS I would like to express the deepest appreciation to my thesis advisor, Professor Anil K. Jain, for his continuous support, generosity, patience, enthusiasm, and wisdom. Being his student and a part of the PRIP Lab is something that will always make me feel proud and privileged. It has been a great opportunity for me to work with such an intelligent, hard-working and renowned researcher like Professor Jain, and I have tried to gain as much as possible from his immense knowledge of pattern recognition and life. I am thankful to Professor Rong Jin for working closely with me during my PhD. I was very fortunate to work with a such a smart, disciplined, and knowledgeable researcher, and collaborating with him taught me the importance of passion and hard work in research. I am grateful to have Professor Selin Aviyente and Professor Pang-Ning Tan on my thesis committee. Their valuable comments and suggestions helped me to improve my thesis. I would also like to thank Professor Todd Fenton and Professor Roger Haut for supporting me in my last year of PhD under the National Institute of Justice grant and giving me the opportunity to work with them in the pediatric fracture printing project. I would also like to thank Professor George Stockman for the valuable advice he gave to me throughout my PhD. I thankfully acknowledge the funding sources that made my Ph.D. work possible. My research was supported by grants from the Office of Naval Research, ONR N00014-09-1-0663. I was funded by the National Institute of Justice grant, NIJ Award No. 2011-DN-BX-K540, in my last year. Professor Bilge Gunsel is a very important person for me. I started working with her in my senior year and continued to study under her supervision for my MSc degree at Istanbul Technical University. Her generosity, support, and passion for research helped me to have very rewarding and pleasant time at ITU. Working with her was one of the main factors that encouraged me to v pursue a PhD. I was fortunate to have great collaborations outside MSU. It was a very valuable learning experience for me to work at IBM with Vikas Sindhwani and Jianying Hu. I also had a very fruitful internship experience at Samsung working with Ankur Saxena, Abhishek Nagar, Felix Fernandes, and Kong-Posh Bhat. I also had the pleasure of working on a research paper with Professor Akgul from ITU. I would like to thank the fellow PRIP students and friends: Soweon, Brendan, Pavan, Abhishek, Radha, Jung-Eun, Kien, Alessandra, Tim, Sunpreet, Scott, Lacey, Charles, Unsang, and Mayur. They made my life at MSU easier and more fun. I also consider myself fortunate and honored to work on research papers with Pavan, Brendan, and Abhishek. Ali Mutlu, Mehrdad Mahdavi and Jen Vollner are other fellow PhD students that I want to thank. Sezai Turkes is another person I need to thank, not only for the school he created that provided an excellent education and seven fun years for me, but also for his generosity and vision, which were always a source of motivation. Last but not least, I want to thank my families in US and in Turkey. My parents-in-law Shari and Tom made my life in Michigan much easier with their kindness and generosity. I am grateful to have three great siblings, Efkan, Serhan, and Tuba, who gave me support and encouragement whenever I needed. My mother and father have been providing me a constant support with endless patience during my long years of study, and it is not possible to thank them enough. Finally, I would like to thank my dear wife Danielle for making my life much more beautiful. vi TABLE OF C ONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . 1.1 Multiple Kernel Learning for Image Categorization . . . . . . 1.2 Multi-label Learning for Image Categorization . . . . . . . . . 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Challenges in MKL for Image Categorization . . . . . . . . 1.3.2 Challenges in Multi-label Learning for Image Categorization 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . 2 . 4 . 5 . 6 . 7 . 8 . 15 Chapter 2 Multiple Kernel Learning for Image Categorization: A Review 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Overview of Multiple Kernel Learning (MKL) . . . . . . . . . . . . . . 2.2.2 Relationship to the Other Approaches . . . . . . . . . . . . . . . . . . . 2.3 Multiple Kernel Learning (MKL): Formulations . . . . . . . . . . . . . . . 2.3.1 Multiple Kernel Learning and Group Lasso . . . . . . . . . . . . . . . . 2.3.2 Regularization in MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Multiple Kernel Learning: Optimization Techniques . . . . . . . . . . . . . 2.4.1 Direct Approaches for MKL . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1.1 A Sequential Minimum Optimization (SMO) based Approach for MKL 2.4.2 Wrapper Approaches for MKL . . . . . . . . . . . . . . . . . . . . . . . 2.4.2.1 A Semi-infinite Programming Approach for MKL (MKL-SIP) . . . . . 2.4.2.2 Subgradient Descent Approaches for MKL (MKL-SD & MKL-MD) . . 2.4.2.3 An Extended Level Method for MKL (MKL-Level) . . . . . . . . . . . 2.4.2.4 An Alternating Optimization Method for MKL (MKL-GL) . . . . . . . 2.4.3 Online Learning Algorithms for MKL . . . . . . . . . . . . . . . . . . . 2.4.4 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Data sets, Features and Kernels . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 MKL Methods Used in Comparison . . . . . . . . . . . . . . . . . . . . 2.5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Classification Performance of MKL . . . . . . . . . . . . . . . . . . . . 2.5.4.1 Experiment 1: Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 17 19 20 21 23 24 26 28 29 29 30 30 31 32 33 33 34 35 35 37 38 39 39 2.5.4.2 Experiment 2: Number of Kernels vs. Classification Accuracy 2.5.5 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . 2.5.5.1 Experiment 4: Evaluation of Training Time . . . . . . . . . . 2.5.5.2 Experiment 5: Evaluation of Sparseness . . . . . . . . . . . 2.5.6 Large-scale MKL on ImageNet . . . . . . . . . . . . . . . . . . 2.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 43 45 46 47 Chapter 3 Multi-label Multiple Kernel Learning by Stochastic Approximation 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Multi-label Multiple Kernel Learning (ML-MKL) . . . . . . . . . . . . . . . . . 3.3.1 A Minimax Framework for Multi-label MKL . . . . . . . . . . . . . . . . . . 3.3.2 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Classification Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 Sensitivity to Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.7 Large-scale MKL on ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 60 62 64 67 68 68 69 70 70 74 80 81 84 Chapter 4 Image Categorization by Multi-label Ranking 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Label Set Transformation Methods . . . . . . . . . . 4.2.1.1 Problem Transformation Methods . . . . . . . . . . 4.2.1.2 Label Set Projection Methods . . . . . . . . . . . . 4.2.2 Supervised Algorithm Adaptation Methods . . . . . . 4.2.2.1 Transfer learning for multi-label classification . . . . 4.2.3 Multi-label Ranking Methods . . . . . . . . . . . . . 4.2.4 Exploiting Label Correlation in Multi-label Learning . 4.2.5 Related Problems . . . . . . . . . . . . . . . . . . . . 4.3 Maximum Margin Framework for Multi-label Ranking . 4.4 Approximate Formulation . . . . . . . . . . . . . . . . . 4.4.1 Relation to the One-vs-all Approach . . . . . . . . . . 4.4.2 Proposed Approximation . . . . . . . . . . . . . . . . 4.5 Efficient Algorithm . . . . . . . . . . . . . . . . . . . . 4.6 Experimental Results . . . . . . . . . . . . . . . . . . . 4.6.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . 4.6.3 Multi-label Ranking Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 87 88 89 89 90 92 93 93 94 95 96 97 98 99 102 103 103 104 105 viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.4 Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 5 Multi-label Ranking for Image Categorization with Incomplete Class Assignments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 A Framework for Multi-label Learning from Incompletely Labeled Data . . . . . . . 5.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Multi-label Ranking Performance on Incompletely Labeled Data . . . . . . . . . . 5.4.4 Training Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 116 120 122 127 127 128 130 132 134 Chapter 6 Multiple Kernel Multi-label Ranking . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Multiple Kernel Multi-Label Ranking (MK-MLR) . . . . . . . . 6.3.1 A Minimax Framework for Multiple kernel Multi-label Ranking 6.3.2 Proposed Approximation . . . . . . . . . . . . . . . . . . . . . 6.3.3 Optimization via Semi-infinite Linear Programming . . . . . . . 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . 6.4.5 Multi-label Learning Performance . . . . . . . . . . . . . . . . 6.4.6 Training Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 6.4.7 Prediction Efficiency . . . . . . . . . . . . . . . . . . . . . . . 6.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 138 140 141 141 144 145 146 146 147 148 148 149 155 161 163 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 7 Contributions and Future Work . . . . . . . . . . . . . . . . . . . . . . . 165 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 ix L IST OF TABLES Table 1.1 Multi-label ranking performance (AUC-ROC) for the ESP Game and MIR Flickr25000 data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Table 1.2 AUC-ROC (%) scores for the ESP Game and MIR Flickr25000 data sets for the missing label scenario. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Table 1.3 The list of symbols used in this dissertation . . . . . . . . . . . . . . . . . . . . 16 Table 2.1 Comparison of MKL baselines and simple baselines (“Single” for single best performing kernel and “AVG” for the average of all the base kernels) in terms of classification accuracy. The last three columns give the references in which either “method1” or “method2” performs better, or both methods give comparable results, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 2.2 Comparison of computational efficiency of MKL methods. The last three columns give the references, where “method1” is better, “method2” is better, or both give similar results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Table 2.3 Description of the 48 kernels built for the Caltech 101 data set. . . . . . . . . . . 36 Table 2.4 Classification results (MAP) for the Caltech 101 data set. We report the average values over five random splits and the associated standard deviation. . . . . . . . . 40 Table 2.5 Classification results (MAP) for the VOC 2007 data set. We report the average values over five random splits and the associated standard deviation. . . . . . . . . 41 Table 2.6 Comparison with the state-of-the-art performance for object classification on the Caltech 101 (measured by classification accuracy) and VOC 2007 data sets (measured by MAP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 2.7 Comparison of training time between MKL-SMO and MKL-SIP . . . . . . . . . 45 Table 2.8 Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of training examples for Caltech 101. . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Table 2.9 Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of training examples for the VOC 2007 data set. . . . . . . . . . . . . . . . . . . . . 56 x Table 2.10 Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of base kernels for the Caltech 101 data set. . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 2.11 Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of base kernels for the VOC 2007 data set. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Table 3.1 Classification results (MAP) for the Caltech 101 data set. We report the average values over five random splits and the associated standard deviation. . . . . . . . . 71 Table 3.2 Classification results (MAP) for the VOC 2007 data set. We report the average values over five random splits and the associated standard deviation. . . . . . . . . 72 Table 3.3 Training time (seconds) for the Caltech 101 data set. We report the average values over five random splits and the associated standard deviation. . . . . . . . . 74 Table 3.4 Training time (seconds) for the VOC 2007 data set. We report the average values over five random splits and the associated standard deviation. . . . . . . . . . . . . 75 Table 4.1 AUC-ROC and MAP results for the VOC 2007 data set . . . . . . . . . . . . . . 106 Table 4.2 AUC-ROC (%) for the ESP Game data set with 10,000 training images . . . . . 107 Table 4.3 MAP (%) for the ESP Game data set with 10,000 training images . . . . . . . . 108 Table 4.4 AUC-ROC and MAP results for the MIR Flickr25000 data set . . . . . . . . . . 110 Table 4.5 The label predictions by the baselines for four images from the ESP Game data set. The first row under the images gives the true image class labels. For each baseline, we provide the top six returned labels (three in the top row, and three in the lower row) ranked from left to right. The hits are written with bold characters. . 111 Table 5.1 Some concepts that can be confused with the incomplete label assignment problem119 Table 5.2 AUC-ROC (%) for the ESP Game data set with 10,000 training images and 200 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Table 5.3 MAP (%) for the ESP Game data set with 10,000 training images and 200 classes.128 Table 5.4 The label predictions by the baselines for four images from the ESP Game data set, when 40% of the training labels are missing. The first row under the images gives the true image class labels. For each baseline, we provide the top nine returned labels (three in the top row, and three it the lower row) ranked from left to right. The hits are written with bold characters. . . . . . . . . . . . . . . . . . . . 129 xi Table 5.5 AUC-ROC results for the MIR Flickr data set . . . . . . . . . . . . . . . . . . . 129 Table 5.6 Examples of training images from the ESP Game data set with true labels and annotations generated by different multi-label learning methods. Only the underlined true labels are provided to the methods for training. For each method, the correct (returned) keywords are highlighted by bold font whereas the incorrect ones are highlighted by italic font. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Table 5.7 Examples of test images from the ESP Game data set with annotations generated by different multi-label learning methods. The correct keywords are highlighted by bold font whereas the incorrect ones are highlighted by italic font. . . . . . . . . . 137 Table 6.1 The change of category based AUC score (%) withe respect to the number of selected classes for a subset of the ESP Game data set with 2,500 training images. . 149 Table 6.2 The change of image based AUC score (%) withe respect to the number of selected classes for a subset of the ESP Game data set with 2,500 training images. . 150 Table 6.3 The change of category based AUC score (%) withe respect to the number of selected classes for a subset of the MIR Flickr data set with 6,250 training images. . 150 Table 6.4 The change of image based AUC score (%) withe respect to the number of selected classes for a subset of the MIR Flickr data set with 6,250 training images. . 151 Table 6.5 The change of category based AUC score (%) with respect to the number of training samples for a subset of the ESP Game data set. The AUC score is calculated using the top 200 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Table 6.6 The change of image based AUC score (%) with respect to the number of training samples for a subset of the ESP Game data set. The AUC score is calculated using the top 200 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Table 6.7 The change of category based AUC score (%) with respect to the number of training samples for a subset of the MIR Flickr data set. The AUC score is calculated using the top 200 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Table 6.8 The change of image based AUC score (%) with respect to the number of training samples for a subset of the MIR Flickr data set. The AUC score is calculated using the top 200 classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Table 6.9 Sparsity (%) of kernel weights and dual variables for the multiple kernel baselines and the resulting prediction times. These results are obtained from a subset of the ESP Game data set with 5, 000 training images and 200 classes. . . . . . . . 163 xii Table A.1 A list of techniques that can be used in each module of the Bag-of-Words (BOW) model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Table A.2 Data set statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 xiii L IST OF F IGURES Figure 1.1 The first column shows the surface graphs that demonstrate the influence of different kernel combination weights on the mean average precision score for three different classes. Four examples from each class are given in the second column. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this thesis. . . . . . . . . . . . . . . . . . . 3 Figure 1.2 Illustration of some image categorization challenges: (a) Blue Mosque under two different illumination conditions, (b) two miniatures with background clutter and object deformation, (c) two different views of the Topkapi Palace, (d) two ferry images, one being partially occluded. . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 1.3 In Chapter 2, we discuss binary MKL methods for the one-vs-all framework, where an individual MKL algorithm is trained for each class. . . . . . . . . . . . . 9 Figure 1.4 In Chapter 3, we present our multi-label MKL algorithm, which solves one MKL problem for all classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 1.5 The difference between the two proposed multi-label ranking approaches MLRL1 (Chapter 4) and MLR-GL (Chapter 3) is that MKL-L1 strictly addresses the complete class assignment problem whereas MLR-GL can handle missing class assignments. For example, the complete and full annotations are provided with all four labels (soccer, referee, field, goalkeeper) for the given image. . . . . . . . . . 13 Figure 1.6 The difference between the two proposed multi-label ranking approaches (a) MLR-L1 (Chapter 4) and (b) MLR-GL (Chapter 3) is that MKL-L1 strictly addresses the complete class assignment problem whereas MLR-GL can handle missing class assignments. For example, only two labels (soccer and field, written with bold characters) are given for the above image whereas two labels (goalkeeper and referee, underlined text) are missing. . . . . . . . . . . . . . . . . . . . . . . . . . 14 Figure 2.1 A summary of representative MKL optimization schemes . . . . . . . . . . . . 50 Figure 2.2 Mean average precision (MAP) scores of different L1 -MKL methods vs. number of iterations for the anchor class of the Caltech101 data set. . . . . . . . . . . 51 Figure 2.3 Mean average precision (MAP) scores of different L1 -MKL methods vs. number of iterations for the bonsai class of the Caltech101 data set. . . . . . . . . . . . 51 xiv Figure 2.4 Mean average precision (MAP) scores of different L1 -MKL methods vs. number of iterations for the camera class of the Caltech101 data set. . . . . . . . . . . 52 Figure 2.5 The change in MAP score with respect to the number of base kernels for the Caltech 101 data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 2.6 The change in MAP score with respect to the number of base kernels for the VOC 2007 data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 2.7 Number of active kernels learned by the MKL-SIP algorithm vs. number of iterations for the Caltech 101 data set. Note that it is difficult to distinguish the results of L2 -MKL and L4 -MKL from each other as they are identical. . . . . . . . 53 Figure 2.8 Number of active kernels learned by the MKL-SIP algorithm vs. number of iterations for the VOC 2007 data set. Note that it is difficult to distinguish the results of L2 -MKL and L4 -MKL from each other as they are identical. . . . . . . . 54 Figure 2.9 Classification performance for different training set sizes for the ImageNet data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Figure 2.10 Training times for L1 -MKL and L2 -MKL on different training set sizes for the ImageNet data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Figure 3.1 For the 4 classes (ant, butterfly, ceiling fan, chair) taken from the Caltech 101 data set, the first row gives images which produced false negatives for the single kernel baseline and true positives for ML-MKL-SA baseline. The second row gives images which produced false positives for the single kernel baseline and true negatives for the ML-MKL-SA baseline for the corresponding classes. . . . . . . . 69 Figure 3.2 For the 4 classes (bird, potted plant, dining table, train) taken from the VOC 2007 data set, the first row gives images which produced false negatives for the single kernel baseline and true positives by the GMKL baseline. The second row gives images which produced false positives for the single kernel baseline and true negatives for the ML-MKL-SA method for the corresponding classes. . . . . . . . 71 Figure 3.3 The evolution of kernel weights computed by the MKL-Level method over time for the Caltech 101 data set with 30 training instances per class. . . . . . . . . . . 76 Figure 3.4 The evolution of kernel weights computed by the MKL-SIP-L1 method over time for the Caltech 101 data set with 30 training instances per class. . . . . . . . 77 Figure 3.5 The evolution of kernel weights computed by the ML-MKL-Sum method over time for the Caltech 101 data set with 30 training instances per class. . . . . . . . 78 Figure 3.6 The evolution of kernel weights computed by the ML-MKL-SA method over time for the Caltech 101 data set with 30 training instances per class. . . . . . . . 79 xv Figure 3.7 Classification performance (MAP) of the proposed algorithm ML-MKL-SA on Caltech 101 with 30 training instances per class using different values of δ (for ηβ = ηγ = 0.01). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 3.8 Classification performance (MAP) of the proposed algorithm ML-MKL-SA on Caltech 101 with 30 training instances per class using different values of ηβ (for ηγ = 0.0001 and δ = 0.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 3.9 Classification performance (MAP) of the proposed algorithm ML-MKL-SA on Caltech 101 with 30 training instances per class using different values of ηγ (ηβ = 0.0001 and δ = 0.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 3.10 Comparison of the mean average precision scores for different training set sizes for the ImageNet data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Figure 3.11 Comparison training times for different training set sizes for the ImageNet data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 4.1 A diagram summarizing the label set projection schemes for multi-label learning. 91 Figure 4.2 For four images from the VOC 2007 data set, the original labels are given in addition to the outputs of baseline methods. . . . . . . . . . . . . . . . . . . . . . 106 Figure 4.3 Change of the AUC-ROC score with respect to the number of training images. . 109 Figure 4.4 Training time of the three baselines for a fixed number of categories (100) with respect to the number of training samples for the ESP Game data set. . . . . . . . . 111 Figure 4.5 Training time of the three baselines for a fixed number of training samples (10,000) with respect to the number of categories for the ESP Game data set. . . . . 112 Figure 5.1 Some example images from the VOC 2007 (top row) and ESP Game (bottom row) data sets with their annotations. The labels written in italic are provided with the images, whereas the ones written in bold fonts are the missing labels. These images, with their missing annotations, are examples of incomplete labeled data. . 117 Figure 5.2 Example images from the ESP Game data set and their annotations. The annotations highlighted by bold font, which are used to annotate the same concept/object in the corresponding images, are examples of the label ambiguity problem. . . . . . 119 Figure 5.3 The change in the baseline training times (seconds) with respect to the number of training images from the ESP Game data set. . . . . . . . . . . . . . . . . . . . 133 Figure 5.4 The change in the training time (seconds) for the proposed multi-label ranking algorithms and one-vs-all SVM with respect to the number of image labels (m). . . 134 xvi Figure 6.1 The plot of recall vs. number of retrieved labels per image. The number of training images is 2, 500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Figure 6.2 Comparing MK-MLR to ML-MKL methods that learn optimal kernel combination separately for each class in terms of training time. We use 5, 000 training images and create four different settings by changing the number of classes {50, 100, 200, 500} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Figure 6.3 Comparing MK-MLR to ML-MKL methods that learn one optimal kernel combination for all classes in terms of training time. We use 5, 000 training images and create four different settings by changing the number of classes {50, 100, 200, 500} 157 Figure 6.4 Comparing MK-MLR to ML-MKL methods that learn one optimal kernel combination separately for each class in terms of training time. We use images from 200 classes and create three settings by changing the data set size {1, 000, 2, 500, 5, 000} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Figure 6.5 Comparing MK-MLR to ML-MKL methods that learn one optimal kernel combination for all classes in terms of training time. We use images from 200 classes and create three settings by changing the data set size {1, 000, 2, 500, 5, 000} . . . . 160 Figure A.1 Four example images from the Caltech 101 data set with their labels. . . . . . . 174 Figure A.2 Four example images from the ImageNet data set. A cat and a car image are shown in the top row. The second row has two dog images, one from the dalmatian synset, and one from the Mexican hairless synset . . . . . . . . . . . . . . . . . . 175 Figure A.3 Two example images from the MIR Flickr data set. Left image (reflection effect) is by Szymczak [1] and the right image (fish eye effect) is by Wild. [2] . . . 176 Figure A.4 Four example images from the ESP Game data set. . . . . . . . . . . . . . . . . 177 xvii Chapter 1 Introduction In this dissertation, we develop multiple kernel and multi-label learning algorithms for the image categorization problem. The goal of image categorization is labeling an image with the relevant categories from a predefined tag set. In other words, image categorization requires desinging classifiers to ask the following type of question: “Does the query image have a cat in it?” Answering questions such as this (cat is one of the possible image labels) is also the goal of visual object recognition and automatic image annotation tasks, which we consider as two very closely related subsets of image categorization. Visual object recognition is defined as the task of determining if any of the predefined objects (visible or tangible things) are present in an image or not. On the other hand, automatic image annotation task differs from visual object recognition in that the goal is not only to look for the existence of tangible objects, but also concepts like color (green, white), place (Paris, Ireland), and scene (sunset, fight). The methods we present in this dissertation are designed to be used in both of these tasks. Image categorization is a very good fit as a benchmark to test multiple kernel and multi-label learning algorithms for several reasons. Firstly, we see that many state-of-the-art methods for image categorization use information fusion to combine different image representations. Therefore, multiple kernel learning (MKL), which is an information fusion technique, is expected to perform 1 well in image categorization. Secondly, different classes in image categorization data sets require using similar features (i.e., the scale-invariant feature transform, SIFT, works well for the majority of image classes). Therefore, the assumption we use for our multiple kernel learning algorithms holds, which is a kernel combination that benefits all classes can be learned. Thirdly, only a small number of image representations are needed to obtain the optimal classification performance. This means that sparseness, one of the goals of the multiple kernel learning algorithms we develop, is a useful feature in image categorization. Fourthly, since image classes are often correlated with each other, multi-label learning is expected to work well with image categorization. Finally, incompletely labeled data, which is one of the problems we address in this dissertation, frequently occur in image categorization applications. 1.1 Multiple Kernel Learning for Image Categorization Given the variety of alternatives and the large number of ways for constructing image representations, one critical issue in developing statistical models for image categorization is how to effectively combine different image features. MKL presents a principled framework for combining multiple image representations: It creates a set of base kernels for each representation and finds the optimal kernel combination via a linear combination of kernels. We demonstrate MKL on a simple image categorization problem. We create two kernels: one based on color histogram and one based on texture distribution in the image. We choose three object classes (crocodile, snoopy, strawberry) from the Caltech 101 data set [3], each with 15 instances, and train one-vs-all support vector machines (SVM) for each of the three classes by using different combinations of these two kernels. To combine the kernels, we vary the combination coefficients in the set {0, 0.2, 0.4, 0.6, 0.8, 1}. In Figure 1.1 we generate a heat map to represent classification performance of different linear combinations of the two kernels. We observe that the optimal combination varies from one class to another. For example, while the texture based kernel 2 color kernel weight 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 color kernel weight texture kernel weights 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 color kernel weight texture kernel weights 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 texture kernel weights Figure 1.1: The first column shows the surface graphs that demonstrate the influence of different kernel combination weights on the mean average precision score for three different classes. Four examples from each class are given in the second column. For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this thesis. 3 is assigned a higher coefficient for crocodile classification task, the color kernel should be used with a higher weight for the strawberry class. This simple example illustrates the significance of identifying the appropriate combination of kernels for recognizing a specific class of visual objects. It also motivates the need for developing automatic approaches for finding the optimal combination of kernels from training examples, as there is no universal solution for kernel combination that works well for all classes. MKL has been successfully applied to a number of tasks in computer vision, particularly to image categorization. For instance, the winning group in the Pascal VOC 2010 object categorization challenge [4] used MKL to combine multiple sets of visual features. The best performance reported on the Caltech 101 data set was achieved by learning the optimal combination of multiple kernels [5]. Recent studies have also shown promising performance of MKL for object detection [6]. 1.2 Multi-label Learning for Image Categorization In multi-label learning, more than one class can be assigned to an instance. With the increase in the number of data sets where each image has multiple labels, there have been a vast amount of studies that focus on developing strong classification methods for image categorization [7–9]. Many researchers employ decomposition methods, particularly one-vs-all framework, with SVM as a base classifier. In this setting, a separate classifier is trained for each image label, leading to an independent prediction for each label on a query image. Although decomposition based methods are frequently used to solve multi-label classification, they do have some limitations (see Chapter 4). To overcome the limitations of decomposition techniques, there have been many direct multi-label learning methods proposed in the literature that do not decompose or transform the multi-label learning problem into a set of binary classification tasks [10–14]. In this dissertation, we are particularly interested in multi-label ranking, in which the learning task is formulated as a bipartite 4 ranking problem. Multi-label ranking is an example of a direct multi-label learning approach that can exploit label correlations. Also, by avoiding a binary decision, multi-label ranking is usually more robust than the classification approaches, particularly when the number of classes is very large [10, 15]. Ranking has been successfully used in other application domains such as document classification and recommender systems. For example, it makes more sense in recommender systems to provide the user an ordered list of items that she/he might be interested in. Also, since the preference ratings given by the users are not universal (i.e., the rating “7” is not same for every user) ranking results would be easier to obtain compared to predicting the exact ratings. Similarly, ranking labels might be useful for image categorization systems. Consider an image search system where the search is based on image labels. Being able to rank image labels can be useful for refining the search. For example, if a user is interested in finding “cafe shop” images from the internet to decide where to go, then a system that only focuses on the label “cafe shop” would not help in refining the search. If the user is looking for images of pet-friendly cafe shops where more people read books than use computers, then ranking labels would be useful. Such a system would aim to retrieve images where the labels cafe shops, books, cats, dogs, have higher scores than the label computer. This does not mean that the image should not contain any computers, but the emphasis on the other labels is set to be higher. 1.3 Challenges There are thousands of possible image classes and as such, there is no optimal image representation technique that would work best for all of these classes. In fact, it is very difficult to find a salient representation for even a single image class due to large variations in the visual appearance of samples within a class, a phenomenon known as the intra-class variation problem [16, 17]. In addition to intra-class variation, challenges include translation [18], scale [19], rotation [20], 5 illumination problem background clutter viewpoint variation viewpoint occlusion Figure 1.2: Illustration of some image categorization challenges: (a) Blue Mosque under two different illumination conditions, (b) two miniatures with background clutter and object deformation, (c) two different views of the Topkapi Palace, (d) two ferry images, one being partially occluded. affine transformation [21], viewpoint variation [22], occlusion [23], background clutter [24], and illumination [25]. Figure 1.2 shows example images that demonstrate some of these challenges. The problems we have stated above often force recognition algorithms to utilize complex models. More specifically, kernel machines, which use non-linear functions of the features, generally work better than linear classification models. For instance, we see from the image categorization literature that using SVM with RBF (radial basis function) or χ2 kernel gives superior performance compared to linear SVM [26]. However, there are some challenges of using kernel machines for image categorization. We examine these under the following two topics: (i) challenges of multiple kernel learning and (ii) challenges of multi-label learning for image categorization. 1.3.1 Challenges in MKL for Image Categorization • The application of MKL to multi-labeled data, such as in image categorization, is primarily limited to one-vs-all framework, which fails to exploit label correlations. As MKL solvers for each class operate independently, no interaction or information transfer between image 6 classes takes place, leading to suboptimal performance [15, 27, 28]. • The training complexities of MKL algorithms are quadratic in terms of the number of training samples and linear in terms of the number of classes. More importantly, the prediction is computationally expensive. Once the distance between a query sample and the support vectors is calculated, a different kernel combination needs to be calculated for each class prior to prediction, which is a costly process. 1.3.2 Challenges in Multi-label Learning for Image Categorization • Exploiting correlations or dependencies between different classes is an important research problem, and a number of approaches have been developed for multi-label learning that aim to capture dependencies among classes [10, 12, 13, 29, 30]. The majority of such methods make strong assumptions regarding the type of relationships that exist between class labels. Although these methods give promising results when the underlying assumptions hold, there is no guarantee that the assumptions would hold for all types of data. • Formulating a multi-label learning problem as multi-label ranking methods is an effective approach that takes advantage of the label correlations without making a strong assumption about the data structure. However, the bipartite ranking constraints make the computational complexity quadratic in the number of classes, making these algorithms computationally inefficient when the number of classes is large. • It is unclear if strong multi-label learning algorithms would work well in practice. One of the main concerns for real world systems is that the labeling process is very expensive and often inaccurate. In image categorization systems, the image annotations for the training data set are provided primarily by online users through services like Amazon Mechanical Turk [31]. As a result, the retrieved annotations are often incomplete; only a subset of the true image 7 labels is given by the annotators. Therefore, it is important to build robust classifiers that would work well even when the full label information is not provided. 1.4 Contributions We can divide our contributions in this dissertation into two parts: (i) multiple kernel learning and (ii) multi-label learning for image categorization. Chapters 2 and 3 show how multiple kernel learning can be used to simultaneously improve the representation and learning stages. Chapters 4 and 5 discuss the multi-label learning problem, which is arguably the most appropriate formulation of the image categorization problem. We present our (single) kernel based multi-label learning algorithms in Chapters 4 and 5. Finally, we merge these two directions by developing a multiple kernel multi-label ranking approach in Chapter 6 and address our main goal, which is to develop efficient algorithms that outperform published classification methods when state-of-the-art image representations are used. We can list our contributions as follows: • Our contribution in Chapter 3 is to improve the computational efficiency of MKL with respect to the number of classes for both the training and prediction steps. The majority of MKL methods require executing a binary MKL algorithm individually for each image class, see Figure 1.3, making the training and prediction complexities linear in terms of the number of classes. This is the reason that the existing MKL solvers do not scale well when the number of classes is large. We address this computational challenge by developing a framework for MKL that learns a single kernel combination benefiting all classes by combining a worst-case analysis with stochastic approximation (see Figure 1.4). Our analysis shows that the training complexity of our algorithm is O(m1/3 log m) in terms of the number of classes, m. Moreover, since our algorithm learns a single sparse kernel combination for all classes, the time consumed for the kernel construction step of the prediction phase is also reduced significantly. 8 MKL class 1 {-1,1} {-1,1} feature 1 K MKL class 2 feature 2 K MKL class 3 {-1,1} . . . . . . . . MKL class m {-1,1} . Test image K feature s Training images Complete training label set Figure 1.3: In Chapter 2, we discuss binary MKL methods for the one-vs-all framework, where an individual MKL algorithm is trained for each class. 9 {-1,1} feature 1 K feature 2 K . . {-1,1} ML-MKL for m classes {-1,1} . . . Test image feature s . K {-1,1} Training images Complete training label set Figure 1.4: In Chapter 3, we present our multi-label MKL algorithm, which solves one MKL problem for all classes. 10 • Our contributions in Chapters 4 and 5 are efficient multi-label ranking algorithms. Given a test image, a multi-label ranking method aims to order all the object classes such that the relevant classes are ranked higher than the irrelevant classes (Figure 1.5). We present two efficient algorithms for multi-label ranking based on the idea of block coordinate descent. The proposed methods are computationally efficient; their computational complexity is linear in the number of classes, while the majority of the multi-label ranking schemes suffer from quadratic dependence on the number of classes. Our experimental results show that the proposed methods outperform state-of-the-art classification methods. Table 1.1 gives a comparison between the proposed multi-label ranking methods (MLR-L1 and MLR-GL), and two state-of-the-art approaches on two benchmark data sets, ESP Game and MIR Flickr25000, in terms of AUC-ROC score. We use dense-SIFT features to generate the results in Table 1.1; however, the proposed methods consistently outperform the baselines even when different features are used. • In Chapter 5 we present a robust multi-label learning method that performs well under the setting of limited annotations. Specifically, we consider a situation where the training example class assignments are incomplete, see Figure 1.6. Consider a training image whose true class assignment is (c1 , c2 , c3 , c4 ), but is only assigned to classes c1 and c4 . We refer to this problem as multi-label learning with incomplete class assignments, which has not been addressed in the multi-label learning literature. Incompletely labeled data is frequently encountered when the number of classes is very large (hundreds as in the MIR Flickr data set) or when there is a large ambiguity between classes (e.g., labels jet and plane). In both cases, it is difficult for users to provide complete class assignments for objects. • We propose a ranking based multi-label learning framework that explicitly addresses the challenge of learning from incompletely labeled data by exploiting the group lasso technique to combine the ranking errors. Table 1.2 reports the results on two benchmarks data sets, 11 Table 1.1: Multi-label ranking performance (AUC-ROC) for the ESP Game and MIR Flickr25000 data sets SVM MLLS MLR-L1 MLR-GL ESP Game MIRFlickr25000 79.5 70.2 79.4 75.9 81.5 75.4 80.5 76.2 ESP Game and MIR Flickr25000, in terms of AUC-ROC score, in two scenarios: (i) the complete label information is provided, (ii) 60% of the training labels are randomly removed. With performance in Table 1.2 and the experimental results in Chapter 5, we claim that the proposed method, MLR-GL outperforms the state-of-the-art multi-label classification methods on incompletely labeled data, including our other multi-label ranking approach MLR-L1 . • Finally, we propose a multiple kernel multi-label ranking method (MK-MLR) by combining the strengths of the algorithms in Chapters 2, 3, and 4. We extend the proposed MLR-L1 method to multiple kernel setting by integrating it into the SILP (semi-infinite linear programming) based wrapper MKL solver, which is the most efficient MKL-L1 optimization method according to our detailed analysis in Chapter 2. We also use the idea of learning a shared kernel combination for all image classes to improve the computational efficiency. The MK-MLR method addresses the two essential factors for improving the performance of image categorization: (i) heterogeneous information fusion, and (ii) exploiting label correlation of multi-label data. 12 Training images Feature set Test image Soccer, referee, Goalkeeper, field K Multi-label ranking Complete training label set c1 > c2 > c3 > … cm-1 > cm Ordered list of labels Figure 1.5: The difference between the two proposed multi-label ranking approaches MLR-L1 (Chapter 4) and MLR-GL (Chapter 3) is that MKL-L1 strictly addresses the complete class assignment problem whereas MLR-GL can handle missing class assignments. For example, the complete and full annotations are provided with all four labels (soccer, referee, field, goalkeeper) for the given image. Table 1.2: AUC-ROC (%) scores for the ESP Game and MIR Flickr25000 data sets for the missing label scenario. SVM MLLS MLR-L1 MLR-GL ESP Game complete 60% missing 80.2 75.2 79.8 75.0 82.9 79.4 83.8 82.1 13 MIR Flickr25000 complete 60% missing 70.2 65.7 75.9 71.5 75.4 69.1 76.2 74.1 Training images Feature set Test image K Multi-label ranking Soccer, referee, Goalkeeper, field Incomplete training label set c1 > c2 > c3 > … cm-1 > cm Ordered list of labels Figure 1.6: The difference between the two proposed multi-label ranking approaches (a) MLR-L1 (Chapter 4) and (b) MLR-GL (Chapter 3) is that MKL-L1 strictly addresses the complete class assignment problem whereas MLR-GL can handle missing class assignments. For example, only two labels (soccer and field, written with bold characters) are given for the above image whereas two labels (goalkeeper and referee, underlined text) are missing. 14 1.5 Notation Let D = {x1 , . . . , xn } be a collection of n training instances, where X ⊆ Rd is a compact domain. Each training example xi is annotated by a set of class labels from L, denoted by a binary vector i yi = (y1i , . . . , ym ) ∈ {−1, 1}m , where m is the total number of classes, and yki = 1 when xi is assigned to class ck and −1, otherwise. In multi-label ranking, we aim to learn m classification functions fk (x) : Rd → R, k = 1, . . . , m, one for each class. We denote by {κj (x, x′ ) : X × X → R, j = 1, . . . , s} a set of s base kernels to be combined in multiple kernel learning (MKL). For each kernel function κj (·, ·), we construct a kernel matrix Kj = [κj (x, x′ )]n×n by applying κj (·, ·) to the training instances in D. We denote by β = (β1 , . . . , βs )⊤ ∈ Rs+ the set of coefficients used to combine the base kernels, and denote by κ(x, x′ ; β) = s ′ j=1 βj κj (x, x ) and K(β) = s j=1 βj Kj the combined kernel function and kernel matrix, respectively. We further denote by Hβ the Reproducing Kernel Hilbert Space (RKHS) endowed with the combined kernel κ(x, x′ ; β). The list of symbols and descriptions are given in Table 1.3. The vectors and matrices are denoted by bold lowercase and uppercase characters, respectively. We use superscript to indicate the training instance index and subscript to show the class index for the feature and label vectors. For example, yi ∈ Rm , with m being the number of labels, denotes the label vector for multi-labeled the training instance xi . On the other hand, yk ∈ Rn , where n is the number of training instances, is the label assignment vector on all training instances for class ck . We use a scalar yki ∈ {−1, +1} to indicate the label assignment of instance i for class ck . For binary classification tasks, for example Chapter 2, we drop the subscript, i.e., y i ∈ {−1, +1}, for simplicity. For a matrix K, K:,i and Kj,: denote the ith column and and jth row vectors, respectively. For the multiple kernel learning section, Kj indicates the jth base kernel. 15 Table 1.3: The list of symbols used in this dissertation Definition Instance space Label set Number of dimensions Number of instances Number of class labels Number of base kernels for MKL Kernel function 1k 0k M:,i Classification function for class k Reproducing Kernel Hilbert Space (RKHS) endowed with the combined kernel Kernel coefficients for MKL Training instance Label vector 16 Symbol X ∈ Rd L d n m s κ(., .) k dimensional vector of all ones k dimensional vector of all zeros ith column vector of the matrix M fk (x) : Rd → R Hβ β = (β1 , . . . , βs )⊤ ∈ Rs+ xi = (xi1 , xi2 , . . . , xid ) ∈ X i yi = (y1i , . . . , ym ) ∈ {−1, 1}m Chapter 2 Multiple Kernel Learning for Image Categorization: A Review 2.1 Introduction Kernel methods [32] have become popular in computer vision, particularly for image categorization. The key idea of kernel methods is to introduce nonlinearity into the decision function by mapping the original features to a higher dimensional space. Many studies [4, 33, 34] have shown that nonlinear kernels, such as radial basis functions (RBF) or chi-squared kernels, yield significantly higher accuracy for image categorization than a linear classification model. One difficulty in developing kernel classifiers is to design an appropriate kernel function for a given task. We often have multiple kernel candidates for image categorization. These kernels arise either because multiple feature representations are derived for images, or because different kernel functions (e.g., polynomial, RBF, and chi-squared) are used to measure the visual similarity between two images for a given feature representation. One of the key challenges in image categorization is to find the optimal combination of these kernels for a given object class. This is the central question addressed by Multiple Kernel Learning (MKL). 17 Table 2.1: Comparison of MKL baselines and simple baselines (“Single” for single best performing kernel and “AVG” for the average of all the base kernels) in terms of classification accuracy. The last three columns give the references in which either “method1” or “method2” performs better, or both methods give comparable results, respectively. meth1 MKL MKL L1 -MKL L1 -MKL L1 -MKL meth2 Single Single AVG AVG AVG dataset UCI UCI Cal-101 VOC07 Oxford Flowers Lp -MKL AVG VOC07 Lp -MKL AVG Cal-101 Lp -MKL AVG Oxford Flowers L1 -MKL Lp -MKL UCI L1 -MKL Lp -MKL VOC07 L1 -MKL Lp -MKL Cal-101 # samples # kernels [1-6K] [1-10] [1-2K] [10-200] [510-3K] [10-1K] 5011 [10-22] 680 [5-65] mtd1 [35] [37] [38], [9] [9], 5011 [1K-3K] 680 10 [24-1K] [5,65] [42] [41] [1-2K] 5011 [510-3K] [1-50] [10-22] [10-1K] [44] mtd2 comp. [36] [39], [40] [41] [41] [42] [43] [40] [41] [45], [46] [42], [41] [40] [47] [41] A lack of comprehensive studies has resulted in different, sometimes conflicting, statements regarding the effectiveness of various MKL methods on real-world problems, particularly for image categorization. For instance, some of the studies [5, 9, 41, 46] reported that MKL outperforms the average kernel baseline while other studies made the opposite conclusion [40,48,49], see Table 2.1. Moreover, as Table 2.2 shows, there are also some confusing results and statements about the efficiency of different MKL methods. Besides summarizing the latest developments in MKL and its application to image categorization, an important contribution of this chapter is to resolve the conflicting statements by conducting a comprehensive evaluation of state-of-the-art MKL algorithms under various experimental conditions. The main contributions of the survey we give in this chapter are: • A review of a wide range of MKL formulations that use different regularization mechanisms, and the related optimization algorithms. • A comprehensive study that evaluates and compares a representative set of MKL algorithms 18 Table 2.2: Comparison of computational efficiency of MKL methods. The last three columns give the references, where “method1” is better, “method2” is better, or both give similar results. meth1 meth2 datasets # samples # kernels training time L1 -MKL Lp -MKL MedMill 30,993 3 MKL-L1 Lp -MKL UCI [1-2K] [90-800] MKL-SD MKL-SIP UCI [1-2K] [50-200] MKL-SD MKL-SIP UCI [1-2K] [50-200] MKL-SD MKL-SIP Oxford 680 [5-65] Flowers MKL-SD MKL-MD Oxford 680 [5-65] Flowers MKL-SD MKL-MD Cal-101 3,060 9 MKL-SD MKL-MD VOC07 5,011 22 MKL-SD MKL-Lev UCI [1-2K] [50-200] MKL-SIP MKL-Lev UCI [1-2K] [50-200] # active kernels MKL-SD MKL-SIP UCI [1-2K] [50-200] MKL-SD MKL-SIP UCI [1-2K] [50-200] MKL-SD MKL-Lev UCI [1-2K] [50-200] MKL-SIP MKL-Lev UCI [1-2K] [50-200] mtd1 mtd2 cmp. [50] [48] [51], [52] [53], [46] [43] [39] [9] [9] [52] [52] [51] [53] [52] [52] for image categorization under different experimental settings. • An exposition of the conflicting statements regarding the performance of different MKL methods, particularly for image categorization. We attempt to understand these statements and determine to what degree and under what conditions these statements are correct. 2.2 Overview In this section we give an overview of multiple kernel learning. 19 2.2.1 Overview of Multiple Kernel Learning (MKL) MKL was first proposed in [54], where it was cast into a Semi-Definite Programming (SDP) problem. Most studies on MKL are centered around two issues, (i) how to improve the classification accuracy of MKL by exploring different formulations, and (ii) how to improve the learning efficiency of MKL by exploiting different optimization techniques (see Figure 2.1). In order to learn an appropriate kernel combination, various regularizers have been introduced for MKL, including L1 norm [55], Lp norm (p > 1) [56], entropy based [48], and mixed norms [57]. Among them, L1 norm is probably the most popular choice because it results in sparse solutions and could potentially eliminate irrelevant and noisy kernels. In addition, theoretical studies [58, 59] have shown that L1 norm will result in a small generalization error even when the number of kernels is very large. A number of empirical studies have compared the effect of different regularizers used for MKL [41,46,60]. Unfortunately, different studies arrive at contradictory conclusions. For instance, while many studies claim that L1 regularization yields good performance for object recognition [40, 61], others show that L1 regularization results in information loss by imposing sparseness over MKL solutions, thus leading to suboptimal performance [41, 46, 48, 60, 62]. In addition to a linear combination of base kernels, several algorithms have been proposed to find a nonlinear combination of base kernels [39, 45, 63–65]. Some of these algorithms try to find a polynomial combination of the base kernels [45, 63], while others aim to learn an instancedependent linear combination of kernels [5, 66, 67]. The main shortcoming of these approaches is that they have to deal with non-convex optimization problems, leading to poor computational efficiency and suboptimal performance. Given these shortcomings, we will not review them in detail. Despite significant efforts in improving the effectiveness of MKL, one of the critical questions remaining is whether MKL is more effective than the popular simple baselines, e.g., taking the average of the base kernels. While many studies show that MKL algorithms bring significant 20 improvement over the average kernel approach [46, 62, 68], opposite conclusions have been drawn by some other studies [40, 41, 48, 49]. Our empirical studies show that these conflicting statements are largely due to the variations in the experimental conditions, or in other words, the consequence of a lack of comprehensive studies on MKL. The second line of research in MKL is to improve the learning efficiency. Many efficient MKL algorithms [46,48,53,55,64,69,70] have been proposed, mostly for L1 regularized MKL, based on the first order optimization methods. We again observe conflicting statements in the MKL literature when comparing different optimization algorithms. For instance, while some studies [46,51,52] report that the subgradient descent (SD) algorithms [53] are more efficient in training MKL than the semi-infinite linear programming (SILP) based algorithm [71], an opposing statement was given in [61]. It is important to note that besides the training time, the sparseness of the solution also plays an important role in computational efficiency: both the number of active kernels and the number of support vectors affect the number of kernel evaluations and, consequentially, computational times for both training and testing. Unfortunately, most studies focus on only one aspect of computational efficiency: some only report the total training time [48, 61] while others focus on the number of support vectors (support set size) [46,67]. Another limitation of the previous studies is that they are mostly constrained to small data sets (around 1,000 samples) and limited number of base kernels (10 to 50), making it difficult to draw meaningful conclusions on the computational efficiency. 2.2.2 Relationship to the Other Approaches Multiple kernel learning is closely related to feature selection [72], where the goal is to identify a subset of features that are optimal for a given prediction task. This is evidenced by the equivalence between MKL and group lasso [73], a feature selection method where features are organized into groups, and the selection is conducted at the group level instead of at the level of individual features. 21 Feature selection and feature combination can be given among the main motivations of multiple kernel learning, particularly for the image categorization task. There is a vast amount of choices of image representations. Feature selection is related to choosing the correct image representation for the given classification task. In this manner, MKL is closely related to feature selection. However, selecting one type of representation might not be adequate, since image categorization often involves many classification tasks, one for each image class, and one representation that would work for some of the classes might not work for others. One way to tackle this problem is combining several features. The early approaches for feature combination includes unweighted combination of features [34] or employing brute force learning of feature combination parameters [74]. However, the goal of MKL is to find a more principled way of performing feature combination. It is important to note that equivalence between MKL and group lasso has been proven in [73] building a formal connection between MKL and feature selection. MKL is also related to metric learning [75], where the goal is to find a distance metric, or more generally a distance function, consistent with the class assignment. MKL generalizes metric learning by searching for a combination of kernel functions that gives a larger similarity to any instance pair from the same class than instance pairs from different classes. Finally, it is important to note that multiple kernel learning is a special case of kernel learning. In addition to MKL, another popular approach for learning a linear combination of multiple kernels is kernel alignment [76], which finds the optimal combination of kernels by maximizing the alignment between the combined kernel and the class assignments matrix. More generally, kernel learning methods can be classified into two groups: parametric and non-parametric kernel learning. In parametric kernel learning, a parametric form is assumed for the combined kernel function [77, 78]. In contrast, nonparametric kernel learning does not make any parametric assumption about the target kernel function [76, 79, 80]. Multiple kernel learning belongs to the category of parametric kernel learning. Despite its generality, the high computational cost of nonparametric kernel learning limits its applications to real-world problems. Aside from supervised 22 kernel learning, both semi-supervised and unsupervised kernel learning have also been investigated [76, 78, 81]. We do not review them in detail here because of their limited success in practice and because of their high computational cost. 2.3 Multiple Kernel Learning (MKL): Formulations In this section, we first review the theory of multiple kernel learning for binary classification. We leave the discussion of the MKL methods for multi-class and multi-label learning to Chapter 3. Let D = {x1 , . . . , xn } be a collection of n training instances, where X ⊆ Rd is a compact domain. Let y = (y 1 , . . . , y n )⊤ ∈ {−1, +1}n be the vector of class assignments for the instances in D. We denote by {κj (x, x′ ) : X × X → R, j = 1, . . . , s} the set of s base kernels to be combined. For each kernel function κj (·, ·), we construct a kernel matrix Kj = [κj (x, x′ )]n×n by applying κj (·, ·) to the training instances in D. We denote by β = (β1 , . . . , βs )⊤ ∈ Rs+ the set of coefficients used to combine the base kernels, and denote by κ(x, x′ ; β) = and K(β) = s j=1 βj Kj s j=1 βj κj (x, x′ ) the combined kernel function and kernel matrix, respectively. We further denote by Hβ the Reproducing Kernel Hilbert Space (RKHS) endowed with the combined kernel κ(x, x′ ; β). In order to learn the optimal combination of kernels, we first define the regularized classification error L(β) for a combined kernel κ(·, ·; β), i.e., L(β) = min f ∈Hβ 1 ||f ||2Hβ + C 2 n ℓ(y i f (xi )), (2.1) i=1 where ℓ(z) = max(0, 1 − z) is the hinge loss and C > 0 is a regularization parameter. Given the regularized classification error, the optimal combination vector β is found by minimizing L(β), i.e., min β∈∆,f ∈Hβ 1 ||f ||2Hβ + C 2 n ℓ(y i f (xi )) (2.2) i=1 where ∆ is a convex domain for combination weights β that will be discussed later. As in [54], 23 the problem in Eq. (2.2) can be written into its dual form, leading to the following convex-concave optimization problem 1 min max L(α, β) = 1⊤ α − (α ◦ y)⊤ K(β)(α ◦ y), β∈∆ α∈Q 2 (2.3) where ◦ denotes the Hadamard (element-wise) product, 1 is a vector of all ones, and Q = {α ∈ [0, C]n } is the domain for dual variables α. The choice of domain ∆ for kernel coefficients can have a significant impact on both classification accuracy and efficiency of MKL. One common practice is to restrict β to a probability distribution, leading to the following definition of domain ∆ [54, 55], s ∆1 = β∈ Rs+ : β 1 = j=1 |βj | ≤ 1 . (2.4) Since ∆1 bounds β 1 , we also refer to MKL using ∆1 as the L1 regularized MKL, or L1 -MKL. The key advantage of using ∆1 is that it results in a sparse solution for β, leading to the elimination of irrelevant kernels and consequentially an improvement in computational efficiency as well as robustness in classification. 2.3.1 Multiple Kernel Learning and Group Lasso Lasso (least absolute shrinkage and selection operator), regression with L1 regularization, is a popular technique that performs feature selection and shrinkage [82]. Shrinkage in this context means producing sparse solutions, since the L1 -norm regularization forces some of the covariates to shrink to zero. An extension of the lasso technique, in which the L1 -norm is replaced by a block L1 -norm, is called the group lasso. In group lasso the covariates are assumed to be clustered and the absolute values of each group’s Euclidean norm are added when constructing the regularizer term. Therefore, the shrinkage is forced at the group level, meaning that all covariates within a 24 group are forced to be zero altogether. Let each training instance xi ∈ Rd have a block structure with m blocks, such that xi = m k=1 dk (xi1 , xi2 , . . . , xim ), where xik ∈ Rdk , k = 1, 2, . . . , m and = d. The group lasso can be formulated as the optimization problem in Eq. (2.5), n min w∈Rd ,b∈R m ℓ((xi , y i ); w) + C i=1 k=1 λk ||wk ||, (2.5) where w is a linear classifier, b is a bias term, C is a constant, and λk , k = 1, . . . , m are positive weights. Square of the block L1 -norm, ( m k=1 λk ||wk ||)2 , can also be used as an alternative group lasso regularizer and would give the same path of solutions [35, 73]. The group lasso formulation with the squared block L1 -norm, can be extended to nonlinear case by using functions and reproducing kernel Hilbert norms instead of linear predictors and Euclidean norms as expressed in Eq.(2.6), 1 ( {fk }k=1 ∈ 2 m min m k=1 n ||fk ||Hk )2 + C m ℓ(y i i=1 fk (xi )), (2.6) k=1 where Hk is the k-th Reproducing Kernel Hilbert Space (RKHS). Note that this formulation, which learns a sparse combination of functions, enables using an infinite dimensional space for each group. By following [46, 73], it is possible to show that this formulation is equivalent to learning a convex combination of kernel functions, each corresponding to one group and endows the corresponding RKHS. To prove this connection, we will use an alternative MKL formulation that is given by Eq. (2.7). min λ∈Rm +, min m k λk =1 {fk ∈Hk }k=1 1 2 m k=1 n λk ||fk ||2Hk +C m y i λk fk (xi )). ℓ( i=1 k=1 We provide the proof of equivalence between Eqs. (2.2) and (2.7) in the Appendix. 25 (2.7) Replacing λk fk with f˜k , we rewrite Eq. (2.7) as Eq. (2.8). min m λ∈R+ , min m ˜ k λk =1 {fk ∈Hk } k=1 1 2 m k=1 1 ˜ 2 ||fk ||Hk + C λk n m y i f˜k (xi )). ℓ( i=1 (2.8) k=1 It is straightforward to show that the expression in Eq. (2.9) is the minimizer of Eq. (2.8), λk = ˜ ||fk ||Hk . m ˜k ||H || f k k=1 (2.9) Substituting the expression in Eq. (2.9) into Eq. (2.8) leads to the following optimization problem, 1 ( {fk }k=1 ∈ 2 m min m k=1 n ||fk ||Hk )2 + C m ℓ(y i i=1 fk (xi )), (2.10) j=1 which is the same as Eq. (2.10), proving the equivalance between MKL and group lasso. 2.3.2 Regularization in MKL The robustness of L1 -MKL is verified by the analysis in [58], which states that the additional generalization error caused by combining multiple kernels is O( log s/n) when using ∆1 as the domain for β, implying that L1 -MKL is robust to the number of kernels as long as the number of training examples is not too small. The advantage of L1 -MKL is further supported by the equivalence between L1 -MKL and feature selection using group Lasso [73]. Since group Lasso is proved to be effective in identifying the groups of irrelevant features, L1 -MKL is expected to be resilient to weak kernels. Despite the advantages of L1 -MKL, it was reported in [50] that sparse solutions generated by L1 -MKL might result in information loss and consequentially suboptimal performance. As a result, Lp regularized MKL (Lp -MKL), with p > 1, was proposed in [56, 61] in order to obtain a 26 smooth kernel combination, with the following definition for domain ∆ ∆p = β ∈ Rs+ : ||β||p ≤ 1 . (2.11) Among various choices of Lp -MKL (p > 1), L2 -MKL is probably the most popular one [49,50,56]. Other smooth regularizers proposed for MKL include negative entropy (i.e., s j=1 βj log βj ) [48] and Bregman divergence [70]. In addition, hybrid approaches have been proposed to combine different regularizers for MKL [49, 83, 84]. Although many studies compared L1 regularization to smooth regularizers for MKL, the results are inconsistent. While some studies claimed that L1 regularization yields better performance for image categorization [40, 61], others show that L1 regularization may result in suboptimal performance due to the sparseness of the solutions [41, 46, 48, 60, 62]. In addition, some studies reported that training an L1 -MKL is significantly more efficient than training a L2 -MKL [48], while others claimed that the training times for both MKL techniques are comparable [50]. A resolution to these contradictions, as revealed by our empirical study, depends on the number of training examples and the number of kernels. In terms of classification accuracy, smooth regularizers are more effective for MKL when the number of training examples is small. Given a sufficiently large number of training examples, particularly when the number of base kernels is large, L1 regularization is likely to outperform the smooth regularizers. In terms of computation time, we found that Lp -MKL methods are generally more efficient than L1 -MKL. This is because the objective function of Lp -MKL is smooth while the objective function of L1 -MKL is not 1 . As a result, Lp -MKL enjoys a significantly faster convergence rate (O(1/T 2)) than L1 -MKL (O(1/T )) according to [85], where T is the number of iterations. However, when the number of kernels is sufficiently large and kernel combination becomes the dominant computational cost at each iteration, L1 -MKL can be as efficient as Lp -MKL because 1 A function is smooth if its gradient is Lipschitz continuous 27 L1 -MKL produces sparse solutions. One critical question that remains to be answered is whether MKL is more effective than simple approaches for kernel combination, e.g., using the best single kernel (selected by cross validation) or the average kernel method. Most studies show that L1 -MKL outperforms the best performing kernel, although there are scenarios where kernel combination might not perform as well as the single best performing kernel [50]. Regarding the comparison of MKL to the average kernel baseline, the answer is far from conclusive (see Table 2.2). While some studies show that L1 -MKL brings significant improvement over the average kernel approach [46, 62, 68, 86], other studies claim the opposite [40, 41, 48, 49]. As revealed by the empirical study presented in Section 2.5, the answer to this question depends on the experimental setup. When the number of training examples is not sufficient to identify the strong kernels, MKL may not perform better than the average kernel approach. But, with a large number of base kernels and a sufficiently large number of training examples, MKL is very likely to outperform, or at least yield similar performance as, the average kernel technique. 2.4 Multiple Kernel Learning: Optimization Techniques A large number of algorithms have been proposed to solve the optimization problems posed in Eqs. (2.2) and (2.3). We can broadly classify them into two categories. The first group of approaches directly solve the primal problem in Eq. (2.2) or the dual problem in Eq. (2.3). We refer to them as the direct approaches. The methods of the second group solve the convex-concave optimization problem in Eq. (2.3) by alternating between two steps, i.e., the step for updating the kernel combination weights and the step for solving the SVM classifier for the given combination weights. We refer to them as the wrapper approaches. Figure 2.1 summarizes different optimization methods developed for MKL. We note that due to the scalability issue, almost all MKL algorithms are based on first order methods (i.e., iteratively updating the solutions which use the gradient of the 28 objective function or the most violated constraint). We refer the readers to [52, 60, 87] for more discussion about the equivalence or similarities among different MKL algorithms. 2.4.1 Direct Approaches for MKL Lanckriet et al. [54] showed that the problem in Eq. (2.2) can be cast into Semi-Definite Programming (SDP) problem, i.e., n min t/2 + C s. t.  z∈Rn ,β∈∆,t≥0 i=1 max(0, 1 − y i z i )   K(β) z    z⊤ t 0. (2.12) Although general-purpose optimization tools such as SeDuMi [88] and Mosek [89] can be used to directly solve the optimization problem in Eq. (2.12), they are computationally expensive and are unable to handle more than a few hundred training examples. Besides directly solving the primal problem, several algorithms have been developed to directly solve the dual problem in Eq. (2.3). Bach et al. [35] proposed to solve the dual problem using sequential minimal optimization (SMO) [90]. In [48], the authors applied the Nesterov’s method to solve the optimization problem in Eq. (2.3). Although both approaches are significantly more efficient than the direct approaches that solve the primal problem of MKL, they are generally less efficient than the wrapper approaches [55]. 2.4.1.1 A Sequential Minimum Optimization (SMO) based Approach for MKL This approach is designed for Lp -MKL. Instead of constraining β p ≤ 1, Vishwanathan et al. proposed to solve a regularized version of MKL in [70], and converted it into the following optimization problem, 29 1 max 1⊤ α − α∈Q 8λ 2 q s j=1 (α ◦ y)⊤ Kj (α ◦ y) q . (2.13) It can be shown that given α, the optimal solution for β is given by γj βj = 2λ q p where γj = (α ◦ y)⊤ Kj (α ◦ y) 1 − p1 q s k=1 (α ◦ y)⊤ Kk (α ◦ y) q (2.14) and q −1 + p−1 = 1. Since the objective given in Eq. (2.13) is differentiable, a Sequential Minimum Optimization (SMO) approach [70] can be used. 2.4.2 Wrapper Approaches for MKL The main advantage of the wrapper approaches is that they are able to effectively exploit the off-the-shelf SVM solvers, making them, in general, significantly more efficient than the direct approaches. Below, we describe several representative wrapper approaches for MKL, including a semi-infinite programming (SIP) approach, a subgradient descent approach, an extended level method, an alternating optimization approach, and a sequential minimum optimization (SMO) based approach. 2.4.2.1 A Semi-infinite Programming Approach for MKL (MKL-SIP) It was shown in [71] that the dual problem in Eq. (2.3) can be cast into the following SIP problem: min θ∈R,β∈∆ θ (2.15) s s. t. 1 βj {α⊤ 1 − (α ◦ y)⊤ Kj (α ◦ y)} ≥ θ, 2 j=1 ∀α ∈ Q 30 When the domain ∆1 is used for β, the problem in Eq. (2.15) is reduced to a Semi-Infinite Linear Programming (SILP) problem. To solve Eq. (2.15), we first initialize the problem with a small number of linear constraints. Then the SIP problem in Eq. (2.15) is solved by alternating between two steps, i.e., (i) finding the optimal β and θ with fixed constraints, and (ii) finding the unsatisfied constraints with the largest violation under the fixed β and θ and adding them to the system. Note that in the second step, to find the most violated constraints, the following optimization problem, which is an SVM problem for the combined kernel κ(·, ·; β), needs to be solved: s max α∈Q 1 βj Sj (α) = α⊤ 1 − (α ◦ y)⊤ K(β)(α ◦ y). 2 j=1 2.4.2.2 Subgradient Descent Approaches for MKL (MKL-SD & MKL-MD) A popular wrapper approach for MKL is SimpleMKL [53], which solves the dual problem in Eq. (2.3) by a subgradient descent approach. The authors turn the convex concave optimization problem in Eq. (2.3) into a minimization problem min J(β), where the objective J(β) is defined β∈∆ as 1 J(β) = max − (α ◦ y)⊤ K(β)(α ◦ y) + 1⊤ α. α∈Q 2 (2.16) Since the partial gradient of J(β) is given by ∂βj J(β) = 1 − 21 (α∗ ◦ y)⊤ Kj (α∗ ◦ y), j = 1, . . . , s, where α∗ is an optimal solution to Eq. (2.16), following the subgradient descent algorithm, we update the solution β by β ← π∆ (β − η∂J(β)) where η > 0 is the step size determined by a line search [53] and π∆ (β) projects β into the domain ∆. Similar approaches were proposed in [62, 63]. A generalization of the subgradient descent method for MKL is a mirror descent method (MKL-MD) [39]. Given a proximity function w(β′ , β), the current solution β t and the subgra31 dient ∂J(β t ), the new solution β t+1 is obtained by solving the following optimization problem β t+1 = arg min η(β − β t )⊤ ∂J(β t ) + w(β t , β), (2.17) β∈∆ where η > 0 is the step size. The main shortcoming of SimpleMKL arises from the high computational cost of line search. It was indicated in [46] that many iterations may be needed by the line search to determine the optimal step size. Since each iteration of the line search requires solving a kernel SVM, it becomes computationally expensive when the number of training examples is large. Another subtle issue of SimpleMKL, as pointed out in [53], is that it may not converge to the global optimum if the kernel SVMs in the intermediate steps are not solved with high precision. 2.4.2.3 An Extended Level Method for MKL (MKL-Level) An extended level method is proposed for L1 -MKL in [52]. To solve the optimization problem in Eq. (2.3), at each iteration, the level method first constructs a cutting plane model g t (β) that provides a lower bound for the objective function J(β). Given {β a }ta=1 , the solutions obtained for the first t iterations, a cutting plane model is constructed as g t (β) = max1≤a≤t L(β, αa ), where αa = arg maxα∈Q L(β a , α). Given the cutting plane model, the level method then constructs a level set St as ¯ t + (1 − λ)Lt }, St = {β ∈ ∆1 : g t (β) ≤ lt = λL (2.18) ¯ t and Lt , the upper and lower and obtain the new solution β t+1 by projecting β t into St , where L ¯ t = min L(β a , αa ). bounds for the optimal value L(β ∗ , α∗ ), are given by Lt = min g t (β) and L β∈∆ 1≤a≤t Compared to the subgradient-based approaches, the main advantage of the extended level method is that it is able to exploit all the gradients computed in the past for generating new solutions, leading to a faster convergence to the optimal solution. 32 2.4.2.4 An Alternating Optimization Method for MKL (MKL-GL) This approach was proposed in [53, 56] for L1 -MKL. It is based on the equivalence between group Lasso and MKL, and solves the following optimization problem for MKL min β ∈ ∆1 1 2 s j=1 fj βj 2 Hj n s ℓ yi +C i=1 fj (xi ) (2.19) j=1 fj ∈ Hj The solution requires alternating between two steps, i.e., the step of optimizing fj under fixed β and the step of optimizing β given fixed fj . The first step is equivalent to solving a kernel SVM with a combined kernel κ(·, ·; β), and the optimal solution in the second step is given by βj = ||fj ||Hj ,j s k=1 ||fk ||Hk = 1, . . . , s. (2.20) It was shown in [46] that the above approach can be extended to Lp -MKL. 2.4.3 Online Learning Algorithms for MKL Online learning is computationally efficient as it only needs to process one training example at each iteration. In [91], the authors proposed several online learning algorithms for MKL that combine the Perceptron algorithm [92] with the Hedge algorithm [93]. More specifically, the authors applied the Perceptron algorithm to update the classifiers for the base kernels and the Hedge algorithm to learn the combination weights. In [38], Jie et al. presented an online learning algorithm for MKL, based on the follow-the-regularized-leader (FTRL) framework. One disadvantage of online learning for MKL is that it usually yields suboptimal recognition performance compared to the batch learning algorithms. As a result, we did not include online MKL in our empirical study. 33 2.4.4 Computational Efficiency In this section, we review the conflicting statements in MKL literature about the computational efficiency of different optimization algorithms for MKL. First, there is no consensus on the efficiency of the SIP based approach for MKL (MKL-SIP). While several studies show a slow convergence of MKL-SIP [52, 53, 68, 70], it was stated in [87] that only a few iterations would suffice when the number of relevant kernels is small. According to our empirical study, the SIP based approach can converge in a few iterations for Lp -MKL. On the other hand, MKL-SIP takes many more iterations to converge for L1 -MKL. Second, several studies evaluated the training time of SimpleMKL in comparison to the other approaches for MKL, but with different conclusions. In [46] MKL-SIP was found to be significantly slower than SimpleMKL while the studies in [51, 52] reported the opposite. The main reason behind the conflicting conclusions is that the size of test bed (i.e. the number of training examples and the number of base kernels) varies significantly from one study to another (Table 2.2). When the number of kernels and the number of training examples are large, calculation and combination of the base kernels take a significant amount of the computational load, while for small data sets, the computational efficiency is mostly decided by the iteration complexity of algorithms. In addition, implementation details, including the choice of stopping criteria and programming tricks for calculating the combined kernel matrix, can also affect the running time. Our empirical study for image categorization showed that SimpleMKL is less efficient than MKL-SIP. Although SimpleMKL requires a smaller number of iterations, it takes significantly longer time to finish one iteration compared to the other approaches for MKL, due to the high computational cost of the line search. Overall, we observed that MKL-SIP is more efficient than the other wrapper optimization techniques for MKL whereas MKL-SMO is the fastest method for solving Lp -MKL. 34 2.5 Experiments Our goal is to evaluate the classification performance of different MKL formulations and the efficiency of different optimization techniques for MKL. We focus on MKL algorithms for binary classification, and apply the one-vs-all strategy to convert a multi-label learning problem into a set of binary classification problems. Among various formulations for MKL, we only evaluate algorithms for L1 and Lp regularized MKL. As stated earlier, we do not consider (i) online MKL algorithms due to their suboptimal performance and (ii) nonlinear MKL algorithms due to their high computational costs. The first objective of this empirical study is to compare L1 -MKL algorithms to the two simple baselines of kernel combination mentioned in Section 2.2.1, i.e., the single best performing kernel and the average kernel approach. As already mentioned in Section 2.2.1, there are contradictory statements from different studies regarding the comparison of MKL algorithms to these two baselines. The goal of our empirical study is to examine and identify the factors that may contribute to the conflicting statements. The factors we consider here include (i) the number of training examples and (ii) the number of base kernels. The second objective of this study is to evaluate the classification performance of different MKL formulations for image categorization. In particular, we will compare L1 -MKL to Lp -MKL with p = 2 and p = 4. The final objective of this study is to evaluate the computational efficiency of different optimization algorithms for MKL. To this end, we choose seven representative MKL algorithms in our study (See Section 2.5.2). 2.5.1 Data sets, Features and Kernels Three benchmark data sets for image categorization are used in our study: Caltech 101 [3], Pascal VOC 2007 [94], and a subset of ImageNet (see Appendix A). All the experiments conducted in this study are repeated five times, each with an independent random partition of training and testing data. Average classification accuracies along with the associated standard deviation are reported. 35 The Caltech 101: To obtain the full spectrum of classification performance for MKL, we vary the number of training examples per class (10, 20, 30). We construct 48 base kernels (Table 2.3) for the Caltech 101 data set: 39 of them are built by following the procedure in [43] and the remaining 9 are constructed by following [69]. For all the feature sets except the one that is based on geometric blur, RBF kernel with χ2 distance is used as the kernel function [33]. For the geometric blur feature, RBF kernel with the average distance of the nearest descriptor pairs between two images is used [69]. Table 2.3: Description of the 48 kernels built for the Caltech 101 data set. Kernel indices 1-3 4 5-8 9-12 13-16 17-18 19-22 23-26 27-30 31,34, 33,34 35 36-38 39 40 41-43 44-46 47-48 Description LBP [95] LBP (combined histogram) BoW with dense-SIFT (300 bins) BoW with dense-SIFT (1000 bins) BoW with dense-SIFT (1000 bins) SIFT on 100 sub-windows [40] BoW with dense-SIFT (300 bins) Canny edge detector + histogram of unoriented gradient feature (40 bins) Canny edge detector + histogram of oriented gradient feature (40 bins) [96] Product of kernels: {20 to 23}, {24 to 27}, {16 to 19}, and {4 to 7} V1S+ feature [97] Region covariance [98] Product of kernels 4 to 7 Geometric blur [99] BoW with dense-SIFT (300 bins) BoW with dense-SIFT (300 bins) BoW (300 visual words) [100] with self-similarity features Color Space Gray Gray HSV Gray HSV Gray-HSV Gray Gray # levels for SPK 3 3 4 4 4 1 4 4 Gray 4 1 Gray Gray Gray Gray HSV Gray 1 3 1 1 4 4 2 The Pascal VOC 2007: Similar to the Caltech 101 data set, we vary the number of training examples, by randomly selecting 1%, 25%, 50%, and 75% of images to form the training set. Due 36 to the different characteristics of the two data sets, we choose a different set of image features for VOC 2007, suggested by the participants of the VOC Challenges. In particular, for the MKL experiments, we follow [101] and create 15 sets of features: (i) GIST features [102]; (ii) six sets of color features generated by two different spatial pooling layouts [103] (1 × 1 and 3 × 1), and three types of color histograms (i.e. RGB, LAB, and HSV). (iii) eight sets of local features generated by two key-point detection methods (i.e., dense sampling and Harris-Laplacian [104]), two spatial layouts (1 × 1 and 3 × 1), and two local descriptors (SIFT and robust hue descriptor [105]). An RBF kernel function with χ2 distance is applied to each of the 15 feature sets. A Subset of ImageNet: Following the protocol in [106], we use 81, 738 images from ImageNet that belong to the 18 (out of 20) categories specified in VOC 2007. This data set is significantly larger than Caltech 101 and VOC 2007, making it possible to examine the scalability of MKL methods for image categorization. Both dense sampling and Harris-Laplacian [104] are used for key-point detection, and SIFT is used as the local descriptor. We create four BoW models by setting the vocabulary size to be 1, 000 and applying two descriptor pooling techniques (i.e. maxpooling and mean-pooling) for two types of spatial partitioning (i.e. 1×1 and 2×2). We also create six color histograms by applying two pooling techniques (i.e. max-pooling and mean-pooling) to three different color spaces, namely RGB, LAB and HSV. In total, ten kernels are created for the ImageNet data set. We note that the number of base kernels we construct for the ImageNet data set is significantly smaller than the other two data sets because of the significantly larger number of images in the ImageNet data set. The common practice for large scale data sets has been to use a small number of features/kernels for scalability concerns [106]. 2.5.2 MKL Methods Used in Comparison We divide the MKL baselines into two groups. The first group consists of the two simple baselines for kernel combination, i.e., the average kernel method (AVG) and the best performing kernel selected by the cross validation method (Single). The second group includes seven MKL meth37 ods designed for binary classification. These are: GMKL [63], SimpleMKL [53], VSKL [64], MKL-GL [46], MKL-Level [52], MKL-SIP [56], MKL-SMO [70]. The difference between the two subgradient descent based methods, SimpleMKL and GMKL, is that SimpleMKL performs a golden section search to find the optimal step size while GMKL applies a simple backtracking method. In addition to different optimization algorithms, we use L1 -MKL and Lp -MKL with p = 2 and p = 4. For Lp -MKL, we apply MKL-GL, MKL-SIP, and MKL-SMO to solve the related optimization problems. 2.5.3 Implementation To make a fair comparison, we followed [46] and implemented all wrapper MKL methods within the framework of SimpleMKL using MATLAB, where we used LIBSVM [107] as the SVM solver. For MKL-SIP and MKL-Level, CVX [108] and MOSEK [89] were used to solve the related optimization problems, as suggested in [52]. The same stopping criteria were applied to all baselines. The algorithms were stopped when one of the following criteria is satisfied: (i) the maximum number of iterations (specified as 40 for wrapper methods) is reached, (ii) the difference in the kernel coefficients β between two consecutive iterations is small (i.e., ||βt − β t−1 ||∞ < 10−4 ), (iii) the duality gap drops below a threshold value (10−3 ). The regularization parameter C was chosen with a grid search over {10−2, 10−1 , . . . , 104 }. The bandwidth of RBF kernels was set to the average pair-wise χ2 distance of image features. In our empirical study, all the feature vectors were normalized to have the unit L2 norm before they are used to construct the base kernels. According to [109] and [56], kernel normalization can have a significant impact on the performance of MKL. Various normalization methods have been proposed, including unit trace normalization [109], normalization with respect to the variance of kernel features [56], and spherical normalization [56]. However, we did not observed significant 38 differences in the classification accuracy when applied the above normalization techniques. The experiments with varied numbers of kernels on the ImageNet data set were performed on a cluster of Sun Fire X4600 M2 nodes, each with 256 GB of RAM and 32 AMD Opteron cores. All other experiments were run on a different cluster, where each node has two four-core Intel Xeon E5620s at 2.4 GHz with 24 GB of RAM. We pre-computed all the kernel matrices and loaded them into the memory. This allowed us to avoid re-computing and loading kernel matrices at each iteration of optimization. 2.5.4 Classification Performance of MKL We evaluate the classification performance by the category based mean average precision (MAP) score. For convenience, we report normalized MAP scores (percentage). 2.5.4.1 Experiment 1: Classification Performance Table 2.4 summarizes the classification results for the Caltech 101 data set with 10, 20, and 30 training examples per class. First, we observe that both the MKL algorithms and the average kernel approach (AVG) outperform the best base kernel (Single). This is consistent with most of the previous studies [5, 69]. Compared to the average kernel approach, we observe that the L1 -MKL algorithms have the worst performance when the number of training examples per class is small (n = 10, 20), but significantly outperform the average kernel approach when n = 30. This result explains the seemingly contradictory conclusions reported in the literature. When the number of training examples is insufficient to determine the appropriate kernel combination, it is better to assign all the base kernels equal weights. MKL becomes effective only when the number of training examples is large enough to determine the optimal kernel combination. Next, we compare the performance of L1 -MKL to that of Lp -MKLs. We observe that L1 -MKL performs worse than Lp -MKLs (p = 2, 4) when the number of training examples is small (i.e., n = 10, 20), but outperforms Lp -MKLs when n = 30. This result again explains why conflicting 39 Table 2.4: Classification results (MAP) for the Caltech 101 data set. We report the average values over five random splits and the associated standard deviation. Baseline Norm Single Average GMKL p=1 SimpleMKL p = 1 VSKL p=1 level-MKL p = 1 MKL-GL p=1 MKL-GL p=2 MKL-GL p=4 MKL-SIP p=1 MKL-SIP p=2 MKL-SIP p=4 MKL-SMO p = 2 MKL-SMO p = 4 Number of training instances per class 10 20 30 45.3 ± 0.9 55.2 ± 0.9 70.6 ± 0.9 59.0 ± 0.7 69.7 ± 0.6 77.2 ± 0.5 54.2 ± 1.1 64.1 ± 0.7 84.8 ± 0.7 53.6 ± 0.9 63.4 ± 0.6 84.6 ± 0.5 53.9 ± 0.9 64.0± 0.6 85.3 ± 0.5 54.7 ± 1.0 63.4 ± 0.6 84.4 ± 0.4 54.3 ± 1.0 64.7 ± 0.7 85.4 ± 0.4 60.3 ± 0.6 70.7 ± 1.0 80.0 ± 0.6 60.1 ± 0.7 70.7 ± 1.0 80.0 ± 0.6 53.8 ± 0.6 63.8 ± 0.9 83.9 ± 0.7 60.1 ± 0.6 70.7 ± 1.0 79.1 ± 0.6 59.4 ± 0.6 70.0 ± 1.0 77.5 ± 0.5 59.8 ± 0.5 69.7.0 ± 0.9 79.3 ± 0.9 59.6 ± 0.4 69.6 ± 0.7 79.0 ± 0.5 results were observed in different MKL studies in the literature. Compared to Lp -MKL, L1 -MKL gives a sparser solution for the kernel combination weights, leading to the elimination of irrelevant kernels. When the number of training examples is small, it is difficult to determine the subset of kernels that are irrelevant to a given task. As a result, the sparse solution obtained by L1 -MKL may be inaccurate, leading to a relatively lower classification accuracy than Lp -MKL. L1 -MKL becomes advantageous when the number of training examples is large enough to determine the subset of relevant kernels. We observe that there is no significant difference in the classification performance between different MKL optimization techniques. This is not surprising since they solve the same optimization problem. It is interesting to note that although different optimization algorithms converge to the same solution, they could behave very differently over iterations. In Figures 2.2, 2.3, and 2.4, we show how the classification performances of the L1 -MKL algorithms change over the iterations for three classes from Caltech101 data set. We observe that, • SimpleMKL converges in a smaller number of iterations compared to the other L1 -MKL 40 Table 2.5: Classification results (MAP) for the VOC 2007 data set. We report the average values over five random splits and the associated standard deviation. baseline Single Average L1 -MKL L2 -MKL Percentage of the samples used for training 1% 25% 50% 75% 23.4 ± 0.1 44.7 ± 0.8 48.6 ± 0.8 50.0 ± 0.8 21.9 ± 0.5 48.2 ± 0.8 54.5 ± 0.8 57.5 ± 0.8 23.5 ± 0.7 51.9 ± 0.4 57.4 ± 0.4 59.9 ± 0.9 22.7 ± 0.4 49.8 ± 0.2 57.3 ± 0.2 60.6± 0.5 algorithms. Note that convergence in a smaller number of iterations does not necessarily mean a shorter training time, as SimpleMKL takes significantly longer time to finish one iteration. • The classification performance of MKL-SIP fluctuates significantly over iterations. This is due to the greedy nature of MKL-SIP as it selects the most violated constraints at each iteration of optimization. For simplicity, from now on, unless specified, we will only report the results of one representative method for both L1 -MKL (Level-MKL) and Lp -MKL (MKL-SIP, p = 2). Table 2.5 shows the classification results for the VOC 2007 data set with 1%, 25%, 50%, and 75% of images used for training. These results confirm the conclusions drawn from the Caltech 101 data set: MKL methods do not outperform the simple baseline (i.e., the best single kernel) when the number of training examples is small (e.g., 1%); the advantage of MKL is clear only when the number of training examples is sufficiently large. Finally, we compare in Table 2.6 the performance of MKL to that of the state-of-the-art methods for image categorization on the Caltech 101 and VOC 2007 data sets. For Caltech 101, we use the standard splitting formed by randomly selecting 30 training examples for each class, and for VOC 2007, we use the default partitioning. We observe that the L1 -MKL achieves similar classification performance as the state-of-the-art approaches for the Caltech 101 data set. However, for the VOC 2007 data set, the performance of MKL is significantly worse than the best ones [112, 113]. 41 Table 2.6: Comparison with the state-of-the-art performance for object classification on the Caltech 101 (measured by classification accuracy) and VOC 2007 data sets (measured by MAP). Caltech 101 (30 per class) This paper state-of-the-art AVG : 77.09 [5]: 84.3 L1 -MKL : 79.93 [110]: 81.9 L2 -MKL : 77.94 [111]: 80.0 VOC 2007 This paper state-of-the-art AVG: 55.4 [112]: 73.0 L1 -MKL: 57.2 [113]: 63.5 L2 -MKL: 57.4 [114]: 61.7 The gap in the classification performance is because object detection (localization) methods are utilized in [112, 113] to boost the recognition accuracy for the VOC 2007 data set but not in this dissertation. We also note that the authors of [114] get a better result by using only one strong and well-designed (Fisher vector) representation compared to the MKL results we report. Interested readers are referred to [114], which provides an empirical study that shows how the different steps of the BoW model can affect the classification results. Note that the performance of MKL techniques can be improved further by using the different and stronger options discussed in [114]. 2.5.4.2 Experiment 2: Number of Kernels vs. Classification Accuracy In this experiment, we examine the performance of MKL methods with increasing numbers of base kernels. To this end, we rank the kernels in the descending order of their weights computed by L1 -MKL, and measure the performance of MKL and baseline methods by adding kernels sequentially. The number of kernels is varied from 2 to 48 for the Caltech 101 data set and from 2 to 15 for the VOC 2007 data set. Figures 2.5 and 2.6 summarizes the classification performance of MKL and baseline methods as the number of kernels is increased. We observe that when the number of kernels is small, all the methods are able to improve their classification performance with increasing number of kernels. But, the performance of average kernel and L2 -MKL starts to 42 drop as more and more weak kernels (i.e., kernels with small weights computed by L1 -MKL) are added. In contrast, we observe a performance saturation for L1 -MKL after five to ten kernels have been added. We thus conclude that L1 -MKL is more resilient to the introduction of weak kernels than the other kernel combination methods. 2.5.5 Computational Efficiency To evaluate the learning efficiency of MKL algorithms, we report training time for the experiments with different numbers of training examples and base kernels. Many studies on the computational efficiency of MKL algorithms focused on the convergence rate (i.e., number of iterations) [52], which is not necessarily the deciding factor in determining the training time. For instance, according to Figure 2.2, although SimpleMKL requires a smaller number of iterations to obtain the optimal solution than the other L1 -MKL approaches, it is significantly slower in terms of running time than the other algorithms because of its high computational cost per iteration. Thus, besides the training time, we also examine the sparseness of the kernel coefficients, which can significantly affect the efficiency of both training and testing. 2.5.5.1 Experiment 4: Evaluation of Training Time We first examine how the number of training examples affects the training time of the wrapper methods. Tables 2.8 and 2.9 summarize the training time of different MKL algorithms for the Caltech 101 and VOC 2007 data sets, respectively. We also include in the table the number of iterations and the time for computing the combined kernel matrices. We did not include the time for computing kernel matrices because it is shared by all the methods. We draw the following observations from Tables 2.8 and 2.9: • The Lp -MKL methods require a considerably smaller number of iterations than the L1 -MKL methods, indicating they are computationally more efficient. This is not surprising because 43 Lp -MKL employs a smooth objective function that leads to more efficient optimization [85]. • Since a majority of the training times is spent on computing combined kernel matrices, the time difference between different L1 -MKL methods is mainly due to the sparseness of their intermediate solutions. Since MKL-SIP yields sparse solutions throughout its optimization process, it is the most efficient wrapper algorithm for MKL. Although SimpleMKL converges in a smaller number of iterations than the other L1 -MKL methods, it is not as efficient as the MKL-SIP method because it does not generate sparse intermediate solutions. In the second set of experiments, we evaluate the training time as a function of the number of base kernels. For both the Caltech 101 and VOC 2007 data sets, we choose 15 kernels with the best classification accuracy, and create 15, 30, and 60 kernels by simply varying the kernel bandwidth (i.e., from 1 times, to 1.5 and 2 times the average χ2 distance). The number of training examples is set to be 30 per class for Caltech 101 and 50% of images are used for training for VOC 2007. Tables 2.10 and 2.11 summarize for different MKL algorithms, the training time, the number of iterations, and the time for computing the combined kernel matrices. Overall, we observe that Lp -MKL is still more efficient than L1 -MKL, even when the number of base kernels is large. But the gap in the training time between L1 -MKL and Lp -MKL becomes significantly smaller for the MKL-SIP method when the number of combined kernels is large. In fact, for the Caltech 101 data set with 108 base kernels, MKL-SIP for L1 -MKL is significantly more efficient than MKL-SIP for Lp -MKL (p > 1). This is because of the sparse solution obtained by MKL-SIP for L1 -MKL, which leads to less time on computing the combined kernels than MKL-SIP for Lp -MKL, as indicated in Tables 2.10 and 2.11. As discussed in Section 2.5.3, we cannot compare MKL-SMO directly with the other baselines in terms of training times since they are not coded in the same platform. Instead, we use the code provided by the authors of MKL-SMO [70] to compare it to the C++ implementation of MKL-SIP, the fastest wrapper approach, which is available within the Shogun package [115]. We fix p = 2, 44 Table 2.7: Comparison of training time between MKL-SMO and MKL-SIP Caltech 101 MKL-SIP MKL-SMO VOC 2007 MKL-SIP MKL-SMO Caltech 101 MKL-SIP MKL-SMO VOC 2007 MKL-SIP MKL-SMO Number of training samples n = 10 n = 20 n = 30 3.6 ±0.2 6.5± 0.3 11.8 ± 0.7 0.2 ±0.1 2.3 ± 0.2 3.8 ± 0.5 25% 15.5 ± 1.6 3.5 ± 0.7 50% 145.6 ± 3.9 14.2± 1.8 75% 360.7 ± 8.4 33.1± 3.0 Number of base kernels K = 48 K = 63 K = 108 6.5 ± 0.3 13.6 ± 2.9 19.8 ± 3.4 2.3 ± 0.2 3.2± 0.8 6.3± 1.0 K = 15 K = 30 K = 75 145.6 ± 3.9 542.0 ± 32.8 1412.1 ± 63.4 14.2 ± 1.8 29.1± 2.8 77.8± 10.3 vary the number of training samples for a fixed number of kernels (48 for Caltech 101 and 15 for VOC 2007) and the number of base kernels for a fixed number of samples (2,040 for Caltech 101 and 5,011 for VOC 2007). Table 2.7 shows that MKL-SMO is significantly faster than MKLSIP on both data sets, demonstrating the advantage of a well-designed direct MKL optimization method against the wrapper approaches for Lp -MKL. We finally note that MKL-SMO cannot be applied to L1 -MKL which often demonstrates better performance with a modest number of training examples. 2.5.5.2 Experiment 5: Evaluation of Sparseness We evaluate the sparseness of MKL algorithms by examining the sparsity of the solution for kernel combination coefficients. In Figures 2.7 and 2.8, we show how the size of active kernel set (i.e., kernels with non-zero combination weights) changes over the iterations for MKL-SIP with three types of regularizers: L1 -MKL, L2 -MKL and L4 -MKL. Note that it is difficult to distinguish the 45 results of L2 -MKL and L4 -MKL from each other as they are identical. As expected, L1 -MKL method produces significantly sparser solutions than Lp -MKL. As a result, although Lp -MKL is more efficient for training because it takes a smaller number of iterations to train Lp -MKL than L1 -MKL, we expect L1 -MKL to be computationally more efficient for testing than Lp -MKL as most of the base kernels are eliminated and need not to be considered. 2.5.6 Large-scale MKL on ImageNet To evaluate the scalability of MKL, we perform experiments on the subset of ImageNet consisting of 81, 738 images. Figure 3.10 shows the classification performance of MKL and baseline methods with the number of training images per class varied in powers of 2 (21 , 22 , ..., 211 ). Similar to the experimental results for Caltech 101 and VOC 2007, we observed that the difference between L1 MKL and the average kernel method is significant only when the number of training examples per class is sufficiently large (i.e. ≥ 16). We also observed that the difference between L1 -MKL and the average kernel method starts to diminish when the number of training examples is increased over 256 per class. We believe that the diminishing gap between MKL and the average kernel method with increasing number of training examples can be attributed to the fact that all the 10 base kernels constructed for the ImageNet data set are strong kernels and provide informative features for image categorization. This is reflected in the kernel combination weights learned by the MKL method: most of the base kernels received significant non-zero weights. Figure 2.10 shows the running time of MKL with a varied number of training examples. Similar to the experimental results for Caltech 101 and VOC 2007, we observe that L2 -MKL is significantly more efficient than L1 -MKL. We also observe that the running time for both L1 -MKL and L2 -MKL increases almost quadratically in the size of training data, making it difficult to scale to millions of training examples. We thus conclude that although MKL is effective in combining multiple image representations for image categorization, scalability of MKL algorithms is an open problem. 46 2.6 Summary and Conclusions In this chapter, we have reviewed different formulations of multiple kernel learning and related optimization algorithms, with an emphasis on the application to image categorization. We highlighted the conflicting conclusions drawn by published studies on the empirical performance of different MKL algorithms. We have attempted to resolve these inconsistent conclusions by addressing the experimental setups in the published studies. Through our extensive experiments on three standard data sets used for image categorization, we are able to make the following conclusions: • Overall, MKL is significantly more effective than the simple baselines for kernel combination (i.e., selecting the best kernel by cross validation or taking the average of multiple kernels), particularly when there are a large number of base kernels available, and the number of training examples is sufficiently large. However, MKL is not recommended for image categorization when the base kernels are strong, and the number of training examples are sufficient enough to learn a reliable prediction for each base kernel. • Compared to Lp -MKL, L1 -MKL is overall more effective for image categorization and is significantly more robust to the weaker kernels with low classification performance. • MKL-SMO, which is not a wrapper method but a direct optimization technique, is the fastest MKL baseline. However, it does not address the L1 -MKL formulation. • Among various algorithms proposed for L1 -MKL, MKL-SIP is overall the most efficient for image categorization, because it produces sparse intermediate solutions throughout the optimization process. • Lp -MKL is significantly more efficient than L1 -MKL because it converges in a significantly smaller number of iterations. But, neither L1 -MKL nor Lp -MKL scale well to very large data sets. 47 • L1 -MKL can be more efficient than Lp -MKL in terms of prediction time. This is because L1 -MKL generates sparse solutions and, therefore, will only use a small portion of the base kernels for prediction. In summary, we conclude that MKL is an extremely useful tool for image categorization because it provides a principled way to combine the strengths of different image representations. Although MKL methods have demonstrated significant success for image categorization, there is still room for improvement. One of the most important directions for improving the accuracy of MKL methods is developing MKL algorithms that addresses the need of multi-label data, such as image categorization data sets. To this end, we propose a multiple kernel multi-label ranking method in Chapter 6. It is also very critical to improve the overall computational efficiency of MKL. The existing algorithms for MKL do not scale to large data sets with millions of images and thousands of classes. In the next chapter, we discuss our efforts on reducing the computational load of MKL for large-scale multi-label data sets. 48 Table 2.8: Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of training examples for Caltech 101. baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 10 training instances per class training #iter KerComb 34.6 ± 8.6 38.4 ± 2.0 27.9 ± 7.7 55.7 ± 25.3 17.2 ± 6.8 46.1 ± 22.0 14.1 ± 2.3 38.3 ± 4.3 11.1 ± 1.7 21.9 ± 0.8 40.0 ± 0.0 19.5 ± 0.8 5.3 ± 0.6 8.8 ± 1.0 4.8 ± 0.6 3.5 ± 0.2 5.9 ± 0.4 3.2 ± 0.2 8.0 ± 2.3 33.0 ± 9.5 5.5 ± 1.4 5.4 ± 0.9 39.4 ± 2.6 2.1 ± 0.3 3.8±1.2 5.6±0.9 2.4±1.1 3.3±0.6 4.4±0.5 1.8±0.6 30 training instances per class training #iter KerComb 256.7 ± 47.7 38.6 ± 1.8 212.5 ± 42.3 585.6 ± 204.7 19.0 ± 7.5 494.4 ± 174.7 121.9 ± 22.4 36.6 ± 5.1 103.5 ± 17.7 197.1 ± 9.1 39.8 ± 1.0 178.3 ± 8.5 50.8 ± 5.6 9.3 ± 1.0 46.3 ± 5.2 32.5 ± 1.6 5.9 ± 0.3 29.6 ± 1.5 63.3 ± 22.1 27.5 ± 11.1 47.9 ± 14.9 44.3 ± 6.1 39.7 ± 2.9 23.2 ± 2.7 30.4±4.2 6.3±1.0 25.2±3.9 22.6±2.6 4.7±0.5 18.2±2.1 49 MKL algorithms batch methods online methods direct methods Dual methods Primal methods (+) Optimize SVM and MKL parameters tigether (-) Not efficient for L1-MKL wrapper methods semi-infinite programming (SIP) (+) Scales to number of samples (-) May require many iterations to convergence (-) Might not scale well to the number of kernels subgradient descent (SD) (+) Fast convergence level method (-) High computational cost per iteration (+) Exploits all gradients from previous steps and regularizes the solution via projection to a level set. (-) May not converge to global optimum (-) Parameter selection for level set construction mirror descent (MD) alternating update methods (+) Generalizes the subgradient descent (+) Have closed-form solution (-) High computational cost at each iteration (-) Solutions obtained may be unstable Figure 2.1: A summary of representative MKL optimization schemes 50 70 65 MAP (%) 60 55 50 VSKL GMKL SimpleMKL MKL−SILP MKL−GL MKL−Level 45 40 35 0 10 20 30 40 50 60 70 number of iterations Figure 2.2: Mean average precision (MAP) scores of different L1 -MKL methods vs. number of iterations for the anchor class of the Caltech101 data set. 100 98 MAP (%) 96 94 92 VSKL GMKL SimpleMKL MKL−SILP MKL−GL MKL−Level 90 88 86 0 10 20 30 40 50 60 70 number of iterations Figure 2.3: Mean average precision (MAP) scores of different L1 -MKL methods vs. number of iterations for the bonsai class of the Caltech101 data set. 51 96 94 MAP (%) 92 90 88 VSKL GMKL SimpleMKL MKL−SILP MKL−GL MKL−Level 86 84 82 80 0 10 20 30 40 50 60 70 number of iterations Figure 2.4: Mean average precision (MAP) scores of different L1 -MKL methods vs. number of iterations for the camera class of the Caltech101 data set. 0.88 L1−MKL 0.87 L2−MKL 0.86 AVG MAP 0.85 0.84 0.83 0.82 0.81 0.8 0.79 5 10 15 20 25 30 number of kernels 35 40 45 Figure 2.5: The change in MAP score with respect to the number of base kernels for the Caltech 101 data set. 52 0.57 0.56 MAP 0.55 0.54 0.53 L1−MKL 0.52 L −MKL 2 AVG 0.51 2 4 6 8 10 12 14 number of kernels Figure 2.6: The change in MAP score with respect to the number of base kernels for the VOC 2007 data set. number of active kernels 50 40 30 20 L1−MKL L2−MKL 10 L4−MKL 0 5 10 15 20 25 30 number of iterations Figure 2.7: Number of active kernels learned by the MKL-SIP algorithm vs. number of iterations for the Caltech 101 data set. Note that it is difficult to distinguish the results of L2 -MKL and L4 -MKL from each other as they are identical. 53 number of active kernels 16 14 12 10 8 6 L −MKL 1 4 L2−MKL 2 0 L4−MKL 5 10 15 20 25 30 number of iterations Figure 2.8: Number of active kernels learned by the MKL-SIP algorithm vs. number of iterations for the VOC 2007 data set. Note that it is difficult to distinguish the results of L2 -MKL and L4 MKL from each other as they are identical. 70 L −MKL 1 L −MKL MAP (%) 60 2 AVG 50 40 30 20 1 2 10 10 3 10 number of training samples per class Figure 2.9: Classification performance for different training set sizes for the ImageNet data set. 54 8000 L1−MKL training time (sec) 7000 L2−MKL 6000 5000 4000 3000 2000 1000 0 2 3 10 10 4 10 total number of training samples Figure 2.10: Training times for L1 -MKL and L2 -MKL on different training set sizes for the ImageNet data set. 55 Table 2.9: Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of training examples for the VOC 2007 data set. baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 2, 500 training instances training #iter KerComb 117.6 ± 16.3 39.0 ± 0.0 67.4 ± 7.7 175.1 ± 77.4 16.7 ± 7.3 112.9 ± 48.3 45.2 ± 6.1 37.0 ± 3.4 25.3 ± 2.2 62.6 ± 4.7 40.0 ± 0.0 43.5 ± 0.6 14.5 ± 1.3 9.3 ± 0.6 10.2 ± 0.7 8.0 ± 0.8 5.2 ± 0.4 5.6 ± 0.5 40.1 ± 10.8 35.0 ± 7.7 20.2 ± 4.0 34.6 ± 6.8 39.9 ± 0.5 12.7 ± 1.4 9.6±1.9 5.7±0.5 4.9±0.4 7.1±1.1 4.0±0.0 3.5±0.1 7, 500 training instances training #iter KerComb 1133.2 ± 252.8 39.0 ± 0.0 646.9 ± 98.2 1671.3 ± 919.1 16.8 ± 6.4 1019.7 ± 424.8 330.0 ± 49.2 29.9 ± 3.8 190.9 ± 22.8 549.2 ± 79.8 40.0 ± 0.0 373.8 ± 4.2 130.1 ± 17.7 9.5 ± 0.5 89.4 ± 6.1 74.9 ± 11.1 5.3 ± 0.5 51.2 ± 4.5 297.3 ± 95.2 31.1 ± 8.1 151.9 ± 31.0 309.0 ± 94.5 40.0 ± 0.0 117.0 ± 6.4 84.3±24.5 6.1±0.3 47.3±3.0 56.4±14.7 4.1±0.3 31.5±2.2 56 Table 2.10: Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of base kernels for the Caltech 101 data set. baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 63 base kernels training #iter KerComb 718.1 ± 169.8 38.8 ± 0.8 625.3 ± 152.9 1255.2 ± 350.9 17.3 ± 6.5 1047.6 ± 285.8 398.1 ± 123.7 36.3 ± 5.2 345.6 ± 101.5 397.1 ± 30.0 39.8 ± 1.0 351.9 ± 26.7 118.8 ± 14.7 9.3 ± 1.0 108.5 ± 13.7 84.6 ± 5.8 6.0 ± 0.0 77.3 ± 4.8 204.1 ± 75.7 27.8 ± 10.4 167.2 ± 56.1 147.8 ± 29.8 39.8 ± 2.4 85.3 ± 15.0 114.7±36.7 7.9±0.7 102.7±33.6 111.1±38.8 7.5±0.8 98.3±34.5 9 108 base kernels training #iter KerComb 1170.5 ± 208.7 38.9 ± 0.8 1049.2 ± 190.7 2206.3 ± 580.1 17.2 ± 6.4 1960.3 ± 503.5 569.9 ± 160.3 35.6 ± 5.9 491.8 ± 131.2 604.6 ± 69.9 39.6 ± 1.6 546.6 ± 66.0 226.3 ± 24.8 9.5 ± 1.0 212.0 ± 23.6 169.1 ± 16.0 6.0 ± 0.1 158.2 ± 14.5 405.8 ± 152.7 29.5 ± 9.5 343.7 ± 121.3 192.1 ± 41.3 39.9 ± 0.9 110.1 ± 18.1 634.1±107.2 6.8±1.3 582.1±106.3 407.2±80.2 4.6±0.6 368.4±67.9 57 Table 2.11: Total training time (seconds), number of iterations, and total time spent on combining the base kernels (seconds) for different MKL algorithms vs. number of base kernels for the VOC 2007 data set. baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 baseline GMKL-L1 SimpleMKL-L1 VSKL-L1 MKL-GL-L1 MKL-GL-L2 MKL-GL-L4 MKL-Level-L1 MKL-SIP-L1 MKL-SIP-L2 MKL-SIP-L4 30 base kernels training #iter 1816.8 ± 405.8 37.8 ± 5.4 2335.3 ± 991.9 11.2 ± 7.1 880.2 ± 128.5 30.6 ± 3.8 853.5 ± 206.1 40.0 ± 0.0 282.4 ± 64.2 9.6 ± 0.5 190.1 ± 23.9 6.0 ± 0.0 665.4 ± 114.7 36.8 ± 5.1 460.0 ± 135.5 40.0 ± 0.0 240.8±62.5 8.7±1.6 170.1±16.5 6.2±0.4 75 base kernels training #iter 3975.3 ± 890.0 34.2 ± 8.8 3416.3 ± 1299.7 8.3 ± 7.8 1587.9 ± 238.8 29.4 ± 3.7 1500.4 ± 239.4 40.0 ± 0.0 629.5 ± 84.0 9.8 ± 0.4 346.2 ± 45.3 6.0 ± 0.0 1136.8 ± 328.9 36.7 ± 3.1 686.8 ± 262.9 40.0 ± 0.0 413.9±258.1 3.8±1.7 566.4±141.9 5.0±0 58 KerComb 1186.9 ± 270.4 1581.6 ± 626.4 525.5 ± 75.3 561.8 ± 107.3 218.2 ± 46.3 147.4 ± 11.0 404.7 ± 40.2 170.6 ± 23.1 154.5±43.5 115.1±15.4 KerComb 3072.5 ± 724.5 2776.4 ± 885.7 909.3 ± 122.2 1043.8 ± 87.6 520.4 ± 47.7 286.2 ± 31.9 702.2 ± 177.7 228.5 ± 46.0 302.2±135.7 424.2±81.5 Chapter 3 Multi-label Multiple Kernel Learning by Stochastic Approximation 3.1 Introduction In Chapter 2, we provided a detailed review of MKL and a set of empirical analyses on image categorization data sets to demonstrate the effectiveness of MKL. The focus of Chapter 2 was the MKL methods for the binary classification problem, which constitutes the majority of the MKL literature. The application of MKL to multi-labeled data, such as image categorization data, is mostly limited to a use of one-vs-all framework for MKL, which has two main drawbacks. First, one-vs-all framework requires training a MKL algorithm separately for each class. Considering that there are thousands of training instances and hundreds of classes in recent image categorization data sets, training a one-vs-all MKL solver would be computationally demanding. Second, one-vs-all framework cannot exploit label correlations, since MKL solvers for each class are operated independently, meaning that no interaction of information transfer is available. It has been shown in many multi-label learning studies that learning independent classifiers for each class gives suboptimal performance compared to direct approaches which consider all classes together in the 59 learning process. In this chapter, we present an efficient algorithm for multi-label multiple kernel learning (ML-MKL). We assume that all the classes under consideration share the same combination of kernel functions, and the objective is to find the optimal kernel combination that benefits all the classes. Although several algorithms have been developed for ML-MKL, their computational cost is linear in the number of classes; therefore, they do not scale well when the number of classes increases, a challenge frequently encountered image categorization. We address this computational challenge by developing a framework for ML-MKL that combines the worst-case analysis with stochastic approximation. Our analysis shows that the complexity of our algorithm √ is O(m1/3 lnm), where m is the number of classes. This Chapter is organized as follows: in Section 3.2, we provide a brief literature review on MKL for multi-class and multi-label learning. Next, we introduce our multi-label MKL formulation and give an efficient algorithm to solve it. A convergence analysis for the proposed algorithm is provided in Section 3.3.2. In Section 3.4, we provide empirical analyses that demonstrate the strength of the proposed framework on benchmark data sets. We end the chapter with the concluding remarks and future directions in Section 3.5. 3.2 Previous Work There is a large body of literature on MKL, and we provided a detailed review of binary MKL methods in Chapter 2. Although most efforts in MKL focus on binary classification problems, several studies have attempted to extend MKL to multi-class and multi-label learning [5, 68, 87, 116, 117]. Even though studies show that MKL for multi-class and multi-label learning can result in significant improvement in classification accuracy, the computational cost is often linear in the number of classes, making it computationally expensive when dealing with a large number of classes. Since most image categorization problems involve many image classes, whose number might go up to hundreds or sometimes even to thousands, it is important to develop an efficient 60 learning algorithm for multi-class and multi-label MKL that is sublinear in the number of classes. In multi-class and multi-label learning, each instance can be simultaneously assigned to multiple classes. A straightforward approach for multi-label MKL (ML-MKL) is to decompose a multi-label learning problem into a number of binary classification tasks using either one-vs-all or one-vs-one approach. Varma et al. discussed and compared one-vs-all and one-vs-one schemes for MKL [69]. Tang et al. [116] evaluated three different strategies for multi-label MKL based for the one-vs-all approach: (i) learning one common kernel combination shared by all classes, (ii) learning a different kernel combination for each class independently, and (iii) a hybrid approach that allows partial sharing of kernel combination among different classes. Based on their empirical study, they concluded that learning one common kernel combination shared by all classes not only is computationally efficient but also yields classification performance that is comparable to choosing different kernel combinations for different classes. One drawback of the decomposition based approaches for multi-label learning is that they are unable to take into account the dependency between different classes or the correlation between data points. To overcome this drawback, Ji et al. [68] proposed to encode the instance-class correlation into a hypergraph, which is then used to embed the multi-label data into a lower-dimensional space. Zien et al. proposed MKL for joint feature maps Φ(x, y) and learns a single multi-class classification function fw,b (x, y) = w, Φ(x, y) + b from training data [87]. They formulated the problem via several optimization methods including quadratically constrained quadratic programming (QCQP) and SILP. Mei proposed a multi-label multi-kernel transfer learning method, which uses a one-vs-all classification scheme, for protein subcellular localization [118]. Gehler et al. proposed a twostep boosting approach that requires solving SVMs separately for each kernel, similar to wrapper approaches [43]. The method they presented learns nonlinear kernel combinations, which yield promising classification performance, but also leads to a high computational load. In another nonlinear MKL method [5], group information between the classes has been incorporated to multiple 61 kernel learning framework (GSMKL) in order to improve the classification accuracy. Getting use of class dependencies has been shown to improve the accuracy in multi-label learning task [13], and GSMKL also gets benefit of this to yield improved classification performance with a price of increased computational load. In addition to the high computational load, another limitation of this approach is that it assumes that there is a group structure within the classes, bringing the need of effective tools to find the group structure (if exists) within the classes. In this chapter, we develop an efficient algorithm for Multi-Label MKL (ML-MKL) that assumes all the classifiers share the same linear combination of kernels. We note that although this assumption significantly constrains the choice of kernel functions for different classes, our empirical studies with image categorization show that the classification performance is not negatively affected. A naive implementation of ML-MKL with shared kernel combination will lead to a computational cost linear in the number of classes. We alleviate this computational challenge by exploring the idea of combining worst case analysis with stochastic approximation. Our analysis √ reveals that the convergence rate of the proposed algorithm is O(m1/3 ln m), which is significantly better than a linear dependence on m, where m is the number of classes. Our empirical studies show that the proposed MKL algorithm yields similar performance as the state-of-theart algorithms for ML-MKL, but with a significantly shorter running time, making it suitable for multi-label learning with a large number of classes. 3.3 Multi-label Multiple Kernel Learning (ML-MKL) In this chapter, we use the same notation as in Chapter 2 with only a change in the notation of the label vector y, since the focus of this chapter is multi-label MKL. We introduce β = (β1 , . . . , βs ), a probability distribution, for combining base kernels. We denote by K(β) = s j=1 βa Kj the combined kernel matrices. We use the domain ∆1 for the probability distribution β, i.e., ∆1 = {β ∈ Rs+ : β ⊤ 1 = 1}. Our goal is to learn from the training examples the optimal kernel 62 combination β for all m classes. The simplest approach for multi-label multiple kernel learning with a shared kernel combination is to find the optimal kernel combination β by minimizing the sum of regularized loss functions of all m classes, leading to the following optimization problem: m min m min β∈∆1 {fk ∈H(β)}m k=1 Hk = k=1 k=1 1 |fk |2H(β) + 2 n ℓ yki fk (xi ) , (3.1) i=1 where ℓ(z) = max(0, 1 − z) and H(β) is a Reproducing Kernel Hilbert Space endowed with kernel κ(x, x′ ; β) = s ′ j=1 βj κj (x, x ). Hk is the regularized loss function for the kth class. It is straightforward to verify the following dual problem of Eq. (3.1): m min max β∈∆1 α∈Q1 L(β, α) = k=1 1 [αk ]⊤ 1 − (αk ◦ yk )⊤ K(β)(αk ◦ yk ) 2 , (3.2) where Q1 = {α = (α1 , . . . , αm ) : αk ∈ [0, C]n , k = 1, . . . , m}. To solve the optimization problem in Eq. (3.2), we can view it as a minimization problem, i.e., minβ∈∆1 A(β), where A(β) = maxα∈Q1 L(β, α). We then follow the subgradient descent approach in [53] and compute the gradient of A(β) as 1 ∂βj A(β) = − 2 m k=1 (αk (β) ◦ y)⊤ Kj (αk (β) ◦ yk ), where αk (β) = arg maxα∈[0,C]n [αk ]⊤ 1 − (αk ◦ yk )⊤ K(β)(αk ◦ yk ). We refer to this approach as multi-label multiple kernel learning by sum, or ML-MKL-Sum. Note that this approach is similar to the one proposed in [116]. The main computational problem with ML-MKL-Sum is that by treating every class equally, in each iteration of subgradient descent, it requires solving m kernel SVMs, making it unscalable to a very large number of classes. Below we present a formulation for multi-label MKL whose computational cost is sublinear in the number of classes. 63 3.3.1 A Minimax Framework for Multi-label MKL In order to alleviate the computational difficulty arising from a large number of classes, we search for the combined kernel matrix K(β) that minimizes the worst classification error among m classes, i.e., min min max Hk (3.3) β∈∆1 {fk ∈H(β)}m 1≤k≤m k=1 m k=1 Eq. (3.3) differs from Eq. (3.1) in that it replaces tational advantage of using maxk Hk instead of k Hk with max1≤k≤m Hk . The main compu- Hk is that by using an appropriately designed method, we may be able to figure out the most difficult class, the class that yields the worst classification performance, in a few iterations, and spend most of the computational cycles on learning the optimal kernel combination for the most difficult class. In this way, we are able to achieve a running time that is sublinear in the number of classes. Below, we present an optimization strategy for Eq. (3.3) based on the idea of stochastic approximation. A direct approach is to solve the optimization problem in Eq. (3.3) by its dual form. It is straightforward to show that dual problem of Eq. (3.3) is Eq. (3.4) (see Proposition 4 in Section A.3 for the proof). min max β∈∆1 ρ∈B where    m L(β, ρ) = k=1 1 [ρk ]⊤ 1 − (ρk ◦ yk )⊤ K(β)(ρk ◦ yk ) 2 1 2 2    , (3.4) m B= (ρ1 , . . . , ρm ) : ρk ∈ Rn+ , k n = 1, . . . , m, ρk ∈ [0, Cλk ] s.t. λk = 1 . k=1 The challenge in solving Eq. (3.4) is that the solutions {ρ1 , . . . , ρm } in domain B are correlated 64 with each other, making it impossible to solve each ρk independently by an off-the-shelf SVM solver. Although a gradient descent approach can be developed for optimizing Eq. (3.4), it is unable to explore the sparse structure in ρk making it less efficient than state-of-the-art SVM solvers. In order to effectively explore the power of off-the-shelf SVM solvers, we rewrite Eq. (3.3) as follows m 1 ⊤ γ k α⊤ k 1 − (αk ◦ yk ) K(β)(αk ◦ yk ) α∈Q1 2 k=1 min max L(β, γ) = max β∈∆1 γ∈Γ , (3.5) ⊤ where Γ = {(γ1 , . . . , γm ) ∈ Rm + : γ 1 = 1}. In Eq. (3.5), we replace max1≤k≤m with maxγ∈Γ . The advantage of using Eq. (3.5) is that we can resort to a SVM solver to efficiently find αk for a given combination of kernels K(β). Given Eq. (3.5), we develop a subgradient descent approach for solving the optimization problem. In particular, in each iteration of subgradient descent, we compute the gradient L(β, γ) with respect to β and γ as follows ∇βj L(β, γ) = − 1 2 m k=1 γk (αk ◦ yk )⊤ Kj (αk ◦ yk ), 1 ∇γk L(β, γ) = [αk ]⊤ 1 − (αk ◦ yk )⊤ K(β)(αk ◦ yk ), 2 (3.6) where αk = arg maxα∈[0,C]n α⊤ 1 − (α ◦ yk )⊤ K(β)(α ◦ yk )/2, i.e., a SVM solution to the combined kernel K(β). Following the mirror prox descent method [119], we define potential functions Φβ = ηβ ηγ s j=1 βj ln βj for β and Φγ = m k=1 γk ln γk for γ, and have the following equations for updating β t and γ t β t+1 = j γkt+1 βjt exp(−ηβ ∇βj L(β t , γ t )), Zβt γkt = t exp(−ηγ ∇γk L(βt , γ t )), Zγ ⊤ (3.7) ⊤ where Zβt and Zγt are normalization factors that ensure β t 1 = γ t 1 = 1. ηβ > 0 and ηγ > 0 are 65 the step sizes for optimizing β and γ, respectively. Unfortunately, the algorithm described above shares the same shortcoming as the other approaches for multiple label multiple kernel learning: it requires solving m SVM problems in each iteration; therefore, its computational complexity is linear in the number of classes. To alleviate this problem, we modify the above algorithm by introducing the stochastic approximation method. In particular, in each iteration t, instead of computing the full gradients that requires solving m SVMs. t We sample one classification task according to the multinomial distribution Multi(γ1t , . . . , γm ). Let at be the index of the sampled classification task. Using the sampled task at , we estimate the gradient of L(β, γ) with respect to βj and γk , denoted by gjβ (β t , γ t ) and gkγ (β t , γ t ), as follows 1 gjp (pt , γt ) = − (αat ◦ yat )⊤ Kj (αat ◦ yat ), 2   0 γ t t gk (β , γ ) =   1 αk ⊤ 1 − 1 (αk ◦ yk )⊤ K(β)(αk ◦ yk ) γk 2 (3.8) k = at . (3.9) k = at The computation of gjβ (β t , γ t ) and gkγ (β t , γ t ) only requires αat ; therefore, it only needs to solve one SVM problem, instead of m SVMs. The key property of the estimated gradients in Eqs. (3.8) and (3.9) is that their expectations are equal to the true gradients, as summarized by Proposition 1. This property is the key to the correctness of our algorithm. Proposition 1. We have Et [gjβ (β t , γ t )] = ∇βj L(β t , γ t ), Et [gkγ (β t , γ t )] = ∇γk L(β t , γ t ), where Et [·] stands for the expectation over the randomly sampled task at . Given the estimated gradients, we will follow Eq. (A.12) for updating β and γ in each iteration. Since gkγ (β t , γ t ) is proportional to 1/γ t , to ensure the norm of gkγ (β t , γ t ) to be bounded, we need to smooth γ t+1 . In order to have a smoothing effect, without modifying γ t+1 , we will sample 66 ˆ t+1 , directly from γ ˆ s.t. γˆ t+1 ← γ t+1 (1 − δ) + ∀γ ∈ Γ, ∃ˆ γ ∈ Γ, k k δ , k = 1, . . . , m, m where δ > 0 is a small probability mass used for smoothing and ˆ= Γ ˆ ⊤ 1 = 1, γˆk ≥ γ δ , k = 1, . . . , m . m We refer to this algorithm as multi-label multiple kernel learning by stochastic approximation, or ML-MKL-SA for short. Algorithm 3 gives the detailed description. 3.3.2 Convergence Analysis Since Eq. (3.5) is a convex-concave optimization problem, we introduce the following citation for measuring the quality of a solution (β, γ) ¯ ∆(β, γ) = max L(β, γ ′ ) − min L(β′ , γ). ′ ′ γ ∈Γ β ∈∆1 (3.11) We denote by (β ∗ , γ ∗ ) the optimal solution to Eq. (3.5). ¯ Proposition 2. We have the following properties for ∆(β, γ) ¯ 1. ∆(β, γ) ≥ 0 for any solution β ∈ ∆1 and γ ∈ Γ ¯ (β ∗ , γ ∗ ) = 0 2. ∆ 3. ∆(β, γ) is jointly convex in both β and γ We have the following theorem for the convergence rate for Algorithm 3. The detailed proof can be found in Section A.3. Theorem 1. After running Algorithm 3 over T iterations, we have the following inequality for the 67 ¯ and γ¯ obtained by Algorithm 3 solution β ¯ γ ¯ β, ¯ E ∆ ≤ 1 m2 (ln m + ln s) + ηγ d 2 λ20 n2 C 4 + n2 C 2 , ηγ T 2δ where d is a constant term, E[·] stands for the expectation over the sampled task indices of all iterations, and λ0 = max λmax (Kj ), where λmax (Z) stands for the maximum eigenvalue of matrix 1≤j≤s Z. 2 1 Corollary 2. With δ = m 3 and ηγ = n1 m− 3 (ln m)/T , after running Algorithm 1 (on the original ¯ γ ¯ )] ≤ O(nm1/3 paper) over T iterations, we have E[∆(β, (ln m)/T ) in terms of m,n and T . Since we only need to solve one kernel SVM at each iteration, we have the computational complexity for the proposed algorithm on the order of O(m1/3 (ln m)/T ), sublinear in the number of classes m. 3.4 Experimental Results In this section, we empirically evaluate the proposed multi-label multiple kernel learning algorithm by demonstrating its efficiency and effectiveness on the image categorization task. 3.4.1 Data Sets Following the MKL experiments in Chapter 2, we use the same three benchmark data sets and the same base kernels as in this Chapter: Caltech 101 [3], Pascal VOC 2007 [94], and a subset of ImageNet. All the experiments conducted in this chapter are repeated five times, each with an independent random partition of training and testing data. Mean average precision scores along with the associated standard deviations are reported. 68 3.4.2 Baseline Methods We compare four MKL methods and the average kernel baseline. The MKL baselines can be categorized into two groups. The first group is the one-vs-all MKL framework which requires solving one MKL problem for each class separately. For this group, we use two base MKL solvers that are shown to be the most efficient L1 -MKL methods in Chapter 2 : (i) MKL-SIP, a SemiInfinite Programming (SIP) based method for MKL, [71] and (ii) MKL-Level, an extended level based method for MKL, [52]. We also use MKL-SIP-L2 to include a non-sparse MKL solver into the comparison. The second group of methods requires learning a single kernel combination simultaneously for all classes. The two baseline methods that fall into this group are: (i) MLMKL-Sum which learns a kernel combination shared by all classes as described in Section 3.3 using the optimization method in [116], (ii) the proposed ML-MKL-SA method. ant butterfly ceiling fan chair Figure 3.1: For the 4 classes (ant, butterfly, ceiling fan, chair) taken from the Caltech 101 data set, the first row gives images which produced false negatives for the single kernel baseline and true positives for ML-MKL-SA baseline. The second row gives images which produced false positives for the single kernel baseline and true negatives for the ML-MKL-SA baseline for the corresponding classes. 69 3.4.3 Implementation The experiments with varied numbers of instances on the ImageNet data set were performed on a cluster of Sun Fire X4600 M2 nodes, each with 256 GB of RAM and 32 AMD Opteron cores, due to a need of high RAM capacity (over 100 GB). All other experiments were run on a different cluster, where each node has two four-core Intel Xeon E5620s at 2.4 GHz with 24 GB of RAM. We pre-compute all the kernel matrices and load the computed kernel matrices into the memory. This allows us to avoid re-computing and loading kernel matrices at each iteration of optimization. All the baseline methods are coded in MATLAB For all the wrapper methods for MKL, LIBSVM [107] is used as off-the-shelf SVM solver. For MKL-SIP and MKL-Level, MOSEK [89] is used to solve the related optimization problems, as suggested in [52]. The same stopping criteria is applied to all the MKL algorithms when applicable. All the algorithms terminate when: (i) the relative change in the duality gap falls below a threshold (1 − ∆t ∆t−1 < 102 ), (ii) the change in the cost function falls below a threshold (10−3 ), (iii) the difference in the kernel coefficients β between two consecutive iterations is small (i.e., ||β t − β t−1 ||∞ < 10−4 ), and (iv) the maximum number of iterations is reached. A 2-fold cross-validation is applied to select the value of the regularization parameter C ∈ {10−2 , 10−1, . . . , 104 }. The bandwidth of the RBF kernel is set to the average pair-wise χ2 distances between image pairs. Unless stated, the smoothing parameter δ is set to be 0.2 for the proposed method. For simplicity we take η = ηβ = ηγ in all the following experiments. Step size η is chosen as 0.01 for the Caltech 101 data set, 0.001 for the VOC 2007 and ImageNet data sets in order to achieve the best computational efficiency. 3.4.4 Classification Performance To evaluate the effectiveness of different algorithms for multi-label multiple kernel learning, we report the category based mean averaged precision (MAP) over all the classes. We evaluate the 70 bird potted plant dining table train Figure 3.2: For the 4 classes (bird, potted plant, dining table, train) taken from the VOC 2007 data set, the first row gives images which produced false negatives for the single kernel baseline and true positives by the GMKL baseline. The second row gives images which produced false positives for the single kernel baseline and true negatives for the ML-MKL-SA method for the corresponding classes. efficiency of algorithms by their running times (seconds) for training. Table 3.1 summarizes the classification accuracies (MAP) of all the baseline methods over the Caltech 101 data sets under three settings with 10, 20, and 30 training instances per class. MKLSIP-L2 and average kernel baselines yield the best performance for the first two settings, whereas MKL solvers with L1 norm are superior for the last setting, where the number of training instances per class is 30. MKL-L1 methods give sparse solutions by eliminating irrelevant base kernels. Table 3.1: Classification results (MAP) for the Caltech 101 data set. We report the average values over five random splits and the associated standard deviation. Baseline Average MKL-Level MKL-SIP-L1 MKL-SIP-L2 ML-MKL-Sum ML-MKL-SA Number of training instances per class 10 20 30 59.0 ± 0.7 69.7 ± 0.6 77.2 ± 0.5 54.7 ± 1.0 63.4 ± 0.6 84.4 ± 0.4 53.8 ± 0.6 63.8 ± 0.9 83.9 ± 0.7 60.1 ± 0.6 70.7 ± 1.0 79.1 ± 0.6 55.1± 1.3 65.0 ± 0.7 85.6 ± 0.7 54.5± 0.7 66.1 ± 0.9 85.3± 0.8 71 Table 3.2: Classification results (MAP) for the VOC 2007 data set. We report the average values over five random splits and the associated standard deviation. baseline Average MKL-Level MKL-SIP-L1 MKL-SIP-L2 ML-MKL-Sum ML-MKL-SA Percentage of the samples used for training 1% 5% 25% 50% 75% 21.9 ± 0.5 42.4 ± 0.3 48.2 ± 0.8 54.5 ± 0.8 57.5 ± 0.8 23.4 ± 0.6 44.4 ± 0.4 51.5 ± 0.5 57.1± 0.6 59.6 ± 0.9 22.6 ± 1.0 44.2 ± 0.3 51.2 ± 0.33 56.6 ± 0.5 59.5± 0.9 22.7 ± 0.4 42.6 ± 0.2 49.8 ± 0.2 57.3 ± 0.2 60.6± 0.5 24.1 ± 0.4 43.5 ± 0.5 50.1 ± 0.4 55.8 ± 0.1 58.8±0.2 24.6 ± 0.9 44.1 ± 0.6 50.6 ± 0.4 56.1 ± 0.2 58.9±0.4 However, as discussed in Chapter 2, when the number of training examples is very small, it is difficult to determine the subset of kernels that are irrelevant to a given task. This is why MKL-L1 methods give better results than MKL-L2 methods on the Caltech 101 data set as the number of training instances increases. Although the two multi-label MKL baselines, namely ML-MKL-Sum and ML-MKL-SA, are originally proposed as efficient approximations to one-versus-all MKL framework, they match and sometimes even outperform the one-vs-all MKL methods, MKL-SIP and Level-MKL, that learn one kernel combination for each class. These results justify the assumption of using the same kernel combination for all the classes for the Caltech 101 data set. Note that the average kernel baseline (AVG), which is similar in that it uses the same kernel combination for all classes, yields reasonable performance, although its classification performance is significantly worse than the proposed approach ML-MKL-SA when there is a sufficient number of training instances (30 instances per class for the Caltech 101 data set). We provide some example images from the Caltech 101 data set in Figure 3.1 to visualize the advantage MKL brings over using a single kernel. For the 4 classes (ant, butterfly, ceiling fan, chair) taken from the Caltech 101 data set, the first row gives images which produced true positives for the ML-MKL-SA baseline and false negatives when a single kernel (the best performing base 72 kernel) is used. On the other hand, the second row gives images which produced false positives for the single kernel case and true negatives for the ML-MKL-SA baseline for the corresponding classes. Note the level of similarity in the shapes of each image on the same column, which is the possible cause of the errors for the single kernel case. On the other hand, by using different image representations, MKL avoids these errors on these sample images. Table 3.2 summarizes the classification accuracies (MAP) for all the baseline methods on the VOC 2007 data set under five different settings, where 1%, 5%, 25%, 50%, and 75% of the whole data set is used as the training set. Table 3.2 confirms the conclusions that are drawn from Table 3.1: all the MKL methods, including ML-MKL-Sum and ML-MKL-SA outperform average kernel baseline as the number of training instances increase (for all settings except case-1%). The difference between the Caltech 101 and VOC 2007 results is that we do not see a significant performance difference between MKL-L1 and MKL-L2 methods. As discussed in Chapter 2, this is because the number of base kernels is smaller for the VOC 2007 experiments. Finally, we see that ML-MKL-Sum and ML-MKL-SA yield very close results compared to other MKL baselines, despite learning one shared kernel combination for all classes. We also provide some example images from the VOC 2007 data set in Figure 3.2 to visualize the strength of MKL. We take four object categories and two different test images from each category to test. The first row gives images which produced true positives for the ML-MKL-SA baseline and false negatives when a single kernel (the best performing base kernel) is used. The second row gives images which produced false positives for the single kernel case and true negatives for the ML-MKL-SA baseline for the corresponding classes. These examples demonstrate that MKL methods are able to avoid false positives and negatives by successfully combine several image representations. 73 Table 3.3: Training time (seconds) for the Caltech 101 data set. We report the average values over five random splits and the associated standard deviation. Baseline level-MKL MKL-SIP-L1 MKL-SIP-L2 ML-MKL-Sum ML-MKL-SA Number of training instances per class 10 20 30 816.1± 125.6 3570± 519.0 6456.6± 664.2 550.8 ±91.8 2233.8±871.5 4518.6 ± 501.2 387.6 ±72.4 1275.0± 201.6 3100.8± 314.6 302.7 ± 4.8 1053.8± 201.3 3817.9 ± 308.1 119.2 ± 0.9 471.3 ± 16.9 1140.4 ± 276.5 3.4.5 Training Time We provide Tables 3.3 and 3.4 to compare the running times of the MKL baseline methods. Observe that ML-MKL-SA and ML-MKL-Sum are in general more efficient than the other MKL methods in the Caltech 101 experiments. This is not surprising as ML-MKL-SA and ML-MKLSum compute a single kernel combination for all classes. However, note that MKL-SIP-L2 is faster than ML-MKL-Sum when the number of training instances is 30 per class for the Caltech 101 data set. This is because of the fast convergence of MKL-L2 problem (see Chapter 2 for details). Moreover, we see that MKL-SIP-L2 is faster than ML-MKL-Sum in most of the settings. The main reason for this is that, in addition to the fast convergence of MKL-SIP-L2 , the number of kernels and classes is smaller in the VOC 2007 data set. However, based on these observations, we expect ML-MKL-Sum to become faster as the number of classes and the number of kernels increase, since MKL-L1 formulation often provides sparse solutions, which would significantly cut down the time spent on kernel computations. The main advantage of the proposed algorithm is its computational efficiency. From Tables 3.3 and 3.4 we see that the proposed method requires less training time compared to the other baselines while providing comparable classification performance. Clearly, for the data sets with a high number of categories, the two methods that learn one shared kernel combination for all labels (MLMKL-SA and ML-MKL-Sum) would be computationally more efficient than the methods that 74 Table 3.4: Training time (seconds) for the VOC 2007 data set. We report the average values over five random splits and the associated standard deviation. Baseline level-MKL MKL-SIP-L1 MKL-SIP-L2 ML-MKL-Sum ML-MKL-SA Percentage of the samples used for training 1% 5% 25% 50% 75% 4.5±0.5 43.3±7.1 802± 113.2 4332.6± 587.3 5946± 950.1 6.4±3.4 47.9±10.6 692 ± 67.8 4396.8±606.7 6180 ± 940.2 16.4±2.3 34.3±7.4 192±21.3 706± 178.3 1686± 246.5 2.5± 0.3 57.4± 9.1 372.3± 26.6 2162.1± 175.3 3983 ± 402.2 1.2± 0.3 39.8± 4.8 234.1± 21.1 886.5± 101.7 1224.3± 136.2 learn a kernel combination separately for each class. In addition to this, the proposed method brings further improvement in efficiency compared to ML-MKL-Sum. The reduction in computation time is more significant for the Caltech 101 data set compared to the VOC 2007 data set. This is because the proposed algorithm employs an SVM solver for only one class per an iteration whereas MLMKL-Sum has to train SVM solvers separately for all classes at each iteration. Since Caltech 101 has a larger number of classes, the proposed method shows a greater advantage for the Caltech 101 data set. Figure 3.6 shows the change in the kernel weights over time for the proposed method (MLMKL-SA) and Figures 3.3, 3.4, and 3.5 show the change in the kernel weights for three other baseline methods (ML-MKL-Sum, MKL-Level, and MKL-SIP-L1 ) on the Caltech 101 data set with 30 training instances per class. We observe that, overall, ML-MKL-SA shares a similar pattern as Level-MKL in the evolution curves of kernel weights, but is much faster. We also have very similar curves when comparing MKL-SIP-L1 and ML-MKL-Sum, as expected, since these two baselines use the same solver. When comparing ML-MKL-Sum and ML-MKL-SA, which are significantly more efficient than the other two baselines, we see that the kernel weights learned by ML-MKL-Sum vary significantly, particularly at the beginning of the learning process, making it a less stable algorithm than the proposed algorithm ML-MKL-SA. 75 0.25 kernel weights 0.2 0.15 0.1 0.05 0 0 1000 2000 3000 4000 5000 6000 7000 time (sec) Figure 3.3: The evolution of kernel weights computed by the MKL-Level method over time for the Caltech 101 data set with 30 training instances per class. 76 0.9 0.8 0.7 kernel weights 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 time (sec) Figure 3.4: The evolution of kernel weights computed by the MKL-SIP-L1 method over time for the Caltech 101 data set with 30 training instances per class. 77 1 0.9 0.8 kernel weights 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 3000 3500 4000 time (sec) Figure 3.5: The evolution of kernel weights computed by the ML-MKL-Sum method over time for the Caltech 101 data set with 30 training instances per class. 78 0.25 kernel weights 0.2 0.15 0.1 0.05 0 0 200 400 600 800 1000 1200 1400 time (sec) Figure 3.6: The evolution of kernel weights computed by the ML-MKL-SA method over time for the Caltech 101 data set with 30 training instances per class. 79 3.4.6 Sensitivity to Parameters To evaluate the sensitivity of the proposed method to parameters δ, ηβ and ηγ , we conducted experiments with varied values for these three parameters. Figure 3.7 shows how the classification performance (MAP) of the proposed algorithm changes over iterations on Caltech 101 (30 training instances per class) using six different values of δ: {0, 0.2, 0.4, 0.6, 0.8, 1}. We observe that the final classification accuracy is comparable for different values of δ, demonstrating the robustness of the proposed method to the choice of δ. However, we also note that the extreme case where δ = 0 gives the worst performance, indicating the importance of adding the uniform sampling component for an increased stability. 0.87 0.86 0.85 MAP (%) 0.84 0.83 δ=0 δ=0.2 δ=0.4 δ=0.6 δ=0.8 δ=1 0.82 0.81 0.8 0.79 0 50 100 150 200 250 300 iterations Figure 3.7: Classification performance (MAP) of the proposed algorithm ML-MKL-SA on Caltech 101 with 30 training instances per class using different values of δ (for ηβ = ηγ = 0.01). Figure 3.8 shows the change of classification performance (MAP) for three different values of ηβ for a fixed ηγ whereas Figure 3.9 shows the change of classification performance (MAP) for 80 90 88 86 MAP (%) 84 82 80 78 76 ηp=0.01 74 ηp=0.001 ηp=0.0001 72 70 0 20 40 60 80 100 120 140 160 180 200 iterations Figure 3.8: Classification performance (MAP) of the proposed algorithm ML-MKL-SA on Caltech 101 with 30 training instances per class using different values of ηβ (for ηγ = 0.0001 and δ = 0.2). three different values of ηγ for a fixed ηβ , when 30 samples per class are used from the Caltech 101 data set. Based on these plots we observe that a change in the value of ηβ is more likely to have a greater impact on the convergence speed than a change in the ηγ value. Particularly, we see that ηγ = 0.01 and ηγ = 0.001 produce very similar plots. This result demonstrates that the proposed algorithm is in general insensitive to the choice of the step size ηγ . On the other hand, a more careful selection still needs to be done ηβ in order to avoid slow convergence. 3.4.7 Large-scale MKL on ImageNet To evaluate the scalability of MKL, we perform experiments on the subset of ImageNet consisting of 81, 738 images. Figure 3.10 shows the classification performance of ML-MKL-SA and 81 87 86 85 MAP (%) 84 83 82 81 80 ηγ=0.01 79 ηγ=0.001 ηγ=0.0001 78 77 0 20 40 60 80 100 120 140 160 180 200 iterations Figure 3.9: Classification performance (MAP) of the proposed algorithm ML-MKL-SA on Caltech 101 with 30 training instances per class using different values of ηγ (ηβ = 0.0001 and δ = 0.2). 82 80 MKL−SIP−L MAP (%) 1 70 MKL−SIP−L 60 ML−MKL−SA ML−MKL−Sum 2 50 40 30 20 10 2 10 3 10 total number of training instances 4 10 Figure 3.10: Comparison of the mean average precision scores for different training set sizes for the ImageNet data set. the other baseline methods with the number of training images per class varied in powers of 2: (21 , 22 , . . . , 211 ). We used MKL-SIP for both L1 and L2 norm MKL. Similar to the experimental results for Caltech 101 and VOC 2007, we observe that ML-MKL-SA and ML-MKL-Sum give comparable performance to the MKL solvers that learn a separate kernel combination for each class. In fact, for the settings with smaller number of instances (100 to 18000) ML-MKL-SA outperforms MKL-L2 whereas ML-MKL-Sum outperforms both MKL-L1 and MKL-L2 . However, the difference between the baseline performances starts to diminish when the number of training examples is increased over 256 per class. As discussed in Chapter 2, this is because all the 10 kernels constructed for the ImageNet data set are strong kernels and provide informative features for image categorization. In other words, the main strength of MKL-L1 , which is being able to remove irrelevant or weak kernels, does not bring any advantage. 83 4 14 x 10 training time (sec) MKL−SIP−L1 12 MKL−SIP−L 10 ML−MKL−SA ML−MKL−Sum 2 8 6 4 2 0 3 10 4 10 total number of training instances Figure 3.11: Comparison training times for different training set sizes for the ImageNet data set. We also compare the training times of the baseline methods on the ImageNet data set. The comparison in Fig. 3.11 confirms our previous results and demonstrates the efficiency of the proposed ML-MKL-SA method. 3.5 Conclusions and Future Work In this chapter, we present an efficient optimization framework for multi-label multiple kernel learning that combines a worst-case analysis with stochastic approximation. Compared to the other algorithms for ML-MKL, the key advantage of the proposed algorithm is that its computational cost is sublinear in the number of classes, making it suitable for handling a large number of classes. We verify the effectiveness of the proposed algorithm by experiments in image categorization 84 on several benchmark data sets. There are two main directions that we plan to explore in the future. The first one is improving the classification performance. Our experiments showed that, for OvA MKL framework, the proposed method improves the computational efficiency without causing a significant drop in the performance. However, the accuracy in image categorization can be improved by replacing the OvA framework by a multi-label learning formulation. To address this issue, we proposes a multiple kernel multi-label ranking method in Chapter 6. The second future direction is improving the prediction speed, which is in general more crucial than training speed in real world systems. To be able to cope with the increasing size of the image data sets, the prediction step needs to use sparse kernel combinations and classification functions. It is also desirable to have a sublinear dependency of prediction complexity on the number of classes. 85 Algorithm 1 The proposed Multi-label ranking algorithm 1: Input • ηβ , ηγ : step sizes • K: the kernel matrix • y1 , . . . , ym : the assignments of m different classes to n training instances • T : number of iterations • n, m,s: number of instances, classes, and kernels, respectively • δ: smoothing parameter 2: Initialization • γ 1 = 1/m and β 1 = 1/s 3: for t = 1, . . . , T do t 4: Sample a classification task at according to the distribution Multi(γ1t , . . . , γm ). at ⊤ ⊤ 5: Compute α = arg maxα∈[0,C]n α 1 −(α ◦ yat ) K(β)(α ◦ yat )/2 using an off shelf SVM solver. 6: Compute the estimated gradients gjβ (β t , γ t ) and gkγ (β t , γ t ) using Eqs. (3.8) and (3.9). ˆ t+1 as follows 7: Update β t+1 , γ t+1 and γ βjt+1 k γ t+1 ˆ t+1 = (1 − δ)γ t+1 + γ 8: 9: βjt exp(−ηγ gjβ (β t , γ t )), j = 1, . . . , s. = t Zβ = γkt exp(ηγ gkγ (β t , γ t )), k = 1, . . . , m Zγt δ 1. m end for ¯ and γ ¯ as Compute the final solution β 1 ¯= γ T T ¯= 1 β T t γ, t=1 86 T βt . t=1 (3.10) Chapter 4 Image Categorization by Multi-label Ranking 4.1 Introduction Image categorization requires an image to be assigned to a set of multiple classes, chosen from a large set of class labels. Therefore, image categorization can be cast into multi-label learning, in which each image can be simultaneously classified into more than one class. The most widely used approaches divide a multi-label learning problem into multiple independent binary labeling tasks. The division usually follows one-vs-all, one-vs-one, or the general error correction code framework [120, 121]. Most of these approaches suffer from imbalanced data distributions when constructing binary classifiers. This problem becomes more severe when the number of classes is large. Another limitation of these approaches is that they are unable to capture the correlation among classes [10]. In this chapter, we describe our multi-label ranking method, which addresses these two issues by simultaneously learning classifiers for each label. Our method tackles the multi-label learning problem using a multi-label ranking approach. For a given example, multi-label ranking aims to rank all relevant classes higher than irrelevant 87 classes. By converting the classification problem into a ranking problem, multi-label ranking avoids constructing binary classifiers, which operate by distinguishing an individual class from the rest (one-vs-all) of a pair classes from each other (one-vs-one), thus alleviating the problem of imbalanced data distribution. In addition, by avoiding the binary decision regarding which subset of classes should be assigned to each example, multi-label ranking is usually more robust than the classification approaches, particularly when the number of classes is large. We propose an efficient algorithm to solve the multi-label ranking problem which is based on a simple line search. One advantage of our method compared to the majority of the ranking methods is that the proposed algorithm has a linear dependency on the number of classes. On the other hand, most multi-label ranking methods have quadratic dependency because of the pair-wise class comparisons. We show that our kernel based multi-label ranking problem formulation is closely related to one-vs-all dual SVM objective. However, unlike the one-vs-all formulation, the proposed cost function cannot be divided into independent components, i.e., one for each class, for optimization. Instead, two features of the proposed method enables exploiting the relationships between labels without making explicit assumptions on the structure of correlations. The first one is a balance constraint, which forces the sum of the dual variables that correspond to positive classes be equal to that of negative classes. The second feature of the proposed method is the optimization scheme it employs which solves the problem for all classes together and chooses the dual variables from a closed set. 4.2 Previous Work The most widely used approach for multi-label learning is dividing the multi-label learning task into multiple independent binary classification tasks, i.e., learning a binary classifier for each label and deciding the label assignment of a test sample independently for each class. This method is 88 called binary relevance (BR) or one-vs-all classification. Once a multi-label learning problem is decomposed into multiple binary classification problems, any binary classification algorithm can be employed as a base solver. However, this straightforward approach has several shortcomings. Therefore, we see several attempts in the literature to develop algorithms that specifically address the needs of multi-labeled data, instead of simplifying the multi-label learning task by transforming the problem into an easier one. There is a very rich literature on multi-label learning. We review multi-label learning methods in four subsections, which are not necessarily mutually exclusive. We also discuss related problems to multi-label learning in Section 4.2.5. 4.2.1 Label Set Transformation Methods We categorize the methods that fall into this category into two groups: (i) Problem transformation, and (ii) label set projection methods. 4.2.1.1 Problem Transformation Methods With binary decomposition techniques like one-vs-all and one-vs-one, label set transformation methods were the popular choice for early multi-label learning studies [121]. In a binary decomposition framework, a multi-label learning problem is decomposed into a set of binary classification tasks, which can be easily solved by using well-studied binary classifiers such as SVM or naive Bayes. One of the shortcomings of the binary decomposition methods is that each classifier is trained independently, meaning that the correlation or dependencies between different classes are not exploited. Such dependencies can be very handy in many applications. Consider an example from automatic image annotation: if an image is tagged with the labels sun and clouds, it is very likely that the label sky is also a relevant label. Therefore, knowing the existence of the label sun in the image should be able to make detecting the label sky in the image easier. Another problem with 89 converting a multi-label learning problem into a set of binary classification tasks is the imbalanced (skewed) data distributions, particularly when the number of classes is large. Another approach for label set transformation is to consider each possible combination of a i binary label vector yi = (y1i , . . . , ym ) ∈ {0, 1}m as an individual class. This approach, which is named as the “label powerset” technique leads to a multi-class single label problem with a total of 2K new labels, which are named as powerset labels. However, label powerset is not a practical method since the number of classes (powerset labels) in the transformed problem is exponential in the number of original labels. Dietterich et al. proposed a technique for encoding classifier outputs in a multi-class singlelabel setting to increase the performance and robustness of the base learners [120]. The authors borrowed the idea of error-correcting coding (ECC) from the communication theory to create distributed output representations. Error-correcting coding is a robust coding scheme that makes detecting and correcting the errors in the output code possible. The main idea in the error correcting output codes (ECOC) scheme is to encode each class by a unique binary string (codeword) of length q. Then, a separate binary classifier is learned to calculate each of these q bits. Once the functions for each codeword digit are learned, the outputs of these q functions are evaluated for each test instance and an output binary string is constructed, which is then compared to all class codewords. 4.2.1.2 Label Set Projection Methods The idea of projecting a label set into a lower dimensional space before the learning step is a frequently used idea in the multi-label learning literature. The main motivation of using a projected label set instead of the original assignment vector is to increase the computational efficiency by decreasing the number of classes. The overall framework of the label set projection methods is illustrated by Figure 4.1. We can summarize the overall process in 4 steps: 90 Learn the mapping/ projection Project the label set of each training sample Training classifier/regressor in the projected label space Back-projection of the outputs to the original label space Figure 4.1: A diagram summarizing the label set projection schemes for multi-label learning. 1. Learn or construct the projection operation (matrix) to be used to project the original label vector into (possibly, but not necessarily) a lower dimensional space. ′ 2. Perform the projections: ψ(·) : Rm → Rm , s.t. y ˘ i = ψ(yi ), where m is the number of original labels, m′ the projected space dimension, yi the original label vector and y ˘ i is the new label vector in the projected space. ′ 3. Learn a classification/regression model f (·) : Rd → Rm , s.t. f (xi ) = y ˘i. ′ 4. Perform back projection to the original label space from the projected space: ψ ′ (·) : Rm → Rm , s.t. y ˆi = ψ ′ (˘ yi ), where y ˆ i is the final label vector prediction. Hsu et al. proposed to use the compressed sensing technique as a label set projection algorithm [122]. With the underlying assumption that label vectors are sparse, their scheme uses random projections for Step 2 and performs regression in Step 3. They show that if the label vectors are k-sparse (average number of nonzero entries is k), then the number of projections would be in the order of k log m, where m is the number of classes. One drawback of this method is that Step 4, the mapping of the predictions back to the original label space, might be complicated since it requires solving an optimization problem for each test sample. Zhou et al. [123] proposed to use the sign of the random Gaussian projections instead of the projections themselves, thus making the projected label matrix Y ′ binary and allowing the use of binary classification, instead of regression, which 91 is employed by the other methods. The recovery step (Step 4) is also different from the original compressed based algorithm [122]. Zhou et al. proposed to use a technique they named as the label set distilling method. In order to reduce the back-projection step’s complexity, Tai et al. proposed a technique called principle label space transform (PLST) [124]. PLST differs from the compressed sensing approach in that its projection matrix is constructed by using the singular vectors of the label matrix Y. Since singular vectors are orthonormal, the projection back to the original label space can be completed simply with a round-based reconstruction: multiplying lower dimensional predictions by the projection matrix and then performing element-wise rounding. 4.2.2 Supervised Algorithm Adaptation Methods There are also methods that are specifically designed for multi-label learning by adapting the wellknown supervised binary classification methods for handling multi-label data. For example, Zhang et al. proposed a maximum a posteriori estimation (MAP) multi-label K-nearest neighbors method (ML-KNN) [125]. In ML-KNN, the estimation of the label vector for a query sample depends on the label prior probability and the probability of assigning a label to an instance conditioned on the number of neighboring instances with the same label. Schapire et al. proposed two extensions to the well-known Adaboost method for multi-label learning. The first one is Adaboost.MH, which minimizes the Hamming loss and uses a binary decomposition approach, in which each multi-labeled sample is replaced by m new binary sample, with m being the number of labels. The second extension, Adaboost.MR, enforces a bi-partite ranking of labels through a set of pair-wise comparisons [126]. Many well-know decision tree algorithms are adapted to multi-label learning with some modifications. Clare et al. modified the C4.5 decision tree algorithm [127] for multi-label learning by modifying the entropy definition for the information-gain criterion [128]. Alternating decision trees (ADT) [129] and predictive clustering trees (PCT) [130] are other methods that are extended 92 to multi-label learning [131, 132]. In their PCT based multi-label learning study, Blockeel et al. showed that learning one tree for all labels simultaneously is better in terms of both speed and accuracy when compared with learning an independent tree for each label [132]. 4.2.2.1 Transfer learning for multi-label classification Han et al. proposed a transfer learning scheme for multi-label learning that transfers knowledge between different domains via a linear projection of the data points; this projection is formulated by the use of Graph Laplacian of a label-induced hypergraph and elastic-net regularizer [133]. Claiming that most of the transfer learning methods focus on transferring knowledge between different sources or domains for the same class, Qi et al. presented a multi-label transfer learning approach that aims to perform inter-class knowledge transfer, which can perform within a single domain or multiple domains [134]. They defined a transfer function for each class; these functions depend on two types of similarity measures that are defined for the training samples. One of the similarities is based on a kernel function that strictly uses the input features. The second similarity measure involves both the input features and the corresponding label information through a use of label affinity matrix S. The algorithm described in their study simultaneously optimizes the transfer function and the label affinity matrix. 4.2.3 Multi-label Ranking Methods One of the earliest multi-label ranking algorithms was proposed in [27]. Constraints derived from the multi-labeled instances were used to enforce the relevant classes to be ranked higher than the irrelevant ones [27]. Crammer et al. [135] improved the computational efficiency of [27] by exclusively considering the most violated constraint: comparing only two labels per instances, with one being the positive label with the minimum output score and the other being the negative label with the maximum output score. Elisseeff et al. proposed the RankSVM method, which uses pair-wise label ranking loss in the 93 SVM formulation [136]. Dekel et al. [137] and Shalev-Shwartz et al. [11] encoded the ranking problem using a preference graph. A boosting based algorithm was used in [137] to learn the classifiers from a set of given instances and the corresponding preference graphs. Although the described framework in [137] suits any type of ranking task, the multi-label learning problem is formulated as directed bipartite ranking. In [11] a generalization of the hinge loss function for the preference graphs was used for multi-label ranking. In all these approaches, a ranking model is learned from pairwise constraints between the relevant and irrelevant classes. The number of pair-wise constraints has a quadratic dependence on the number of classes, making it computationally expensive when the number of classes is large. In contrast, our proposed framework for the multi-label ranking that we discuss in this chapter is computationally efficient and can handle a large number of classes (order of 100s). 4.2.4 Exploiting Label Correlation in Multi-label Learning A number of approaches have been developed for multi-label learning that aim to capture dependencies among classes. In [10], the authors proposed to model the dependencies among the classes using a generative model. Ghamrawi et al. [12] tried to capture the dependencies by defining a conditional random field over all possible combinations of the labels. In [13], a multi-label matrix factorization approach that captures the class correlation via a class co-occurrence matrix was used. A hierarchical Bayesian approach was introduced in [29] to capture the dependency among classes. There are several approaches [30, 138–141] for multi-label learning that encode the class dependencies under the assumption that some important features are shared among classes. Given the bag-of-words representation of documents, McCallum proposed an EM based scheme that not only estimates the source classes for each document, but also tries to find how the classes contribute to the generation process of the words [138]. By revealing word-class relationship, this method can benefit from label correlations when classifying a document based on its word content. 94 In their algorithm named MAHR, Huang et al. exploit the label correlations automatically by the hypothesis reuse principle: a hypothesis extracted for one label can be used on other labels [142]. Guo et al. proposed to use conditional dependency networks to model label correlations [143]. A hypergraph representation, in which each vertex is a training instance and each hyperedge for a category is a collection of relevant training samples, was also used to model higher-order label correlations [144, 145]. There are also stacking techniques (i.e., BR+) [146, 147] and classifier chains [148, 149] as feature set transformation methods that exploits class correlations. We emphasize that our work does not focus on modeling the class correlations explicitly. While indirectly benefiting from dependencies between class labels, we do not make any assumptions regarding the type of relationships that exist between class labels. It should be noted that our proposed multi-label ranking method can be combined with many of the above approaches to further improve the classification performance in multi-label learning. 4.2.5 Related Problems It is important to note that multi-label learning, despite having a similar goal, differs from a related task, multi-task learning [150]. Multi-task learning can be thought as a bridge between multi-label learning and binary decomposition methods. Similar to binary decomposition methods, a binary classifier is trained for each class. However, unlike binary decomposition methods, the classes are no longer assumed to be independent; rather they are trained using shared information between classes. Multi-instance learning [151] is another task that can be confused with multi-label learning. The sole goal of multi-label learning is to find the relevant labels of an image. In contrast, multiinstance learning requires locating the concepts/objects in the image. In this thesis, our understanding of the image categorization problem requires that all categories are pre-defined and have at least one corresponding instance (image) for each category in the training step. In other words, classifiers should be trained for all the classes that are going to be 95 used in the prediction/testing phase. However, there is a group of studies that are not restricted to this definition. For example, in zero-shot learning [152] and transfer learning [29,141] frameworks, the labels that do not have any corresponding training instances can be used in the prediction stage. 4.3 Maximum Margin Framework for Multi-label Ranking Let xi , i = 1, . . . , n be the collection of training instances where each example xi ∈ Rd is a vector of d dimensions. Each training example xi is annotated by a set of class labels, denoted by a binary i vector yi = (y1i , . . . , yK ) ∈ {−1, 1}m , where m is the total number of classes, and yki = 1 when xi is assigned to class ck and −1 otherwise. In multi-label ranking, we aim to learn m classification functions fk (x) : Rd → R, k = 1, . . . , m, one for each class, such that for any example x, fk (x) is larger than fl (x) when x belongs to class ck and does not belong to class cl . We define the classification error εk,l i for an example xi with respect to any two classes ck and cl , as follows εik,l = I(yik = yil )ℓ yki − yli fk (xi ) − fl (xi ) 2 , (4.1) where I(z) is an indicator function that outputs 1 when z is true and zero, otherwise. The loss ℓ(z) is defined to be the hinge loss, where ℓ(z) = max(0, 1 − z). Note that the above error function outputs 0 when yki = yli , namely when no classification error is counted, i.e., xi either belongs to both ck and cl or xi does not belong to either of the two classes. Following the maximum margin framework for classification, we aim to search for the classification functions fk (x), k = 1, . . . , m that simultaneously minimize the overall classification error. This is summarized into the following optimization problem. min {fk ∈Hκ }m k=1 1 2 m k=1 n |fk |2Hκ m εik,l , +C (4.2) i=1 k,l=1 where κ(x, x′ ) : Rd × R → R is a kernel function, Hκ is a Hilbert space endowed with a kernel 96 function κ(·, ·) and C is a regularization parameter. Theorem 3 provides the representer theorem for fk (·), k = 1, . . . , m. Theorem 3. Classification functions fk (x), k = 1, . . . , m that optimize Eq. (4.2) are represented in the following form, n yki [Γi ]k κ(xi , x), fk (x) = (4.3) i=1 where [Γi ]k = m i l=1 Γk,l . Note that Γi ∈ Sm×m , i = 1, . . . , n are symmetric matrices that are obtained by solving the following optimization problem: n m m n 1 max [Γ ]k − κ(xi , xj )yki ykj [Γi ]k [Γj ]k 2 k=1 i,j=1 i=1 k=1    0 ≤ Γik,l ≤ C yki = yli i s. t. Γk,l =   0 otherwise i Γi = [Γi ]⊤ , i = 1, . . . , n; k, l = 1, . . . , m. (4.4) Proof. See the proof in Section A.4.1 The constraints in Eq. (4.4) explicitly capture the relationship between the classes. When an instance xi belongs to class ck , but does not belong to class cl , the value of Γik,l is positive, causing xi to be a support vector. The positive terms Γik,l are combined into [Γi ]k , which is used in computing the ranking function for class ck . 4.4 Approximate Formulation A straightforward approach that directly solves Eq. (4.4) by a standard quadratic programming approach is computationally expensive when the number of classes m is large because the number 97 of constraints is O(m2 ). We show that the relationship between multi-label ranking and the onevs-all approach provides insight for deriving an approximate formulation for Eq. (4.4) that can be solved efficiently. 4.4.1 Relation to the One-vs-all Approach Consider constructing fk (x) in Eq. (4.2) by the one-vs-all approach. The resulting representer theorem for fk (x) is n yki αki κ(xi , x), k = 1, . . . , m fk (x) = (4.5) i=1 where αki , i = 1 . . . , n; k = 1, . . . , m, are obtained by solving the following optimization problem n m αki max s. t. 1 − 2 i=1 k=1 αki ∈ [0, C], m n κ(xi , xj )yki ykj αki αkj k=1 i,j=1 i = 1, . . . , n; k = 1, . . . , m. (4.6) Comparing the above formulation to Eq. (4.4), we clearly see the mapping, i.e., [Γi ]k ↔ αki . Hence, the first simplification is to relax Eq. (4.4) by treating each [Γi ]k as an independent variable, which approximates Eq. (4.4) into the following optimization problem n m max i=1 k=1 αki − 1 2 m n κ(xi , xj )yki ykj αki αkj k=1 i,j=1 m s. t. 0 ≤ αki ≤ C I(yki = yli ), l=1 i = 1, . . . , n; k = 1, . . . , m. 98 (4.7) m i l=1 I(yk Note that the constraint αki ≤ C = yli ) follows m [Γi ]k = l=1 m I(yki = yli )Γik,l ≤ C I(yki = yli ). l=1 While the problem in Eq. (4.7) can be decomposed into m independent problems, similar to an OvA SVM, this is not adequate for multi-label ranking as the dependence between the functions fk (x), k = 1, . . . , m cannot be captured. 4.4.2 Proposed Approximation In this section, we present a better approximation of Eq. (4.4) compared to the one presented in Eq. (4.7). Without loss of generality, consider a training example xi that is assigned to the first a classes, and is not assigned to the remaining b = m − a classes. According to the definition of Γi in (4.4), we can rewrite Γ as    0 Z  Γ= , ⊤ Z 0 (4.8) where Z ∈ [0, C]a×b . Using this notation, variable τk = [Γi ]k is computed as τk =      b l=1 Zk,l a l=1 Zl,k a + 1 ≤ k ≤ m 1≤k≤a where Zk,l is an element in Z that is bounded by 0 and C. According to the above definition, for each instance, τk is the sum of either the k th column or the k th row of Z depending on whether the label k is relevant to that instance or not. Formulating τk by using Z brings several advantages. Firstly, it enables us to derive constraints for τk explicitly in the optimization. Secondly, all τk variables depend on each other in the optimization since the components of these variables are 99 taken from a closed domain Z. This relationship is in fact a special case of the constraint given in Eq. (4.4). The constraint in Eq. (4.4) intuitively forces a balance between the irrelevant and relevant labels of an instance by requiring the sum of the upper bounds of [Γi ]k that correspond to relevant classes to be equal to that of [Γi ]k that correspond to irrelevant classes. Obtaining τk from Z as formulated above introduces an additional constraint by forcing the sum of the weights corresponding to the relevant labels to be equal to the sum of the weights that are associated with irrelevant labels. This constraint is useful in dealing with the imbalance between the number of relevant and irrelevant labels as well as capturing the dependencies between the classes for that instance. In order to convert τk , k = 1, . . . , m into free variables, we need to derive explicit constraints on τk that will ensure that each solution of τk will result in a feasible solution for Z. Let us first consider a simple case in which we only require elements in Z to be non-negative. Theorem 4 provides the constraints on τk . Theorem 4. The following two domains Q1 and Q2 for vector τ = (τ1 , . . . , τK ) are equivalent Q1 = {τ ∈ Rm : ∃Z ∈ Ra×b + s. t. τ1:a = Z1b , τa+1:m = Z ⊤ 1a } a Q2 = τ ∈ Rm + : τk = k=1 (4.9) m τk (4.10) k=a+1 Proof. See Section A.4.2 for the proof. Theorem 4 states that the two domains Q1 and Q2 are equivalent for vector τ and leads to the following corollary: 100 Corollary 5. Consider the following two domains Q1 and Q2 for vector τ = (τ1 , . . . , τm ) Q1 = {τ ∈ Rm : ∃Z ∈ [0, C]a×b s. t. τ1:a = Z1b , τa+1:m = Z ⊤ 1a } a τ ∈ [0, C]m : Q2 = (4.11) m τk = k=1 τk (4.12) k=a+1 We have τ ∈ Q2 ⇒ τ ∈ Q1 . The above corollary becomes the basis for our approximation. Instead of defining matrix variables Γi , i = 1, . . . , n as in (4.4), we introduce the variable αki for [Γi ]k . We furthermore restrict αi = (α1i , . . . , αki ) to be in the domain G = τ ∈ [0, C]m : a k=1 τk = m k=a+1 τk to ensure that feasible Γi can be recovered from a solution of αki . The resulting approximate optimization is n m αki max i=1 k=1 m s. t. I(yki = k=1 αki ∈ 1 − 2 m n κ(xi , xj )yki ykj αki αkj k=1 i,j=1 m i 1)αk = I(yki k=1 [0, C], = −1)αki , i = 1, . . . , n, k = 1, . . . , m (4.13) Unlike Eq. (4.7), Eq. (4.13) cannot be solved as m independent problems since for each instance xi , the αki from all the classes ck , k = 1, . . . , m are involved in the constraint. According to these constraints, for each instance the sum of the weights corresponding to the relevant labels should be equal to the sum of the weights that are associated with irrelevant labels. Theorem 4 shows that by adding this constraint to the problem, the relationships between the classes can be exploited and used without explicitly determining the set Z and the matrices Γi . Another advantage of this formulation is that no assumptions on the form of these relationships (e.g., pairwise relationships between classes) are made. 101 4.5 Efficient Algorithm We follow the work of Lin et al. [153] and solve Eq. (4.13) by coordinate descent. At each i iteration, we choose one training example (xi , yi ) and the related variables αi = (α1i , . . . , αm ), while fixing the remaining variables. The resulting optimization problem becomes m αki max k=1 s. t. 1 − 2 m yki fk−i (xi )αki k=1 κ(xi , xi ) − 2 m (αki )2 k=1 i⊤ αi ∈ [0, C]m , y αi = 0 (4.15) where fk−i (xi ) is the leave-one-out prediction that can be computed as fk−i (x) j=i = ykj αkj κ(xj , x). Theorem 6. The optimal solution to (4.15) is written as 1 + λyki − 12 yki fk−i (xi ) κ(xi , xi ) αki = π[0,C] , k = 1, . . . , m (4.16) where λ is the solution to the following equation m g(λ) = h k=1 yki + λ − 12 fk−i (xi ) i , yk C κ(xi , xi ) = 0. (4.17) Here h(x, y) = π[0,y] (x) if y > 0 and h(x, y) = π[y,0] (x) if y ≤ 0. Function πG (x) projects x onto the region G. Proof. See Section A.4.3 for the proof. The function g(λ) defined in Eq. (4.17) is a monotonically increasing function of λ which can be solved using the bisection search. The lower and upper bounds for λ for the bisection search are shown in the proposition below. 102 Proposition 3. The value of λ that satisfies Eq. (4.17) is bounded by λmin and λmax . Define, κii = κ(xi , xi ) and G = [0, C], 1 −i ηk+ = 1 + fk−i (xi ) 2 m −1 ηk− i ∆= δ(yk , 1)πG kii k=1 1 −i ηk− = 1 − fk−i (xi ) 2 m −i ηk+ i − δ(yk , −1)πG κii k=1 −i amin = −Cκii + min ηk+ i yk =−1 −i amax = Ckii − min ηk− i −i bmin = − max ηk− i yk =1 −i bmax = max ηk+ i yk =1 yk =−1 If ∆ < 0, we have λmin = 0 and λmax = min(amax , bmax ). If ∆ > 0, we have λmax = 0 and λmin = max(amin , bmin ). Proof. See Section A.4.4 for the proof. Once λ is calculated by applying the bisection search between the bounds λmin and λmax , it is straightforward to calculate the coefficients αki and finally the ranking functions fk (x) for any new instance x. 4.6 Experimental Results In this section, we empirically evaluate the proposed multi-label ranking algorithm by demonstrating its efficiency and effectiveness on the image categorization task. 4.6.1 Data Sets In order to compare our proposed multi-label learning method to state-of-the-art methods, we use three benchmark data sets: VOC 2007, ESP Game and MIR Flickr25000. For the VOC 2007 data set, we use the default partitioning suggested by the Pascal VOC Challenges: 5,011 training images and 4,952 test images. We follow [101] and use dense-SIFT features. 103 Note that the majority of the images in the VOC 2007 data set are labeled by a single class. In fact, the average number of labels per image is only 1.5. Because of this property, VOC 2007 is not an ideal data set for evaluating multi-label learning algorithms. Nevertheless, the performance on the VOC 2007 data set will allow us to examine if the proposed algorithm is effective for image categorization, since VOC 2007 is the most-widely used image categorization benchmark. The MIR Flickr25000 [154] data set is a subset of the MIR Flickr-1M data set. The original data set contains 25,000 images from 457 classes. However, to be able to create a better test bed for multi-label learning, we remove the images that are assigned to fewer than three classes and take 75% of the instances to form the training set by random sampling. The bag-of-words model based on dense-SIFT features, provided by [101] and [155], are used for image representation. We also use a subset of the ESP data set, in which the average number of labels per image is 8.3. To study the influence of the number of training samples and labels on multi-label learning performance, we vary the number of training samples and number of labels. In total, we have 20 settings: four training sets with 10,000, 20,000, 30,000, and 40,000 images and five different cases for the number of categories {20, 50, 100, 200, 500}. After ranking the categories in terms of their frequency (number of images annotated with them) in the data set, we pick the top 20, 50, 100, 200, and 500 categories to create these five different test settings. The number of test images is 10,000. We use dense-SIFT based BoW representation. 4.6.2 Baseline Methods The following methods are evaluated: • SVM: We use LIBSVM [156] implementation of the one-vs-all SVM classifier, which is shown to outperform other multi-class SVM methods in [121]. • PLATT: We apply Platt’s method to convert SVM scores to posterior probabilities [157]. This conversion makes it easy to compare the output scores of different SVM classifiers, 104 leading to better performance for multi-label ranking in some cases. • MLKNN: A nearest neighbor based multi-label classification method [125]. The number of nearest neighbors is chosen by cross-validation. MLKNN is a very popular baseline in the multi-label learning literature due to its simplicity. • Multiple label shared space model with least square loss (MLLS): A direct multi-label learning method [139] that makes use of the class correlations via a feature space transformation under the assumption of a shared subspace between the categories. MLLS is reported to outperform other state-of-the-art methods that explore class correlations [139]. • MLR-L1 : Our proposed multi-label ranking method that is described in this chapter. • MLR-GL: Our proposed group lasso based multi-label ranking method that is described in Chapter 5. The approximation parameter η is chosen by cross-validation. For kernel based methods, we use the RBF kernel with χ2 distance in our experiments, which has shown to outperform other kernels for image categorization. The regularization parameter C is chosen with a grid search over {10−2 , 10−1, . . . , 104 }. The bandwidth of the RBF kernel is set to the average pair-wise χ2 distance between the training image pairs. 4.6.3 Multi-label Ranking Performance We first compare the ranking performance of the baseline methods in terms of the AUC-ROC and MAP scores. We start by comparing the baselines on the VOC 2007 data set. According to [94, 158], SVM classifier with RBF kernel with χ2 distance, one of the baselines (SVM) used in our study, yields a comparable performance with the state-of-art methods in the PACAL VOC evaluations. Table 4.1 shows that the proposed algorithm yields a better performance than the one-vs-all SVM method in terms of AUC-ROC and MAP, indicating that the proposed method is effective for image categorization. 105 Input Image True objects SVM PLATT MLKNN MLLS MLR-L1 MLR-GL people, motorbike, car people, car, bus people, car, bus people, horse, bus people, car, bus people, motorbike, car car, people, motorbike car, people, dog people, car, horse people, car, horse car, people, cat people, dog, cat car, people, dog people, car, dog people, motorbike, car people, cow, motorbike people, cow, car people, car, cat people, car, bus people, motorbike, car people, motorbike, car car, people, bike motorbike, people, horse motorbike, people, bus people, motorbike, car bike, people, car car, people, bike car, bike, people Figure 4.2: For four images from the VOC 2007 data set, the original labels are given in addition to the outputs of baseline methods. Table 4.1: AUC-ROC and MAP results for the VOC 2007 data set AUC-ROC MAP SVM 90.7 65.6 PLATT 90.5 65.6 MLKNN 89.4 63.7 MLLS 90.7 66.0 MLR-L1 91.0 67.2 MLR-GL 91.0 67.2 As an illustration, Figure 4.2 shows examples of test images from VOC 2007 data set and the categories predicted by different methods. This figure supports the claim that the categories identified by the proposed ranking method are more relevant to the visual content of images than the baseline methods, particularly for the images that contain several objects. It is important to note that the evaluation measure we are using in this chapter is not the same as the one used in the VOC competition. In the VOC challenge, the performance is evaluated for each object class separately (category-based), based on the confidence values obtained by binary classifiers. This is not applicable for our case, as we propose a label ranking scheme. We rank the categories for each image in the descending order of their scores, and our AUC measure evaluates how successful is label-ranking. See Section A.1.5 for a detailed discussion on evaluation measures we are using in this dissertation. Tables 4.2 and 4.3 provide the AUC-ROC and MAP results for the baselines on the ESP Game 106 Table 4.2: AUC-ROC (%) for the ESP Game data set with 10,000 training images SVM PLATT MLKNN MLLS MLR-L1 MLR-GL 20 79.3 79.1 78.6 79.7 81.7 80.2 50 79.5 79.4 78.8 79.4 81.5 80.5 100 79.8 79.5 79.5 79.4 82.1 81.8 200 80.2 79.9 81.3 79.8 82.9 83.8 500 80.4 80.0 83.5 80.2 83.9 85.4 data set when 10,000 images are used for training. From the Tables 4.2 and 4.3, we reach the following conclusions: • The proposed ranking methods, MLR-L1 and MLR-GL, consistently and significantly outperforms the other baseline methods. • Converting SVM scores to posterior probabilities does not improve the performance on this data set. • MLLS method performs better than SVM, PLATT, and MLKNN baselines when the number of categories is small (20, 50, 100). However, this gap decreases as the number of categories increases, possible because the assumption of a shared subspace that covers all categories is too restrictive when the number of categories is large. • The relative performances of MLKNN and MLR-GL against the other baselines are better when evaluated by AUC-ROC than MAP. This is because these two baselines do not focus on optimizing the performance for the top ranks (i.e., rank-1, rank-2, etc.). • MLKNN, which is a very popular baseline in the multi-label learning literature due to its simplicity, is significantly outperformed by the other baselines. • The methods that are specifically designed for multi-label learning, MLLS and the two ranking methods, outperform one-vs-all SVM in majority of the settings. 107 Table 4.3: MAP (%) for the ESP Game data set with 10,000 training images SVM PLATT MLKNN MLLS MLR-L1 MLR-GL 20 59.3 59.1 57.0 59.9 62.2 59.4 50 49.5 49.3 45.9 48.5 52.0 48.6 100 43.4 43.4 39.7 43.3 45.5 42.9 200 38.0 37.9 35.2 38.0 40.0 38.2 500 32.4 32.2 28.5 32.4 34.2 32.1 Figure 4.3 plots the change of the AUC-ROC score with respect to the number of training images (10, 000, 20, 000, 30, 000, and 40, 000), when the number of classes is 200. It should be noted that although increasing the number of samples boost the performance of all the baselines, the relative performance of the baselines with respect to each other does not change. The only change in the relative performance is seen for MLLS method. When compared to the one-vs-all SVM baselines and ML-KNN, MLLS takes a better advantage of the increased number of training instances and outperforms them for the settings where the number of training instances is greater than 10,000. We also check the AUC-ROC and MAP results on the MIR Flickr25000 data set. We see from Table 4.4 that the proposed ranking methods and MLLS give better results compared to one-vs-all SVM baselines, showing again the superiority of direct multi-label learning methods compared to problem transformation approaches like one-vs-all decomposition. Furthermore, the MLR-GL method, which is another multi-label ranking approach, significantly outperforms the other baselines in terms of AUC-ROC score. However, similar to the case in the ESP Game experiments, its relative performance drops in terms of MAP. The proposed MLR-L1 technique gives comparable results to MLLS, which seems to perform well for the MIR Flickr25000 data set, indicating that the shared subspace assumption is valid for this data set. This indicates that the multi-label learning methods that make strong assumptions when exploiting label correlations have potential to perform well when their assumptions hold, as shown by our MIR Flickr25000 experiments. However, 108 86 85 AUC−ROC 84 83 82 SVM PLATT MLKNN MLLS MLR−L1 81 80 MLR−GL 79 1 1.5 2 2.5 3 number of training samples 3.5 4 x 10 Figure 4.3: Change of the AUC-ROC score with respect to the number of training images. 109 4 Table 4.4: AUC-ROC and MAP results for the MIR Flickr25000 data set AUC-ROC MAP SVM 70.2 31.5 PLATT 68.7 31.4 MLKNN 68.6 28.9 MLLS 75.9 33.2 MLR-L1 75.4 33.3 MLR-GL 76.2 32.8 these methods might not perform well for the data sets where the underlying assumption does not hold (e.g., ESP Game for the MLLS method). Finally, we show some example images from the ESP data set and the predicted labels by each baseline method in Table 4.5. The first row under the images gives the true image class labels. For each baseline, we provide the top six returned labels ranked from left to right. The hits are written with bold characters. For example, for the left-most image, SVM provides the following outputs, in the descending relevance order: ad, computer, screen, book, woman, sign. Among these six labels, only the labels ad and sign are correct, meanwhile the other four labels, which are irrelevant for the image, are ranked above the two labels, logo and sign, causing them to become false negatives. On the other hand, the proposed ranking method MLR-L1 successfully ranks the labels ad, logo, and sign above all other labels. 4.6.4 Training Time Figure 4.4 plots the change of the training time of the three baselines (MLR-L1 1 , SVM, and MLLS) for a fixed number of categories (100) with respect to the number of training samples for the ESP Game data set. In this experiment, we vary the number of training examples from 10,000 to 40,000. We observe that the MLLS method gets computationally more efficient compared to SVM and MLR-L1 because of its subspace assumption, which allows learning in a lower dimensional space. The main bottleneck of the MLLS algorithm is the SVD operation on the data matrix. However, when the number of samples, n, is large (n >> d), the algorithm only calculates d 1 we will analyse the computational efficiency of the MLR-GL method in Chapter 5 110 Table 4.5: The label predictions by the baselines for four images from the ESP Game data set. The first row under the images gives the true image class labels. For each baseline, we provide the top six returned labels (three in the top row, and three in the lower row) ranked from left to right. The hits are written with bold characters. labels SVM MLKNN MLLS MLR MLR-GL ad logo sign man sky sky tee cloud ad computer screen sky window people tee sky water book woman sign light cloud gun building rock light ad logo sport hair face sky hair face man sign man screen man tree smile sky girl woman logo sign ocean sky window light sky tee water sea silver book cloud people man light rock cloud logo sign ad sky man people sky tee cloud woman man paper woman window cloud building water dark ad sign logo sky man woman sky cloud tee man picture computer girl people hair water light dark 4 4 x 10 MLR−L training time (sec) 1 3 SVM MLLS 2 1 0 1 1.5 2 2.5 3 number of training samples 3.5 4 4 x 10 Figure 4.4: Training time of the three baselines for a fixed number of categories (100) with respect to the number of training samples for the ESP Game data set. 111 6000 MLR−L 1 training time (sec) 5000 SVM MLLS 4000 3000 2000 1000 0 0 100 200 300 400 number of training samples 500 Figure 4.5: Training time of the three baselines for a fixed number of training samples (10,000) with respect to the number of categories for the ESP Game data set. singular values, where d is the dimension of the feature vectors, and the corresponding d singular vectors. This is why we see that the computational time of MLLS does not increase significantly when the number of samples increases. On the other hand, SVM and MLR-L1 have a quadratic dependency on the number of samples because they are both kernel-based methods. The difference between the speeds of these two methods increases in favor of MLR-L1 as the number of samples increases. Figure 4.5 plots the change of the training time of the three baselines (MLR-L1 , SVM, and MLLS) for a fixed number of training samples (10,000) with respect to the number of categories for the ESP Game data set. This time, we vary the number of categories by using 20, 50, 100, 200, and 500 classes. Similar to the previous case, the MLLS method is the most efficient method among the compared baselines. The training time of the MLR-L1 method has a linear dependency on the number of categories. Although we would expect SVM to show a very similar characteristic as well, it actually becomes more efficient than MLR-L1 as the number of categories increase. This 112 is because the SVM optimization terminates early for the classes that cause the long-tail problem: the classes that have a very small number of positive samples. To conclude, it is important to note that the proposed ranking method is comparable to, if not more efficient than, one-vs-all SVM in terms of training time. Considering that many researchers are employing one-vs-all SVM in their image categorization studies, MLR-L1 emerges as a strong alternative. As our empirical analyses showed, shared subspace methods, or label set projection methods (i.e., compressed sensing based multi-label learning), which rely on a similar idea, are computationally more efficient for large scale data sets. However, the proposed baseline can be combined with such approaches to significantly reduce the training time. 4.7 Conclusions and Future Work We have introduced an efficient multi-label ranking scheme which offers a direct solution to multilabel ranking unlike the conventional methods that use a set of binary classifiers for multi-label learning. Our direct approach enables us to capture the relationships between the class labels without making any assumptions on how these relationships should be modeled. The strength of the proposed approach lies in establishing the relationships between the classifiers by treating them as ranking functions. An efficient algorithm is presented for solving the proposed multi-label ranking approach. Our empirical study of image categorization with three benchmark data sets demonstrated that the proposed method outperforms state-of-the-art methods. Yet, there are some future directions that can be followed to improve the proposed method: • Improving the computational efficiency: The computational efficiency of the proposed method can be improved by combining it with label set space projection methods such as compressed labeling [123] to have a sublinear dependency on the number of labels. • Exploiting label correlations: If the data being used have a label structure that can be modeled explicitly, e.g., hierarchical structure or existing pair-wise correlations between classes, 113 such structures can be integrated into the proposed multi-label learning framework. • Robustness to incomplete label assignments: The label annotations for training images are often incomplete due to the high cost of the annotation process and the ambiguities between the class labels. It is important to develop multi-label learning methods that are robust to incomplete label assignments. One possible solution is the method we present in Chapter 5. • Multiple kernel learning for multi-label ranking: The proposed multi-label ranking method is limited to a single kernel use. We discussed how considering labels together in a multiple kernel learning task can improve both the computational efficiency and classification accuracy for multi-label data in Chapter 3. The next step in this direction is to extend the proposed multi-label ranking method to multiple kernel learning. To address this issue, we extend the proposed MLR-L1 algorithm to multiple kernel setting in Chapter 6. 114 Algorithm 2 Multi-label ranking algorithm: 1: Input • x1 , . . . , xn ; xi ∈ Rd : Training instances • y1 , . . . , yn ; yi ∈ {−1, 1}m : the assignments of m different classes to n training instances • K: n × n kernel matrix • T : number of iterations 2: Initialization • αik = 0, i = 1, . . . , n, k = 1, . . . , K 3: for t = 1, . . . , T do 4: for i = 1, . . . , n do 5: ∆=0 6: Calculate the leave one out prediction: ⊤ ⊤ f −i (xi ) = yi αi K:,i − (yi αi )Ki,i 7: Compute ∆ m ∆= k=1 8: yki =−1 −i min ηk− yki =1 −i bmax = max ηk+ i yk =−1 if ∆ < 0, we have λmin = 0 and λmax = min(amax , bmax ) if ∆ > 0, we have λmax = 0 and λmin = max(amin , bmin ) Solve for λ by using a line search and the bounds λmax and λmin g(λ) = h k=1 yki + λ − 12 fk−i (xi ) i , yk C Ki,i = 0. where h(x, y) = π[0,y] (x) if y > 0 and h(x, y) = π[y,0] (x) if y ≤ 0. Compute αki = π[0,C] 11: 12: k=1 yki =1 K 10: m 1 −i ηk− = 1 − fk−i (xi ) 2 −i ηk+ δ(yki , −1)π[0,C] Kii where function πG (x) projects x onto the region G. Calculate the bounds λmax and λmin for the line search −i −i amin = −Cκii + min ηk+ bmin = − max ηk− amax = Ckii − 9: 1 −i ηk+ = 1 + fk−i (xi ) 2 −1 ηk− δ(yki , 1)π[0,C] − Kii 1 + λyki − 21 yki fk−i (xi ) Ki,i end for end for 115 , k = 1, . . . , m (4.14) Chapter 5 Multi-label Ranking for Image Categorization with Incomplete Class Assignments 5.1 Introduction In Chapter 4, we have discussed the multi-label learning problem in detail, and presented our multi-label ranking approach, MLR-L1 . Our empirical analyses showed that the proposed MLR-L1 method outperforms the state-of-the-art multi-label learning techniques. The strength of the MLRL1 method is its label ranking formulation, which implicitly considers the pair-wise comparisons between the relevant and irrelevant labels for each training image. Simultaneously solving this formulation for all class labels enables an exploitation of the label correlations, one of the main research directions in the multi-label learning literature. However, the performance of multi-label learning techniques, including the MLR-L1 method, depends on the quality of the training set and the label supervision. It is unclear if strong multi-label learning algorithms would work well in practice. One of the main concerns about the real world systems is that the labeling process 116 is very expensive and often inaccurate. In image categorization systems, the image annotations for the training data set are provided mostly by online users from online services like Amazon Mechanical Turk. As a result, the retrieved annotations are often incomplete; only a subset of the true image labels are given by the annotators. person, horse, grass, tail bus, car, grass, tail tree, animal, ear, sand, sky blue, pink, cartoon, animal, ear, tail Figure 5.1: Some example images from the VOC 2007 (top row) and ESP Game (bottom row) data sets with their annotations. The labels written in italic are provided with the images, whereas the ones written in bold fonts are the missing labels. These images, with their missing annotations, are examples of incomplete labeled data. In this chapter, we consider the image categorization problem from incomplete labeled data. As an example, an image, whose true class assignment is (c1 , c2 , c3 ), is only presented with class c1 when it is used for training. Our goal is to learn a multi-label learning model from training examples that have incomplete class assignments. We refer to this problem as multi-label learning with incomplete class assignments, and the training data as incompletely labeled data. Multi117 label learning with incomplete class assignments is frequently encountered in automatic image annotation when the number of classes is very large, and it is only feasible for users to provide a limited number of class labels for a given instance, as seen in Figure 5.1. Incompletely labeled data also arise when there is a large ambiguity between class labels, making it difficult for annotators to decide appropriate class assignments for given training instances. Figure 5.2 shows two examples of annotated images from the ESP Game data set. Some of the annotated words used to describe these two images can cause ambiguity. For example, the keywords baby, kid, and boy can be used interchangeably; therefore, an annotator who picks any of these labels would probably not include the other two to the annotation set. Also, note that these annotations are often generated by collapsing annotated words from multiple users. Therefore, it is very likely that some of the labels that cause the label ambiguity problem might be missing from the final list of annotations. Both scenarios, missing labels and label ambiguity, are frequently encountered in the image categorization problem. It is important to distinguish the learning scenario studied in this work from the related ones in the previous studies such partial labeling [159, 160] and weakly labeled data [161]. We provide in Table 5.1 some of the related concepts that can be confused with the multi-label learning with incomplete class assignments task and briefly highlight the differences. There is a rich body of literature on multi-label learning, ranging from simple approaches that divide multi-label learning into a set of binary classification problems [162], to more sophisticated approaches that explicitly explore the correlation among classes [10–13]. But none of these approaches directly address the challenge of multi-label learning from incompletely labeled data, which is a more realistic scenario. To this end, we present a multi-label learning framework based on the idea of multi-label ranking [11,15,27,137]. Unlike the classification approaches that make a binary decision about the class assignment for a given instance, multi-label ranking methods rank classes for a given instance such that the relevant classes are ranked before the irrelevant ones. In order to handle the problem of incomplete class assignment, we extend multi-label ranking by 118 baby, boy, child, eye, face, girl, hair, house, kid, mouth, nose, pink, smile anime, ball, boy, cartoon, drawing, girl, group, hair, kid, man, people, play, red Figure 5.2: Example images from the ESP Game data set and their annotations. The annotations highlighted by bold font, which are used to annotate the same concept/object in the corresponding images, are examples of the label ambiguity problem. Table 5.1: Some concepts that can be confused with the incomplete label assignment problem problem partial labeling weakly labeled data weakly tagged images partially labeled data bandit multi-class learning bib [159, 160] [161] [164] [165] [166, 167] definition Only one of the positive class assignments is correct A value indicating correctness of predictions is given Some of the class assignments are incorrect Another name for semi-supervised learning Learner receives partial feedback, e.g., click data. exploiting the group lasso technique [163] to combine the errors in ranking the assigned and unassigned classes for each image. As will be seen in the following discussions, by using group lasso to combine ranking errors, the proposed framework is able to automatically detect the missing class assignments in the training set and consequentially improve the classification accuracy. We present an efficient learning algorithm for the proposed framework. The efficiency of a multi-label ranking method is important, since a naive implementation would result in performing a pairwise comparison between all possible image pairs, making it difficult to scale to a large number of classes and training instances. Our empirical studies on two benchmark data sets for image categorization indicate that (i) our framework is robust to the missing class assignments 119 problem and performs better than the state-of-the-art multi-label learning approaches in the case of incompletely labeled data, and (ii) the proposed approach is computationally efficient and scales well to large numbers of training examples and classes. 5.2 A Framework for Multi-label Learning from Incompletely Labeled Data In order to handle incompletely labeled data, we consider exploring the group lasso regularizer when estimating the error in ranking the assigned classes against the unassigned ones. The key idea is to selectively penalize the ranking errors. To facilitate our discussion, we follow the notation in Chapter 4 and consider an instance x that is assigned to classes c1 , . . . , ca . Consequently, classes ca+1 , . . . , cm are remained as the unassigned classes for x. If example x is fully labeled, following [15], the ranking error for the classification functions fk (x), k ∈ [m] is expressed as a m k=1 l=a+1 max(0, fl (x) − fk (x) + 1). (5.1) However, given the data is only partially labeled, some of the unassigned class labels may indeed be the true classes, and the above loss function for x may overestimate the classification error. To address this issue, we introduce a slack variable, denoted by εk,l , to account for the error of ranking an unassigned class l before the assigned class k. This introduces the following constraint εk,l + fk (x) ≥ 1 + fl (x). 120 (5.2) Now, instead of adding all the errors together for an instance x, i.e., a k=1 m l=a+1 εk,l, we combine the ranking errors εk,l via a group lasso regularizer, i.e., m a ε2k,l l=a+1 (5.3) k=1 The motivation of using group lasso for aggregating ranking errors is two fold: first, as stated in the general theory, group lasso is able to select a group of variables, which in our case, is to select the group of ranking errors {εk,l , k = 1, . . . , a} for each unassigned class cl . In particular, an unassigned class cl is likely to be a missing class assignment for an instance x when many of its ranking errors {εk,l }ak=1 are non-zero, which coincides with the criterion of group selection by group lasso. Thus, by using the group lasso regularizer, we may be able to decide which unassigned classes are indeed the missing correct class assignments. Second, group lasso usually results in a sparse solution in which most of the group variables are zero and only a small number of groups are assigned non-zero values. In our case, the sparse solution implies that most of the unassigned classes for x are indeed correct, and only a few unassigned classes are the true class assignments for x that are missed during annotation. Let x1 , . . . , xn be the collection of training instances that are labeled by Y1 , . . . , Yn , where each Yi ⊂ Y. For the convenience of presentation, we represent each class assignment Yi by a binary i vector y i = (y1i , . . . , ym ) ∈ {−1, +1}m , where yki = +1 if k ∈ Yi and yki = −1 if k ∈ / Yi . Using the group lasso regularizer described above, we have the following optimization problem: 1 min fk ∈Hκ 2 m k=1 n |fk |2Hκ +C i=1 l∈Y / i k∈Yi ℓ2 (fk (xi ) − fl (xi )), (5.4) where ℓ(z) = max(0, 1−z) is the hinge loss function that assesses the error in ranking two classes ck and cl . In the next section, we discuss the strategy for efficiently optimizing Eq. (5.4). 121 5.3 Optimization Algorithm First, we have the following representer theorem for f (x) that optimizes Eq. (5.4). Theorem 7. The optimal solution to Eq. (5.4) admits the following expression for f (x), i.e., n yki αki κ(x, xi ), fk (x) = k = 1, . . . , m, i=1 where αki , i = 1, . . . , n are the combination weights. It is straightforward to verify the above representer theorem. Next, in order to solve Eq. (5.4) efficiently, we aim to linearize the objective function in Eq. (5.4) by using the following lemma. Lemma 1. m l=a+1 a 2 i k=1 ℓ (fk (x ) − fl (xi )) is equivalent to the following expression: m max γ i ∈Ra×(m−a) s.t. a i γk,l ℓ(fk (xi ) − fl (xi )) l=a+1 k=1 max |γ i·,l |2 1≤l≤m−a (5.5) ≤ 1, where γ ·,l stands for the lth column vector of matrix γ i . Lemma 1 follows directly from the fact that m l=a+1 a 2 i k=1 ℓ (fk (x ) − fl (xi )) is a L1,2 norm of the loss function ℓ(fk (x) − fl (x)) and the dual norm of L1,2 is L∞,2 . See Section A.5.1 for a detailed proof. Using lemma 1, we turn Eq. (5.4) into a convex-concave optimization problem as revealed in the following theorem. Theorem 8. The problem in Eq. (5.4) is equivalent to the following convex-concave optimization 122 problem maxn i min m L = {γ ∈∆i }i=1 {fk ∈Hκ }k=1 1 2 m k=1 |fk |2Hκ (5.6) n +C i=1 l∈Y / i k∈Yi i γk,l ℓ(fk (xi ) − fl (xi )), i where γ i = [γk,l ]m×m and ∆i =            i γk,l ≥ 0, k, l = 1, . . . , m, i γ i ∈ Rm×m : γk,l = 0 if l ∈ Yi or k ∈ / Yi , , |γ i·,l |2 max 1≤l≤m ≤1            The above theorem follows by directly plugging the result of Lemma 1 into Eq. (5.4). As indicated by the above theorem, the introduction of the group lasso is equivalent to introducing a i different weight γk,l for each comparison between an assigned class and an unassigned class. It is the introduction of these weights that allows us to determine which unassigned class is missed in the annotation process. Theorem 9. The optimal solution f (x) to Eq. (5.6) can be expressed as follows: n yki αki κ(x, xi ), fk (x) = i=1 i ⊤ where αi = (α1i , . . . , αm ) , i = 1 . . . n is the optimal solution to the following optimization prob- lem: m n maxn i {α ∈Ωi }i=1 k=1 i=1 n αki − αki αkj yki ykj Ki,j i,j=1 123 , (5.7) where Ωi = αi ∈ Rm : ∃γ i ∈ ∆i s. t. αi = Cγ i 1 + C[γ i ]⊤ 1 . The proof of this theorem can be found in the Section A.5.2. Note that although the objective function in Eq. (5.7) is similar to that of SVM, it is the constraints specified in domain Ωi that makes this problem computationally more challenging. Algorithm 3 Multi-label ranking algorithm with Group Lasso 1: Input • x1 , . . . , xn ; xi ∈ Rd : Training instances • y1 , . . . , yn ; yi ∈ {−1, 1}m : the assignments of m different classes to n training instances • K: n × n kernel matrix • T : number of iterations 2: Initialization • αji = 0, i = 1, . . . , n, j = 1, . . . , m 3: for t = 1, . . . , T do 4: for i = 1, . . . , n do 5: ∆=0 6: Calculate the leave one out prediction vector: ⊤ ⊤ f −i = yi αi K:,i − (yi αi )Ki,i m i i 7: a= m j=1 I(yj == 1) & b = j=1 I(yj = −1), where I(z) is an indicator function that outputs 1 when z is true and zero, otherwise. 8: Construct fa−i and fb−i such that f −i (xi )=fa−i fb−i fa−i : components of f −i that corresponds to positive labels, i.e., yji = 1. fb−i : components of f −i that corresponds to negative labels, i.e., yji = −1. −i ⊤ −i ⊤ 9: Compute matrix H ∈ Ra×b : H = 12 ((1⊤ b 1a ) − fb 1a − 1b fa ) 10: Construct matrix γ ∈ Ra×b 11: for s = 1, . . . , b do (H:,s ) (H:,s )|2 12: γ:,s = |ππ++(H min(1, |π+ηCK ) :,s )|2 i,i where function π+ (z) projects z onto the region Ra+ . 13: end for 14: Calculate α αa = Cγ1b αb = Cγ ⊤ 1a α = αa αb 15: end for 16: end for In order to efficiently solve Eq. (5.7), we consider the block coordinate descent method. In 124 particular, we aim to optimize αi with the other dual variables, {αj , j = i}, being fixed. Without a loss of generality, we assume that example xi is assigned to the first a classes and is not assigned to the remaining b = m − a classes. For the convenience of presentation, we drop the index i and write αi as α. We thus have the following optimization problem for αi . m m max α∈Ω k=1 αk − Ki,i m k=1 αk2 − 2 αkj ykj Ki,j , y k αk k=1 (5.8) j=i where Ω is defined as Ω = α ∈ Rm : ∃γ ∈ Ra×b + , |γ ·,l |2 ≤ 1, l ∈ [b] s.t. α1:a = Cγ1b , αa+1:a+b = Cγ ⊤ 1a }. In the above, we use the notation αi:j = (αi , . . . , αj ) to represent a subset of vector b whose index ranges from i to j. 1a represents a vector of a dimensions with all its elements being one. We now aim to simplify the problem in Eq. (5.8). First, we have for any α ∈ Ω m αk = 2C(1⊤ a γ1b ). (5.9) k=1 Second, we have m a α2k = k=1 a+b α2k + k=1 ⊤ ⊤ ⊤ α2k = C 2 1⊤ b γ γ1b + 1a γγ 1a . (5.10) k=a+1 To simplify the last term in Eq. (5.8), we define αkj ykj κ(xi , xj ), fk−i (xi ) = yk (5.11) j=i and vector f −i = (f1−i (xi ), . . . , fi−i (xi )) = (fa−i , fb−i ). Using these notations, the third term in 125 Eq. (5.8) becomes m αk fk−i (xi ) = α⊤ f −i = Ctr 1b [fa−i ]⊤ + fb−i 1⊤ a] γ . (5.12) k=1 Thus, we have the following optimization problem to solve 1 ⊤ ⊤ ⊤ ⊤ max 1⊤ a γ1b − CKi,i 1b γ γ1b + 1a γγ 1a γ∈∆ 2 −tr (5.13) −i ⊤ fb−i 1⊤ γ , a + 1b [fa ] where ∆ = {γ ∈ Ra×b : |γ ·,l |2 ≤ 1, l = 1, . . . , b}. The problem in Eq. (5.13) is indeed a Second + Order Cone Programming (SOCP) problem [168]. Although a SOCP problem can be solved by a standard tool like SeDuMi [88], it can still be computationally expensive to solve a large-scale SOCP problem. We thus further simplify Eq. (5.13) by the following approximation ⊤ ⊤ ⊤ ⊤ ⊤ ⊤ 1⊤ b γ γ1b + 1a γγ 1a ≈ ηtr(γ γ + γγ ) = 2ηtr(γ γ), (5.14) where η > 1 is a parameter introduced for approximation. Using the approximation in Eq. (5.14), we have ⊤ max 1⊤ a γ1b − CKi,i ηtr(γ γ) − tr γ∈∆ −i ⊤ fb−i 1⊤ γ , a + 1b [fa ] (5.15) where we define −i ⊤ −i ⊤ (1b 1⊤ a ) − fb 1a − 1b [fa ] ⊤ = 2H = (2h1 , . . . , 2hb ). Lemma 2 shows a closed form solution to Eq. (5.15). 126 (5.16) Lemma 2. The optimal solution to Eq. (5.15) is γ ·,s = πG (hs ) |πG (hs )|2 min 1, |πG (hs )|2 CKi,i η , s = 1, . . . , b, (5.17) where G = {z : z ∈ Ra+ } and πG (h) projects vector h into the domain G. The proof of this lemma can be found in Section A.5.3. 5.4 Experimental Results 5.4.1 Data Sets In order to evaluate the proposed method for multi-label learning with incomplete class assignments, we use two multi-label data sets that were used in Chapter 4: subsets of the ESP Game and MIR Flickr25000 data sets. For MIR Flickr25000, we remove the images that are assigned to fewer than three classes. This procedure gives us 10,199 images from 457 classes. We take 75% of the examples to form a training set by random sampling. The bag-of-words model based on dense-SIFT features, provided by [101] and [155], are used for image representation. We use a subset of the ESP data set, in which the average number of labels per image is 8.3. To study the influence of the number of training samples and labels on multi-label learning performance, we vary the number of training samples and labels. We follow the protocol in Chapter 4 to vary the number of training instances and classes. The number of test images is 10,000. We use dense-SIFT based BoW representation to construct image features. To simulate the situation of incomplete class assignment, we conduct experiments in four different settings for the ESP Game and MIR Flickr25000 data sets. In the first setting, termed case-1, there is no missing class assignment for any training image. In the next three settings, termed case-2, case-3, and case-4, for each training image, we randomly choose 20%, 40%, and 127 Table 5.2: AUC-ROC (%) for the ESP Game data set with 10,000 training images and 200 classes. SVM PLATT MLKNN MLLS MLR-L1 MLR-GL case-1 80.2 80.1 81.3 79.8 82.3 83.8 case-2 79.2 79.5 72.5 78.9 82.2 83.4 case-3 77.5 77.9 72.3 77.3 81.1 82.8 case-4 75.2 75.9 72.1 75.0 79.4 82.1 Table 5.3: MAP (%) for the ESP Game data set with 10,000 training images and 200 classes. SVM PLATT MLKNN MLLS MLR-L1 MLR-GL case-1 38.0 37.9 35.2 38.0 40.0 38.2 case-2 36.2 36.5 26.4 37.0 38.0 37.5 case-3 34.0 34.5 25.8 35.5 37.1 36.8 case-4 31.0 31.8 25.6 33.1 35.2 35.4 60% of the assigned class labels, respectively, and remove them from the training data. During the label removal process, we make sure that each image has at least one positive class label. 5.4.2 Baseline Methods We use the same baselines as in Chapter 2: SVM [121], PLATT [157], MLKNN [125], MLLS [139], MLR-L1 , and MLR-GL, the proposed group lasso based multi-label ranking method that is described in this chapter and specifically addresses the multi-label learning with incomplete class assignment problem. When calculating the kernel matrix, a modified chi-squared kernel with d(x, x′ ) = |x − x′ |22 /|x + x′ |22 , is used for the ESP GAME and MIR Flickr25000 data sets because it yields significantly better performance than the standard version. σ is set to be chi-squared kernel is chosen as the mean of the pair-wise distances d(x, y) [69]. The optimal values for parameters C and the 128 Table 5.4: The label predictions by the baselines for four images from the ESP Game data set, when 40% of the training labels are missing. The first row under the images gives the true image class labels. For each baseline, we provide the top nine returned labels (three in the top row, and three it the lower row) ranked from left to right. The hits are written with bold characters. labels SVM MLKNN MLLS MLR-L1 MLR-GL silver circle round silver diamond circle jewelry metal wrist tree time wood man ad woman metal face girl logo people sky silver circle tree round dark line wood hand orange silver circle round dark woman line orange logo wood round circle silver ad logo square line face woman sky orange dark dark man cloud computer face wave metal space dance man hair face girl sky people woman water smile face dark night sea eyes ocean teeth computer orange man dark light lights cloud orange shadow night sun light dark man woman night girl sky orange people tree road wood water sidewalk ride ocean man boat wall animal fish ad hair sky girl man tree water smile screen water sea sky ocean man cloud tree wall street ocean sky man water sea wall men people wood sky water man wood sea people tree road woman Table 5.5: AUC-ROC results for the MIR Flickr data set SVM PLATT MLKNN MLLS MLR-L1 MLR-GL case-1 70.2 70.0 68.7 75.9 75.4 76.2 case-2 69.1 68.8 67.6 74.6 72.7 4 75.7 129 case-3 67.6 67.3 66.1 72.7 71.7 75.0 case-4 65.7 65.0 64.3 71.5 69.1 74.1 approximation parameter η are selected by cross validation. ⊤ ⊤ ⊤ a×b The parameter η approximates 2tr(γ ⊤ γ)/(1⊤ , b γ γ1b + 1a γγ 1a ), for the matrix γ ∈ R where a and b are respectively the number or relevant and irrelevant labels for a training image. As the number of classes increases, we would expect both a and b to increase. Consequently, larger values of a and b would require a larger η value for a better approximation. This is confirmed by the cross validation operation that we performed to choose the η value in our experiments. For example, the selected η value was 50 when we set the number of labels 50 in the training data set, whereas η = 150 gave the best performance among different values of η tried for the data subset with 500 image labels (for the experiments in Chapter 4). Therefore, we can conclude that the optimal value of η depends on factors like the number of image labels and the nature of the data set (i.e., the average number of labels per image). 5.4.3 Multi-label Ranking Performance on Incompletely Labeled Data Tables 5.2 and 5.3 show the results for the ESP Game data set in terms of AUC-ROC and MAP, respectively, for a training set with 10, 000 images. We note that the classification results are consistent among experiments with different training set sizes, and only report the results for the 10, 000 images setting results for brevity. From the tables, we first observe that the baseline PLATT, which converts SVM output scores into probabilistic scores, improves the performance of SVM in the missing label settings. This is consistent with [169], where the conversion procedure makes the outputs from different SVM classifiers more comparable and consequently may lead to better performance for multi-label ranking. On the other hand, both SVM and PLATT are outperformed by the direct multi-label learning methods, namely MLR-GL, MLR-L1 , and MLLS; this stresses the importance of developing multi-label ranking methods for multi-label learning. Second, we observe a significant decrease in classification accuracy for all the methods when moving from case-1 to case-4, proving that the missing class assignment could significantly affect the classification performance. On the other hand, compared to the other baseline methods, the 130 proposed method (MLR-GL) is more resilient to the missing class labels: it only experiences a drop less than 2% in terms of AUC-ROC metric when 60% of the assigned class labels are removed (case-4), whereas the other methods experience drops from 3% to 5%. Similarly, the performance drop from case-1 to case-4 is less than 3% for MLR-GL, whereas it is more than 5% for the other baselines in terms of MAP score. These results indicate the robustness of the proposed method in handling missing class assignments. In Table 5.4, we provide results for sample test images from the ESP Game data set for the case-3 experiments, where 40% of the assigned class labels are missing from the training images. We give the label predictions by the baselines for four three images, and the first row under the images gives the true image class labels. For each baseline, we provide the top nine returned labels ranked from left to right and top to bottom. The correct matches are written with bold characters. In addition to the clear superiority of the proposed method’s predictions over the other baselines, there is another point that needs to be emphasized. The analysis of the left-most image, whose labels are silver, circle, and round, shows how using label correlations help to address the label ambiguity problem. We see that the three direct multi-label learning methods, MLR-GL, MLRL1 , and MLLS, successfully retrieve the label round in addition to circle, whereas SVM baselines cannot. This is because certain label pairs, such as circle-round, girl-woman and logo-ad, are mostly retrieved together by the direct multi-label learning methods. This makes these methods more robust for the label ambiguity problem. We also report the results on the MIR Flickr25000 data set in terms of AUC-ROC score in Table 5.5. Similar to the ESP Game data set, we observe (i) a significant drop in AUC-ROC score for all the methods when some class assignments are missing from training examples, and (ii) MLR-GL experiences the least degradation, together with the MLLS method, in terms of AUCROC score compared to the other baseline methods. We also notice that unlike the ESP Game data set, the baseline SVM slightly outperforms the baseline PLATT for the MIR Flickr25000 data set, showing that the probabilistic score conversion does not improve the SVM outputs for this data 131 set. To better understand the reasons as to why the proposed MKL-GL is more robust, we observe the outcomes for the training samples after the training/learning step. Table 5.6 shows how different methods perform in finding the missing true labels for training examples, where only the underlined true labels are provided to the learning algorithms. We observe that MLR-GL is able to find more missing labels than the other baselines. Unlike the other baselines, when ranking the label scores for the training images, MLR-GL does not always put the assigned labels at the top of the ranking. In contrast, it ranks some categories that are initially labeled as irrelevant higher than the relevant ones, meaning that MKL-GL does not overfit. This is why the proposed method outperforms the baselines in this task. Table 5.7 shows examples of annotations generated for test images for case-4, where 60% of the positive labels are removed from the training data set. These examples confirm that the proposed method gives better annotation results than the baseline methods. Based on the above results, we conclude that the proposed method for multi-label learning (i) is effective for image categorization, and (ii) is more effective in handling incompletely labeled data than the state-of-the-art methods for multi-label learning. 5.4.4 Training Time In Chapter 4, we observed that the MLLS baseline is computationally more efficient than one-vsall SVM and the MLR-L1 multi-label ranking method when the number of samples, n, is greater than the number of feature dimensions, d. Therefore, when comparing the proposed MLR-GL method to SVM and MLR-L1 in terms of training time, we exclude the MLLS baseline from the evaluations. Moreover, we are also not including MLKNN algorithm, which is significantly faster than the other baselines, because it only requires simple and fast operations, such as calculating label prior probabilities. However, MLKNN’s efficiency comes with a price of lower classification performance. 132 4 4 x 10 SVM MLR−L training time (sec) 3.5 1 3 MLR−GL 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 number of training images 3.5 4 4 x 10 Figure 5.3: The change in the baseline training times (seconds) with respect to the number of training images from the ESP Game data set. Figures 5.3 and 5.4 plot the change in the baseline training times with respect to the number of training images and labels, respectively. We use the ESP Game data set, and the three baselines that we are comparing are MLR-GL, MLR-L1 , and one-vs-all SVM. All these three methods are implemented with C. In this experiment, we vary the number of training examples from 1, 000 to 40, 000 and labels from 10 to 500. Overall, we observe that the methods in comparison have similar running times. The computational complexity of MLR-L1 and MLR-GL per iteration is O(mn2 ), where n is the number of training examples and m is the number of classes. Note that the time spent on kernel matrix construction is not included in this study because it is shared by all the three methods in comparison. However, when the RAM capacity is not large enough to store the whole kernel matrix, using a pre-computed kernel matrix would not be possible. This would have a larger negative impact on one-vs-all SVM, since the computational complexity would become O(dmn2 ). This is because the kernel function computations need to be performed separately for each class. On the other hand, the computational complexity of the proposed multi-label ranking methods would be O(dn2 + mn2 ), since the classifiers for all labels 133 training time (sec) 5000 4000 SVM MLR−L 1 MLR−GL 3000 2000 1000 0 0 100 200 300 number of image labels 400 500 Figure 5.4: The change in the training time (seconds) for the proposed multi-label ranking algorithms and one-vs-all SVM with respect to the number of image labels (m). are learned together by using a single kernel. 5.5 Conclusions and Future Work In this chapter, we have presented our multi-label ranking approach which addresses the incomplete class assignment problem. By using the group lasso technique [163] to combine the errors in ranking the assigned classes and unassigned classes, our method is able to use the relationships between the class labels to detect the missing class assignments, making it more robust for incompletely labeled data. Our empirical study of image categorization with two benchmark data sets demonstrated that the proposed method outperforms state-of-the-art methods, particularly when the number of missing label assignments increases in the training set. We can list our contributions as follows: • We have proposed a multi-label ranking approach which offers a direct solution to multi-label learning unlike the conventional methods that use a set of binary classifiers. Our experiments have shown that the proposed method outperforms the multi-label learning techniques from the literature. • The proposed method is robust to incomplete class assignment problem. The performance difference 134 between the proposed method and the multi-label learning baselines increases in favor of the proposed approach as the number of missing class labels in the training set increases. • We have proposed an efficient algorithm that involves using a closed form solution. The computational complexity is linear with respect to the number of class labels. The computational load of the proposed algorithm is comparable to that of one-vs-all SVM, which is one of the most efficient multi-label learning algorithms. The proposed algorithm can efficiently handle the majority of the available image categorization data sets with tens of thousands of images and hundreds of classes. The proposed algorithm efficiently and effectively tackles the incomplete class assignment problem. However, there are three main issues that need to be addressed to improve this work further. The first one is extending the proposed framework to multiple kernel learning. Similar to the multi-label ranking approach presented in Chapter 4, the multi-label ranking method we describe in this chapter is limited to a single kernel function usage. Extension of this work to multiple kernel learning setting can bring a significant improvement in classification performance. The second issue is the computational complexity. The current algorithm can handle tens of thousands of samples and hundreds of classes. However, since the computational complexity is linear in the number of class labels and quadratic in terms of training instances, training the proposed algorithm in recent large scale image categorization data sets (millions of images and thousands of class labels) would not be practical. One way to improve the training efficiency of the proposed multi-label ranking algorithm would be incorporating label set space projection methods like compressed labeling [123]. Finally, the proposed method can be extended to the scenario where not only some of the “true” class assignments are missing, but some of the class labels are incorrectly assigned to the training instances. This is a more challenging problem in which we need to address the uncertainty arising from missing class assignment as well as from noisy class assignments. This scenario often encountered in the problem of image tagging/annotation [155]. 135 Table 5.6: Examples of training images from the ESP Game data set with true labels and annotations generated by different multi-label learning methods. Only the underlined true labels are provided to the methods for training. For each method, the correct (returned) keywords are highlighted by bold font whereas the incorrect ones are highlighted by italic font. Images Labels MLR-GL LIBSVM+Platt LIBSVM MLR-L1 brown girl grass green hair picture smile tree man black green people white red woman tree blue sky girl hair picture grass brown water light yellow old hat face smile house shirt eye girl green blue black face hair woman people white glasses man group tree grass sky light pink chinese eye red plant dress hand flower forest green girl space drink sky point face woman shop metal family pot machine light truck forest star guy sit glasses white night hair black usa green girl black tree people light hair man white metal dark band leaf star glasses sky space woman red night truck face street pot group 136 blue, building car city cloud sky street white window white man sky blue green red black woman water window tree people grass hair picture house yellow brown girl cloud building mountain smile face car window city black hair man white water yellow smile chinese line tree sky lake mountain pink blue computer wood green table woman boy house hat city window metal truck car ball lake lake building room fly line wing roof water website mountain road helmet white tent chinese chair pink silver small window city black sky water metal mountain pink wing car building hair boy computer lake truck insect person roof room man tree silver road ocean Table 5.7: Examples of test images from the ESP Game data set with annotations generated by different multi-label learning methods. The correct keywords are highlighted by bold font whereas the incorrect ones are highlighted by italic font. Images Labels MLR-GL LIBSVM LIBSVM+Platt MLR-L1 tree water black picture drawing sea art blue boat green city man white black woman people blue green red tree girl sky water hair picture old brown grass yellow face mountain book smile gray sun flag computer brick man yellow street machine sea leaf road ocean couple forest fly purple toy book man smile white blue sky black woman red green people tree water computer girl face old hair yellow leaf tree green hair movie white black people grass statue leaf orange old bike red flower mountain picture dance eye dirt 137 man woman people hair girl picture smile group photo kid family man woman black white people blue green red girl tree hair sky water picture old brown face yellow grass smile man hair black movie face food fire boy smile lady metal statue dance couple red table toy arm bike gold movie food man hair white smile woman blue face black people green red girl fire tree sky boy table eye hair tree black movie green man eye woman white hand face girl people smile dance red hat orange statue brown Chapter 6 Multiple Kernel Multi-label Ranking 6.1 Introduction In this chapter, we present a multiple kernel multi-label ranking (MK-MLR) algorithm for image categorization. The algorithm we propose combines different image representations to make the best multi-label prediction for a query image, by learning to rank relevant labels over irrelevant labels. To achieve this goal and build our algorithm, we combine several conclusions drawn in the previous chapters: • The experimental results in Chapter 2 showed that, given a sufficient number of training samples, learning a sparse combination of base kernels (MKL-L1 ) is advantageous for image categorization. Not only does it often improve accuracy when compared to the average kernel or MKL-L2 frameworks, but the sparse solutions also lead to a computationally efficient prediction step. Using a smaller number of base kernels as a result of sparsity brings a significant time gain in terms of feature extraction cost; one of the main bottlenecks of the prediction step. • Among the MKL-L1 baselines we evaluated in Chapter 2, MKL-SILP (Semi-Infinite Linear Programming) [71] is the most computationally efficient method. MKL-SILP is a wrapper approach, meaning that learning the kernel weights and classification functions can be separated in each iteration. Because of this, the inner SVM-solver can be replaced by other learning algorithms without modifying the linear programming solution that is used for updating the kernel weights. 138 • In Chapter 4, we formulated image categorization as a multi-label ranking problem. Our experimental results showed that learning classification functions for all the classes in a single framework (i.e., direct multi-label approaches) gives better prediction results compared to decomposing the problem into individual binary classification tasks, i.e., one-vs-all SVM. However, the algorithm we presented in Chapter 4 (MLR-L1 ) is designed for using a single kernel. In this chapter, our goal is not only finding the optimal multi-label ranking solution, but also the best linear kernel combination that would maximize multi-label prediction performance. • The experimental results provided in Chapter 3 showed that, for image classification, there is not a significant performance difference between using one shared kernel combination for all classes and learning a different kernel combination for each class. Therefore, in order to improve the computational efficiency of training and prediction steps, we propose to learn a single kernel combination that would benefit all the classes in a multiple kernel multi-label ranking framework. Based on these stated conclusions, we extend the MLR-L1 method by integrating it into a wrapper SILP MKL framework. The goal of developing a multiple kernel multi-label ranking method is to address the two essential factors for improving the performance of image categorization: (i) heterogeneous information fusion, and (ii) exploiting label correlation of multi-label data. The main difference between the algorithm proposed in Chapter 3, ML-MKL-SA, and the MK-MLR algorithm we present in this chapter is that the former aims to improve the training efficiency of MKL for one-vs-all framework. On the other hand, the goal of the MK-MLR algorithm is to improve the image categorization performance by exploiting label dependencies in multi-label data and optimizing the use of different image representations. This Chapter is organized as follows: in Section 6.2, we provide a literature review on MKL methods that are proposed for multi-label learning. Next, in Section 6.3, we introduce our multiple kernel multilabel ranking formulation and provide a computationally efficient algorithm, which is based on semi-infinite linear programming (SILP), to solve it. In Section 6.4, we provide empirical analyses that demonstrate the strength of the proposed framework on benchmark data sets. We end the chapter with the concluding remarks and future directions in Section 6.5. 139 6.2 Previous Work MKL is a very useful tool for the image categorization problem, since an image can be represented in many ways depending on the methods used for key-point detection, descriptor/feature extraction, and key-point quantization; each image representation has different strengths and weaknesses. MKL offers a systematic solution to image feature selection and combination for the image representation and learning problems. However, a vast majority of MKL studies in the literature address the binary classification task. Therefore, the use of MKL for image categorization is mostly limited to one-vs-all framework, which gives suboptimal performance (see Chapter 4). A detailed survey of binary MKL methods is presented in Chapter 2. We presented a multi-label multiple kernel method (ML-MKL-SA) in Chapter 3. Unlike the one-vs-all scheme, the proposed ML-MKL-SA method does not decompose the multi-label problem into individual binary problems. By learning a common kernel for all classes, ML-MKL-SA takes advantage of multi-label data by sharing information between the classes. However, the classification functions for each class are still trained independently, meaning that label correlations are not used when the classifiers are trained. One of the main conclusions of Chapter 4 is that direct methods for multi-label learning, which optimize classification functions together, are superior to decomposition based methods such as one-vs-all and onevs-one. However, there is a limited number of works that extend a direct multi-label learning method to multiple kernel setting in the literature. Kernel multiple linear regression (KMLR) and canonical correlation analysis (CCA) are two techniques that are employed in multi-label learning literature to compute a mapping between data samples and data labels [170]. Yakhnenko et al. extended the kernel regression model and canonical correlation analysis methods to the multiple kernel setting [171]. The authors proposed a reduced gradient method to solve for the optimal linear kernel combination for multi-label learning with KMLR and CCA. Ji et al. [68] proposed a multi-label multiple kernel learning method that can be considered as a generalization to KCCA. The goal of the method they proposed is to embed the data into a low-dimensional space by using a hypergraph, which encodes instance-label correlations. In addition to proposing a SILP solver, they also approximated the problem in order to use Nesterov’s method [85]. Zhang et al. used concept networks to model inter-label dependencies and similarity diversities [172]. Inter-label dependencies exploit the similarity between images that share a common label. For a pair of 140 images that share some common labels but also contain different labels from each other, similarity diversity is used to measure the dissimilarity between these two images. The authors proposed to learn an optimal kernel not only for each label, but also for each label pair in order to utilize the concept networks. Our method, MK-MLR is the first attempt of extending multi-label ranking to multiple kernel setting. One of the main advantages of MK-MLR compared to other multi-label MKL methods is that MK-MLR exploits label correlations without making explicit assumptions on the data. Moreover, learning one shared kernel combination for all classes is advantageous for classes with small number of positive samples. Since MKL-L1 methods require a sufficient number of training samples to perform well, sharing a kernel combination, which also means sharing information among different classes, benefits classes with a small number of samples. Finally, by imposing sparsity on the kernel combination vector, the proposed method improves the computational efficiency of training and prediction. 6.3 Multiple Kernel Multi-Label Ranking (MK-MLR) In this chapter, we use the same notation as in Chapter 3. We introduce β = (β1 , . . . , βs ), a probability distribution, for combining base kernels. We use the domain B1 for the probability distribution β, i.e., B1 = {β ∈ Rs+ : β ⊤ 1 = 1}. Our goal is to learn from the training examples the optimal kernel combination β for all m classes while simultaneously optimizing the corresponding ranking functions. 6.3.1 A Minimax Framework for Multiple kernel Multi-label Ranking In multiple kernel multi-label ranking, we aim to learn m classification functions fk (x; β) : Rd1 ×d2 ×...ds → R, k = 1, . . . , m, one for each class, such that for any example x, fk (x; β) is larger than fl (x; β) when x belongs to class ck and does not belong to class cl . Note that fk (x; β) is computed by using the kernel function κ(·, ·; β) = K s=1 βs κs (·, ·). i We define the classification error εk,l i for an example x with respect to any two classes ck and cl , as follows εik,l = I(yik = yil )ℓ yki − yli fk (xi ; β) − fl (xi ; β) 2 141 , (6.1) where I(z) is an indicator function that outputs 1 when z is true and zero, otherwise. The loss ℓ(z) is defined to be the hinge loss, where ℓ(z) = max(0, 1 − z). Following the framework in Chapter 4 and the multiple kernel learning problem, we aim to search for the classification functions fk (x; β), k = 1, . . . , m that simultaneously minimize the overall classification error. This is summarized into the following optimization problem. min min β∈B1 {fk ∈Hκ (β)}m k=1 1 2 m k=1 n |fk |2Hκ + C m εik,l , (6.2) i=1 k,l=1 where κ(x, x′ ) : Rd × R → R is a kernel function, Hκ (β) is a Hilbert space endowed with a kernel function κ(·, ·; β) = K s=1 βs κs (·, ·). and C is a regularization parameter. The domain B1 is defined in Eq. (6.3). B1 =    s β ∈ Rs+ : β 1 = j=1 By using the following definition for ∆ik,l , ∆ik,l =   |βj | ≤ 1 .  yki − yli fk − fl , κ(xi , ·) 2 Hκ . We can rewrite the objective function in Eq. (6.2) as follows 1 h(f ; β) = 2 m n fl , fl l=1 HK (β) m I(yli = yki )ℓ ∆ik,l . +C i=1 l,k=1 We then rewrite ℓ(z) as ℓ(z) = max (x − xz). x∈[0,1] Using the above expression for ℓ(z), the second term in h(f ; β) can be rewritten as, n m I(yli = yki ) max i=1 l,k=1 i ∈[0,C] γk,l 142 i i γk,l − γk,l ∆ik,l . (6.3) Then, the problem in Eq. (6.2) can be rewritten as follows, max min max i ∈[0,C] β∈B1 fl ∈H(β)m γl,k g(f, γ, β), where n m i I(yli = yki )γl,k + g(f, γ, β) = i=1 l,k=1 m n − 1 2 m fl , fl H(β)K l=1 i I(yil = yki )γl,k ∆ik,l . i=1 l,k=1 Next, we switch the order of minimization over f and maximization over γ. By taking the minimization over fl first, we have n m yli fl (x; β) = i=1 i I(yli = yki )γl,k κ(xi , x; β). k=1 In the above derivation, we use the relation I(yli = yki )(yli − yki ) = 2yli . To simplify our notation, we i if y i = y i and zero otherwise. Note that since γ i = γ i , we introduce Γi ∈ [0, C]m×m where Γil,k = γl,k l k l,k k,l have Γi = [Γi ]⊤ . We furthermore introduce the notation [Γi ]l as the sum of the elements in the lth row, i.e., [Γi ]l = m i k=1 Γl,k . Using these notations, we have fl (x; β) expressed as n yli [Γi ]l κ(xi , x; β). fl (x) = i=1 Finally, the remaining maximization problem becomes n m m n 1 min max [Γ ]k − κ(xi , x; β)yki ykj [Γi ]k [Γj ]k β∈B1 Γ 2 i=1 k=1 k=1 i,j=1    0 ≤ Γik,l ≤ C yki = yli i s. t. Γk,l =   0 otherwise i Γi = [Γi ]⊤ , i = 1, . . . , n; k, l = 1, . . . , m. Note that Eq. (6.4) is a generalized version of Eq. (4.4) and also might be expensive to solve, as the number 143 of constraints is O(m2 ), where m is the number of classes. Therefore, we propose a similar approximation. 6.3.2 Proposed Approximation Without a loss of generality, consider a training example xi that is assigned to the first a classes, and is not assigned to the remaining b = m − a classes. According to the definition of Γi in (6.4), we can rewrite Γ as   0 Z   Γ= , Z⊤ 0 (6.4) where Z ∈ [0, C]a×b . Using this notation, variable τk = [Γi ]k is computed as τk =      b l=1 Zk,l 1≤k≤a a l=1 Zl,k a + 1 ≤ k ≤ m, where Zk,l is an element in Z that is bounded by 0 and C. According to the above definition, for each instance, τk is the sum of either the kth column or the kth row of Z depending on whether the label k is relevant to that instance or not. As discussed in Chapter 4, formulating τk by using Z enables us to exploit label relationships during the optimization process. Using Theorem 4 and Corollary 5 from Chapter 4, we introduce the variable αik for [Γi ]k . We furthermore restrict αi = (αi1 , . . . , αik ) to be in the domain G = τ ∈ [0, C]m : a k=1 τk = m k=a+1 τk to ensure that feasible Γi can be recovered from a solution of αik . Then, using the vector notation, we can rewrite the new optimization problem for multiple kernel multi-label ranking (MK-MLR) as in Eq. (6.5). m minβ∈B1 maxα∈Q1 L(α, β) = 1 1⊤ αk − (αk ◦ yk )⊤ K(β)(αk ◦ yk ) , 2 k=1 m m I(yki s. t. k=1 αik ∈ where κ(x, x′ ; β) = s ′ j=1 βj κj (x, x ) = 1)αik = k=1 [0, C], and B1 = I(yki = −1)αik , i = 1, . . . , n, k = 1, . . . , m, β ∈ Rs+ : β 144 1 = s j=1 |βj | (6.5) ≤ 1 . It is important to note that the only difference between Eq. (6.5) and the optimization problem of ML-MKL-Sum (Eq. (3.2) in Chapter 3) is the domain defined for α. 6.3.3 Optimization via Semi-infinite Linear Programming One of the conclusions in Chapter 2 was that MKL-SILP (Semi-Infinite Linear Programming) [71] is the most efficient method among the MKL-L1 baselines. Therefore we will use SILP to optimize Eq. (6.5). Let’s define Ss (α) = − m k=1 1⊤ αk − 12 (αk ◦ yk )⊤ Ks (αk ◦ yk ) . Then, we can rewrite Eq. (6.5) as the following min-max problem, K maxβ∈B1 minα∈Q1 βs Ss (α), (6.6) s=1 For the optimal solution α∗ , θ ∗ = S(α∗ , β) would be minimal, meaning that S(α, β) ≥ θ for any α ∈ Q1 . Therefore, as proposed in [71], we need to solve the following SILP problem in order to find a saddle-point of Eq. (6.6). min θ∈R,β∈B1 θ (6.7) s s. t. j=1 m 1 βj {−α⊤ 1 + (αk ◦ yk )⊤ Kj (αk ◦ yk )} ≥ θ, 2 m I(yki = 1)αik = k=1 αik ∈ [0, C], k=1 I(yki = −1)αik , i = 1, . . . , n, k = 1, . . . , m. MKL-SILP is a wrapper method, meaning that learning the kernel weights and classification functions can be separated in each iteration of the optimization process. In this chapter, we use the MKL-SILP method with two modifications. Note that, unlike the binary MKL-SILP or ML-MKL-Sum formulations, we cannot use an off-the-shelf SVM solver to maximize Eq. (6.5) with respect to α because of the domain definition. Instead, we need to replace the SVM solver with the MLR-L1 method that we proposed in Chapter 4. In 145 addition, compared to binary MKL-SILP, the number of constraints in each step increases since each class generates its own constraints. In order to optimize Eq. (6.7), we use the column generation method that is used in [71] and [116] to solve the MKL-SILP problem: In an alternating optimization process, the optimal (β, θ) are calculated for a restricted set of constraints. Then, for fixed a β, new constraints that are determined by αk , k = 1, . . . , m are generated. This step corresponds to solving for the optimal α for fixed a β. Therefore, Eq. (6.7) can be solved by simply replacing the SVM solver within the off-the-shelf MKL-SIP solvers (Shogun, ML-MKLSum) with the MLR-L1 algorithm, which is presented in Chapter 4. 6.4 Experimental Results In this section, we empirically evaluate the proposed multiple kernel multi-label ranking algorithm by comparing it to other MKL baselines for the image categorization task. 6.4.1 Data Sets In order to compare our proposed multi-label learning method to state-of-the-art MKL methods, we use two benchmark multi-label data sets that we have discussed in Section A.1.6. The MIR Flickr25000 data set [154] is a subset of the MIR Flickr-1M data set that contains 25,000 images and 457 image tags. We followed [101] and created 15 sets of low level-features: (i) GIST features [102]; (ii) six sets of color features generated by two different spatial pooling layouts [103] (1 × 1 and 3 × 1) and three types of color histograms (i.e., RGB, LAB, and HSV). (iii) eight sets of local features generated by two key-point detection methods (i.e., dense sampling and Harris-Laplacian [104]), two spatial layouts (1 × 1 and 3 × 1), and two local descriptors (SIFT and robust hue descriptor [105]). A RBF kernel function with χ2 distance was applied to each of the 15 feature sets. In addition to these 15 low-level features, we extracted 177 different kinds of object banks [173], which encode semantic and spatial information regarding an image. Each object bank is a 256-dimensional vector, which is a collection of response-maps of pre-trained generic object detectors. 146 In order to test how different baselines perform with respect to the numbers of training images, we created training subsets with different sizes (%2, %5, %25, and %50 of the whole data set). Also, after ranking the categories (image tags) in terms of their frequency (number of images annotated with them), we picked the top 200 categories for multi-label learning evaluation. The number of test samples is 12, 500. ESP Game data set The second data set we use in this chapter is a subset of the ESP Game data set. We computed nine base kernels by using low level features. The first kernel is based on dense-SIFT descriptors and a Bag-of-Words model with 1, 000 visual words. In addition to dense sampling, we also used the Harris-Laplacian (HarLap) [104] method for key-point detection. For HarLap based Bag-of-Words model, we created two visual dictionaries with sizes 250 and 1, 000 and used two types of spatial pyramid kernels (i.e., 1 × 1 and 2 × 2 spatial partitioning), leading to 4 different base kernels. We also created color histograms, each with 4, 096 bins, by using three different color spaces, namely RGB, LAB and HSV. Finally, we constructed a base kernel by using GIST features [102]. In addition to these low level features, we extracted 177 different kinds of object banks for ESP Game data set. In total, we have 186 base kernels for the ESP Game data set. To study the influence of the number of training samples and labels on multi-label learning performance, we varied the number of training samples and number of labels for the ESP Game data set as well. We created four subsets of the training data (with {1, 000, 2, 500, 5, 000, 10, 000} images). Also, after ranking the categories in terms of their frequency (number of images annotated with them) in the data set, we picked the top {20, 50, 100, 200, 500} categories to create five different test settings in terms of the number of classes. The number of test images is set to 5,000. 6.4.2 Baseline Methods Following the experiments in Chapter 3, we compare the proposed MK-MLR with four MKL methods, two single kernel baselines, and two average kernel baselines (AVG-SVM and AVG-MLR). The single kernel baselines are the single kernel one-vs-all SVM scheme (SK-SVM) and the single-kernel multi-label ranking method (SK-MLR) that we presented in Chapter 4 (as MLR-L1 ). We ran these two methods for each base kernel separately and reported the results for the kernel with the highest score. 147 The MKL baselines can be categorized into two groups. The first group is the one-vs-all MKL framework, which requires solving one MKL problem separately for each class. For this group, we use two base MKL solvers that are shown to be the most efficient wrapper MKL methods in Chapter 2 : (i) SILP (semiinfinite linear programming) solver for MKL-L1 [71], and (ii) SIP (semi-infinite programming) solver for MKL-L2 . The second group of methods requires learning a single kernel combination simultaneously for all classes. The two baseline methods that fall into this group are: (i) ML-MKL-Sum, which learns a kernel combination shared by all classes using the optimization method in [116], and (ii) ML-MKL-SA method: A stochastic sampling based algorithm we presented in Chapter 3. Note that all the baselines except MK-MLR, AVG-MLR, and SK-MLR are based on the one-vs-all framework. 6.4.3 Implementation The experiments were run on a cluster where each node has two four-core Intel Xeon E5620s at 2.4 GHz with 24 GB of RAM. Since the number of kernels is not small (192 for MIR Flickr25000 and 186 for ESP Game), we did not store and use pre-computed the kernel matrices. Instead, all MKL baselines worked with on the fly kernel computation. All the baseline methods were coded in MATLAB. For the SVM based MKL wrapper methods, we used LIBSVM [107] as the off-the-shelf SVM solver. MOSEK [89] is used for solving the related optimization problem for MKL-SIP, as suggested in [52]. For kernel based methods, we used the RBF kernel in our experiments. The regularization parameter C is chosen with a grid search over {10−4 , 10−1 , . . . , 103 }. The bandwidth of the RBF kernel is set to the average pair-wise Euclidean distance between the training image pairs. 6.4.4 Evaluation Measures To evaluate the effectiveness of different algorithms for multiple kernel multi-label learning, we first vary the number of selected categories and report the Area under ROC curve (AUC) over the selected classes. This procedure is named as category based evaluation (see appendix, Section A.1.5 for details), in which we rank test images for each class and the evaluation is performed on each label independently, before their 148 Table 6.1: The change of category based AUC score (%) withe respect to the number of selected classes for a subset of the ESP Game data set with 2,500 training images. SK-SVM SK-MLR AVG MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR number of classes 50 100 200 500 70.40 70.00 69.85 69.01 71.84 71.32 70.55 70.04 75.86 75.61 75.43 73.66 77.07 76.12 75.60 73.10 76.43 76.05 75.78 73.19 76.86 76.22 76.05 73.62 77.26 76.53 76.33 73.89 76.06 76.02 76.11 73.66 78.39 77.69 77.58 74.87 average is taken over all classes. We also use image based evaluation, particularly for comparing multi-label ranking performance. Image based MLR-AUC measures show how accurate is the ranking of outcomes. In addition, we evaluate the training efficiency of algorithms by the level of sparsity, training and prediction times (seconds). 6.4.5 Multi-label Learning Performance We list the category and image based AUC results results for the ESP Game data set in Tables 6.1 and 6.2, respectively. The results in these two tables are obtained by varying the number of classes for the setting in which 2, 500 images are used for training. For instance, in the setting where the number of classes is 200, we calculate the AUC score for the top 200 classes (column 3) after ranking them based on the number of positively labeled images they have. We draw the following conclusions from Table 6.1: • Multiple kernel algorithms consistently outperform single kernel algorithms. • Learning a sparse combination of base kernels via MKL-L1 gives better results compared to the average kernel and MKL-L2 methods. • Learning one shared kernel combination for all classes does not cause a significant performance drop. 149 Table 6.2: The change of image based AUC score (%) withe respect to the number of selected classes for a subset of the ESP Game data set with 2,500 training images. SK-SVM SK-MLR AVG MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR number of classes 50 100 200 500 75.95 76.32 76.68 76.14 77.73 78.97 80.41 79.90 80.81 81.44 82.06 81.67 81.67 81.85 82.06 81.50 79.09 80.14 81.24 80.82 81.51 81.84 82.22 81.78 81.67 81.99 82.40 81.93 81.86 82.97 84.10 83.47 83.28 84.04 84.93 84.68 Table 6.3: The change of category based AUC score (%) withe respect to the number of selected classes for a subset of the MIR Flickr data set with 6,250 training images. SK-SVM SK-MLR AVG MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR 50 65.14 65.67 70.31 70.98 70.83 71.00 71.28 72.10 72.28 number of classes 100 200 500 64.83 63.75 62.16 65.36 64.52 63.20 68.45 66.93 64.88 69.03 66.96 64.98 68.86 67.24 65.31 69.53 67.93 65.97 69.83 68.21 66.05 70.16 68.35 66.30 70.34 68.25 66.44 150 Table 6.4: The change of image based AUC score (%) withe respect to the number of selected classes for a subset of the MIR Flickr data set with 6,250 training images. SK-SVM SK-MLR AVG MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR number of classes 50 100 200 500 63.82 62.94 62.28 62.01 64.67 63.96 63.35 62.88 72.89 71.99 71.10 70.69 73.57 72.70 71.53 71.08 73.13 72.35 71.64 70.71 73.37 72.58 71.62 70.60 73.60 72.91 71.95 70.88 75.26 74.23 73.71 72.75 75.26 74.40 73.70 72.91 • Although the proposed multi-label ranking method is not designed to optimize category based evaluation measures, it still gives comparable results to MKL-L1 and outperforms the remaining baselines. • The proposed MK-MLR method clearly outperforms SK-MLR and AVG-MLR baselines, demonstrating the effectiveness of multiple kernel learning for multi-label ranking. The results on Table 6.1 are calculated by performing category based evaluation. A better way to evaluate multi-label raking performance is using image based evaluation: ranking each label given a test image. By increasing the number of retrieved labels per image, we can obtain a sequence of true positive and false positive rates and calculate AUC values. Since the proposed MK-MLR method optimizes ranking loss, as expected, it outperforms the other baselines (see Table 6.2). Also note that, compared to the other baselines, the relative performance of all the multi-label ranking methods (MK-MLR, SK-MLR, and AVG-MLR) increases, showing that multi-label ranking methods benefit from a larger number of labels. Another conclusion we draw from Table 6.2 is that multiple kernel methods outperform their single kernel counterparts. Although the proposed method outperforms the other baselines in terms of the AUC score, it might not be clear how much impact this difference in the AUC score would make in a retrieval system. In order to get a better understanding of the classification accuracies (recall), we plot the classification accuracies of different baselines vs. the number of retrieved labels (rank) in Figure 6.1. To generate this plot we increase the number of retrieved images from 5 to 30 (the maximum number of labels per image is 30 in the subset 151 0.75 SNG−SVM AVG−SVM MKL−L1 0.7 MKL−L recall 0.65 2 AVG−MLR MK−MLR 0.6 0.55 0.5 0.45 0.4 5 10 15 20 25 30 rank Figure 6.1: The plot of recall vs. number of retrieved labels per image. The number of training images is 2, 500. we are using). We see from Figure 6.1 that MLR methods, both AVG-MLR and MK-MLR, yield superior performance compared to the other baselines. In fact, the accuracy of MK-MLR is 2-3% better than that of AVG-MLR and at least 4-5% better than the remaining baselines. In order to see how the image based AUC score changes with respect to number of samples, in Tables 6.5 and 6.6 , we report AUC and MAP scores for the top 200 classes in four settings, with different subsets of the training data with {1, 000, 2, 500, 5, 000, 10, 000} images. The following conclusions can be made from Tables 6.5 and 6.6 : • MK-MLR method is not outperformed by any other baseline in any setting. In fact, the proposed 152 Table 6.5: The change of category based AUC score (%) with respect to the number of training samples for a subset of the ESP Game data set. The AUC score is calculated using the top 200 classes. SK-SVM SK-MLR AVG-SVM MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR number of training samples 1,000 2,500 5,000 10,000 67.45 69.85 70.13 70.01 68.14 70.71 71.02 70.85 72.09 75.43 77.71 79.11 72.27 75.60 77.57 79.36 72.40 75.78 77.92 80.23 72.69 76.05 78.03 80.56 72.85 76.33 78.87 81.02 72.62 76.11 78.27 80.90 74.12 77.58 79.48 81.61 Table 6.6: The change of image based AUC score (%) with respect to the number of training samples for a subset of the ESP Game data set. The AUC score is calculated using the top 200 classes. SK-SVM SK-MLR AVG-SVM MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR number of training samples 1,000 2,500 5,000 10,000 75.97 76.68 78.31 78.69 79.95 80.41 82.49 83.27 80.41 82.05 84.21 84.82 80.52 82.06 84.01 84.99 80.78 81.23 84.36 85.01 80.79 82.21 83.07 83.86 80.93 82.79 83.82 84.80 82.25 84.10 85.35 86.05 83.08 84.93 85.87 86.45 153 Table 6.7: The change of category based AUC score (%) with respect to the number of training samples for a subset of the MIR Flickr data set. The AUC score is calculated using the top 200 classes. SK-SVM SK-MLR AVG-SVM MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR number of training samples 500 1,250 6,200 12,500 59.72 62.23 63.75 64.51 60.31 62.41 64.19 64.97 60.46 64.02 66.93 67.85 61.14 64.93 66.96 68.34 60.76 64.32 67.24 68.59 62.20 65.71 67.93 69.23 62.21 65.78 68.21 69.61 61.11 65.02 68.35 69.97 63.06 66.89 68.85 70.33 MK-MLR algorithm significantly outperforms the competing algorithms in the majority of the experimental settings. • Using multiple kernels improves the performance. • All baselines experience an increase in their performance when the number of training instances increases. We provide the category based and image based AUC scores for the MIR Flickr25000 data set in Tables 6.7 and 6.8. We vary the number of training samples to see how the increase in the training data set size affects the performance. One thing to observe from these two tables is that the performance of the baselines is overall worse compared to the ESP Game data set experiments, particularly when the number of training images is small. Because of this reason, the performance gap between the baselines is not as high as it is for the ESP Game experiments. Further, we can make the following statements based on the results in Tables 6.7 and 6.8. • MKL methods that learn a single kernel combination for all classes (ML-MKL-Sum and ML-MKLSA) give slightly better results than training MKL for each class separately (MKL-L1 and MKL-L2 ) for the MIR Flickr25000 data set. 154 Table 6.8: The change of image based AUC score (%) with respect to the number of training samples for a subset of the MIR Flickr data set. The AUC score is calculated using the top 200 classes. SK-SVM SK-MLR AVG-SVM MKL-L1 MKL-L2 ML-MKL-Sum ML-MKL-SA AVG-MLR MK-MLR number of samples 500 1,250 6,250 12,500 63.89 64.99 65.57 67.81 64.76 67.83 68.21 69.12 65.06 68.26 71.10 71.80 66.11 69.29 71.53 71.99 65.40 68.59 70.94 71.86 67.13 70.12 71.62 72.13 67.16 70.18 71.95 72.54 66.40 68.84 73.71 75.26 68.12 70.93 73.70 75.91 • The performance difference between MK-MLR and AVG-MLR decreases as the number of training samples increases. As we have previously discussed in Chapter 2, this is because the quality of all the base kernels increases with an increased number of training samples, and the advantage that a sparse combination would bring, i.e., eliminating weak kernels, vanishes. • MLR algorithms always perform better than their OvA counterparts, i.e, SK-MLR performs better than SK-SVM; AVG-MLR outperforms AVG-SVM. 6.4.6 Training Efficiency In this section, we compare the computational efficiency of the MK-MLR algorithm to the other MKL baselines in terms of training times. We group the MKL algorithms into two categories: (i) ML-MKL for learning individual kernel combination for each class, (ii) ML-MKL for learning shared kernel combination. We report the training times for each method under various experimental setting with different number of training samples and classes. Figure 6.2 and 6.3 compare the training times of the MKL baselines for a fixed number of training set size, 5, 000, under four settings with increasing class numbers: {50, 100, 200, 500}. It is clear from Figure 6.2 that the proposed method is significantly faster than MKL methods that require learning a separate 155 6 12 x 10 MK−MLR MKL−L 2 Training time 10 MKL−L 1 8 6 4 2 0 50 100 200 500 Number of classes Figure 6.2: Comparing MK-MLR to ML-MKL methods that learn optimal kernel combination separately for each class in terms of training time. We use 5, 000 training images and create four different settings by changing the number of classes {50, 100, 200, 500} 156 5 3 Training time 2.5 x 10 MK−MLR ML−MKL−Sum ML−MKL−SA 2 1.5 1 0.5 0 50 100 200 500 Number of classes Figure 6.3: Comparing MK-MLR to ML-MKL methods that learn one optimal kernel combination for all classes in terms of training time. We use 5, 000 training images and create four different settings by changing the number of classes {50, 100, 200, 500} 157 kernel combination for each class (MKL-L1 and MKL-L2 ). The main advantage of the proposed method is that it avoids repetitiously performing expensive kernel construction and combining operations for each class. The computational complexity of kernel construction is O(dn2 ), where d is the dimension of feature vectors and n is the number of samples. When the number of classes and base kernels is larger (order of hundreds), MK-MLR has a significant advantage over these methods. We also see from Figure 6.3 that MK-MLR is slower than the two MKL baselines which learn one shared kernel combination for all classes. In Chapter 3, we proved that the computational complexity of ML-MKL√ SA is sublinear, O(m1/3 lnm), in terms of the number of classes, m. Therefore, it is not surprising to see that ML-MKL-SA is the fastest method. Moreover, we can expect the gap between the training times to increase as the number of classes increases. The reason for the performance gap between MK-MKL and ML-MKL-Sum, which use the same SILP solver for kernel weights, is the implementation difference of the dual variable optimizers. Recall from Chapter 4 that our multi-label ranking method and kernel SVM show very close performance in terms of computational complexity and yield almost equivalent training times when implemented in the same environment. On the other hand, since we use a MATLAB implementation for the MLR algorithm, MK-MLR algorithm gives higher training times compared to ML-MKL-Sum, which uses a very efficient SVM solver that is coded with C. However, note that the performance difference does not increase as the number of training sample increases, since these two methods have the same complexity. Figures 6.4 and 6.5, which compare the training times of the baselines over different data set sizes, {1, 000, 2, 500, 5, 000}, confirm the conclusions we drew from Figures 6.2 and 6.3. MKL-L1 and MKL-L2 methods are significantly slower, since they require expensive kernel computation and combination operations for each class. In addition, both ML-MKL-SA and ML-MKL-Sum methods are faster than MK-MLR. However, ML-MKL-SA does not have a computational advantage as it did when the comparison was made in terms of the change in the number of classes. All the baselines have similar dependency to the number of samples. Therefore, we see a similar growth in training times for them. 158 6 7 x 10 MK−MLR MKL−L Training time 6 2 MKL−L1 5 4 3 2 1 0 1,000 2500 5,000 Number of training images Figure 6.4: Comparing MK-MLR to ML-MKL methods that learn one optimal kernel combination separately for each class in terms of training time. We use images from 200 classes and create three settings by changing the data set size {1, 000, 2, 500, 5, 000} 159 4 18 16 x 10 MK−MLR ML−MKL−Sum ML−MKL−SA Training time 14 12 10 8 6 4 2 0 1,000 2500 5,000 Number of training images Figure 6.5: Comparing MK-MLR to ML-MKL methods that learn one optimal kernel combination for all classes in terms of training time. We use images from 200 classes and create three settings by changing the data set size {1, 000, 2, 500, 5, 000} 160 6.4.7 Prediction Efficiency Prediction speed is in general more crucial than training speed in real word systems. Given a query image, a multi-label prediction system requires calculating an output score for each class. For multiple kernel setting, an output score for class k can be computed as, n αik .κ(xi , x; β k ), fk (x) = i=1 where κ(., .; β k ) is the optimal kernel function (linear combination of the base kernels) for class k. Since the computation of output function score is standard for all baselines that use multiple kernels, only the following three factors affect the prediction speed: • Multi-label kernel combination: Do output functions for each class require a different kernel combination, or do they share a single kernel combination function? • Sparsity of kernel combination weights. • Sparsity of output functions. Therefore, in addition to reporting the actual prediction times, we also discuss these factors to get a better understanding of the prediction efficiency. For a fixed number of training samples (5, 000) and classes (200), we report the sparsity of kernel weights and dual variables in Table 6.9 for the multiple kernel baselines. We also compute two types of prediction times, both reported in seconds: (i) Average prediction time per single class, (ii) Total prediction time. Note that the average prediction time per class is not calculated simply by dividing the total prediction time by the number of classes, but it is the time to make a prediction if there was only one class needed (binary prediction). When the prediction scores for all classes need to be computed, the prediction time does not increase linearly, since feature extraction, which is the most time-consuming step, can be done once for all classes. An analysis on the sparsity values leads to the following conclusions: • The main bottleneck for prediction time is the feature extraction step. The time it takes to extract all 186 features that are being used for the ESP Game data set is 10.29 seconds per image. Therefore, 161 the level of sparsity in kernel coefficients vector is the major factor determining the prediction time efficiency. • The feature extraction time is not uniform among the features we use. Dense-SIFT based BoW representation is the one that takes the most time with 1.08 seconds. On the other hand, the average time to compute an object bank feature vector is 0.04 seconds. Therefore, sparseness by itself is not the only factor that affects the feature extraction time. For instance a less sparse solution that excludes dense-SIFT based BoW representation from the final feature combination might be more efficient than a sparser solution that requires using dense-SIFT. • AVG-SVM and MKL-L2 methods employ all base kernels, meaning that they require extracting all feature types. Since feature extraction is one of the most expensive steps of prediction, having a non-sparse kernel coefficient vector makes AVG-SVM and MKL-L2 slower in prediction, compared to the methods that learn a sparse kernel combination vector. • The average sparsity of kernel combination weights over all classes is 87.77% for MKL-L1 , making it the method with the fastest prediction step for a single class. However, when the kernel weights for all classes are considered together, we see that only 21 of the base kernels are not used for any class prediction function. Therefore, although individual binary classifiers have sparse kernel combinations, the overall multi-label prediction sparsity is 11.29% for MKL-L1 , making it significantly slower than methods that use a single kernel combination, namely MK-MLR and ML-MKL-Sum, when all the classes need to be evaluated. • The average sparsity of kernel combination weights that are learned throughout the ML-MKL-SA is 51.58% per iteration. However, since the final kernel combination is the mean of all previous kernel combination weights, the final sparsity becomes 11.29%, which is significantly lower compared to the ML-MKL-Sum and MK-MLR methods. • MK-MLR outputs a very sparse kernel combination. Because of this, MK-MLR enables a fast prediction by avoiding unnecessary feature extraction and kernel construction steps. 162 Table 6.9: Sparsity (%) of kernel weights and dual variables for the multiple kernel baselines and the resulting prediction times. These results are obtained from a subset of the ESP Game data set with 5, 000 training images and 200 classes. AVG-SVM Sparsity(β) Sparsity(α) Avg. pred. time per class Total pred. time MKL L2 0 59.53 MLMKLSum 77.42 57.88 MLMKLSA 11.29 60.81 MK-MLR 0 60.55 MKL L1 87.77 61.72 10.71 3.74 5.31 5.31 10.69 4.94 11.38 10.76 11.30 5.58 11.33 5.11 76.88 47.11 • All OvA based MKL methods produce similar sparsity percentages for dual variables. Although the sparsity of the proposed MK-MLR method is around 10% lower than others, MK-MLR also yields a sparse support vector set. Sparsity of the support set is crucial for reducing storage requirements, kernel construction, and output function calculation costs. However, its impact is much smaller compared to the sparsity of the kernel combination weights in our experimental settings. 6.5 Conclusions and Future Work In this chapter, we presented an efficient multiple kernel multi-label ranking method by putting together different ideas from the previous chapters. Our experiments in Chapter 4 showed that formulating image categorization as a multi-label ranking problem leads to superior performance compared to more widelyused formulations such as binary decomposition (e.g., OvO and OvA). Therefore, we extended multi-label ranking to multiple kernel setting and proposed the MK-MLR algorithm. Following the conclusions of Chapter 3, we proposed to learn a shared kernel combination for all classes. This approach improves the computational efficiency of both the training and prediction steps significantly. MK-MLR algorithm learns kernel weights and class output functions simultaneously using the semi-infinite linear programming (SILP) method, which is shown to be the most computationally efficient wrapper MKL solver. Our experimental results on two multi-label data sets, ESP Game and MIR Flickr25000 demonstrated 163 the superiority of the proposed MK-MLR algorithm. MK-MLR efficiently combines heterogeneous data sources and exploit label correlations to maximize image categorization performance. In addition to yielding strong prediction performance, MK-MLR is also faster than OvA MKL formulations, which require solving MKL for each class. The sparsity of kernel combination weights and dual variables also leads to a much faster prediction step. However, there is still room for improvement of the prediction speed. One of the drawbacks of MK-MLR is that the computational complexity of the prediction step is linear in the number of classes. A future direction would be employing label set projection methods, such as compressed sensing, to make the prediction complexity sublinear in the number of classes. 164 Chapter 7 Contributions and Future Work The main contributions of this thesis are efficient multiple kernel learning (MKL) and multi-label ranking algorithms that advance the state of the art in kernel learning for image categorization by combining different image representations and exploiting image label correlations for improved multi-label predictions. 7.1 Contributions In Chapter 3 we proposed a stochastic approximation based multi-label multiple kernel learning algorithm that makes the following contributions: • Developed a multi-label multiple kernel learning method that enables information sharing between class labels to improve the performance on the classes with a small number of training samples. • Demonstrated that learning a shared combination of kernels for all classes improves the computational efficiency significantly without adversely affecting the classification performance. • Proposed an stochastic optimization algorithm with a computational cost that is sublinear in the num√ ber of classes, O(m1/3 lnm), making it suitable for handling a large number of classes, m. The multi-label ranking method in Chapters 4 offers the following contributions: 165 • Formulated multi-label learning as a multi-label ranking task, which is more flexible than classification based on binary decisions because of the ability to provide an ordered list of image labels. • Developed an approximation that reduces the number of constraints in the optimization problem and makes it linear in terms of the number of classes as compared to the quadratic dependency in the original ranking formulation. The approximation also enables class correlations to be implicitly included into the optimization process for an improved multi-label learning performance. • Proposed an efficient optimization problem that is based on block coordinate descent and a simple line search algorithm for which the search boundaries are provided. Experimental results demonstrate that the computational load of the multi-label ranking algorithm is in the same order as one-vs-all SVM. • Showed superior performance as compared to state-of-the-art multi-label learning methods on a data set in which full label information is available. Studies on multi-label learning with incomplete class assignments in Chapter 5 offer the following contributions: • Formally defined the problem of learning from multi-label data with incomplete class assignments. • Developed a multi-label ranking method (MLR-GL), which explicitly addresses the challenge of learning from incompletely labeled data by exploiting the group lasso technique to combine the ranking errors. • Proposed a computationally efficient optimization algorithm that has a closed-form solution. Experimental results demonstrate that the complexity of the multi-label ranking algorithm is in the same order as one-vs-all SVM. • Empirically demonstrated the robustness of MLR-GL for incomplete class assignment problem. We proposed a multiple kernel multi-label ranking method (MK-MLR) in Chapter 6, which is an extension of the MLR-L1 algorithm in Chapter 4 to the multiple kernel setting and makes the following contributions: 166 • Proposed a method (MK-MLR) that combines multiple kernel learning and multi-label ranking in a single framework. • Developed an efficient semi-infinite linear programming (SILP) algorithm that learns a single kernel combination for all classes. • Showed empirically that the MK-MLR algorithm finds the optimal shared sparse kernel combinations of the base kernels for all classes. Sparse solutions improve the computational efficiency and robustness by eliminating weak or noisy kernels/features. • Sparseness is particularly important for the prediction step, in which feature extraction is the main bottleneck. The experimental results showed that sparsity of the kernel combination coefficient vector reduces the prediction time. Because of its sparse solutions, MK-MLR algorithm reduces the prediction time significantly (order of seconds) compared to other methods which fail to yield sparse solutions. Based on the extensive empirical evaluations made in this dissertation, we make the following recommendations: • Despite the high computational cost in the training step, multiple kernel learning is useful for image categorization. It not only optimizes the classification performance by choosing the best kernel combination, but the sparse MKL also decreases the prediction time significantly by minimizing the time spent for feature extraction. • MKL is particularly useful when the number of kernels/features is high and there are potentially weak/noisy kernels, which necessitates kernel selection for an improved classification performance. In the settings where there is a small number of strong base kernels, using the average of the base kernels would give comparable results to MKL. • Learning a shared kernel combination for all classes is a good strategy to follow in multiple kernel learning for image categorization. Although the assumption of all classes sharing the same kernel might not work for other application domains, it not only yields good classification performance, but also reduces the training and prediction times significantly. 167 • Casting multi-label learning as a ranking problem is an effective way to boost the classification performance, particularly when the number of classes is high. The multi-label ranking methods presented in this dissertation are able to exploit the label correlations without making strong assumptions on the data, proving their effectiveness in classification and in their generalizability. 7.2 Future Work Despite significant progress in the literature and this dissertation, there are some shortcomings of the current multiple kernel and multi-label learning methods for image categorization. We point out the following research directions: • Improving the scalability of multiple kernel learning methods: Although MKL methods have been shown to be very useful in learning an optimal combination of different image representations and corresponding kernel functions, they do not scale well to training sets with millions of images and thousands of classes. In Chapter 3, we addressed the problem of a large number of classes. However, handling a large number of training samples is still the biggest challenge in using MKL. One of the priorities for MKL research should be making MKL methods scalable to data containing millions of samples. • Computational efficiency in the prediction phase: In general, computational efficiency in the prediction step is more important than the training efficiency for practical systems, since the training phase can be done off-line. On the other hand, a server, for instance, might need to make a decision in a short time, making a fast prediction algorithm necessary. Therefore, it is important to develop efficient multiple kernel multi-label prediction algorithms. However, there are only a few studies in the machine learning literature that target improving the prediction speed. 168 A PPENDIX 169 Appendix A Supplementary materials In this chapter, we first discuss the image categorization problem by briefly explaining the image representations, data sets, and evaluation measures we use in our experiments. Then, we provide the proofs of some theorems that were not included in the corresponding chapters. A.1 Image Representation We start with a brief background on image representations, and then briefly explain the bag-of-words (BoW) model, which is the most widely used low-level image representation technique. We also discuss the use of high level (semantic) image representations for image categorization. A.1.1 A Brief History The history of the published work on image categorization can be traced back to the 1960s [174]. The majority of the studies in the 1960s aimed to model and recognize simple geometric objects in an image. Such techniques are called “model-based recognition methods” [175, 176]. The goal in model-based recognition is to define or describe models for object categories and find matches between models and the detected objects in an image. In the 1990s, we saw a rapid growth in the object recognition literature, probably due to the improve- 170 ments in imaging and processing technologies. Although there were still methods using local shape-based features, i.e., modeling via small shape parts [177] and polygon approximation of object boundaries [178], researchers started to use color [179, 180] and texture based representations [181, 182] as well. The early works on automatic image annotation, which can be considered as a subset of the image categorization problem, used image segmentation to extract blobs/regions from the image. Once the features are extracted for each of these regions, the corresponding image labels would be extracted for these regions [183–186]. However, this approach requires a successful segmentation step, which is a very difficult task. Interest in extracting key points from an image and describing the local patches around these key points evolved in the 1990s [187,188]. The popularity of local features/descriptors has increased even more rapidly with the success of the SIFT algorithm, the seminal work by Lowe [189]. The SIFT approach for local descriptor extraction enabled high accuracy for the image matching problem. The bag-of-words (BoW) model enabled using key-point descriptors beyond the simple image matching problem by efficiently constructing a global representation for an entire image, which is necessary for image categorization, based on local key-point descriptors like SIFT features [190]. Among various approaches developed for image representation, the bag-of-words (BoW) model is the most popular due to its simplicity and success in practice. Most state-of-the-art methods use the bag-of-words model. Therefore, we also use the BoW model in our experiments. A.1.2 The Bag-of-Words (BoW) Model The first step in the BoW model is to detect key points or key-regions from images. Many algorithms have been developed for key-point/region detection [104,189,191], each having its own strengths and weaknesses. For instance, although dense sampling is shown to be superior to other techniques for image categorization, it usually yields a large number of key points and might lead to high computational costs. To have a richer variety of representations, in our experiments we used Harris-Laplacian [104] and Canny-edge detector based key-point methods in addition to dense sampling. The second step is to generate local descriptors for the detected key points/regions. There is a rich literature on local descriptors, among which scale invariant feature transform (SIFT) [189] is, without doubt, 171 Table A.1: A list of techniques that can be used in each module of the Bag-of-Words (BOW) model Region Detector Descriptor Visual Dictionary Encoding/quantization Pooling technique Spatial arrangement Kernel function Dense sampling, random sampling, Harris points, Harris-Laplace regions, Hessian-Laplace, Harris-Affine regions, Hessian-Affine regions SIFT, GLOH, Shape context, PCS-SIFT, spin images, steerable filters, LBP, cross-correlation, color histograms, HOG k-means, hierarchical k-means, GMM Vector quantization, Salient coding, LLC, LCC, Fisher vector, Sparse coding max-pooling, average pooling 1×1, 2×2, 4×4, 1×3, 3×1 Linear, RBF, polynomial, χ2 the most popular. Other techniques that we use in our experiments to improve the recognition performance are local binary patterns (LBP) [95] and histogram of oriented gradients (HOG) [192]. Given the descriptors, the third step of the BoW model is to construct a visual vocabulary. Both the dictionary size and the technique used to create the dictionary can have a significant impact on the final recognition accuracy. In our experiments, we use k-means clustering technique to generate the dictionary. Given the dictionary, the next step is to map each key-point to a visual word in the dictionary, a step that is often referred to as the encoding module. Recent studies express a vast amount of interest in the encoding step, resulting in many alternatives to vector quantization (e.g., Fisher kernel representation [193]). The last step in the BoW model is the pooling step that pools encoded local descriptors into a global histogram representation. Various pooling strategies have been proposed for the BoW model such as mean and max-pooling, two techniques that we employ in our experiments. Studies [103] have shown that it is important to take into account the spatial layout of key points in the pooling step. One common approach is to divide an image into multiple regions and construct a histogram for each region separately. A well known example of this approach is spatial pyramid pooling [103] that divides an image into 1 × 1, 2 × 2, and 4 × 4 grids. Table A.1 lists different techniques for each module of the BoW model. Besides the BoW model, many alternative low-level image features have been proposed for object recognition, including GIST [102], color 172 Table A.2: Data set statistics Caltech 101 ImageNet subset VOC 2007 MIR Flickr subset ESP Game subset # samples 8,677 81,738 9,963 10,199 100,000 # classes 101 101 20 457 500 avg. no. of labels/img 1 1 1.5 2.7 8.5 avg no. of img/label 85.9 85.9 729.9 145.4 1691.3 histograms, V1S+ [97], and geometric blur [99]. A.1.3 High-level Image Representations Although most of the image categorization is based on low-level features, particularly the BoW model, the use of high-level features is growing. One of the popular high-level image representations tools is the object banks method [173]. Li et al. defined a total of 177 different pre-computed object detectors using large object recognition data sets like ImageNet and LabelMe [194]. Each object detector is based on multi-scale, spatial pyramid representation and linear classifiers. Then, an image can be represented as a set of responses to these object detectors (classifiers). The object bank method is closely related to the image attributes method [195]. Attributes are human-designed names, such as {“striped”, “has a tail”} and by using a separate classifier for each attribute, an image can be described based on the attributes it has. In our multiple kernel learning experiments, we employ object bank representations in addition to several low level features to increase the number of base kernels and richness of the representations. A.1.4 Data Sets The majority of the data sets we use are multi-labeled data sets. However, in order to compare different multiple kernel learning solvers, we also use multi-class single-label benchmarks. Table A.2 provides statistics of the data sets we used in our experiments. A.1.4.0.1 The Caltech 101 data set has been used in many MKL studies; therefore, we also use it in our MKL experiments. It is comprised of 9,146 images from 101 object classes and an additional class of 173 cougar strawberry snoopy crocodile Figure A.1: Four example images from the Caltech 101 data set with their labels. “background” images. Caltech 101 is a multi-class single-label data set in which each image is assigned to one object class. As it can be seen from the sample images in Figure A.1, the objects are generally center aligned, scaled, and are not occluded. Because of these reasons, Caltech 101 is considered as a relatively easy data set for classification. A.1.4.0.2 The Pascal VOC 2007 data set is comprised of 9,963 images from 20 object classes. Unlike Caltech 101, more than half of the images in VOC 2007 are assigned to multiple classes. Overall, it is a more challenging data set than Caltech 101 because of the large variations in object size, orientation, and shape, as well as the occlusion problem. A.1.4.0.3 A subset of ImageNet data set is used in [106] for evaluating multiple kernel learning methods for image categorization. While the ImageNet data set contains 14,197,122 images from 21,841 categories, the data set that is used in the ImageNet Large Scale Visual Recognition Challenges contain 1.2 million training images from 1, 000 categories [196]. However, following the protocol in [106], we use 81, 738 images from ImageNet that belong to 18 out of 20 categories specified in VOC 2007; only 18 of the VOC 2007 categories are available within the ImageNet data set. This is significantly larger than Caltech 174 Figure A.2: Four example images from the ImageNet data set. A cat and a car image are shown in the top row. The second row has two dog images, one from the dalmatian synset, and one from the Mexican hairless synset 101 and VOC 2007, making it possible to examine the scalability of MKL methods for image categorization. Like Caltech 101, ImageNet is also a multi-class single-label data set, and we use this data set exclusively for the MKL experiments. Although the objects in the images are not always well-aligned and scaled, this data set is not considered challenging for classification because objects are roughly aligned, and there is only mild object occlusion, as seen in Figure A.2. Therefore, we can still consider the subset of ImageNet that we are using as a relatively easy data set. Note that, although the ImageNet data set has a hierarchical label structure, we will not be considering this structure in our experiments. For instance, we label the two images in the bottom row of Figure A.2 as two instances of dog images, although their synsets, which are dalmatian and mexican hairless, are different. A.1.4.0.4 MIR Flickr25000 is a subset of the MIR Flickr-1M data set [154] that is used for classification challenges. It was created to be used for the visual concept detection and annotation tasks in the IMAGECLEF Challenge [197]. The data set contains 25,000 images with 457 types of tags. MIR 175 Figure A.3: Two example images from the MIR Flickr data set. Left image (reflection effect) is by Szymczak [1] and the right image (fish eye effect) is by Wild. [2] Flickr25000 can be considered as a more difficult data set for classification compared to VOC 2007 and Caltech 101 because it is multi-labeled and it has a larger number of classes. In addition to all the challenges we have listed for the VOC 2007 data set, the MIR Flickr25000 data set poses extra difficulties because of the camera effects used by the photographers who took the photos, such as tilt shifting, post-processing, cinematic effects, etc. Figure A.3 shows two images with such effects. A.1.4.0.5 ESP Game is an online game that involves comparing the annotations of multiple users (competitors) for an image to retrieve the relevant labels [198]. The labels that are agreed on by multiple annotators are treated as true labels, and the annotators who provide these true labels acquire points for each correct annotation they provide. The ESP Game data set, which contains 100, 000 images with 26, 449 annotations, is also one of the more difficult data sets for multi-label learning. As it can be seen from Figure A.4, the types of images (e.g., cartoon, video games, portrait, etc.) show an immense variety, and images are not always of high quality (low resolution, occlusion). We pick 500 of the most frequent labels and use the images that contain at least one of these 500 labels. Although most of the labels describe concrete objects, there are also abstract image labels such as fight, sale, view, and symbol. A.1.5 Evaluation Measures We use two approaches to evaluate an algorithm for image categorization. Given an image, the first approach is to rank the labels and measure the ability of an algorithm to rank the relevant labels higher than 176 Figure A.4: Four example images from the ESP Game data set. irrelevant ones. In the second approach, given a category (label), the goal is to measure the performance of an algorithm in separating positive-labeled images from negative-labeled ones. The first approach is image based evaluation, whereas the second one is category based evaluation. A.1.5.1 Image Based Evaluation: Since we focus on multi-label ranking, we rank the classes in the descending order of their scores for a given image. The true label assignments (provided by human annotators) of an image are called relevant labels and the remaining labels are called irrelevant labels. For each image, we predict its categories by retrieving the first k labels with the largest scores. We vary k, i.e., the number of retrieved labels, from 1 to the total number of categories, and compute the following scores for an image indexed with i: • True Positive (T Pi ): The number of correctly retrieved relevant labels • False Positive (F Pi ): The number of retrieved labels which are not relevant • False Negative (F Ni ): The number of relevant labels which are not retrieved • True Negative (T Ni ): The number of rejected irrelevant labels per image 177 • True Positive Rate: T P Ri = • False Positive Rate: F P Ri = • Recall = T Pi T Pi +F Ni F Pi F Pi +T Ni T Pi T Pi +F Ni • Precision = T Pi T Pi +F Pi Once the above scores are calculated for an image, we can obtain the AUC-ROC (area under the curve for Receiver operating characteristic graph) and AP (average precision) measures. ROC curve plots TPR (y-axis) against FPR (x-axis), and the area under the curve (%), which is a value between 0 and 100, measures the ranking performance of an algorithm: higher scores are better. Following PASCAL Visual Object Classes (VOC) challenge, we calculate the precision values corresponding to a set of evenly spaced recall levels {0, 0.1, ... ,1.0}, and calculate the mean of these precision values to get the AP score. Once AUCROC and AP scores are calculated for each image, we take the mean of these scores over all test images (micro-averaging). A.1.5.2 Category Based Evaluation: We use category based evaluation for the multiple kernel learning experiments, which involves comparing binary MKL algorithms. Note that, unlike the previous case, we rank images for each label. Let us redefine the measures we use for the classification performance: • Category-based True Positive (T Pc ): The number of images that are correctly assigned a positive label for a category • Category-based False Positive (T Pc ): The number of images that are falsely assigned a positive label for a category • Category-based False Negative (F Nc ): The number of images that are falsely assigned a negative label for a category • Category-based Recall: = T Pc T Pc +F Nc 178 • Category-based Precision: = T Pc T Pc +F Pc By using the category based precision and recall values, we can calculate the average precision (AP) score for each category. As suggested in the PASCAL Visual Object Classes challenge, we only use the MAP score. The reason for this is that the three data sets we use for the MKL experiments, namely Caltech 101, VOC 2007, and a subset of ImageNet, give fairly high classification performance in terms of AUC-ROC, making it difficult to distinguish the performance difference between the baselines. Therefore, we will be using only the MAP score for the MKL experiments. A.1.6 State-of-the-art Performance in Image Categorization The winner of the ILSVRC (ImageNet) 2012 and 2013 Challenges used deep convolutional networks on raw pixel data [199]. For example, the winner of ILSVRC (ImageNet) 2012 uses a trained neural network that has 60 million parameters and 650,000 neurons, consisting of five convolutional layers. The deep convolutional neural network algorithm yielded an error rate of 0.15 for rank-5 predictions, improving over the second best method in the competition by 10%. The performance was further improved in ILSVRC by combining several CNNs and an error rate of 11.74% was achieved. Deep convolutional networks produce very promising results both for classification and detection when the number of images is high (in the order of millions). In this dissertation, we are interested in developing classification algorithms that would work on any image representation. In contrast, convolutional neural networks learn their own features. The method ranked second in the ILSVRC 2012 Challenge used a set of different BoW representations, including SIFT, LBP, and GIST based Fisher vector features. This approach, which produces an error rate of 0.26 for rank-5 predictions, learns a separate classifier for each feature, which are 262,144 dimensional vectors. It then calculates a weighted sum of these individual classifiers for the final predictions. Similar to ILSVRC 2012 Challenge, we see that the top performing methods in the Pascal VOC categorization challenge combine different representations (mostly Fisher vector representation based) and build features that are over 300,000 dimensional. The winner group includes additional modules such as object detection/localization and subclass modeling. While we see that the winner method in the VOC 2012 chal- 179 lenge, which utilizes object detection, yields a MAP score of 82%, the reported result on the VOC 2007 data set using a single feature is 61.7% (only classification). It is important to note that we are interested in developing methods that perform only categorization, meaning that our algorithms only require a single global descriptor for each image and do not need localization (i.e., bounding boxes) information in the training process. When very high dimensional feature vectors are used, linear SVMs yield results similar to kernel SVMs. Although linear SVMs are more efficient than kernel SVMs, the main bottleneck for them in the prediction step is feature extraction. On the other hand, our goal in this dissertation is to optimize the classification performance for features that are relatively low dimensional (1,000 to 10,000) by using kernel classifiers. A.2 Proofs for Chapter 2 In this section we prove the equivalance between Eqs. (A.1) and (A.2) (originally Eqs. (2.2) and (2.7) in Chapter 2). min β∈∆,f ∈Hβ min λ∈Rs+ , min s j λj =1 {fj ∈Hj }j=1 1 ||f ||2Hβ + C 2 1 2 n ℓ(y i f (xi )) s n j=1 (A.1) i=1 λj ||fj ||2Hj + C i=1  ℓ  s j=1 y i λj fj (xi ) (A.2) We first rewrite Cℓ(z) as maxα∈[0,C] α(1 − z) and place it into Eq. (A.2) to get Eq. (A.3), min λ∈Rs+ , j min s λj =1 {fj ∈Hj }j=1 max α∈[0,C]n 1 2 s j=1 n λj ||fj ||2Hj + i=1  αi 1 − s j=1  y i λj fj (xi ). (A.3) The problem in Eq. (A.3) becomes a convex-concave optimization problem and, according to von Newman’s lemma, we can switch minimization with respect to fj and maximization with respect to α. It is straightforward to show that fj (x) = n i i i=1 αi y κj (x, x ) is the minimizer. Using this expression, the opti- mization problem can be rewritten as in Eq. (A.4), which is exactly the same as the dual form of Eq. (2.2). 180 1 min max L(α, β) = 1⊤ α − (α ◦ y)⊤ K(β)(α ◦ y). β∈∆ α∈Q 2 (A.4) This is an evidence that Eqs. (A.2) and (A.1) are equivalent and concludes the proof. A.3 Proofs for Chapter 3 Proposition 4. Eq. (A.6) is the dual problem of Eq. (A.5). m min m min β∈∆ {fk ∈H(β)}m k=1 1 |fk |2H(β) + 2 Hk = k=1 k=1 n ℓ yki fk (xi ) , (A.5) i=1 where ℓ(z) = max(0, 1 − z) and H(β) is a Reproducing Kernel Hilbert Space endowed with kernel κ(x, x′ ; β) = s ′ j=1 βj κj (x, x ). m min max β∈∆ α∈Q1 L(β, α) = k=1 1 [αk ]⊤ 1 − (αk ◦ yk )⊤ K(β)(αk ◦ yk ) 2 , (A.6) where Q1 = {α = (α1 , . . . , αm ) : αk ∈ [0, C]n , k = 1, . . . , m}. Proof. We first rewrite ℓ(z) as ℓ(z) = max (x − xz), x∈[0,1] Using the above expression for ℓ(z), the second term of Hk can be rewritten as, n max i=1 αik ∈[0,C] αik − αik yki fk (xi ) , According to von Newman’s lemma, we can switch minimization (over fk ) with maximization (over α). By taking the minimization over fk first, we have n yki αik κ(xi , x). fk (x) = i=1 181 Finally the problem becomes m min max β∈∆ α∈[0,C] L(β, α) = 1 [αk ]⊤ 1 − (αk ◦ yk )⊤ K(β)(αk ◦ yk ) 2 k=1 . Proposition 5. Eq. (A.8) is the dual problem of Eq. (A.7). min min max Hk , (A.7) β∈∆ {fk ∈H(β)}m 1≤k≤m k=1 where   min max L(β, ρ) = β∈∆ ρ∈B  m k=1 1 [ρk ]⊤ 1 − (ρk ◦ yk )⊤ K(β)(ρk ◦ yk ) 2 1 2 2    . (A.8) m B= (ρ1 , . . . , ρm ) : ρk ∈ Rn+ , k = 1, . . . , m, ρk ∈ [0, Cλk ]n s.t. λk = 1 . k=1 Proof. We start by formulating Eq. (A.7) as, min min β∈∆ {fk ∈H(β)}m k=1 min t subject to Hk ≤ t, k = 1, . . . , m, (A.9) (A.10) with extra variable t ∈ R. Introducing the multiplier λk for Hk ≤ t, and using Proposition 1, the Lagrangian is m t+ k=1 1 λk [αk ]⊤ 1 − (αk ◦ yk )⊤ K(β)(αk ◦ yk ) − t 2 m = (1 − 1T λ)t + k=1 1 λk [αk ]⊤ 1 − (αk ◦ yk )⊤ K(p)(αk ◦ yk ) , 2 182 (A.11) where α ∈ [0, C]n . So, the dual function is g(β, ρ, λ) =     m k=1 [ρk ]⊤ 1 − 12 (ρk ◦ yk )⊤ K(β) λk (ρk ◦ yk ) − t   −∞ 1⊤ λ = 1 , otherwise where ρk = αk λk . Then the dual problem is m min max max L(β, ρ, λ) = β∈∆ ρ∈B λ∈Λ 1 K(β) [ρk ]⊤ 1 − (ρk ◦ yk )⊤ (ρk ◦ yk ) 2 λk k=1 , where B = (ρ1 , . . . , ρm ) : ρk ∈ Rn+ , k = 1, . . . , m, ρk ∈ [0, Cλk ]n . Let min max(ρk ◦ yk )⊤ K(β)(ρk ◦ yk ) = ψk . To eliminate λ, we rewrite the dual problem as maxiβ∈∆1 ρ∈B mization over λ for optimal ψk . Then, the Lagrangian becomes max − λ∈Λ 1 2 m k=1 ψk +υ λk m k=1 λk − 1 . Maximizing over λ, we get 1 υ= 2 λk = 2 m ψk k=1 √ ψk m j=1 ψj By eliminating λ, we obtain the following dual of (A.7): min max β∈∆1 ρ∈B    m L(β, ρ) = k=1 1 [ρk ]⊤ 1 − (ρk ◦ yk )⊤ K(β)(ρk ◦ yk ) 2 Proposition 6. We define potential functions Φβ = ηβ ηγ 183 s j=1 βj ln βj for β and Φγ = 1 2 2    . m i i=1 γ ln γ i for γ, and have the following equations for updating β t and γ t as βjt+1 βjt γt = t exp(−ηβ ∇βj L(β t , γ t )), γkt+1 = kt exp(−ηγ ∇γk L(β t , γ t )), Zβ Zγ ⊤ (A.12) ⊤ where Zβt and Zγt are normalization factors that ensure β t 1 = γ t 1 = 1. Proof. We denote by DΦβ (β, β ′ ) : ∆ × ∆ → R+ and DΦγ (γ, γ ′ ) : Γ × Γ → R+ the Bregman distance functions for β and γ that are induced by Φβ and Φγ , respectively. Note that the Bregman distance between z and z′ induced by the strictly convex function Φ, denoted by DΦ (z, z ′ ), is defined as DΦ (z, z′ ) = Φ(z) − Φ(z′ ) − ∇Φ(z′ )⊤ (z − z′ ) Using the Bregman distance function, we introduce two projection operators: Aβ (gβ ; ∆) that projects solution β into domain ∆ along the direction gβ ∈ Rs and Bγ (gγ ; Γ) that projects solution γ into domain Γ along the direction gγ ∈ Rm . These two operators are defined as follows: gβ⊤ β ′ + DΦβ (β ′ , β), Bγ (gγ ) = min gγ⊤ γ ′ + DΦγ (γ ′ , γ) Aβ (gβ ) = min ′ ′ γ ∈Γ β ∈∆ Based on the mirror prox method, we can solve the optimization problem in Eq. (3.3) iteratively. Given the solution β t and γ t of the current iteration, the new solution, denoted by β t+1 and γ t+1 , is computed as β t+1 = Aβ t ηβ ∇β L(β t , γ t , αt ) , γ t+1 = Cγ t −ηγ ∇γ L(β t , γ t , αt ) , (A.13) where ηβ > 0 and ηγ > 0 are the step sizes. The two gradients are computed as m gj (β) = ∂L(β, γ, α) 1 =− ∂βj 2 gk (γ) = ∂L(β, γ, α) 1 = [αk ]⊤ 1 − (αk ◦ yk )⊤ K(β)(αk ◦ yk ), k = 1, . . . , m ∂γk 2 k=1 γk (αk ◦ yk )⊤ Kj (αk ◦ yk ), j = 1, . . . , s (A.14) (A.15) (A.16) 184 By choosing the potential functions as ηβ Φβ = ηγ m s βj ln βj , Φγ = j=1 γk ln γk , (A.17) k=1 t+1 ) we have the following updating rules for β t+1 = (β1t+1 , . . . , βst+1 ) and γ t+1 = (γ1t+1 , . . . , γm βjt+1 = βjt exp −ηγ gj (β t ) , j = 1, . . . , s Zβt (A.18) γkt+1 = γtk exp (ηγ gk (γt )) , k = 1, . . . , m Zγt (A.19) where Zβt and Zγt are defined as s Zβt m βjt = j=1 t exp −ηγ gj (β ) Zγt γkt exp ηγ gk (γ t ) = k=1 Theorem 10. After running Algorithm 3 over T iterations, we have the following inequality for the solution p and γ obtained by Algorithm 3 ˆ γ ˆ E ∆ β, ≤ 1 m2 (ln m + ln s) + ηγ d 2 λ20 n2 C 4 + n2 C 2 , ηγ T 2δ where d is a constant term and E[·] stands for the expectation over the sampled task indices of all iterations. Proof. Define γ gβ (β t , γ t ) = (g1β (β t , γ t ), . . . , gsβ (β t , γ t )), gγ (β t , γ t ) = (g1γ (β t , γ t ), . . . , gm (β t , γ t )). Using the result of variation inequality [119], we have the following inequality for any β ∈ ∆ and 185 γ∈Γ ∆ β t , γ t ≤ (β t − β)⊤ ∇β L(β t , γ t ) − (γ t − γ)⊤ ∇γ L(β t , γ t ). (A.20) According to Proposition 1, we have Et gβ (β t , γ t ) = ∇β L(β t , γ t ), Et gγ (β t , γ t ) = ∇γ L(β t , γ t ). We therefore can rewrite Eq. (A.20) as Et ∆ β t , γ t ≤ Et (β t − β)⊤ gβ (β t , γ t ) − (γ t − γ)⊤ gγ (β t , γ t ) . From [200] (chapter 11), we know that ηγ (β t − β)⊤ gβ (β t , γ t ) ≤ KL(β β t ) − KL(β β t+1 ) + KL(β t β t+1 ), and −ηγ (γ t − γ)⊤ gγ (β t , γ t ) ≤ KL(γ γ t ) − KL(γ γ t+1 ) + KL(γ t γ t+1 ). Therefore, we have T T t ηγ ∆ β ,γ t=1 t 1 1 ≤ KL(β β ) + KL(γ γ ) + KL(β t β t+1 ) + KL(γ t γ t+1 ) . t=1 We are going to bound each of the three terms on the right hand side of the inequality. First, it is obvious that KL(β β 1 ) ≤ ln s and KL(γ γ 1 ) ≤ ln m given both γ 1 and β 1 are uniform distributions. Second, we bound KL(β t β t+1 ) as follows 186 =     s s t     β ηβ ηβ j β t t t βj ln = βj ln Zβ exp{ηγ gj }  ηγ  βjt+1  ηγ  j=1 j=1   s s   ηβ βjt ηγ gjβ (β t , γ t ) + βjt ln(Zpt )  ηγ  j=1 j=1    s s s  ηβ  βjt ηγ gjβ (β t , γ t ) + βjt ln  βjt exp −ηγ gjβ (β t , γ t )   ηγ  ηβ ηγ − −ηγ E ≤ ηβ ηγ ηγ2 max [gβ (β t , γ t )]2 2 1≤j≤s j KL(β t |β t+1 ) = = = j=1 j=1 gjβ j=1 + ln E exp −ηγ gjβ (β t , γ t ) = cηγ2 β t t 2 |g (β , γ )|∞ , 2 where the inequality follows directly from the Hoeffiding inequality, and c is a constant such that ηp = cηγ . Similarly, we have KL(γ t γ t+1 ) ≤ ηγ2 γ t t 2 2 |g (β , γ )|∞ . By combining the above results together, we have T T t ηγ E ∆ β ,γ t t=1 ≤ ln m + ln s + ηγ2 t=1 E c|gβ (β t , γ t )|2∞ + |gγ (β t , γ t )|2∞ Using Eq. (A.14), we can bound |gβ (β t , γ t )|∞ as follows |gβ (β t , γ t )|∞ = max gjβ (β t , γ t ) 1≤j≤s 1 max − (αat ◦ yat )⊤ Ka (αat ◦ yat ) 1≤j≤s 2 λ0 λ0 1 ≤ (C1)⊤ VDV−1 (C1)] ≤ (C1)⊤ VIV−1 (C1) = (C1)⊤ I(C1) 2 2 2 1 ≤ nC 2 λ0 , 2 = where K = VDV−1 is the eigendecomposition of the PSD matrix K, λ0 = max λmax (Kj ), and 1≤j≤s λmax (Z) stands for the maximum eigenvalue of matrix Z. Similarly, by using Eq. (A.15) we can bound 187 |gγ (β t , γ t )|∞ as |gγ (β t , γ t )|∞ = max {gkγ (β t , γ t ) ≤ 1≤k≤m m λ0 max nC, nC 2 . δ 2 Next, we have the bound simplified as T ∆ βt, γ t E ≤ t=1 1 (ln m + ln s) + ηγ T ηγ d m2 2 2 4 λ n C + n2 C 2 , 2δ2 0 where d is a constant. We complete the proof by using the fact ∆(β, γ) is jointly convex in both β and γ; therefore, T t=1 ∆ ˆ γ ˆ . β t , γ t ≥ T ∆ β, 2 1 Corollary 11. With δ = m 3 and ηγ = n1 m− 3 ˆ γ ˆ )] ≤ O(m1/3 have E[∆(β, (ln m)/T , after running Algorithm 3 over T iterations, we (ln m)/T ) in terms of m and T . A.4 Proofs for Chapter 4 A.4.1 Proof of Theorem 3 For notational convenience, let us define ∆ik,l = yki − yli fk − fl , κ(xi , ·) 2 Hκ Using this, the objective function in Eq. (4.2) can be rewritten as follows h(f ) = 1 2 m n fl , fl l=1 HK m I(yli = yki )ℓ ∆ik,l +C i=1 l,k=1 We then rewrite ℓ(z) as ℓ(z) = max (x − xz) x∈[0,1] 188 Using the above expression for ℓ(z), the second term in h(f ) can be rewritten as, n m I(yli = yki ) max i ∈[0,C] γk,l i=1 l,k=1 i i γk,l − γk,l ∆ik,l The problem in Eq. (4.2) now becomes a convex-concave optimization problem as min max i ∈[0,C] fl ∈Hm γl,k g(f, γ) where n m I(yli g(f, γ) = = i=1 l,k=1 n m − i yki )γl,k 1 + 2 m fl , fl HK l=1 i I(yil = yki )γl,k ∆ik,l i=1 l,k=1 According to von Newman’s lemma, we can switch minimization with maximization. By taking the minimization over fl first, we have n m yli fl (x) = i=1 i I(yli = yki )γl,k κ(xi , x) k=1 In the above derivation, we use the relation I(yli = yki )(yli − yki ) = 2yli . To simplify our notation, we i if y i = y i and zero otherwise. Note that since γ i = γ i , we introduce Γi ∈ [0, C]m×m where Γil,k = γl,k l k l,k k,l have Γi = [Γi ]⊤ . We furthermore introduce the notation [Γi ]l as the sum of the elements in the lth row, i.e., [Γi ]l = m i k=1 Γl,k . Using these notations, we have fl (x) expressed as n yli [Γi ]l κ(xi , x) fl (x) = i=1 189 Finally, the remaining maximization problem becomes n m m n 1 max [Γ ]k − κ(xi , x)yki ykj [Γi ]k [Γj ]k 2 i=1 k=1 k=1 i,j=1    0 ≤ Γik,l ≤ C yki = yli i s. t. Γk,l =   0 otherwise i Γi = [Γi ]⊤ , i = 1, . . . , n; k, l = 1, . . . , m A.4.2 Proof of Theorem 4 . It is straightforward to shown τ ∈ Q1 → τ ∈ Q2 . The main challenge is to show the other direction, i.e., τ ∈ Q2 → τ ∈ Q1 . For a given τ , in order to check if there exists Z ∈ [0, C]a×b such that τ 1 : a = Z1b and τa+1:m = Z ⊤ 1a , we need show that the following optimization problem is feasible min 0 s. t. (A.21) ⊤ Z ∈ Ra×b + , τ 1 : a = Z1b , τa+1:m = Z 1a For the convenience of presentation, we denote by µa = τ1:a ∈ Ra , and by µb = τa+1:K ∈ Rb , and rewrite the above feasibility problem as min 0 s. t. (A.22) Z ∈ [0, C]a×b , µa = Z1b , µb = Z ⊤ 1a It is important to note that, for the above optimization problem, its optimal value is 0 when the solution is feasible, and +∞ when no feasible solution satisfies the condition. By introducing the Lagrangian multipliers λa ∈ Ra for µa = Z1b and λb ∈ Rb for µb = Z ⊤ 1b , we have ⊤ ⊤ min max λ⊤ a (µa − Z1b ) + λb (µb − Z 1a ) Z 0 λa ,λb 190 (A.23) By taking the minimization over Z, we have ⊤ max λ⊤ a µa + λb µb (A.24) λa ,λb s. t. ⊤ λa 1⊤ b + 1a λb 0 To decide if there is a feasible solution to Eq. (A.22), the necessary and sufficient condition is that the optimal value for Eq. (A.24) is zero. First, we show that the objective function of Eq. (A.24) is upper bounded by + 0. We denote by λ+ a and λb the maximum elements in vector ⊤ zero under the constraint λa 1⊤ b + 1a λb + i i λa and λb , respectively, i.e, λ+ a = max [λa ] and λb = max [λb ] . Evidently, according to the constraint λa 1⊤ b + 1a λ⊤ b 0, we have λ+ a 1≤i≤a + λ+ b ≤ 1≤i≤b 0. We then have the objective function bounded as ⊤ + ⊤ + ⊤ + + ⊤ λ⊤ a µa + λb µb ≤ λa 1a µa + λb 1b µb = (λa + λb )1a µa ≤ 0 Second, it is straightforward to verify that zero optimal value is obtainable by setting λa = 0a and λb = 0b . Combining the above two arguments, we have the optimal value for Eq. (A.24) is zero, which therefore indicates that there is a feasible solution to Eq. (A.22). By this, we prove that τ ∈ Q2 → τ ∈ Q1 . A.4.3 Proof of Theorem 6 We first turn the problem in Eq. (4.15) into the following min-max problem m max αi ∈[0,C]m αil min λ l=1 1 − 2 m k=1 m i i κ(x , x ) 2 yki fk−i (xi )αik − ⊤ [αik ]2 + λyi αi (A.25) k=1 Since the objective function in Eq. (A.25) is convex in λ and concave in αi , therefore according von Newman’s lemma, switching minimization with maximization will not affect the final solution. Thus, we could 191 obtain the solution by maximizing over α, i.e., αik = π[0,C] 1 + λyki − 12 yki fk−i (xi ) κ(xi , xi ) where π[0,C] (x) projects x onto the region [0, C]. To compute λ, we aim to solve the following equation m yki π[0,C] k=1 1 + λyki − 12 yki fk−i (xi ) κ(xi , xi ) =0 (A.26) Since when yki = 1, the projection in Eq. (A.26) is π[0,C] and when yki = −1, it is π[−C,0] , we could represent 1+λyki − 12 yki fk−i (xi ) κ(xi ,xi ) yki π[0,C] by h( yki +λ− 12 fk−i (xi ) i , yk C) κ(xi ,xi ) where h(x, y) is already defined in the theorem. ⊤ Since yi αi = 0, we have the following equation for λ m g(λ) = h k=1 yki + λ − 12 fk−i (xi ) i , yk C κ(xi , xi ) =0 (A.27) A.4.4 Proof of Proposition 3 To estimate λmin , we rewrite g(λ) as m I(yki = 1)π[0,C] g(λ) = k=1 m 1 + λ − 12 fk−i (xi ) κ(xi , xi ) − k=1 I(yki = −1)π[0,C] 1 − λ + 12 fk−i (xi ) κ(xi , xi ) To estimate λmin , we search for λmin such that g(λmin ) ≤ 0. To this end, we define the following quantity m I(yki = 1)π[0,C] ∆= k=1 1 − 12 fk−i (xi ) κ(xi , xi ) m − k=1 I(yki = −1)π[0,C] 1 + 12 fk−i (xi ) κ(xi , xi ) If ∆ ≤ 0, we have λmin = 0. Otherwise, we set λmin as the maximum of the following two quantities amin = −Cκ(xi , xi ) + min yki =−1 1 1 1 + fk−1 (xi ) , bmin = − max 1 − fk−i (xi ) 2 2 yki =1 It is evidently that one of the solutions will result into the negative value for g(λ) since (a) by setting λmin = bmin , we ensure that every π[0,C] (1 + λ − 12 fk−i (xi )) is zero, (b) by setting λmin = amin , we have 192 that every π[0,C] (1 − λ + 12 fk−i (xi )) being C. To obtain λmax , we again check ∆ ≥ 0. If so, λmax = 0. Otherwise, the solution for λmax should be the minimum of the following two quantities 1 1 amax = Cκ(xi , xi ) − min 1 − fk−i (xi ) , bmax = max 1 + fk−i (xi ) 2 2 yki =1 yki =−1 A.5 Proofs for Chapter 5 A.5.1 Proof of Lemma 1 We start proving Lemma 1 by writing the dual function of Eq. (5.5), which is as follows: g(λ) = sup L(γ, λ) = sup γi γ i l∈Y / i k∈Yi i γk,l ℓ(fk (xi ) − fl (xi )) + Since L(γ, λ) is a concave function, the upper bound is found by setting g(λ) = l∈Y / i k∈Yi ℓ2 (fk (xi ) − fl (xi )) + 4λl The Lagrange dual is to minimize g over all λ ≥ 0. k∈Yi l∈Y / i ∂L(γ,λ) ∂γ k∈Yi This concludes the proof. A.5.2 Proof of Theorem 9 We can rewrite ℓ(z) as ℓ(z) = max (x − xz) 193 k∈Yi = 0. λl The optimal λl can easily be found as ℓ2 (fk (xi ) − fl (xi )). x∈[0,1] 2 γk,l ) l∈Y / i ℓ2 (fk (xi ) − fl (xi ))/2. Therefore, the Lagrange dual form becomes l∈Y / i λl (1 − Using the above expression for ℓ(z), the objection function can be rewritten as 1 min max max i ∈∆ β i ∈[0,1] 2 fk ∈HK γk,l i k,l m k=1 |fk |2HK (A.28) n +C i=1 k∈Y i l∈Y / i i i γk,l βk,l (1 − fk (xi ) + fl (xi )) The problem now becomes a convex-concave optimization. By defining new variable Γik,l as i i i i Γik,l = γk,l βk,l + γl,k βl,k , we rewrite Eq. (A.29) as 1 2 max min fk ∈HK Γik,l ∈∆i n m k=1 |fk |2HK (A.29) m + i=1 k,l=1 Γik,l (1 − fk (xi ) + fl (xi )) Since Eq. (A.30) is a convex-concave optimization problem, according to von Newman’s lemma, we can switch minimization with maximization. By taking the minimization with respect to fk , we have n m m Γik,l fk (x) = C i=1 l=1 − Γil,k κ(x, xi ) (A.30) l=1 According to the definition of ∆i , Γik,l is nonzero only when k ∈ Y i (i.e., yki = 1) and l ∈ / Y i (i.e., yki = −1). We thus can rewrite fk (x) in Eq. (A.30) as n m i=1 By defining αik = m i l=1 Γk,l + m Γik,l + fk (x) = C m i l=1 Γl,k , l=1 Γil,k yik κ(x, xi ) l=1 we have the result in the theorem. 194 A.5.3 Proof of Lemma 2 First, using the notation of hk , we rewrite the objective function in Eq. (5.15) as b max −CKi,i η γ∈∆ s=1 b |γ ·,s |22 + 2 h⊤ s γ ·,s s=1 Since all γ·,s , s = 1, . . . , b are decoupled in both the domain ∆ and the objective function, we can decompose the above problem into b independent optimization problems, max γ ·,s ∈Ra + −CKi,i η|γ ·,s |22 + 2h⊤ s γ ·,s : |γ ·,s |2 ≤ 1 , (A.31) where s = 1, . . . , b. For each independent optimization problem, we introduce a Lagrangian multiplier λs ≥ 0 for constraint |γ ·,s |2 ≤ 1, and have min max −(CKi,i η + λs )|γ·,s |22 + 2h⊤ s γ ·,s + λs λs ≥0 γ ·,s ∈Ra + The optimal solution to the maximization of γ is γ ·,s = πG hs λs + CKi,i η In order to decide the value for λs , we use the complementary slackness condition, i.e., λs (|γ ·,s |22 − 1) = 0. There are two cases: λ = 0 implies |γ ·,s |22 ≤ 1, and λ > 0 implies |γ ·,s |22 = 1. This leads to the result stated in the Lemma. 195 BIBLIOGRAPHY 196 B IBLIOGRAPHY [1] M. P. Szymczak, Flickr photo, http://www.flickr.com/photos/marooned/. [2] D. Wild, Flickr photo, http://www.flickr.com/people/publicenergy/. [3] L. Fei-fei, R. Fergus, S. Member, and P. Perona, “One-shot learning of object categories,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 594 – 611, 2006. [4] Q. Chen, Z. Song, S. Liu, X. Chen, X. Yuan, T.-S. Chua, S. Yan, Y. Hua, Z. Huang, and S. Shen, “Boosting classification with exclusive context,” in In PASCAL Visual Object Classes Challenge Workshop, 2010. [5] J. Yang, Y. Li, Y. Tian, L. Duan, and W. Gao, “Group-sensitive multiple kernel learning for object categorization,” in IEEE Int. Conference on Computer Vision, 2009, pp. 436–443. [6] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple kernels for object detection,” in IEEE Int. Conference on Computer Vision, 2009, pp. 606–613. [7] A. Jiang, C. Wang, and Y. Zhu, “Calibrated rank-svm for multi-label image categorization,” in Proc. of IEEE Int. Joint Conference on Neural Networks, 2008, pp. 1450–1455. [8] A. Znaidia, H. Le Borgne, and C. Hudelot, “Belief theory for large-scale multi-label image classification,” in Belief Functions: Theory and Applications. Springer, 2012, vol. 164, pp. 205–212. [9] S. S. Bucak, R. Jin, and A. Jain, “Multi-label multiple kernel learning by stochastic approximation: Application to visual object recognition,” in Proc. of Neural Information Processing Systems, 2010, pp. 325–333. [10] N. Ueda and K. Saito, “Parametric mixture models for multi-labeled text,” in Proc. of Neural Information Processing Systems, 2002, pp. 721–728. [11] S. Shalev-Shwartz and Y. Singer, “Efficient learning of label ranking by soft projections onto polyhedra,” Journal of Machine Learning Research, vol. 7, pp. 1567–1599, 2006. [12] N. Ghamrawi and A. McCallum, “Collective multi-label classification,” in Proc. of ACM Int. Conference on Information and Knowledge Management, 2005, pp. 195–200. [13] Y. Liu, R. Jin, and L. Yang, “Semi-supervised multi-label learning by constrained non-negative matrix factorization,” in Proc. of Conference on Artificial Intelligence, 2006, pp. 421–426. [14] F. Sun, J. Tang, H. Li, G.-J. Qi, and T. Huang, “Multi-label image categorization with sparse factor representation,” IEEE Transactions on Image Processing, vol. 23, no. 3, pp. 1028–1037, 2014. [15] S. S. Bucak, P. K. Mallapragada, R. Jin, and A. K. Jain, “Efficient multi-label ranking for multi-class learning: Application to object recognition,” in Proc. of IEEE Int. Conference on Computer Vision, 2009, pp. 2098–2105. 197 [16] Y.-Y. Lin, J.-F. Tsai, and T.-L. Liu, “Efficient discriminative local learning for object recognition,” in Proc. of IEEE Int. Conference on Computer Vision, 2009, pp. 598–605. [17] D. Hall, “A system for object class detection,” in Cognitive Vision Systems. 73–85. Springer, 2006, pp. [18] B. M. Sadler and G. B. Giannakis, “Shift-and rotation-invariant object reconstruction using the bispectrum,” Journal of the Optical Society of America A, vol. 9, no. 1, pp. 57–69, 1992. [19] R. Fergus, P. Perona, and A. Zisserman, “Weakly supervised scale-invariant learning of models for visual recognition,” International Journal of Computer Vision, vol. 71, no. 3, pp. 273–303, 2007. [20] L. Spirkovska and M. B. Reid, “Robust position, scale, and rotation invariant object recognition using higher-order neural networks,” Pattern Recognition, vol. 25, no. 9, pp. 975–985, 1992. [21] T. Kadir, A. Zisserman, and M. Brady, “An affine invariant salient region detector,” in Computer Vision–ECCV. Springer, 2004, pp. 228–241. [22] A. Diplaros, T. Gevers, and I. Patras, “Combining color and shape information for illuminationviewpoint invariant object recognition,” IEEE Transactions on Image Processing, vol. 15, no. 1, pp. 1–11, 2006. [23] E. Hsiao and M. Hebert, “Occlusion reasoning for object detection under arbitrary viewpoint,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3146–3153. [24] A. Selinger and R. C. Nelson, “Improving appearance-based object recognition in cluttered backgrounds,” in Proc. of Int. Conference on Pattern Recognition, 2000, pp. 46–50. [25] S. Dickinson, “The evolution of object categorization and the challenge of image abstraction,” in Object Categorization: Computer and Human Vision Perspectives. Cambridge University Press, 2009, pp. 1–37. [26] S. Maji, A. C. Berg, and J. Malik, “Efficient classification for additive kernel SVMs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 66–77, 2013. [27] S. Har-Peled, D. Roth, and D. Zimak, “Constraint classification for multiclass classification and ranking,” in Proc. of Neural Information Processing Systems, 2002, pp. 809–816. [28] S. S. Bucak, R. Jin, and A. K. Jain, “Multi-label learning with incomplete class assignments,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2801–2808. [29] K. Yu and W. Chu, “Gaussian process models for link analysis and transfer learning,” in Proc. of European Symp. on Artificial Neural Networks, 2008, pp. 1657–1664. [30] N. Loeff and A. Farhadi, “Scene discovery by matrix factorization,” in Computer Vision–ECCV. Springer, 2008, pp. 451–464. [31] Amazon Mechanical Turk, https://www.mturk.com/mturk. [32] B. Scholkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA, USA: MIT Press, 2001. 198 [33] J. Zhang, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: a comprehensive study,” International Journal of Computer Vision, vol. 73, no. 2, pp. 213–238, 2007. [34] M. A. Tahir, K. van de Sande, J. Uijlings, F. Yan, X. Li, K. Mikolajczyk, J. Kittler, T. Gevers, and A. Smeulders, “SurreyUVA SRKDA method,” in PASCAL Visual Object Classes Challenge Workshop, 2008. [35] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernel learning, conic duality, and the SMO algorithm,” in Proc. of Int. Conference on Machine Learning, 2004, pp. 6–13. [36] Z. Wang, S. Chen, and T. Sun, “MultiK-MHKS: A novel multiple kernel learning algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 348–353, 2008. [37] D. P. Lewis, T. Jebara, and W. S. Noble, “Nonstationary kernel combination,” in Proc. of Int. Conference on Machine Learning, 2006, pp. 553–560. [38] L. Jie, F. Orabona, M. Fornoni, B. Caputo, and N. Cesa-bianchi, “OM-2: An online multi-class multikernel learning algorithm,” in Proc. of IEEE Online Learning for Computer Vision Workshop, 2010, pp. 43–50. [39] J. Saketha Nath, G. Dinesh, S. Raman, C. Bhattacharyya, A. Ben-Tal, and K. Ramakrishan, “On the algorithmics and applications of a mixed-norm based kernel learning formulation,” in Proc. of Neural Information Processing Systems, 2009, pp. 844–852. [40] P. V. Gehler and S. Nowozin, “Let the kernel figure it out: Principled learning of pre-processing for kernel classifiers,” in Proc. of IEEE Int. Conference on Computer Vision, 2009, pp. 2836 – 2843. [41] F. Yan, K. Mikolajczyk, M. Barnard, H. Cai, and J. Kittler, “Lp norm multiple kernel fisher discriminant analysis for object and image categorisation,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3626–3632. [42] S. Nakajima, A. Binder, C. Mller, W. Wojcikiewicz, M. Kloft, U. Brefeld, K.-R. Mller, and M. Kawanabe, “Multiple kernel learning for object classification,” in Workshop on Information-based Induction Sciences, 2009. [43] P. V. Gehler and S.Nowozin, “On feature combination for multiclass object classification,” in Proc. of Int. Conference on Machine Learning, 2009, pp. 221–228. [44] J. Ren, Z. Liang, and S. Hu, “Multiple kernel learning improved by MMD,” in Proc. of Int. Conference on Advanced Data Mining and Applications, 2010, pp. 63–74. [45] C. Cortes, M. Mohri, and A. Rostamizadeh, “Learning non-linear combinations of kernels,” in Proc. of Neural Information Processing Systems, 2009, pp. 396–404. [46] Z. Xu, R. Jin, H. Yang, I. King, and M. R. Lyu, “Simple and efficient multiple kernel learning by group lasso,” in Proc. of Int. Conference on Machine Learning, 2010, pp. 1175–1182. [47] C. Cortes, M. Mohri, and A. Rostamizadeh, “L2 regularization for learning kernels,” in Proc. of Conference on Uncertainty in Artificial Intelligence, 2009, pp. 109–116. 199 [48] Z. Xu, R. Jin, S. Zhu, M. Lyu, and I. King, “Smooth optimization for effective multiple kernel learning,” in Proc. of Conference on Artificial Intelligence, 2010, pp. 637–642. [49] R. Tomioka and T. Suzuki, “Sparsity-accuracy trade-off in MKL,” in NIPS Workshop on Understanding Multiple Kernel Learning Methods, 2009. [50] F. Yan, K. Mikolajczyk, J. Kittler, and A. Tahir, “A comparison of L1 norm and L2 norm multiple kernel SVMs in image and video classification,” in Int. Workshop on Content-Based Multimedia Indexing, 2009. [51] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet, “More efficiency in multiple kernel learning,” in Proc. of Int. Conference on Machine Learning, 2007, pp. 775–782. [52] Z. Xu, R. Jin, I. King, and M. R. Lyu, “An extended level method for efficient multiple kernel learning,” in Proc. of Neural Information Processing Systems, 2009, pp. 1825–1832. [53] A. Rakotomamonjy, F. Bach, Y. Grandvalet, and S. Canu, “SimpleMKL,” Journal of Machine Learning Research, vol. 9, no. 11, pp. 2491–2521, 2008. [54] G. Lanckriet, N. Cristianini, P. Bartlett, and L. E. Ghaoui, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, vol. 5, pp. 27–72, 2004. [55] S. Sonnenburg, G. R¨atsch, and C. Sch¨afer, “A general and efficient multiple kernel learning algorithm,” in Proc. of Neural Information Processing Systems, 2006, pp. 1273–1280. [56] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien, “Lp-norm multiple kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 953–997, 2011. [57] M. Kowalski, M. Szafranski, and L. Ralaivola, “Multiple indefinite kernel learning with mixed norm regularization,” in Proc. of Int. Conference on Machine Learning, 2009, pp. 545–552. [58] C. Cortes, M. Mohri, and A. Rostamizadeh, “Generalization bounds for learning kernels,” in Proc. of Int. Conference on Machine Learning, 2010, pp. 247–254. [59] Z. Hussain and J. Shawe-Taylor, “A note on improved loss bounds for multiple kernel learning,” arXiv preprint arXiv:1106.6258, 2011. [60] M. Kloft, U. R¨uckert, and P. L. Bartlett, “A unifying view of multiple kernel learning,” in Proc. of European Conference on Machine Learning and Knowledge Discovery in Databases, 2010, pp. 66–81. [61] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨uller, and A. Zien, “Efficient and accurate lp-norm multiple kernel learning,” in Proc. of Neural Information Processing Systems, 2009, pp. 997–1005. [62] K. Gai, G. Chen, and C. Zhang, “Learning kernels with radiuses of minimum enclosing balls,” in Proc. of Neural Information Processing Systems, 2010, pp. 649–657. [63] M. Varma and B. R. Babu, “More generality in efficient multiple kernel learning,” in Proc. of Int. Conference on Machine Learning, 2009, pp. 1065–1072. 200 [64] J. Aflalo, A. Ben-Tal, C. Bhattacharyya, J. S. Nath, and S. Raman, “Variable sparsity kernel learning,” Journal of Machine Learning Research, vol. 12, pp. 565–592, 2011. [65] F. Bach, “Exploring large feature spaces with hierarchical multiple kernel learning,” in Proc. of Neural Information Processing Systems, 2009, pp. 105–112. [66] J. Yang, Y. Li, Y. Tian, L.-Y. Duan, and W. Gao, “Per-sample multiple kernel approach for visual concept learning,” EURASIP Journal on Image and Video Processing, vol. 2010, no. 2, pp. 1–13, January 2010. [67] M. Gnen and E. Alpaydin, “Localized multiple kernel learning,” in Proc. of Int. Conference on Machine Learning, 2008, pp. 352–359. [68] S. Ji, L. Sun, R. Jin, and J. Ye, “Multi-label multiple kernel learning,” in Proc. of Neural Information Processing Systems, 2009, pp. 777–784. [69] M. Varma and D. Ray, “Learning the discriminative power-invariance trade-off,” in Proc. of IEEE Int. Conference on Computer Vision, 2007, pp. 1–8. [70] S. Vishwanathan, Z. Sun, N. Ampornpunt, and M. Varma, “Multiple kernel learning and the SMO algorithm,” in Proc. of Neural Information Processing Systems, 2010, pp. 2361–2369. [71] S. Sonnenburg, G. R¨atsch, C. Sch¨afer, and B. Sch¨olkopf, “Large scale multiple kernel learning,” Journal of Machine Learning Research, vol. 7, pp. 1531–1565, 2006. [72] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 491–502, 2005. [73] F. Bach, “Consistency of the group lasso and multiple kernel learning,” Journal of Machine Learning Research, vol. 9, pp. 1179–1225, 2008. [74] A. Bosch, A. Zisserman, and X. Munoz, “Representing shape with a spatial pyramid kernel,” in Proc. of ACM Int. Conference on Image and Video Retrieval, 2007, pp. 401–408. [75] T. Hertz, “Learning distance functions: Algorithms and applications,” Ph.D. dissertation, The Hebrew University of Jerusalem, 2006. [76] N. Cristianini, J. Kandola, A. Elisseeff, and J. Shawe-Taylor, “On kernel-target alignment,” in Proc. of Neural Information Processing Systems, 2002, pp. 367–373. [77] O. Chapelle, J. Weston, and B. Schlkopf, “Cluster kernels for semi-supervised learning,” in Proc. of Neural Information Processing Systems, 2003, pp. 585–592. [78] R. I. Kondor and J. Lafferty, “Diffusion kernels on graphs and other discrete structures,” in Proc. of Int. Conference on Machine Learning, 2002, pp. 315–322. [79] J. Zhuang, I. W. Tsang, and S. C. H. Hoi, “SimpleNPKL: Simple non-parametric kernel learning,” in Proc. of Int. Conference on Machine Learning, 2009, pp. 1273–1280. [80] B. Kulis, M. Sustik, and I. Dhillon, “Learning low-rank kernel matrices,” in Proc. of Int. Conference on Machine Learning, 2006, pp. 505–512. 201 [81] S. C. H. Hoi and R. Jin, “Active kernel learning,” in Proc. of Int. Conference on Machine Learning, 2008, pp. 400–407. [82] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Royal. Statist. Soc B., vol. 58, no. 1, pp. 267–288, 1996. [83] C. Longworth and M. J. Gales, “Multiple kernel learning for speaker verification,” in Proc. of IEEE Int. Conference on Acoustics, Speech and Signal Processing, 2008, pp. 1581–1584. [84] V. Sindhwani and A. C. Lozano, “Non-parametric group orthogonal matching pursuit for sparse learning with multiple kernels,” in Proc. of Neural Information Processing Systems, 2011, pp. 414–431. [85] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course. Springer, 1998. [86] A. Martins, N. Smith, E. Xing, P. Aguiar, and M. Figueiredo, “Online multiple kernel learning for structured prediction,” in NIPS Workshop on New Directions in Multiple Kernel Learning, 2010. [87] A. Zien and S. Cheng, “Multiclass multiple kernel learning,” in Proc. of Int. Conference on Machine Learning, 2007, pp. 1191–1198. [88] J. F. Sturm, “Using sedumi 1. 02, a matlab toolbox for optimization over symmetric cones,” Optimization Methods and Software, vol. 11-12, pp. 625–653, 1999. [89] “The MOSEK optimization software.” [Online]. Available: http://www.mosek.com/ [90] J. C. Platt, Fast Training of Support Vector Machines Using Sequential Minimal Optimization. Cambridge, MA, USA: MIT Press, 1999, pp. 185–208. [91] R. Jin, S. C. H. Hoi, and T. Yang, “Online multiple kernel learning: Algorithms and mistake bounds,” in Proc. of Int. Conference on Algorithmic Learning Theory, 2010, pp. 390–404. [92] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review, vol. 65, no. 6, pp. 386–408, 1958. [93] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997. [94] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html. [95] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002. [96] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 4, pp. 509–522, 2002. [97] N. Pinto, D. D. Cox, and J. J. Dicarlo, “Why is real-world visual object recognition hard?” PLoS Computational Biology, vol. 4, no. 1, 2008. 202 [98] O. Tuzel, F. Porikli, and P. Meer, “Region covariance: A fast descriptor for detection and classification,” in Computer Vision–ECCV. Springer, 2006, pp. 589–600. [99] A. Berg and J. Malik, “Geometric blur for template matching,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2001, pp. 607–614. [100] E. Shechtman and M. Irani, “Matching local self-similarities across images and videos,” in Proc. of IEEE conference on Computer Vision and Pattern Recognition, 2007, pp. 607–614. [101] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, “TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation,” in Proc. of IEEE Int. Conference on Computer Vision, 2009, pp. 309–316. [102] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001. [103] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp. 2169 – 2178. [104] K. Mikolajczyk and C. Schmid, “Indexing based on scale invariant interest points,” in Proc. of IEEE Int. Conference on Computer Vision, 2001, pp. 525–531. [105] J. van de Weijer and C. Schmid, “Coloring local feature extraction,” in Computer Vision–ECCV. Springer, 2006, pp. 334–348. [106] F. Perronnin, J. Sanchez, and Y. Liu, “Large-scale image categorization with explicit data embedding,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 2297 –2304. [107] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 21–27, 2011. [108] M. Grant and S. Boyd., “CVX: Matlab software for disciplined convex programming, version 1.21,” http://cvxr.com/cvx, april 2011. [109] F. Bach, R. Thibaux, and M. I. Jordan, “Computing regularization paths for learning multiple kernels,” in Proc. of Neural Information Processing Systems, 2005, pp. 73–80. [110] F. Li, J. Carreira, and C. Sminchisescu, “Object recognition as ranking holistic figure-ground hypotheses,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 1712 – 1719. [111] G. L. Oliveira, E. R. Nascimento, A. W. Vieira, and M. F. M. Campos, “Sparse spatial coding: A novel approach for efficient and accurate object recognition,” in Proc. of IEEE Int. Conference on Robotics and Automation, 2012, pp. 2592–2598. [112] Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, “Contextualizing object detection and classification,” in Proc of IEEE Int. Conference on Computer Vision and Pattern Recognition, 2011, pp. 1585 – 1592. 203 [113] H. Harzallah, F. Jurie, and C. Schmid, “Combining efficient object localization and image classification,” in Proc. of Int. Conference on Computer Vision, 2009, pp. 237–244. [114] K. Chatfield, V. Lemtexpitsky, A. Vedaldi, and A. Zisserman, “The devil is in the details: an evaluation of recent feature encoding methods,” in Proc. of the British Machine Vision Conference, 2011, pp. 1–12. [115] S. Sonnenburg, G. Ratsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. d. Bona, A. Binder, C. Gehl, and V. Franc, “The shogun machine learning toolbox,” The Journal of Machine Learning Research, vol. 99, pp. 1799–1802, 2010. [116] L. Tang, J. Chen, and J. Ye, “On multiple kernel learning with multiple labels,” in Proc. of Int. Joint Conference on Artifical Intelligence, 2009, pp. 1255–1260. [117] F. Orabona, L. Jie, and B. Caputo, “Online-batch strongly convex multi kernel learning,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 787–794. [118] S. Mei, “Multi-kernel transfer learning based on chou’s pseaac formulation for protein submitochondria localization,” Journal of Theoretical Biology, vol. 293, pp. 121–130, 2012. [119] A. Nemirovski, “Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems,” SIAM Journal on Optimization, vol. 15, no. 1, pp. 229–251, 2004. [120] T. G. Dietterich and G. Bakiri, “Solving multiclass learning problems via error-correcting output codes,” Journal of Artificial Intelligence Research, vol. 2, no. 1, pp. 263–286, 1995. [121] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multi-class support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 415–425, 2002. [122] D. Hsu, S. M. Kakade, J. Langford, and T. Zhang, “Multi-label prediction via compressed sensing,” in Proc. of Neural Information Processing Systems, 2009, pp. 772–780. [123] T. Zhou, D. Tao, and X. Wu, “Compressed labeling on distilled labelsets for multi-label learning,” Machine Learning, vol. 88, no. 1-2, pp. 69–126, 2012. [124] F. Tai and H.-T. Lin, “Multilabel classification with principal label space transformation,” Neural Computation, vol. 24, no. 9, pp. 2508–2542, 2012. [125] M.-L. Zhang and Z.-H. Zhou, “ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognition, vol. 40, no. 7, pp. 2038–2048, 2007. [126] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, 1999. [127] J. R. Quinlan, C4.5: programs for machine learning. Publishers Inc., 1993. San Francisco, CA, USA: Morgan Kaufmann [128] A. Clare and R. D. King, “Knowledge discovery in multi-label phenotype data,” in Principles of Data Mining and Knowledge Discovery. Springer, 2001, pp. 42–53. 204 [129] Y. Freund and L. Mason, “The alternating decision tree learning algorithm,” in Proc. of Int. Conference on Machine Learning, 1999, pp. 124–133. [130] H. Blockeel, M. Bruynooghe, S. Dzeroski, J. Ramon, and J. Struyf, “Hierarchical multiclassification,” in ACM SIGKDD Workshop on Multi-Relational Data Mining, 2002. [131] F. De Comit´e, R. Gilleron, and M. Tommasi, “Learning multi-label alternating decision trees from texts and data,” in Machine Learning and Data Mining in Pattern Recognition. Springer, 2003, pp. 35–49. [132] J. Struyf, S. Dzeroski, H. Blockeel, and A. Clare, “Hierarchical multi-classification with predictive clustering trees in functional genomics,” ser. EPIA. Springer-Verlag, 2005, pp. 272–285. [133] Y. Han, F. Wu, Y. Zhuang, and X. He, “Multi-label transfer learning with sparse representation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 20, no. 8, pp. 1110–1121, 2010. [134] G.-J. Qi, C. Aggarwal, Y. Rui, Q. Tian, S. Chang, and T. Huang, “Towards cross-category knowledge propagation for learning visual concepts,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 897–904. [135] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, “Online passive-aggressive algorithms,” Journal of Machine Learning Research, vol. 7, pp. 551–585, 2006. [136] A. Elisseeff and J. Weston, “A kernel method for multi-labelled classification,” in Proc. of Neural Information Processing Systems, 2001, pp. 681–687. [137] O. Dekel, C. D. Manning, and Y. Singer, “Log-linear models for label ranking,” in Proc. of Neural Information Processing Systems, 2003. [138] A. McCallum, “Multi-label text classification with a mixture model trained by EM,” in AAAI Workshop on Text Learning, 1999. [139] S. Ji, L. Tang, S. Yu, and J. Ye, “Extracting shared subspace for multi-label classification,” in Proc. of ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 2008, pp. 381–389. [140] K. Yu, S. Yu, and V. Tresp, “Multi-label informed latent semantic indexing,” in Proc. of Annual ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 258–265. [141] A. Quattoni, M. Collins, and T. Darrell, “Transfer learning for image classification with sparse prototype representations,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [142] S.-J. Huang, Y. Yu, and Z.-H. Zhou, “Multi-label hypothesis reuse,” in Proc. of ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 2012, pp. 525–533. [143] Y. Guo and S. Gu, “Multi-label classification using conditional dependency networks,” in Proc. of Int. Joint Conference on Artificial Intelligence, 2011, pp. 1300–1305. [144] G. Chen, J. Zhang, F. Wang, C. Zhang, and Y. Gao, “Efficient multi-label classification with hypergraph regularization,” in Proc. of IEEE conference on Computer Vision and Pattern Recognition, 2009, pp. 1658–1665. 205 [145] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multi-label classification,” in Proc. of ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining. ACM, 2008, pp. 668–676. [146] S. Godbole and S. Sarawagi, “Discriminative methods for multi-labeled classification,” in Advances in Knowledge Discovery and Data Mining. Springer, 2004, pp. 22–30. [147] N. Alaydie, C. K. Reddy, and F. Fotouhi, “Exploiting label dependency for hierarchical multi-label classification,” in Advances in Knowledge Discovery and Data Mining. Springer, 2012, pp. 294–305. [148] A. Veloso, W. Meira Jr, M. Gonc¸alves, and M. Zaki, “Multi-label lazy associative classification,” in Knowledge Discovery in Databases. Springer, 2007, pp. 605–612. [149] J. Read, L. Martino, and D. Luengo, “Efficient Monte Carlo optimization for multi-dimensional classifier chains,” arXiv preprint arXiv:1211.2190, 2012. [150] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” in Proc. of Int. Conference on Machine Learning, 2009, pp. 1113–1120. [151] R. A. Amar, D. R. Dooly, S. A. Goldman, and Q. Zhang, “Multiple-instance learning of real-valued data,” in Proc. of Int. Conference on Machine Learning, 2001, pp. 3–10. [152] M. Palatucci, D. Pomerleau, G. Hinton, and T. Mitchell, “Zero-shot learning with semantic output codes,” in Proc. of Neural Information Processing Systems, 2009, pp. 1410–1418. [153] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. Keerthi, and S. Sundararajan, “A dual coordinate descent method for large-scale linear svm,” in Proc. of Int. Conference on Machine Learning, 2008, pp. 408– 415. [154] M. J. Huiskes and M. S. Lew, “The MIR Flickr retrieval evaluation,” in Proc. of ACM Int. Conf. on Multimedia Information Retrieval, 2008, pp. 39–43. [155] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi-supervised learning for image classification,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 902–909. [156] R.-E. Fan, P.-H. Chen, and C.-J. Lin, “Working set selection using second order information for training svm,” Journal of Machine Learning Research, vol. 6, pp. 1889–1918, 2005. [157] J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” Advances in Large Margin Classifiers, vol. 10, no. 3, pp. 61–74, 1999. [158] M. Marszalek and C. Schmid, “Semantic hierarchies for visual object recognition,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2007, pp. 1–7. [159] N. Nguyen and R. Caruana, “Classification with partial labels,” in Proc. of ACM SIGKDD Int. Conference on Knowledge Discovery and Data Mining, 2008, pp. 551–559. [160] R. Jin and Z. Ghahramani, “Learning with multiple labels,” in Proc. of Neural Information Processing Systems, 2002, pp. 897–904. [161] A. Pentland, “Expectation maximization for weakly labeled data,” in Proc. of Int. Conference on Machine Learning, 2001, pp. 218–225. 206 [162] K. Crammer and Y. Singer, “On the learnability and design of output codes for multiclass problems,” Machine Learning, vol. 47, no. 2, pp. 201–233, 2002. [163] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. Royal. Statist. Soc B., vol. 68, no. 1, pp. 49–67, 2006. [164] J. Fan, Y. Shen, N. Zhou, and Y. Gao, “Harvesting large-scale weakly-tagged image databases from the web,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 802– 809. [165] M. Szummer, “Learning from partially labeled data,” Ph.D. dissertation, Massachusetts Institute of Technology, 2002. [166] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Efficient bandit algorithms for online multiclass prediction,” in Proc. of Int. Conference on Machine Learning, 2008, pp. 440–447. [167] S. Wang, R. Jin, and H. Valizadegan, “A potential-based framework for online multi-class learning with partial feedback,” in Proc. of Int. Conference on Artificial Intelligence and Statistics, 2010, pp. 900–907. [168] F. Alizadeh and D. Goldfarb, “Second-order cone programming,” Mathematical Programming, vol. 95, no. 1, pp. 3–51, 2003. [169] M. Petrovskiy, “Paired comparisons method for solving multi-label learning problem,” in Proc. of Conference of Hybrid Intelligent Systems, 2006, pp. 42–42. [170] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis for multilabel classification: a least-squares formulation, extensions, and analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 194–200, 2011. [171] O. Yakhnenko and V. Honavar, “Multiple label prediction for image annotation with multiple kernel correlation models,” in IEEE Computer Vision and Pattern Recognition Workshops, 2009. [172] W. Zhang, X. Xue, J. Fan, X. Huang, B. Wu, and M. Liu, “Multi-kernel multi-label learning with max-margin concept network,” in Proc. of Int. Joint Conference on Artificial Intelligence, 2011, pp. 1615–1620. [173] L.-J. Li, H. Su, E. P. Xing, and F.-F. Li, “Object bank: A high-level image representation for scene classification & semantic feature sparsification.” in Proc. of Neural Information Processing Systems, 2010, p. 5. [174] L. G. Roberts, “Machine perception of three-dimensional solids,” Ph.D. dissertation, Massachusetts Institute of Technology, 2007. [175] T. O. Binford, “Survey of model-based image analysis systems,” The International Journal of Robotics Research, vol. 1, no. 1, pp. 18–64, 1982. [176] R. T. Chin and C. R. Dyer, “Model-based recognition in robot vision,” ACM Computing Surveys, vol. 18, no. 1, pp. 67–108, 1986. 207 [177] R. Bergevin and M. Levine, “Generic object recognition: building and matching coarse descriptions from line drawings,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 1, pp. 19 –36, 1993. [178] F. Stein and G. Medioni, “Structural indexing: efficient 2d object recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 12, pp. 1198 –1204, 1992. [179] E. Saber, A. M. Tekalp, R. Eschbach, and K. Knox, “Automatic image annotation using adaptive color classification,” Graphical Models and Image Processing, vol. 58, no. 2, pp. 115 – 126, 1996. [180] M. J. Swain and D. H. Ballard, “Color indexing,” Int. Journal of Compututer Vision, vol. 7, no. 1, pp. 11–32, 1991. [181] J. Mao and A. K. Jain, “Texture classification and segmentation using multiresolution simultaneous autoregressive models,” Pattern Recognition, vol. 25, no. 2, pp. 173 – 188, 1992. [182] B. S. Manjunath and W.-Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 837 –842, 1996. [183] P. Duygulu, K. Barnard, N. de Freitas, and D. Forsyth, “Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary,” in Computer Vision–ECCV. Springer, 2002, pp. 97–112. [184] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic image annotation and retrieval using crossmedia relevance models,” in Proc. of Int. ACM SIGIR conference on Research and development in informaion retrieval, 2003, pp. 119–126. [185] F. Monay and D. Gatica-Perez, “On image auto-annotation with latent space models,” in Proc. of ACM Int. Conference on Multimedia, 2003, pp. 275–278. [186] J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos, “Automatic image captioning,” in Proc. of IEEE Int. Conference on Multimedia and Expo, 2004, pp. 1987–1990. [187] C. Schmid and R. Mohr, “Matching by local invariants,” INRIA, Tech. Rep. RR-2644, 1995. [188] C. Schmid, R. Mohr, and C. Bauckhage, “Comparing and evaluating interest points,” in Proc. of Int. Conference on Computer Vision, 1998, pp. 230 –235. [189] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 2004. [190] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual categorization with bags of keypoints,” in ECCV Workshop on Statistical Learning in Computer Vision, 2004. [191] C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. of Alvey Vision Conference, 1988, pp. 50–50. [192] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893. [193] F. Perronnin, J. Snchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in Computer Vision–ECCV. Springer, 2010, pp. 143–156. 208 [194] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “LabelMe: a database and web-based tool for image annotation,” International Journal of Computer Vision, vol. 77, no. 1-3, pp. 157–173, 2008. [195] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1778–1785. [196] LSVRC Challenge, http://www.image-net.org/challenges/LSVRC/2013/. [197] S. Nowak and M. J. Huiskes, “New strategies for image annotation: Overview of the photo annotation task at ImageCLEF 2010,” in CLEF (Notebook Papers/LABs/Workshops), vol. 1, no. 3, 2010, p. 4. [198] L. von Ahn and L. Dabbish, “Labeling images with a computer game,” in Proc. of the Conference on Human Factors in Computing Systems, 2004, pp. 319–326. [199] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. of Neural Information Processing Systems, 2012, pp. 1106–1114. [200] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games. 2006. 209 Cambridge University Press,